Teldar
Scale-to-Zero AI/ML Infrastructure on Google Cloud
Industry:
AI & Machine Learning Consulting
Core Technologies:
Background
Teldar is a Swiss AI consultancy that handles demanding machine learning workloads. Because their projects require high-end hardware like NVIDIA H100 GPUs, keeping infrastructure running 24/7 is financially impractical. They needed a way to access peak computing power exactly when a model needs to run, without paying for that hardware while it sits idle.
Challenges
Teldar faced several challenges managing GPU-intensive workloads in the cloud:
- Idle GPU Costs: Premium A3 GPU instances incurred significant costs during downtime between workloads.
- Manual Provisioning: Spinning up specialized GPU VMs manually slowed down project execution.
- Operational Overhead: Developers were spending time managing infrastructure instead of building models.
- Deployment Consistency: Ensuring identical environments across runs was difficult without automation.
Solutions
Zazmic designed and implemented an automated, event-driven scale-to-zero architecture on Google Cloud.
- Automated GPU Lifecycle Management:
Google Cloud Functions automatically provision and terminate GPU-enabled VMs based on workload demand, ensuring resources exist only when needed. - Spot GPU Optimization:
The platform leverages GCP Spot VMs, enabling access to NVIDIA H100 performance at a significantly reduced cost. - Infrastructure as Code:
Terraform defines all cloud resources, providing repeatable, reliable deployments with minimal manual effort. - Production Handover:
Zazmic delivered detailed runbooks and documentation, enabling Teldar to manage the platform independently.
Outcomes
The new platform transformed how Teldar runs high-demand AI workloads:
- Significant Cost Reduction: GPU compute costs are incurred only while models are actively running.
- Faster Time to Execution: Automated provisioning removes delays associated with manual setup.
- Operational Autonomy: Teldar’s team manages the infrastructure without ongoing external support.
- Production-Ready Stability: Consistent, predictable environments reduce operational risk and technical debt.
Conclusion
By shifting to an automated scale-to-zero architecture, Teldar can run demanding AI workloads without the burden of fixed GPU costs. The solution turns high-end infrastructure into an on-demand utility—allowing the team to scale efficiently, control spend, and focus fully on delivering AI innovation.