Locaion: San Carlos
Dear Candidate,
Before submitting your resume, please pay attention to the location – we will not be able to review your resume and provide feedback if you are not (in fact) located in the location of the vacancy.
Thanks for understanding
We are seeking skilled Cloud and ML Infrastructure Engineers to lead the buildout of our AWS foundation and our LLM platform. You will design, implement, and operate services that are scalable, reliable, and secure.
The broad scope means focus areas in LLM/ML Infra and IoT infra are strong bonus points. For ML Infra, build the stack that powers retrieval-augmented generation and application workflows built with frameworks like LangChain. Experience with IoT AWS services is a plus.
You will work closely with other engineers and product management. The ideal candidate is hands-on, comfortable with ambiguity, and excited to build from first principles.
Key Responsibilities
Cloud Infrastructure Setup and Maintenance
Design, provision, and maintain AWS infrastructure using IaC tools such as AWS CDK or Terraform.
Build CI/CD and testing for apps, infra, and ML pipelines using GitHub Actions, CodeBuild, and CodePipeline.
Operate secure networking with VPCs, PrivateLink, and VPC endpoints. Manage IAM, KMS, Secrets Manager, and audit logging.
LLM Platform and Runtime
Stand up and operate model endpoints using AWS Bedrock and/or SageMaker; evaluate when to use ECS/EKS, Lambda, or Batch for inference jobs.
Build and maintain application services that call LLMs through clean APIs, with streaming, batching, and backoff strategies.
Implement prompt and tool execution flows with LangChain or similar, including agent tools and function calling.
RAG Data Systems and Vector Search
Design chunking and embedding pipelines for documents, time series, and multimedia. Orchestrate with Step Functions or Airflow.
Operate vector search using OpenSearch Serverless, Aurora PostgreSQL with pgvector, or Pinecone. Tune recall, latency, and cost.
Build and maintain knowledge bases and data syncs from S3, Aurora, DynamoDB, and external sources.
Evaluation, Observability, and Cost Governance
Create offline and online eval harnesses for prompts, retrievers, and chains. Track quality, latency, and regression risk.
Instrument model and app telemetry with CloudWatch and OpenTelemetry. Build token usage and cost dashboards with budgets and alerts.
Add guardrails, rate limits, fallbacks, and provider routing for resilience.
Safety, Privacy, and Compliance
Implement PII detection and redaction, access controls, content filters, and human-in-the-loop review where needed.
Use Bedrock Guardrails or policy services to enforce safety standards. Maintain audit trails for regulated environments.
Data Pipeline Construction
Build ingestion and processing pipelines for structured, unstructured, and multimedia data. Ensure integrity, lineage, and cataloging with Glue and Lake Formation.
Optimize bulk data movement and storage in S3, Glacier, and tiered storage. Use Athena for ad-hoc analysis.
IoT Deployment Management
Manage infrastructure that deploys to and communicates with edge devices. Support secure messaging, identity, and over-the-air updates.
Analytics and Application Support
Partner with product and application teams to integrate retrieval services, embeddings, and LLM chains into user-facing features.
Provide expert troubleshooting for cloud and ML services with an emphasis on uptime and performance.
Performance Optimization
Tune retrieval quality, context window use, and caching with Redis or Bedrock Knowledge Bases.
Optimize inference with model selection, quantization where applicable, GPU/CPU instance choices, and autoscaling strategies.
What Will Make You Successful:
- End-to-End Ownership: Drives work from design through production, including on-call and continuous improvement.
- LLM Systems Experience: Shipped or operated LLM-powered applications in production. Familiar with RAG design, prompt versioning, and chain orchestration using LangChain or similar.
- AWS Depth: Strong with core AWS services such as VPC, IAM, KMS, CloudWatch, S3, ECS/EKS, Lambda, Step Functions, Bedrock, and SageMaker.
- Data Engineering Skills: Comfortable building ingestion and transformation pipelines in Python. Familiar with Glue, Athena, and event-driven patterns using EventBridge and SQS.
- Security Mindset: Applies least privilege, secrets management, network isolation, and compliance practices appropriate to sensitive data.
- Evaluation and Metrics: Uses quantitative evals, A/B testing, and live metrics to guide improvements.
- Clear Communication: Explains tradeoffs and aligns partners across product, security, and application engineering.
Bonus Points:
- 4+ years working with serverless or container platforms on AWS.
- Experience with vector databases, OpenSearch, or pgvector at scale.
- Hands-on with Bedrock Guardrails, Knowledge Bases, or custom policy engines.
- Familiarity with GPU workloads, Triton Inference Server, or TensorRT-LLM.
- Experience with big data tools for large-scale processing and search.
- Background in aviation data or other safety-critical domains.
- DevOps or DevSecOps experience automating CI/CD for ML and app services.
Required qualifications:
- Scope: Independently delivers features and subsystems.
- Contributions: Builds CI/CD pipelines, deploys ML endpoints (Bedrock, SageMaker), develops RAG pipelines and vector search integrations, manages infra security (IAM, KMS).
- Requirements: 3–5 yrs in cloud/infra/ML systems, hands-on with AWS services, experience with APIs, data pipelines, and at least one ML/LLM integration.
Location:
- This is a hybrid role and requires working from our San Carlos, CA office at least three days a week, with the option to work remotely the remaining days.
Dear Candidate,
In an era of rapid technological advancement and the constant evolution of artificial intelligence, at Zazmiс, we believe in the importance of analyzing resumes not only through automated tools but also through interaction with a live recruiter. We value an individualized approach to each candidate and strive to make the hiring process more friendly and efficient.
Understanding the significance of your time and that of our colleagues, we offer you the opportunity to provide additional information that will help us better understand your profile and its alignment with the job description. Your initiative will assist us in making a more informed decision when considering your candidacy.
Please note that Zazmiс reserves the right not to respond to a candidate’s application if we conclude that the candidate does not meet our requirements for any reason. Please understand this as part of our commitment to an efficient and fair hiring process.
Thank you for your understanding and participation in our recruitment process.
Best regards,
The Zazmiс Team