USA | Remote
Remote
Senior
Full Time
4 days ago
💰$ 150,000 - $ 220,000
Site Reliability EngineerKubernetesAWSTerraformAIMLHybrid InfrastructureGPU ComputingSlurmDevOps
Requirements
- •5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
- •Proven hands-on experience building and managing production infrastructure with Terraform
- •Expert-level knowledge of Kubernetes architecture and operations in large-scale environment
- •Experience with HPC job schedulers, specifically Slurm, for GPU-intensive AI workloads
- •Experience managing bare metal infrastructure including server provisioning and lifecycle management
- •Strong scripting and automation skills (Python, Go, Bash)
What You'll Do
- •Architect and maintain core computing platform using Kubernetes on AWS and on-premise
- •Develop and manage infrastructure using Infrastructure-as-Code (Terraform)
- •Design, build, and optimize AI/ML job scheduling and orchestration systems integrating Slurm with Kubernetes clusters
- •Provision, manage, and maintain on-premise bare metal server infrastructure for GPU computing
- •Implement and manage platform networking and storage solutions for hybrid environments
- •Develop observability stack for platform health and automate operational tasks
- •Collaborate with AI researchers and ML engineers to build tools and workflows
- •Automate lifecycle of single-tenant, managed deployments
Nice to Have
- •Experience with CI/CD systems (GitLab CI, Jenkins, ArgoCD) and building developer tooling
- •Familiarity with FinOps principles and cloud cost optimization strategies
- •Knowledge of Kubernetes networking (Calico, Cilium) and storage (Ceph, Rook) solutions
- •Experience in multi-region or hybrid cloud environment
Benefits
- •Medical, dental, vision benefits
- •Annual wellness stipend
- •Mental health support
- •Life, STD, LTD Income Insurance Plans
- •Unlimited PTO
- •Generous paid parental leave
- •Flexible schedule
- •12 Paid US company holidays
- •Quarterly personal productivity stipend
- •One-time stipend for home office upgrades
- •401(k) plan with company match
- •Tax Savings Programs
- •Learning / Education stipend
- •Participation in talks and conferences
- •Employee Resource Groups
- •AI enablement workshops / sessions
