Deepgram logo
    D

    Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)

    Deepgram
    USA | Remote
    Remote
    Senior
    Full Time
    4 days ago
    💰$ 150,000 - $ 220,000
    Site Reliability EngineerKubernetesAWSTerraformAIMLHybrid InfrastructureGPU ComputingSlurmDevOps

    Requirements

    • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
    • Proven hands-on experience building and managing production infrastructure with Terraform
    • Expert-level knowledge of Kubernetes architecture and operations in large-scale environment
    • Experience with HPC job schedulers, specifically Slurm, for GPU-intensive AI workloads
    • Experience managing bare metal infrastructure including server provisioning and lifecycle management
    • Strong scripting and automation skills (Python, Go, Bash)

    What You'll Do

    • Architect and maintain core computing platform using Kubernetes on AWS and on-premise
    • Develop and manage infrastructure using Infrastructure-as-Code (Terraform)
    • Design, build, and optimize AI/ML job scheduling and orchestration systems integrating Slurm with Kubernetes clusters
    • Provision, manage, and maintain on-premise bare metal server infrastructure for GPU computing
    • Implement and manage platform networking and storage solutions for hybrid environments
    • Develop observability stack for platform health and automate operational tasks
    • Collaborate with AI researchers and ML engineers to build tools and workflows
    • Automate lifecycle of single-tenant, managed deployments

    Nice to Have

    • Experience with CI/CD systems (GitLab CI, Jenkins, ArgoCD) and building developer tooling
    • Familiarity with FinOps principles and cloud cost optimization strategies
    • Knowledge of Kubernetes networking (Calico, Cilium) and storage (Ceph, Rook) solutions
    • Experience in multi-region or hybrid cloud environment

    Benefits

    • Medical, dental, vision benefits
    • Annual wellness stipend
    • Mental health support
    • Life, STD, LTD Income Insurance Plans
    • Unlimited PTO
    • Generous paid parental leave
    • Flexible schedule
    • 12 Paid US company holidays
    • Quarterly personal productivity stipend
    • One-time stipend for home office upgrades
    • 401(k) plan with company match
    • Tax Savings Programs
    • Learning / Education stipend
    • Participation in talks and conferences
    • Employee Resource Groups
    • AI enablement workshops / sessions

    About Deepgram

    Deepgram specializes in providing AI-powered speech-to-text technology that offers audio intelligence, text-to-speech, and voice agent API.

    San Francisco, CA
    100 - 250
    AI & Machine Learning