United States
Remote
Senior
Full Time
1 day ago
💰$ 96,000 - $ 192,000
Site Reliability EngineerAI AgentsPlatform EngineeringML InfrastructureTerraformKubernetesAWS
Requirements
- •5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment
- •Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
- •Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale
- •Strong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform design
- •Proficiency with Infrastructure as Code tools, particularly Terraform
- •Experience with containerization and orchestration, particularly Kubernetes and Docker
- •Solid understanding of cloud infrastructure, preferably AWS
- •Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
- •Experience designing and operating observability, monitoring, and alerting systems
- •Experience implementing incident response procedures and participating in on-call rotations
- •Strong collaboration skills working across data, AI, and engineering teams
- •High ownership mindset in a fast-moving, high-stakes production environment
What You'll Do
- •Design, build, and operate the infrastructure layer supporting AI agent workflows in production
- •Ensure reliability, scalability, and observability of agentic systems across internal and external products
- •Design and develop platform services, APIs, SDKs, and self-service capabilities that allow engineering teams to easily consume AI infrastructure and agent platform services
- •Manage and maintain the compute, orchestration, and serving infrastructure powering model inference and agent execution
- •Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads
- •Utilize Infrastructure as Code (IaC) tools such as Terraform to provision and manage cloud (AWS) infrastructure components
- •Build and maintain CI/CD pipelines that support rapid, reliable deployment of AI services and agent workflows
- •Define and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systems
- •Collaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systems
- •Manage containerized workloads using Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services
- •Implement access controls and security best practices across AI infrastructure environments
- •Document architecture, runbooks, and best practices to support knowledge sharing across the team
Nice to Have
- •Experience building or operating infrastructure for agent-based or LLM-powered systems
- •Familiarity with agent orchestration frameworks (e.g., LangGraph, CrewAI, or similar)
- •Background in data infrastructure, including familiarity with Airflow, Kafka, Spark, or data lake tooling
- •Experience with CI/CD pipelines and deployment automation for AI/ML workloads
- •Exposure to evaluation frameworks and model performance monitoring at scale
- •Experience working in fast-moving 0→1 environments or platform-building teams
- •Experience building SDKs, developer tooling, or internal platform products with a strong focus on usability and adoption
- •Experience with Cloudflare's cloud platform and product ecosystem, including networking, security, performance, and Zero Trust solutions
