Kraken logo
    K

    Site Reliability Engineer - AI Agents

    Kraken
    United KingdomChileSouth AfricaColombiaSpainCzech RepublicSwedenBrazilCanadaCyprus
    Remote
    Senior
    Full Time
    1 day ago
    Site Reliability EngineerAIInfrastructurePlatform EngineeringTerraformKubernetesAWSML InfrastructureMLOps

    Requirements

    • 5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment
    • Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
    • Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale
    • Strong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform design
    • Proficiency with Infrastructure as Code tools, particularly Terraform
    • Experience with containerization and orchestration, particularly Kubernetes and Docker
    • Solid understanding of cloud infrastructure, preferably AWS
    • Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
    • Experience designing and operating observability, monitoring, and alerting systems
    • Experience implementing incident response procedures and participating in on-call rotations
    • Strong collaboration skills working across data, AI, and engineering teams
    • High ownership mindset in a fast-moving, high-stakes production environment

    What You'll Do

    • Design, build, and operate the infrastructure layer supporting AI agent workflows in production
    • Ensure reliability, scalability, and observability of agentic systems across internal and external products
    • Design and develop platform services, APIs, SDKs, and self-service capabilities that allow engineering teams to easily consume AI infrastructure and agent platform services
    • Manage and maintain the compute, orchestration, and serving infrastructure powering model inference and agent execution
    • Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads
    • Utilize Infrastructure as Code (IaC) tools such as Terraform to provision and manage cloud (AWS) infrastructure components
    • Build and maintain CI/CD pipelines that support rapid, reliable deployment of AI services and agent workflows
    • Define and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systems
    • Collaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systems
    • Manage containerized workloads using Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services
    • Implement access controls and security best practices across AI infrastructure environments
    • Document architecture, runbooks, and best practices to support knowledge sharing across the team

    Nice to Have

    • Experience building or operating infrastructure for agent-based or LLM-powered systems
    • Familiarity with agent orchestration frameworks (e.g., LangGraph, CrewAI, or similar)
    • Background in data infrastructure, including familiarity with Airflow, Kafka, Spark, or data lake tooling
    • Experience with CI/CD pipelines and deployment automation for AI/ML workloads
    • Exposure to evaluation frameworks and model performance monitoring at scale
    • Experience working in fast-moving 0→1 environments or platform-building teams
    • Experience building SDKs, developer tooling, or internal platform products with a strong focus on usability and adoption
    • Experience with Cloudflare's cloud platform and product ecosystem, including networking, security, performance, and Zero Trust solutions

    About Kraken

    Kraken is a cryptocurrency exchange platform that provides parachain auctions, staking, and index services.

    San Francisco, CA
    1000 - 5000
    Finance