Remote
Remote
Mid Level
Full Time
about 1 month ago
SRESite Reliability EngineeringKubernetesAWSTerraformRemoteInfrastructureAutomation
Requirements
- •Deep hands-on experience with Kubernetes in production (EKS preferred)
- •Experience debugging node pressure, networking issues, and deployment failures at scale
- •Strong experience operating production infrastructure on AWS including organizational boundaries, IAM, and networking
- •Experience automating infrastructure using Terraform or Terragrunt at scale including module design and state management
- •Solid understanding of Linux systems including disk, memory, networking, and failure modes
- •Experience supporting stateful systems such as databases, queues, and storage systems
- •Ability to debug and reason about performance and reliability issues in production
- •Comfortable owning systems end-to-end including on-call responsibilities
What You'll Do
- •Own and operate production systems with deep ownership
- •Turn a fast-growing, stateful system into a predictable, well-automated platform
- •Provisioning, scaling, rebalancing, and recovery of infrastructure
- •Reduce operational stress by designing safe automation for traffic-heavy workloads
- •Build tooling and patterns to scale systems without scaling human effort
- •Operate EKS clusters with Karpenter autoscaling, Cilium networking, and ArgoCD-driven GitOps deployments
- •Manage and evolve a multi AWS account organization including provisioning, networking, access control, and cross-account connectivity
- •Maintain Terraform/Terragrunt IaC platform including modules and automated pipelines
- •Improve operational tooling around deploys, schema changes, backups, restores, and incident response
- •Reduce operational load by identifying and eliminating repeat pain points through code and automation
- •Optimize cloud spend
- •Participate in on-call and incident response with focus on reducing incidents
Nice to Have
- •Experience with GitOps workflows (ArgoCD) and CI/CD pipelines (GitHub Actions)
- •Experience building AI agent-enabled base-level infrastructure services
- •Familiarity with multi-region infrastructure and consistency/availability tradeoffs
Benefits
- •Remote work
- •Autonomy in choosing work and projects
- •Transparency in company operations and strategy
- •Product-led company with strong product-market fit
- •Well-funded with over $100m raised
- •Meeting-free days to prioritize building time
- •Ambitious and optimistic company culture
- •Supportive and professional team environment
