Remote
Remote
Senior
Full Time
10 days ago
💰$116,633 - $181,243
remotesite_reliability_engineerseniorinfrastructurecloudawsci_cdautomationobservabilitywikimedia
Requirements
- •Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible)
- •Proficiency in at least one programming language (e.g., Python, Go, or similar)
- •Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP
- •Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD)
- •Familiarity with progressive delivery approaches such as canary and blue-green deployments
- •Experience with incident response, on-call practices, and leading postmortems
- •Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets
- •Experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
- •Ability to work effectively in a distributed, cross-functional environment
- •Strong documentation and communication skills
What You'll Do
- •Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
- •Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
- •Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
- •Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
- •Partner with engineering team members to embed reliability best practices early in the development lifecycle
- •Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab and ArgoCD, enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
- •Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
- •Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
- •Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
- •Reduce operational toil by identifying repetitive work and implementing automation-first solutions
- •Contribute to and evolve internal platform capabilities that standardize infrastructure and improve scalability across teams
- •Collaborate with a global and asynchronously communicating team
- •Mentor peers in areas of technical and operational strength
Nice to Have
- •Familiarity with Wikimedia or other open source projects
- •Experience managing and troubleshooting event streaming platforms at scale (e.g., Kafka, Kinesis, or similar)
- •Hands-on experience with cloud platforms such as AWS and/or GCP
- •Familiarity with data lake architectures and large-scale data processing frameworks (e.g., Iceberg, Flink, Spark)
- •Experience with continuous profiling and performance optimization tools
- •Experience working with or contributing to open source projects, particularly in infrastructure or data ecosystems
- •Prior participation in the Wikimedia movement
Benefits
- •Competitive and equitable salary
- •Remote-first work environment
- •Inclusive and equitable workplace
- •Opportunities for professional growth and learning
