Senior Site Reliability Engineer, Wikimedia Enterprise

Wikimedia

Remote

Senior

Full Time

10 days ago

💰$116,633 - $181,243

remotesite_reliability_engineerseniorinfrastructurecloudawsci_cdautomationobservabilitywikimedia

Requirements

•Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible)
•Proficiency in at least one programming language (e.g., Python, Go, or similar)
•Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP
•Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD)
•Familiarity with progressive delivery approaches such as canary and blue-green deployments
•Experience with incident response, on-call practices, and leading postmortems
•Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets
•Experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
•Ability to work effectively in a distributed, cross-functional environment
•Strong documentation and communication skills

What You'll Do

•Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
•Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
•Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
•Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
•Partner with engineering team members to embed reliability best practices early in the development lifecycle
•Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab and ArgoCD, enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
•Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
•Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
•Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
•Reduce operational toil by identifying repetitive work and implementing automation-first solutions
•Contribute to and evolve internal platform capabilities that standardize infrastructure and improve scalability across teams
•Collaborate with a global and asynchronously communicating team
•Mentor peers in areas of technical and operational strength

Nice to Have

•Familiarity with Wikimedia or other open source projects
•Experience managing and troubleshooting event streaming platforms at scale (e.g., Kafka, Kinesis, or similar)
•Hands-on experience with cloud platforms such as AWS and/or GCP
•Familiarity with data lake architectures and large-scale data processing frameworks (e.g., Iceberg, Flink, Spark)
•Experience with continuous profiling and performance optimization tools
•Experience working with or contributing to open source projects, particularly in infrastructure or data ecosystems
•Prior participation in the Wikimedia movement

Benefits

•Competitive and equitable salary
•Remote-first work environment
•Inclusive and equitable workplace
•Opportunities for professional growth and learning

Senior Site Reliability Engineer, Wikimedia Enterprise

Requirements

What You'll Do

Nice to Have

Benefits

About Wikimedia