Wikimedia logo
    W

    Senior Site Reliability Engineer, Wikimedia Enterprise

    Wikimedia
    Remote
    Remote
    Senior
    Full Time
    10 days ago
    💰$116,633 - $181,243
    remotesite_reliability_engineerseniorinfrastructurecloudawsci_cdautomationobservabilitywikimedia

    Requirements

    • Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible)
    • Proficiency in at least one programming language (e.g., Python, Go, or similar)
    • Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP
    • Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD)
    • Familiarity with progressive delivery approaches such as canary and blue-green deployments
    • Experience with incident response, on-call practices, and leading postmortems
    • Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets
    • Experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
    • Ability to work effectively in a distributed, cross-functional environment
    • Strong documentation and communication skills

    What You'll Do

    • Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
    • Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
    • Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
    • Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
    • Partner with engineering team members to embed reliability best practices early in the development lifecycle
    • Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab and ArgoCD, enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
    • Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
    • Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
    • Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
    • Reduce operational toil by identifying repetitive work and implementing automation-first solutions
    • Contribute to and evolve internal platform capabilities that standardize infrastructure and improve scalability across teams
    • Collaborate with a global and asynchronously communicating team
    • Mentor peers in areas of technical and operational strength

    Nice to Have

    • Familiarity with Wikimedia or other open source projects
    • Experience managing and troubleshooting event streaming platforms at scale (e.g., Kafka, Kinesis, or similar)
    • Hands-on experience with cloud platforms such as AWS and/or GCP
    • Familiarity with data lake architectures and large-scale data processing frameworks (e.g., Iceberg, Flink, Spark)
    • Experience with continuous profiling and performance optimization tools
    • Experience working with or contributing to open source projects, particularly in infrastructure or data ecosystems
    • Prior participation in the Wikimedia movement

    Benefits

    • Competitive and equitable salary
    • Remote-first work environment
    • Inclusive and equitable workplace
    • Opportunities for professional growth and learning

    About Wikimedia

    Wikimedia Foundation encourages the development and distribution of free educational content with projects such as Wikipedia.

    San Francisco, CA, US
    500 - 1000
    Media & Entertainment