Senior Site Reliability Engineer, Wikimedia Enterprise

full timeengineeringdevopsremote FROM 🇧🇷
Open to candidates in: Brazil
Jobgether
🏭 Not specified
📍 N/A
👤 Not specified

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer, Wikimedia Enterprise in Brazil.

This role sits at the intersection of large-scale infrastructure engineering and mission-driven technology powering global knowledge distribution systems. You will help design, operate, and evolve highly available, high-performance API and data infrastructure that supports large-scale reuse of Wikimedia content worldwide. The position involves deep technical ownership of reliability, scalability, and observability for critical services. You will work in a fully distributed, globally collaborative environment alongside experienced SREs, software engineers, and platform teams. The role combines hands-on engineering, incident response, and long-term reliability strategy. It also offers the opportunity to contribute to systems that directly impact how knowledge is accessed and reused across the internet. You will operate in a fast-paced, product-focused engineering culture with strong emphasis on automation, experimentation, and continuous improvement.


Accountabilities

In this role, you will be responsible for ensuring the reliability, scalability, and performance of large-scale distributed systems that power data and API services. You will:

  • Define, track, and continuously improve SLOs, SLIs, and error budgets for critical services
  • Design and enhance observability systems including metrics, logging, and distributed tracing
  • Participate in incident response, on-call rotations, and post-incident reviews to drive continuous improvement
  • Build and maintain CI/CD and GitOps pipelines enabling secure, automated, and reliable deployments
  • Implement infrastructure-as-code and automation-first practices to reduce operational toil
  • Design and operate scalable cloud infrastructure across production environments
  • Drive capacity planning, performance optimization, and resilience testing (including chaos engineering practices)
  • Improve developer experience by enabling self-service infrastructure and streamlined workflows
  • Collaborate with security, software, and release engineering teams to embed reliability and security best practices
  • Optimize infrastructure cost and efficiency using FinOps principles without compromising availability
  • Develop and maintain operational metrics such as MTTR, MTTD, and incident frequency
  • Contribute to platform engineering initiatives that standardize infrastructure across teams
  • Mentor peers and promote best practices in SRE, automation, and systems reliability
  • Requirements

    This position requires strong expertise in site reliability engineering, distributed systems, and cloud infrastructure, along with a proactive and collaborative mindset. You should have:

    • 5+ years of experience in SRE, DevOps, or infrastructure engineering roles
    • Strong experience with infrastructure-as-code tools such as Terraform and/or Ansible
    • Proficiency in at least one programming language (Python, Go, or similar)
    • Hands-on experience with cloud platforms such as AWS, GCP, or Azure
    • Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab, ArgoCD or similar tools)
    • Strong understanding of SRE principles including SLOs, SLIs, and error budgets
    • Experience with observability tooling such as Prometheus, OpenTelemetry, or equivalent
    • Proven experience in incident response, on-call operations, and postmortem analysis
    • Ability to operate and optimize large-scale distributed systems with high availability requirements
    • Strong communication and collaboration skills in distributed, remote-first environments
    • Ability to document systems clearly and contribute to shared engineering knowledge
    • Strong ownership mindset, with a focus on automation, reliability, and continuous improvement
    • Adaptability to fast-evolving, technology-driven environments
    • Benefits

      • Remote-first work model with global collaboration
      • Opportunity to work on high-impact systems supporting global knowledge platforms
      • Exposure to large-scale distributed systems and modern cloud-native architectures
      • Culture of engineering excellence, automation, and continuous improvement
      • Strong emphasis on learning, experimentation, and open collaboration
      • Competitive compensation adjusted to location and experience
      • Inclusive and diverse work environment with global team exposure
      • Opportunity to contribute to open knowledge infrastructure used worldwide

How Jobgether works: We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best!  Why Apply Through Jobgether?    Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.     #LI-CL1
Jobgether
🏭 Not specified
📍 N/A
👤 Not specified