Posted 03 Jun 26

Senior Site Reliability Engineer, Wikimedia Enterprise

full timeengineeringdevopsremote FROM 🇧🇷

Open to candidates in: Brazil

Jobgether

🏭 Not specified

📍 N/A

👤 Not specified

🌐 Website

Apply Now

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer, Wikimedia Enterprise in Brazil.

This role sits at the intersection of large-scale infrastructure engineering and mission-driven technology powering global knowledge distribution systems. You will help design, operate, and evolve highly available, high-performance API and data infrastructure that supports large-scale reuse of Wikimedia content worldwide. The position involves deep technical ownership of reliability, scalability, and observability for critical services. You will work in a fully distributed, globally collaborative environment alongside experienced SREs, software engineers, and platform teams. The role combines hands-on engineering, incident response, and long-term reliability strategy. It also offers the opportunity to contribute to systems that directly impact how knowledge is accessed and reused across the internet. You will operate in a fast-paced, product-focused engineering culture with strong emphasis on automation, experimentation, and continuous improvement.

Accountabilities

In this role, you will be responsible for ensuring the reliability, scalability, and performance of large-scale distributed systems that power data and API services. You will:

Define, track, and continuously improve SLOs, SLIs, and error budgets for critical services
Design and enhance observability systems including metrics, logging, and distributed tracing
Participate in incident response, on-call rotations, and post-incident reviews to drive continuous improvement
Build and maintain CI/CD and GitOps pipelines enabling secure, automated, and reliable deployments
Implement infrastructure-as-code and automation-first practices to reduce operational toil
Design and operate scalable cloud infrastructure across production environments
Drive capacity planning, performance optimization, and resilience testing (including chaos engineering practices)
Improve developer experience by enabling self-service infrastructure and streamlined workflows
Collaborate with security, software, and release engineering teams to embed reliability and security best practices
Optimize infrastructure cost and efficiency using FinOps principles without compromising availability
Develop and maintain operational metrics such as MTTR, MTTD, and incident frequency
Contribute to platform engineering initiatives that standardize infrastructure across teams
Mentor peers and promote best practices in SRE, automation, and systems reliability

Requirements

This position requires strong expertise in site reliability engineering, distributed systems, and cloud infrastructure, along with a proactive and collaborative mindset. You should have:

5+ years of experience in SRE, DevOps, or infrastructure engineering roles
Strong experience with infrastructure-as-code tools such as Terraform and/or Ansible
Proficiency in at least one programming language (Python, Go, or similar)
Hands-on experience with cloud platforms such as AWS, GCP, or Azure
Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab, ArgoCD or similar tools)
Strong understanding of SRE principles including SLOs, SLIs, and error budgets
Experience with observability tooling such as Prometheus, OpenTelemetry, or equivalent
Proven experience in incident response, on-call operations, and postmortem analysis
Ability to operate and optimize large-scale distributed systems with high availability requirements
Strong communication and collaboration skills in distributed, remote-first environments
Ability to document systems clearly and contribute to shared engineering knowledge
Strong ownership mindset, with a focus on automation, reliability, and continuous improvement
Adaptability to fast-evolving, technology-driven environments

Benefits

Remote-first work model with global collaboration
Opportunity to work on high-impact systems supporting global knowledge platforms
Exposure to large-scale distributed systems and modern cloud-native architectures
Culture of engineering excellence, automation, and continuous improvement
Strong emphasis on learning, experimentation, and open collaboration
Competitive compensation adjusted to location and experience
Inclusive and diverse work environment with global team exposure
Opportunity to contribute to open knowledge infrastructure used worldwide

How Jobgether works: We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Why Apply Through Jobgether? Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1

APPLY NOW