Posted 2026-05-21

Site Reliability Engineer

Description

We're looking for a Site Reliability Engineer (SRE) to join our Platform Delivery and Reliability Services (PDRS) team and help us grow and maintain our Observability Platform. You will build, deploy, and maintain telemetry pipelines and observability platforms that provide real-time insight into system performance and reliability. You will design and integrate monitoring and observability solutions with an automation-first mindset. Collaborate with Development and Infrastructure teams to ensure observability requirements — metrics, traces, logs, and alerts — are built into the design of new systems. Apply SRE principles to reduce repetitive work and improve lead time to detect and resolve issues. Help troubleshoot and prevent production issues to keep our systems stable and performant. Share knowledge and contribute to team documentation and best practices.

Responsibilities

Build, deploy, and maintain telemetry pipelines and observability platforms that provide real-time insight into system performance and reliability.
Design and integrate monitoring and observability solutions with an automation-first mindset.
Collaborate with Development and Infrastructure teams to ensure observability requirements — metrics, traces, logs, and alerts — are built into the design of new systems.
Apply SRE principles to reduce repetitive work and improve lead time to detect and resolve issues.
Help troubleshoot and prevent production issues to keep our systems stable and performant.
Share knowledge and contribute to team documentation and best practices.

Requirements

Hands-on experience as a Site Reliability Engineer or in a similar infrastructure/operations role.
Good coding and scripting skills — Bash, Python, or similar — including experience with code reviews.
Solid understanding of Linux OS and distributed systems.
Experience with Time Series Databases and visualisation tools such as Grafana.
Familiarity with alerting and monitoring tools like Prometheus, Icinga, Datadog, or Nagios.
Experience with CI/CD tools (Jenkins, GitLab, GitHub Actions) and Infrastructure as Code (Terraform, Ansible, or similar).
Experience with logging platforms such as Splunk, ElasticSearch, Loki, FluentD, or Vector is a plus.
Knowledge of container orchestration platforms (Kubernetes, Nomad, OpenShift) and cloud environments (AWS, Azure, OpenStack) is a plus.
Interest or experience in AI tools is considered an advantage.

Benefits

Well-being allowance
Learning and development opportunities
Inclusion networks
Charity days
Long service awards
Social events and activites
Private medical insurance
Life assurance and income protection
Employee Assistance Programme
Pension

Site Reliability Engineer

Want to see more roles like this?

Sign in

Job Alerts