Site Reliability Engineer
Flutter Technology is looking for a Site Reliability Engineer to guarantee the stability, uptime, and efficiency of our essential gaming and betting platforms throughout our worldwide operations. This position blends engineering skills with operational proficiency to sustain continuous service availability for millions of users globally via on-call support. As a member of Flutter Functions, you will work closely with development groups, infrastructure experts, and business partners to maintain high-performance, scalable systems supporting our iGaming and Sports platforms in several markets.
You will be the expert responsible for building and managing enterprise-level observability, disaster recovery, and business continuity features across our AWS Cloud environment. The role requires a passion for system reliability and a proactive approach to spotting and fixing potential issues before they affect customers. You will be responsible for ensuring our systems are resilient, recoverable, and subjected to regular fire drills and extensive testing. This role follows a hybrid approach to working, allowing you to combine working from home with working in our modern offices.
- Maintain 99.9%+ uptime for the Observability platform that monitors and provides insights for systems serving millions of concurrent users.
- Design and support complete monitoring, alerting, and observability systems.
- Take responsibility for the tooling infrastructure that connects with various cloud services and platforms such as Grafana, Splunk, and CloudWatch.
- Conduct capacity planning and performance optimisation to ensure systems can handle peak loads during major sporting events.
- Establish and uphold Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all essential services with assistance from Service Management.
- Collaborate with Service Management to foster continuous improvement via blameless post-mortems.
- Detect repeated failure trends across the platform and work with product teams on resilience upgrades.
- Work together with Service Management on post-incident reviews, offering technical insights and assisting in the adoption of preventative measures.
- Support the development and upkeep of detailed runbooks and incident response methods.
- Deploy and maintain comprehensive monitoring dashboards and visualisation tools for real-time system visibility.
- Create custom dashboards and visual analytics for business metrics, technical indicators, and operational insights.
- Configure and optimise data ingestion from diverse sources including time-series databases, log aggregation systems, cloud monitoring services, and custom APIs.
- Implement and refine alerting rules and notification workflows.
- Develop and sustain APM capabilities, incorporating instrumentation and telemetry collection into the current observability ecosystem.
- Work together with development teams to define, implement, and instrument custom business and technical metrics.
- Own and maintain the chaos testing framework and tools, defining standard failure scenarios.
- Support product teams in performing tests safely and consistently, and carry out disaster recovery fire drills.
- Apply chaos engineering principles to proactively identify system weaknesses and vulnerabilities.
- Work alongside development teams to boost application reliability and deployment procedures.
- Mentor junior team members and contribute to the development of SRE practices across Flutter.
- Participate in architecture reviews and provide reliability expertise for new system designs.
- Document procedures, troubleshooting guides, and system architecture for knowledge sharing.
- Extensive experience with monitoring and observability tools including Prometheus, Grafana, ELK stack, or similar enterprise-scale solutions (required)
- Established capability in handling cloud platforms like AWS, Azure, or Google Cloud Platform (required)
- Extensive experience applying and sustaining reliability engineering methods in production settings supporting 24/7/365 operations (required)
- Delivering and operating systems in stringent security-compliant and highly regulated environments (required)
- Strong scripting and programming abilities in Python, Go, Bash, TypeScript, or Terraform (required)
- Proven experience with CI/CD pipelines and tools including Jenkins, GitLab CI, Azure DevOps, GitHub Actions, or similar (required)
- Working knowledge of database technologies including SQL databases (PostgreSQL, MySQL) and NoSQL solutions (required)
- Producing comprehensive, clear, and actionable technical documentation for operational procedures and runbooks (required)
- Operating within an agile setting alongside cross-functional groups (required)
- Proficiency with containerisation technologies including Docker and Kubernetes (required)
- Previous software engineering experience (nice-to-have)
- AWS certifications (nice-to-have)
- Experience in highly regulated industries such as gaming, financial services, or healthcare (nice-to-have)
- Discretionary annual bonus
- 30 days paid leave
- Health and Dental Insurance for you, your partner, and your children
- Personal life insurance and disability coverage
- Wellbeing fund
- Continuous learning support for certifications and career growth
- 550 EUR gift for a newborn family member
- 26 weeks Maternity leave at 100% pay
- 4 weeks secondary (Paternity) leave at 100% pay
- Sports card membership
- Monthly food vouchers
- Pension benefits
Flutter is the world’s leading online sports betting and gaming company, operating some of the most innovative, diverse and distinctive brands in the sector such as FanDuel, Sky Betting and Sky Gaming, Paddy Power, PokerStars, Betfair, Sportsbet, Tombola, Adjarabet, Sisal, Snai, Betnacional, Junglee Games and MaxBet. We have an unparalleled portfolio of world-class brands, global scale and challenger mindset, through which we excite and entertain our customers, in a safe and sustainable way. Using our collective power, the Flutter Edge, we aim to disrupt our sector, learning from the past to create a better future for our customers, colleagues and communities.
