Staff Observability Engineer
FanDuel is looking for a Staff Observability Engineer to design, build, and mature the observability ecosystem that underpins our platform and services. You will deliver deep visibility into system behavior by combining system telemetry with user signals to provide a holistic view of performance, reliability, and user experience. You’ll also explore how AI and machine learning can enhance observability, from intelligent alerting and anomaly detection to accelerating root cause analysis. This is a hands-on role. You’ll partner closely with engineering and product teams to deliver scalable observability capabilities, serve as a subject matter expert in monitoring, alerting, and incident management, and equip teams with self-service insights and tooling. By connecting system behavior to real user impact and leveraging AI-assisted workflows to surface issues faster, you’ll drive improvements in reliability, performance, and data-informed decision-making across the organization. In addition to the specific responsibilities outlined above, employees may be required to perform other such duties as assigned by the Company. This ensures operational flexibility and allows the Company to meet evolving business needs.
- Contribute in defining and driving the observability strategy and roadmap across multiple teams, aligning with business priorities and engineering goals.
- Design and improve scalable observability capabilities that provide actionable insights into system health, performance, and user experience.
- Establish and standardize best practices for monitoring, alerting, incident management, and postmortems across the organisation.
- Drive operational excellence by evolving incident management, on-call practices, and post-incident learning, ensuring systemic improvements over local fixes.
- Lead cross-team initiatives to improve end-to-end reliability, identifying systemic risks and driving their resolution.
- Leverage automation and AI-assisted workflows to accelerate root cause analysis and reduce operational toil at scale.
- Partner with engineering and product leadership to translate observability insights into strategic roadmap decisions.
- Identify trends across system and user signals to proactively detect, prevent, and mitigate large-scale issues.
- Optimize observability platforms for cost, scalability, and long-term sustainability.
- Mentor engineers and raise the reliability and observability maturity across the organisation.
- In addition to the responsibilities outlined above, employees may be required to perform other duties as assigned by the Company to ensure operational flexibility and meet evolving business needs.
- Significant hands-on experience in observability engineering, SRE, platform engineering, or related roles, with a track record of driving impact beyond individual teams (required).
- Strong expertise in monitoring and observability, with significant hands-on experience in Datadog (required).
- Experience defining and driving observability or reliability strategy across teams or domains (required).
- Proficiency with Kubernetes, cloud infrastructure (AWS), and infrastructure-as-code tools (Terraform) (required).
- Proven ability to influence technical direction and decision-making across multiple teams and stakeholders (required).
- Deep understanding of distributed systems principles (e.g. consistency, availability, partition tolerance) and their real-world trade-offs (required).
- Experience defining and implementing SLOs, SLIs, and alerting strategies, including user-centric and business-aligned metrics (required).
- Strong software engineering fundamentals, with proficiency in at least one modern programming language (e.g. Go, Java, Python, or TypeScript) (required).
- Ability to design scalable systems, build tooling and automation, and operate effectively within large, complex codebases (required).
- Experience driving large-scale improvements through automation, reducing organisational toil, and eliminating classes of recurring issues (required).
- Strong analytical skills, with the ability to translate technical signals into business and customer impact (required).
- Excellent communication and stakeholder management skills, with the ability to influence both technical and non-technical audiences (required).
- A mindset of ownership, with a focus on long-term impact, scalability, and continuous improvement (required).
- Medical, vision, and dental insurance.
- Life insurance.
- Disability insurance.
- A 401(k) matching program.
- Paid personal time off.
- 14 paid company holidays.
- Paid sick time in accordance with all applicable state and federal laws.
- Short-term or long-term incentive compensation, including cash bonuses and stock program participation.
- Fertility and family planning programs.
- Mental health support.
- Fitness benefits.
- Commuter benefits.
- Pet insurance.
FanDuel Group is an innovative sports-tech entertainment company that is changing the way consumers engage with their favorite sports, teams, and leagues. The premier gaming destination in the North America, FanDuel Group consists of a portfolio of leading brands across gaming, sports betting, daily fantasy sports, advance-deposit wagering, and TV/media, including FanDuel, Stardust Casino and TVG. The company is based in New York with US offices in Los Angeles, Atlanta, and Jersey City, as well as global offices in Canada and Scotland. The company’s affiliates have offices worldwide, including in Ireland, Portugal, Romania, and Australia. FanDuel Group is a subsidiary of Flutter Entertainment, the world's largest sports betting and gaming operator with a portfolio of globally recognized brands and traded on the New York Stock Exchange (NYSE: FLUT).
