Posted 2026-06-02

Senior Site Reliability Engineer

Description

As a Senior Site Reliability Engineer, you'll build and scale the critical infrastructure behind every product. In this role, you'll take on complex challenges across global data centers, multiple cloud platforms, and on-premise systems-designing automation-first solutions that elevate performance and eliminate operational friction. You'll be trusted to drive stability at scale, influence architectural decisions, and build tools that empower our teams to move fast and deliver reliably. This is where your impact won't just be felt, it'll be foundational.

Responsibilities

Drive stability and scalability across our global compute platform spanning numerous data centers, multiple public clouds, and on-premise environments, serving as the foundation for every product.
Operate and evolve our GitOps delivery model, using Rancher Fleet and Flux with Helm to deploy core cluster services and application workloads declaratively and repeatably.
Build self-healing, fault-tolerant infrastructure and internal tooling that eliminates repetitive operational work and reduces toil for both platform and application teams.
Own cluster autoscaling and capacity strategy, including Karpenter, HPA and KEDA, and predictive scaling driven by event and calendar data.
Define SLOs and reliability metrics for platform components, using Datadog and our logging pipeline to surface cluster and workload health.
Support technical growth by sharing knowledge, participating in design discussions, and contributing to a collaborative team culture, including on-call rotation.

Requirements

Bachelor's degree in Computer Science or relevant education, experience, and training.
At least 4 years managing distributed cloud and on-premise environments at scale, with strong hands-on AWS experience.
Exposure to GCP, vSphere, or Nutanix is a plus.
Deep expertise in container orchestration with Kubernetes, including the ability to design, scale, and troubleshoot complex workloads.
Strong experience developing software for automation and infrastructure tooling such as Go and Python.
Working knowledge of networking and Linux-based systems, including container runtimes such as Docker and containerd, packet-level debugging, and kernel troubleshooting.
Experience with Infrastructure as Code (IaC) and configuration management tools to ensure scalable and repeatable infrastructure provisioning.

Benefits

Comprehensive health benefits, including various medical plans, dental, and vision.
Wellbeing Program with free access to programs such as free therapy sessions with Lyra Mental Health Solution, Lyra Employee Assistance Program, the Calm App, Virtual Yoga Classes, and many more.
14 weeks of 100% paid parental leave to all global team members and workplace lactation support.
Partnership with Care.com for care finder and backup childcare for U.S. employees.
Range of benefits that support family planning, including adoption, surrogacy, fertility treatments, or other family planning.
Flexible PTO if you work in the United States.
Pet insurance benefit to help save on vet expenses for accidents, illnesses, and more.
Gym Reimbursement for part of your gym membership once you enroll in a health insurance plan.
Financial Planning support, from 401(k) matching to programs such as Origin Financial.
Commuter benefits to help you save money on your daily commute.
Tuition Reimbursement Program for educational courses related to their positions or another position at DraftKings.

Senior Site Reliability Engineer

VIP Host, National East

Senior Machine Learning Engineer

Senior Analyst, Risk

Player Development Executive, Michigan

Lead Product Designer, Platform

Senior Site Reliability Engineer

VIP Host, National East

Senior Machine Learning Engineer

Senior Analyst, Risk

Player Development Executive, Michigan

Lead Product Designer, Platform

Sign in

Job Alerts