2,617 Open roles
98 Companies
70 Posted today
Jobs / BetMGM / Senior Machine Learning Operations Engineer
Posted 2026-06-19

Senior Machine Learning Operations Engineer

Description

The Senior MLOps Engineer treats ML systems as software systems and owns the path from a trained model to a production endpoint that meets its latency, cost, and reliability budgets — across both batch scoring (SageMaker Batch Transform, Snowflake Cortex / Snowpark ML, dbt-orchestrated scoring) and real-time inference (SageMaker real-time endpoints, Lambda + Bedrock, sub-second feature serving). The Senior Engineer builds the platform that data scientists and ML engineers ship on: feature store with guaranteed online/offline parity, model registry, CI/CD for ML, drift and quality monitoring, champion/challenger and shadow deployment scaffolding. This requires a software-engineering-first mindset — distributed systems, observability, and on-call instincts are the foundation; ML literacy makes them effective for this role. GenAI integration experience is a plus, not a requirement.

Responsibilities
  • Stand up and operate BetMGM's ML platform on AWS (SageMaker Training, Model Registry, Pipelines, Endpoints, Batch Transform) and Snowflake (Snowpark ML, Cortex), with Terraform-managed infrastructure.
  • Build self-service scaffolds that let data scientists ship a model end-to-end without a ticket queue — cookie-cutter project templates with CI, drift monitoring, alerting, IaC, and Snowflake connectivity pre-baked.
  • Design and operate batch scoring pipelines — SageMaker Batch Transform, dbt-orchestrated scoring against Snowflake, Snowpark ML — with explicit freshness and cost SLAs.
  • Design and operate real-time inference paths — SageMaker real-time endpoints, Lambda + Bedrock for GenAI, API Gateway — with stated latency budgets (typically sub-100ms) and graceful degradation under load.
  • Own the feature store (SageMaker Feature Store, Tecton, or Feast) with guaranteed online/offline parity — training-serving skew is treated as an incident, not a tradeoff.
  • Build CI/CD for ML — model registry, automated retraining triggers, model versioning, lineage from feature → training run → deployed model → live prediction.
  • Implement champion/challenger, shadow deployments, and canary releases as platform primitives so individual model teams do not reinvent them per project.
  • Stand up drift detection, data quality, and model performance monitoring (Evidently, Arize, or SageMaker Model Monitor — pick one and standardize) with paging that routes to humans who can fix it.
  • Own MLOps incident response — production model failures are SEV events with postmortems.
  • Right-size endpoints, batch caching, request batching, and autoscaling. State cost-per-prediction targets up front and meet them.
  • Integrate LLM APIs (Bedrock, Anthropic, OpenAI) into production paths — RAG pipelines, agent eval frameworks, prompt versioning, cost and latency observability.
  • Partner with the Helix team on AI personalization workloads as they ramp toward March Madness 2027.
  • Direct AI coding agents (Claude Code, Cursor, GitHub Copilot, dbt Copilot) as a force multiplier across infrastructure code, eval suites, and model-serving glue — designing work for agents to do, not just accepting their suggestions.
  • Partner with the data engineering team on shared standards (Terraform modules, CI/CD patterns, observability, lineage).
  • Work alongside data scientists and analytics partners to land the right interfaces between research and production — opinionated about the boundary.
  • Coordinate with Entain India and contractor ML partners as workloads consolidate onto the BetMGM-owned platform.
Requirements
  • BS or MS in Computer Science, Math, Statistics, Machine Learning, or other STEM field — or equivalent practical experience. Practical experience wins ties; a PhD is neither required nor a tiebreaker. (required)
  • 5+ years shipping software in production — Python, Docker, Kubernetes or ECS, CI/CD, distributed systems debugging — including time on-call. (required)
  • 3+ years operating ML in production — you have owned a model in prod that served real traffic, with stated latency and cost budgets and a runbook you wrote. (required)
  • AWS depth across the SageMaker surface (Training, Endpoints, Batch Transform, Model Registry, Pipelines) plus the supporting cast (IAM, Lambda, ECS, S3, Secrets Manager, VPC). (required)
  • Snowflake fluency — Snowpark ML, Cortex, dbt-orchestrated batch scoring, RBAC for ML workloads. (required)
  • IaC for ML — Terraform + SageMaker Pipelines or equivalent. No manual console deployments to production. (required)
  • Feature store experience — SageMaker Feature Store, Tecton, or Feast — with explicit ownership of online/offline parity. (required)
  • Champion/challenger, shadow, and canary deployment patterns as production muscle, not blog-post familiarity. (required)
  • Drift and model monitoring — Evidently, Arize, WhyLabs, or SageMaker Model Monitor — wired to a paging path. (required)
  • Software-engineering-first mindset — you treat ML systems as systems, not notebooks. (required)
  • GenAI in production — Bedrock, Anthropic, or OpenAI APIs integrated into live systems; RAG pipelines; vector DBs (Snowflake Cortex Search, pgvector, Pinecone); evaluation frameworks (Langfuse or in-house). (nice-to-have)
  • Snowflake-native ML — Snowpark Container Services, Cortex AISQL, Cortex Agents — for workloads that do not need to leave the warehouse. (nice-to-have)
  • Streaming feature engineering — Kafka, Flink, or Snowpipe Streaming — for sub-second features. (nice-to-have)
  • Fine-tuning experience — LoRA, QLoRA, instruction tuning, eval-driven iteration — with an honest read on when fine-tuning beats prompting. (nice-to-have)
  • A track record of shipping more with AI in the engineering loop than without. (nice-to-have)
  • Regulated-industry experience (gaming, fintech, healthcare) — comfort with model risk, audit, and lineage requirements. (nice-to-have)
Benefits
  • Medical, Dental, Vision, Life, and Disability Insurance
  • 401(k) with company match
  • Pre-tax spending accounts including health care FSA and commuter savings
  • Flexible paid time off
  • Professional development reimbursement and ongoing skills training opportunities
  • Employee resource groups
  • Swag, ticket giveaways, and more!