Agent Evaluation Intern

Description

We are hiring an intern to work on evaluation and reliability infrastructure for a real-world LLM agent system in the UA performance marketing field. The agent performs multi-step reasoning, retrieves context, selects tools, executes actions, handles user confirmations, and interacts with external services.

The goal of this internship is to build transferable expertise in agent evaluation engineering: evaluating tool use, measuring trajectory quality, designing benchmarks, analyzing traces, comparing model and prompt variants, and improving the reliability of agentic AI systems.

This role is ideal for someone interested in future opportunities in LLM agent evaluation, AI safety evaluation, research engineering, LLMOps, or applied AI infrastructure.

Responsibilities

Research the state-of-the-art agentic workflow evaluation frameworks in the industry and in the research field.
Apply the theory to build automated evaluation pipelines that can run agent scenarios, capture execution artifacts, score results, and detect regressions.
Evaluate tool-use behavior, including whether the agent selects the right tool, passes correct arguments, avoids unnecessary calls, and handles tool errors appropriately.
Analyze agent trajectories using traces, logs, intermediate steps, and final outputs to identify reasoning failures, context misuse, hallucinated assumptions, and brittle workflow patterns.
Design metrics for agent reliability, including success rate, tool-call precision, argument accuracy, recovery rate, retry count, latency, cost, and safety-related failure rates.
Create reusable evaluation datasets from synthetic cases, golden workflows, and real anonymized executions.
Support experiments comparing prompts, model providers, tool descriptions, memory strategies, context construction methods, and execution modes.
Help build human evaluation workflows and rubrics for judging agent correctness, faithfulness, usefulness, and risk awareness.
Work with engineers to translate evaluation findings into better tests, monitoring signals, tool interfaces, prompts, and guardrails.
Potentially compose research papers and publish in scientific conferences.

Requirements

Currently pursuing or recent graduates of a Master’s or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, Software Engineering, Data Science, or a related field.
Strong Python fundamentals and interest in AI systems.
Curious about how LLM agents work, fail, and improve.
Interested in evaluation methodology, not just application building.
Comfortable reading logs, traces, test cases, and structured data.
Detail-oriented and able to define clear, measurable criteria for ambiguous agent behavior.
Prior experience with LLMs, LangChain-like agents, tool calling, pytest, data analysis, or observability tools is helpful but not required.

Similar Active Jobs

Light & WonderProduct & DevelopmentMarousi, Greece

Senior Software Engineer (Java)

We are looking for an experienced Senior Software Engineer to join a high-performing agile team. You will participate in all stages of the software product development life cycle, including analyzing systems, writing Java code, and troubleshooting bugs. Ideal candidates have at least 5 years of experience in web system design and development and can lead technical discussions. The role offers competitive benefits, a supportive environment, and opportunities for career growth.

HybridFull-timeSenior5+ yearsEnglish

2026-06-18

Light & WonderProduct & DevelopmentMacau, China

Global Sr. Commercial Product Manager - Table Games

This role is for a Global Sr. Commercial Product Manager focusing on Table Games. You will lead market assessments, evaluate product performance, and translate player insights into product requirements. You will also partner with sales teams, manage installations, and serve as a subject matter expert for Table Games. Collaboration with product development, engineering, and compliance teams is crucial to ensure products meet technical and regulatory requirements across global markets. The role involves building relationships with operators, distributors, and technical partners, as well as supporting commercial proposals and RFP responses.

On-siteFull-timeSenior8-12 yearsEnglish

2026-06-18

AristocratProduct & DevelopmentNoida, India

Sr Engineer II-2

This role requires 4-7 years of experience in manual closed system testing, with a focus on digital games. You will be responsible for writing test plans, analyzing test approaches, and ensuring quality standards are met. Collaboration with product managers, designers, and engineers is key to delivering features and improvements. The position is based in Noida, India, and is a full-time, on-site role.

On-siteFull-timeSenior4-7 yearsEnglish

2026-06-18

BetwayProduct & DevelopmentCape Town, South Africa

Software Engineer (Front-End)

We are seeking passionate and driven individuals to join Super Group International on a thrilling journey of growth and innovation. As a Software Engineer (Front-End), you will build and iterate on the WTF Games frontend, create fast and reactive interfaces, and integrate with Elantil and Directus CMS. You will ship landing pages, lobby systems, and game UIs at high velocity, implement tracking, and continuously optimise UX based on real user behaviour. This role requires strong experience with React/Next.js and headless CMS platforms, along with a collaborative mindset and exceptional attention to detail.

On-siteFull-timeMid-levelEnglish

2026-06-18

BoostaProduct & DevelopmentRemote

Senior AI/ML Engineer

We’re looking for a Senior AI/ML Engineer with 3+ years of enterprise experience building real AI enterprise solutions. You will design and ship ML and agentic AI systems end‑to‑end, from quick prototypes to scalable, production‑grade solutions. You’ll work closely with product, design, and business stakeholders to lead the AI/software engineering team and translate complex AI architectures into clear, understandable user experiences.

RemoteFull-timeSenior3+ yearsEnglish

2026-06-18

Agent Evaluation Intern

Senior Software Engineer (Java)

Global Sr. Commercial Product Manager - Table Games

Sr Engineer II-2

Software Engineer (Front-End)

Senior AI/ML Engineer

Sign in

Job Alerts