Agent Evaluation Intern
We are hiring an intern to work on evaluation and reliability infrastructure for a real-world LLM agent system in the UA performance marketing field. The agent performs multi-step reasoning, retrieves context, selects tools, executes actions, handles user confirmations, and interacts with external services.
The goal of this internship is to build transferable expertise in agent evaluation engineering: evaluating tool use, measuring trajectory quality, designing benchmarks, analyzing traces, comparing model and prompt variants, and improving the reliability of agentic AI systems.
This role is ideal for someone interested in future opportunities in LLM agent evaluation, AI safety evaluation, research engineering, LLMOps, or applied AI infrastructure.
- Research the state-of-the-art agentic workflow evaluation frameworks in the industry and in the research field.
- Apply the theory to build automated evaluation pipelines that can run agent scenarios, capture execution artifacts, score results, and detect regressions.
- Evaluate tool-use behavior, including whether the agent selects the right tool, passes correct arguments, avoids unnecessary calls, and handles tool errors appropriately.
- Analyze agent trajectories using traces, logs, intermediate steps, and final outputs to identify reasoning failures, context misuse, hallucinated assumptions, and brittle workflow patterns.
- Design metrics for agent reliability, including success rate, tool-call precision, argument accuracy, recovery rate, retry count, latency, cost, and safety-related failure rates.
- Create reusable evaluation datasets from synthetic cases, golden workflows, and real anonymized executions.
- Support experiments comparing prompts, model providers, tool descriptions, memory strategies, context construction methods, and execution modes.
- Help build human evaluation workflows and rubrics for judging agent correctness, faithfulness, usefulness, and risk awareness.
- Work with engineers to translate evaluation findings into better tests, monitoring signals, tool interfaces, prompts, and guardrails.
- Potentially compose research papers and publish in scientific conferences.
- Currently pursuing or recent graduates of a Master’s or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, Software Engineering, Data Science, or a related field.
- Strong Python fundamentals and interest in AI systems.
- Curious about how LLM agents work, fail, and improve.
- Interested in evaluation methodology, not just application building.
- Comfortable reading logs, traces, test cases, and structured data.
- Detail-oriented and able to define clear, measurable criteria for ambiguous agent behavior.
- Prior experience with LLMs, LangChain-like agents, tool calling, pytest, data analysis, or observability tools is helpful but not required.



