Research Intern — Coding LLMs
We are looking for research interns to work on foundational areas for coding language models, including pre-training data, mid-training data, synthetic data generation, evaluation, and agentic coding.
- Explore data-centric methods for improving coding LLMs, including data filtering, quality assessment, deduplication, data mixture, and diversity analysis.
- Build synthetic data and evaluation pipelines for code generation, code editing, repo-level reasoning, tool use, and multi-step coding tasks.
- Run experiments to analyse how data, model, and training strategies affect coding capabilities.
- Work with large-scale code corpora, developer activity data, and agentic coding trajectories.
- Strong programming skills in Python (required).
- Solid understanding of machine learning and large language models (required).
- Familiarity with LLM pre-training, mid-training, code models, data curation, evaluation, agents, or tool use (required).
- Strong experiment design, data analysis, and problem-solving skills (required).
- Interest in code intelligence, software engineering automation, and agentic coding (required).
- Experience with code data processing, GitHub-scale data, synthetic data, LLM evaluation, semantic deduplication, or agentic coding (preferred).
- Research experience, publications, or open-source projects in related areas are a plus (nice-to-have).
- Access to large-scale real-world coding data and agentic trajectories.
- Rich compute resources and model APIs for fast research iteration.
- Opportunities to work on real-world coding model applications and the full model development loop.



