AI Scientist via Synthetic Task Scaling: Training ML Research Agents

To enable AI to conduct machine learning research autonomously, the core challenge is: where does training data come from? This paper proposes a fully automated pipeline for synthesizing ML challenge tasks compatible with the SWE-agent framework, covering three stages: topic sampling, dataset proposal, and code generation. Synthetic tasks carry dual quality guarantees—datasets are verified via the HuggingFace API to anchor tasks in real data, while code is validated through a self-debugging loop for executability. Validation is performed on the MLGym benchmark: GPT-5 serves as the teacher model generating solution trajectories, which are then distilled into Qwen3-4B and Qwen3-8B student models. Results show AUP gains of 9% and 12% respectively on real ML tasks—providing a scalable training path toward AI systems capable of autonomous research.

AI Scientist via Synthetic Task Scaling: Training ML Research Agents at Scale

The Core Problem: Where Does Training Data for

AI Researchers Come From? The aspiration for AI to conduct autonomous scientific discovery has long been one of the field's most ambitious goals. Recent systems—AI Scientist, Co-Scientist, AlphaEvolve—have demonstrated that AI can already perform basic research tasks: formulating hypotheses, running experiments, and integrating findings. Yet a fundamental gap remains between these impressive demonstrations and truly capable research agents: **principled training methodology**. Most existing research agent systems are scaffolded architectures built on top of powerful foundation models. They rely on pre-existing model knowledge rather than a systematic approach to teaching agents how to actually *do* machine learning—iteratively debugging failed experiments, improving upon baselines, reasoning through research decisions in structured ways. This paper from Princeton University and Microsoft Research, authored by Ziyang Cai and Harkirat Behl (arXiv: 2603.17216), addresses precisely this gap. Their key insight: **the bottleneck isn't model capability—it's training data.** And their solution is to synthesize that training data at scale through an automated, no-human-in-the-loop pipeline that generates executable ML research environments from scratch.

The SWE-Agent Framework: Where Agents Live

Before diving into the synthesis pipeline, it's worth understanding the execution framework these agents operate within. The authors build on **SWE-agent**, a task-agnostic framework where agents interact with a code execution environment through a structured action space: reading files, editing code, running bash commands, and submitting implementations. MLGym—the evaluation benchmark used in this paper—uses the same SWE-agent framework. Each ML task in MLGym consists of: - A task description explaining the goal - A dataset description (where applicable) - Starter code implementing a baseline solution - An evaluation script returning a scalar score The agent's objective is to improve upon the baseline starter code, with up to 50 rounds of interaction. Each round, the agent produces a reasoning trace and an action. Multiple submissions are allowed, reflecting the iterative nature of real ML research. The synthetic pipeline is specifically designed to generate tasks compatible with this exact execution environment.

The Three-Phase Synthetic Pipeline

The centerpiece of this paper is a **fully automated pipeline** for generating ML research tasks at scale. The pipeline has three distinct phases: #

Phase

1: Environment Synthesis **Step 1: Topic Sampling** The pipeline begins by prompting GPT-5 to generate *n* distinct machine learning topics. This sampling step maximizes diversity—covering computer vision, NLP, reinforcement learning, time-series forecasting, game theory, and more. From 1,000 sampled topics, the pipeline ultimately generates and validates approximately 500 viable tasks. **Step 2: Task and Dataset Proposal** For each sampled topic, the teacher model performs a dual role: it writes a task description (specifying what the agent must optimize) and proposes a specific HuggingFace dataset to ground the task in. Here lies one of the paper's key engineering decisions: **HuggingFace API validation**. Rather than accepting the model's dataset proposal at face value, the pipeline uses the HuggingFace Search API to verify that the proposed dataset actually exists. If a matching dataset is found, the pipeline fetches sample rows and incorporates them into the task description—grounding the synthetic task in real data distributions. If no match is found, the task is discarded entirely. This validation step serves two critical purposes. First, it prevents hallucinated tasks that reference nonexistent datasets—which would produce misleading training signal. Second, by grounding tasks in real HuggingFace datasets, it ensures agents learn to work with actual ML data formats, real feature distributions, and genuine problem structures rather than fabricated toy examples. Tasks that don't require a dataset—such as game-theoretic agent tasks—are also permitted, adding another dimension of diversity to the task corpus. **Step 3: Config and Starter Code Generation** With a validated task description and dataset in hand, the pipeline generates the full task infrastructure: - Task configuration files compatible with the MLGym execution environment - Dataset configuration files - Baseline implementation code (`baseline.py`) that agents will attempt to improve - Evaluation code (`evaluate.py`) that scores agent submissions - Any necessary helper utilities Code generation is the most error-prone step in the pipeline—LLM-generated code frequently contains bugs, import errors, shape mismatches, or incorrect API calls. This is why Phase 2 is essential. #

Phase

2: Environment Verification with Self-Debugging Raw code generation is not enough—the pipeline must verify that generated tasks are actually executable and produce meaningful evaluation outputs. This is handled by **plugging each new task into the MLGym execution environment and running a GPT-5 agent on it**. The verification run serves multiple purposes: 1. It establishes baseline performance for the task 2. It validates that the task setup is correct and runnable 3. It captures at least one valid agent trajectory When errors occur during verification—and they frequently do—the pipeline employs a **self-debugging loop** rather than immediately discarding the broken task: - With probability *p_debug*: error logs are fed back to the teacher model, which regenerates the starter code with the error context as additional input - With probability *1 - p_debug*: code generation restarts from scratch - The loop continues for at most *k* iterations - Only after exhausting all debug attempts is a task discarded This iterative approach is qualitatively different from simple rejection sampling. By allowing the model to see and respond to its own errors, the pipeline extracts significantly more value from each generation attempt—increasing effective task yield without requiring human intervention. The entire verification process runs **without any human supervision** and is designed for massive parallelization. #

Phase 3: Trajectory Generation and Filtering Large-Scale Parallel Sampling Verified tasks are deployed to an HPC cluster for large-scale trajectory collection. Each task runs on a dedicated GPU, with a target of 256 trajectories per task. Even with verified tasks, trajectory collection isn't perfectly reliable—cluster-level instabilities from file system and containerization issues cause additional failures. The paper's Figure 2 shows the heterogeneous distribution of valid trajectory counts across different tasks, reflecting the unsupervised and somewhat stochastic nature of the collection process. Two-Stage Trajectory Filtering Not all collected trajectories are useful for training. The pipeline applies two filtering criteria: 1. Success filter: Only trajectories where the agent completes at least one successful submission are retained. This filters out pathological cases where agents are stuck in infinite debugging loops, never producing valid output. 2. Length filter: Trajectories exceeding 48K tokens are discarded.

During SFT training, retained trajectories are further truncated to 32K tokens to fit training context windows. After aggregation and filtering, the pipeline yields approximately **34,000 high-quality agent trajectories** forming the SFT training dataset.

Knowledge Distillation: GPT-5 Teaches Qwen3

The training methodology is **supervised fine-tuning (SFT) on teacher-generated trajectories**—a form of knowledge distillation from a powerful frontier model to smaller open-weight models: | Role | Model | |------|-------| | Teacher | GPT-5 (OpenAI frontier) | | Students | Qwen3-4B & Qwen3-8B (Alibaba open-weight) | The scale of this dataset is worth appreciating: 34,000 agentic trajectories covering 500 diverse ML tasks, each consisting of multi-turn interactions with code execution, structured reasoning, and iterative submission loops. This is the kind of rich, structured experience data that would be virtually impossible to collect from human researchers at comparable scale and cost.

MLGym Benchmark

Results **The benchmark**: MLGym contains 13 ML challenges spanning computer vision, NLP, reinforcement learning, and simple game agents. Each task proceeds in rounds (up to 50), with the agent producing reasoning and actions (file browsing, code editing, bash execution, submission) each round. Multiple submissions are allowed. **Key metric**: AUP (Area Under Performance Curve)—introduced by MLGym's authors to aggregate performance across tasks with different score scales and optimization directions into a single comparable number. **Baselines compared**: GPT-4o, GPT-5, Qwen3-4B (base), Qwen3-8B (base) **Results** (aggregated across 64 runs per task): | Model | AUP Improvement | |-------|----------------| | SFT-Qwen3-4B vs. base Qwen3-4B | **+9%** | | SFT-Qwen3-8B vs. base Qwen3-8B | **+12%** | | Tasks improved (out of 13) | **9/13** | These are meaningful, non-trivial improvements. A 9-12% AUP gain from fine-tuning a 4-8B parameter model on synthetic data—with no task-specific training—demonstrates that the synthetic pipeline captures transferable research skills, not just benchmark-specific tricks. The one notable exception: the MS-COCO task shows no improvement. The authors attribute this to their pipeline's insufficient coverage of complex starter code structures—a recognized limitation they propose addressing by conditioning task synthesis on high-quality existing codebases.

Limitations and Honest Caveats

The authors are admirably transparent about what their work has and hasn't established: **Evaluation breadth**: The entire evaluation is on a single benchmark (MLGym). Whether gains transfer to MLE-Bench, MLRC-Bench, or NanoGPT Speedrunning remains empirically open. Some portion of the gains may reflect format familiarity—adapting to SWE-agent's specific interaction conventions—rather than genuine ML research capability improvement. **Missing ablations**: The paper does not ablate individual pipeline components. The relative contributions of HuggingFace validation, the self-debugging loop, success-only filtering, trajectory truncation, and teacher model quality are each untested in isolation. Understanding which components matter most is critical for future work. **Teacher model bias**: The pipeline inherits GPT-5's limitations. Tasks GPT-5 cannot solve never enter the training set, creating a systematic blind spot for particularly challenging problems—precisely the cases where training signal would be most valuable for capability development. **SFT limitations**: Supervised fine-tuning on successful trajectories doesn't explicitly optimize for exploration or novelty. The student model learns to mimic successful patterns but doesn't receive direct incentive to discover genuinely new approaches or push beyond the teacher's capability envelope.

Future Directions

The authors identify several promising extensions: 1. **Reinforcement learning**: Synthetic tasks provide natural reward signals (task evaluation scores), making them suitable for RL training. The challenge is the long GPU training jobs per rollout and heterogeneous reward scales across tasks—making reward shaping non-trivial. 2. **Richer task distributions**: Conditioning synthesis on high-quality codebases (like NanoGPT implementations) to cover more complex starter code patterns that the current pipeline misses. 3. **Literature search integration**: Enabling agents to search ML literature during trajectory sampling, encouraging discovery-oriented behavior rather than pure optimization—a step toward genuine scientific novelty. 4. **Cross-benchmark extension**: The pipeline is framework-agnostic and should extend naturally to MLE-Bench's Kaggle-style challenges and other evaluation suites, enabling broader capability assessment.

Broader Context: A New Training Paradigm

This work connects to several important parallel research threads: **SWE-Smith parallel**: SWE-Smith (Yang et al., 2025) applies an analogous approach to software engineering—synthesizing test-breaking code instances across Python repositories to train SWE-bench agents. Both papers make the same core argument: the key to capable agents is large-scale synthetic experience in *executable environments*, not passive knowledge from static text corpora. **Addressing the ideation-execution gap**: Recent work (Si et al., 2025) has documented a pronounced gap between LLMs' ability to generate research ideas and their ability to execute those ideas successfully. This paper directly targets the execution side, building agents that can reliably implement, debug, and improve ML pipelines through structured experience. **Scaling implications**: The pipeline generates 500 tasks and 34,000 trajectories from 1,000 topic samples with no human supervision. The approach is trivially scalable—more topics, more tasks per topic, longer debug loops, more trajectories per task—as compute becomes available. The ceiling is not yet in sight.

Conclusion

This paper presents a pragmatic, scalable path toward AI systems capable of autonomous ML research. By combining three key innovations—HuggingFace-grounded task synthesis, iterative self-debugging verification, and large-scale teacher trajectory distillation—the authors demonstrate measurable improvements in agentic ML research performance on a standardized benchmark. The honest accounting of limitations is refreshing: the paper doesn't claim to have solved AI research, only to have established a viable training paradigm. The core contribution stands regardless: you can train meaningfully better ML research agents by providing them with tens of thousands of tokens of structured experience in synthetic-but-grounded research environments, generated entirely automatically. This is a practical path forward for building AI scientists—and one that doesn't require waiting for the next generation of base models to arrive. --- **Authors**: Ziyang Cai (Princeton University), Harkirat Behl (Microsoft Research) **arXiv**: 2603.17216 | **Published**: March 19, 2026

Sources

arXiv