EnvFactory: Scaling Tool-Use Agents via Executable Environment Synthesis and Robust Reinforcement Learning

This paper introduces EnvFactory, a fully automated framework addressing two critical bottlenecks in agentic reinforcement learning (Agentic RL) for developing tool-use capabilities in large language models: the lack of scalable, robust execution environments and the absence of authentic training data that captures implicit human reasoning. Existing approaches rely on costly real-world APIs, hallucination-prone LLM simulators, or single-turn synthetic environments, with synthetic trajectories often over-specified—resembling instruction sequences rather than natural human intent. EnvFactory autonomously explores and validates real-world resources to discover state-executable tool environments, then synthesizes natural multi-turn trajectories via topology-aware sampling and calibration-refined generation, producing grounded queries with implicit intent. Using only 85 validated environments spanning 7 domains, EnvFactory generated 2,575 SFT and RL trajectories. Despite using just one-fifth the number of environments compared to prior work, the method demonstrates excellent training efficiency and downstream performance, boosting Qwen3-series models by up to 15% on BFCLv3, 8.6% on MCP-Atlas, and 6% on conversational benchmarks such as τ²-Bench and VitaBench. EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.

Background and Context

The integration of tool-use capabilities into large language models (LLMs) has emerged as a central objective in contemporary artificial intelligence research, with agentic reinforcement learning (Agentic RL) identified as the pivotal mechanism for achieving robust autonomous operation. Despite the theoretical promise of this approach, the field has been significantly constrained by two persistent structural bottlenecks: the scarcity of scalable, robust execution environments and the absence of authentic training data that accurately captures implicit human reasoning. Current methodologies for addressing these gaps predominantly rely on expensive, real-world application programming interfaces (APIs), which are often unstable or restricted, or they utilize LLM-based simulators that are prone to hallucinations and fail to reflect ground-truth system states. Furthermore, existing synthetic environments are typically limited to single-turn interactions and are constructed from pre-collected documentation, resulting in training data that is overly prescriptive. These synthetic trajectories often resemble rigid instruction sequences rather than natural, multi-turn human interactions, thereby introducing a distributional bias that severely undermines the effectiveness of reinforcement learning algorithms.

To address these critical limitations, the research community has introduced EnvFactory, a fully automated framework designed to simultaneously resolve the challenges of environment construction and data synthesis. EnvFactory represents a paradigm shift by autonomously exploring and validating real-world resources to discover state-executable tool environments, eliminating the need for manual coding or costly API subscriptions. The framework is engineered to ensure that the discovered environments are not only executable but also maintain state consistency, which is crucial for stable training. By moving away from the fragile reliance on external APIs or hallucination-prone simulators, EnvFactory provides a more reliable foundation for training agents. This automated discovery process allows the system to extract robust execution environments from a diverse array of real-world scenarios, thereby avoiding the training failures commonly associated with environmental instability in traditional methods.

In the domain of data synthesis, EnvFactory employs innovative sampling and refinement strategies to generate natural, multi-turn interaction trajectories. The framework utilizes topology-aware sampling to capture the complex dependencies and interaction logic between different tools, ensuring that the generated trajectories align with natural human usage patterns. This is complemented by a calibration-refined generation process that adjusts the semantic expression of the trajectories, transforming mechanical instruction sequences into natural dialogues imbued with implicit human intent. The resulting data includes grounded queries that reflect the nuanced, often unspoken, reasoning processes of human users. This combination of strategies not only enhances the diversity of the training data but also improves its adaptability to reinforcement learning algorithms, enabling agents to learn decision-making strategies from more complex and realistic interaction modes.

Deep Analysis

The technical architecture of EnvFactory is characterized by a high degree of automation and intelligence, particularly in its approach to environment validation and data generation. The framework begins by autonomously scanning real-world resources to identify potential tool interfaces, subjecting each candidate to rigorous validation processes to confirm its executability and state consistency. This validation step is critical, as it ensures that the environments used for training are stable and reliable, directly addressing the issue of environmental fragility that has plagued previous Agentic RL approaches. By verifying the state-executable nature of these tools, EnvFactory creates a robust sandbox in which agents can learn without the risk of encountering undefined behaviors or system errors that are common in real-world API interactions. This automated validation mechanism significantly reduces the human effort required to prepare training environments, allowing for the rapid scaling of available resources.

Once the environments are established, EnvFactory proceeds to synthesize training data using its topology-aware sampling and calibration refinement modules. Topology-aware sampling analyzes the structural relationships between tools, identifying which tools are frequently used in conjunction and in what order. This analysis allows the framework to generate trajectories that are structurally coherent and reflect the logical flow of human task execution. The calibration refinement module then steps in to enhance the naturalness of these trajectories. It adjusts the language and intent of the interactions to ensure they are not merely a list of commands but rather a fluid dialogue that mirrors how humans naturally communicate with software systems. This process results in the creation of grounded queries that contain implicit intent, providing the agent with a richer context for learning how to interpret and respond to user requests.

The efficacy of these technical components has been demonstrated through extensive experimentation, which highlights the framework's ability to achieve high performance with significantly reduced resource requirements. In the reported studies, the research team utilized only 85 validated tool environments spanning seven distinct domains. This number is merely one-fifth of the environments typically employed in prior work, yet it was sufficient to generate 2,575 high-quality SFT and RL trajectories. The ablation studies conducted during this phase confirmed the individual contributions of the topology-aware sampling and calibration refinement modules, showing that both are essential for producing trajectories with the necessary implicit intent and structural coherence. The results indicate that the quality of the data, rather than the sheer quantity of environments, is the primary driver of performance improvements in Agentic RL.

Industry Impact

The performance gains achieved by EnvFactory are substantial and have been validated across multiple benchmark suites, demonstrating its effectiveness in enhancing the tool-use capabilities of large language models. Models trained using the EnvFactory-generated data, specifically the Qwen3 series, exhibited significant improvements in their ability to interact with tools and understand complex user instructions. On the BFCLv3 benchmark, which measures the ability of models to use tools in a variety of contexts, the Qwen3 models saw performance boosts of up to 15%. This substantial increase indicates that the agents trained with EnvFactory data are far more proficient at selecting and executing the correct tools for a given task. Similarly, on the MCP-Atlas benchmark, which evaluates multi-turn tool use, the models improved by 8.6%, further confirming the framework's ability to enhance sequential decision-making and context retention.

Beyond tool-use specific benchmarks, EnvFactory also delivered notable improvements on conversational benchmarks that assess the naturalness and coherence of agent interactions. On the τ²-Bench and VitaBench, which focus on dialogue quality and user satisfaction, the models trained with EnvFactory data achieved a 6% improvement. This suggests that the implicit intent and natural language flow embedded in the synthetic trajectories help agents generate more human-like responses, thereby improving the overall user experience. The ability to achieve these gains with only 85 environments underscores the efficiency of the EnvFactory approach, making it a viable solution for organizations that may not have access to vast repositories of real-world APIs or the computational resources to train on massive datasets.

The implications of EnvFactory extend beyond immediate performance metrics to the broader ecosystem of AI development. By providing a scalable and robust foundation for Agentic RL, the framework lowers the barrier to entry for researchers and developers seeking to build advanced AI agents. The automated nature of environment discovery and data synthesis means that organizations can rapidly iterate on their agent designs without being bottlenecked by the manual effort of environment creation. This efficiency is particularly valuable in industrial settings, where the cost and time associated with developing and maintaining tool-use capabilities can be prohibitive. EnvFactory offers a pathway to deploy sophisticated agents more quickly and at a lower cost, accelerating the adoption of AI technologies in complex business environments.

Outlook

The introduction of EnvFactory marks a significant step forward in the evolution of agentic reinforcement learning, shifting the paradigm from manual, resource-intensive data preparation to automated, scalable synthesis. The framework's success in generating high-quality training data with a minimal number of environments suggests that future research will increasingly focus on the quality and structure of training data rather than just the scale of the model or the volume of data. The topology-aware sampling and calibration refinement techniques employed by EnvFactory provide a new template for generating data that captures the nuances of human intent and interaction logic. As these methods are refined and expanded, they are likely to be adopted by other research groups, leading to a broader improvement in the state of the art for tool-use agents. Looking ahead, the potential for EnvFactory to serve as a foundational infrastructure for Agentic RL is substantial. As the framework is extended to cover more domains and integrate with a wider variety of tools, it will enable the development of more versatile and autonomous AI systems. The ability to automatically discover and validate new environments will allow agents to adapt to new tools and platforms with minimal human intervention, enhancing their robustness and generalization capabilities. This adaptability is crucial for the long-term viability of AI agents in dynamic real-world environments where tools and interfaces are constantly evolving. Furthermore, the emphasis on implicit intent and natural interaction in EnvFactory's data synthesis process points to a future where AI agents are not just efficient tool users but also empathetic and intuitive collaborators. By learning from data that reflects the subtle cues and unspoken needs of human users, agents will be able to provide more personalized and context-aware assistance. This shift towards more natural and intuitive human-computer interaction has the potential to transform how humans work with AI, making it a more seamless and productive part of daily life. As the field continues to advance, EnvFactory stands as a testament to the power of automated, intelligent data synthesis in unlocking the full potential of agentic AI systems.

The broader impact of EnvFactory also includes its contribution to the open-source community. By providing a transparent and reproducible framework for environment discovery and data synthesis, EnvFactory encourages collaboration and innovation among researchers worldwide. The availability of such tools democratizes access to high-quality training data, allowing smaller teams and independent researchers to compete with larger organizations in the development of advanced AI agents. This democratization is essential for fostering a diverse and vibrant AI ecosystem, where innovation is driven by a wide range of perspectives and use cases. As EnvFactory continues to evolve, it is poised to play a central role in shaping the future of agentic AI, driving progress in tool use, complex reasoning, and human-machine interaction.