AXPO: Exploratory Policy Optimization for Bridging the Thinking-Acting Gap in Multimodal Agent Reasoning

This paper addresses the pervasive 'thinking-acting gap' in multimodal agent reasoning by introducing AXPO (Agent eXplorative Policy Optimization), a novel policy optimization algorithm. Existing reinforcement learning methods for tool use suffer from severely suppressed learning signals due to low tool utilization rates (only ~30%) and high full-error rates (~40% of calls). AXPO works by fixing the thinking prefix while resampling tool calls and subsequent actions for the completely wrong subgroups, combined with an uncertainty-based prefix selection strategy that enhances model exploration. Across nine multimodal benchmarks, the SFT+AXPO pipeline consistently outperforms SFT+GRPO on both average Pass@1 and Pass@4 metrics. Notably, at the 8B parameter scale, SFT+AXPO surpasses a 32B base model on Pass@4 while using only one-quarter of the parameters.

Background and Context

The evolution of multimodal large language models has reached a critical juncture where internal reasoning capabilities, often termed extended reasoning, are no longer sufficient for complex real-world problem-solving. While visual language models have demonstrated impressive potential in handling abstract logic and internal knowledge retrieval, many practical tasks require interaction with external environments. This necessity introduces the core challenge of Agentic Reasoning: the model must seamlessly interleave internal cognitive processes, referred to as "Thinking," with external interactions, known as "Acting" or tool usage. The research highlights a structural asymmetry between these two modes, defining it as the "Thinking-Acting Gap." This gap is not merely a conceptual distinction but a significant barrier to effective agent performance, particularly when employing standard reinforcement learning frameworks.

Standard reinforcement learning approaches, such as Group Relative Policy Optimization (GRPO), struggle significantly with this duality. The study identifies two critical diagnostic symptoms that manifest during training. First, there is a profound lack of exploration; models attempt to use external tools in only approximately 30% of rollout episodes. This low utilization rate indicates that models prefer the safety of internal reasoning over the perceived risk of external interaction. Second, when models do attempt to use tools, the failure rate is alarmingly high. In roughly 40% of problem instances, every tool call within a group of rollouts fails completely. This high frequency of total failure leads to a suppression of learning signals. Because the entire trajectory is penalized without providing nuanced feedback on which part of the action failed, the model struggles to learn effective tool-use strategies, creating a vicious cycle of avoidance and error.

To address these systemic issues, the research introduces AXPO (Agent eXplorative Policy Optimization), a novel policy optimization algorithm specifically designed to bridge the thinking-acting divide. The primary objective of AXPO is to mitigate the suppression of learning signals and enhance the model's willingness to explore external tools. By targeting the specific failure modes identified in standard reinforcement learning, AXPO aims to provide a more robust framework for training multimodal agents. The algorithm seeks to unlock the true potential of these models by ensuring that tool usage is not only attempted more frequently but also learned from more effectively, thereby reducing the performance gap between internal reasoning and external action.

Deep Analysis

AXPO introduces a sophisticated mechanism for handling "completely wrong" tool-use subgroups, which are the primary source of learning signal suppression in traditional methods. The core innovation lies in its ability to decouple the internal reasoning process from the external action execution. When the algorithm identifies a subgroup of rollouts where all tool calls have failed, it does not discard the entire trajectory. Instead, it employs a strategy of "fixing the thinking prefix and resampling the action." This means that the initial phase of the model's internal reasoning, which led to the decision to use a tool, is preserved. Only the tool call itself and the subsequent execution steps are resampled. This approach ensures that the model retains credit for its correct internal logic while receiving targeted feedback on its external interaction, providing a much more precise learning signal than binary success or failure.

Complementing this resampling strategy is an uncertainty-based prefix selection mechanism. AXPO evaluates the model's uncertainty during the generation of the thinking prefix to dynamically select which trajectories are most valuable for optimization. This mechanism prioritizes prefixes that offer high exploration value without deviating too far from correct reasoning paths. By focusing on these uncertain yet promising prefixes, AXPO enhances the model's exploratory capabilities in a controlled manner. This prevents the training process from being destabilized by the high variance inherent in tool usage, ensuring that the model learns from errors that are informative rather than random noise. The combination of fixed prefixes and selective resampling creates a stable environment for learning complex tool-use behaviors. The efficacy of AXPO was validated through comprehensive experiments across nine widely used multimodal benchmarks. The study utilized Qwen3-VL-Thinking models of varying parameter sizes as baselines to ensure the robustness of the findings. The results demonstrated that the SFT+AXPO pipeline consistently outperformed the standard SFT+GRPO approach. Specifically, SFT+AXPO achieved an average improvement of 1.8 percentage points in both Pass@1 and Pass@4 metrics. While this numerical gain may appear modest, it is statistically significant in the context of multimodal reasoning, particularly for Pass@4, which measures the model's ability to generate diverse and correct solutions. The improvement underscores the algorithm's ability to refine both the accuracy and diversity of agent outputs. A particularly striking finding of the study is the performance parity between models of different scales. The SFT+AXPO trained model with 8 billion parameters surpassed the performance of a 32 billion parameter base model on the Pass@4 metric. This achievement is remarkable because the 8B model uses only one-quarter of the parameters of its larger counterpart. This result suggests that algorithmic efficiency can compensate for model scale, offering a cost-effective pathway to high-performance agents. Ablation studies further confirmed that both the fixed thinking prefix mechanism and the uncertainty-based selection were critical contributors to this success, validating the scientific rigor of the AXPO design.

Industry Impact

The introduction of AXPO has profound implications for the development and deployment of multimodal agents in industrial settings. By providing a theoretical and practical solution to the thinking-acting gap, the algorithm enables the creation of more reliable and efficient agents. The emphasis on distinguishing between internal reasoning and external tool calling offers a new paradigm for designing reinforcement learning training pipelines. This distinction is crucial for future research, as it highlights the need for specialized optimization techniques that account for the unique challenges of agentic workflows. The success of AXPO suggests that current standard methods may be insufficient for complex agent tasks, necessitating a shift towards more nuanced policy optimization strategies.

From a deployment perspective, the ability of smaller models to match the performance of larger ones is a game-changer for cost and latency management. The study demonstrates that an 8B model optimized with AXPO can outperform a 32B base model, reducing computational requirements by 75%. This efficiency gain is particularly valuable for edge devices and large-scale concurrent services where resources are constrained. Lower latency and reduced computational costs make it feasible to deploy sophisticated multimodal agents in real-time applications, such as autonomous robotics, interactive customer service, and real-time data analysis. The democratization of high-performance agent capabilities through algorithmic optimization rather than sheer scale could accelerate the adoption of AI agents across various sectors.

For the open-source community, AXPO provides a reproducible and efficient optimization framework that can be integrated into existing training pipelines. This accessibility fosters innovation by allowing researchers and developers to experiment with advanced agent training techniques without requiring massive computational resources. The local resampling and uncertainty-guided strategies employed by AXPO are not limited to multimodal tasks; they offer potential applications in other domains involving sequential decision-making and tool use, such as code generation and automated workflow orchestration. By providing a robust foundation for these tasks, AXPO contributes to the broader advancement of agentic AI technologies.

Outlook

Looking ahead, the AXPO algorithm sets a new benchmark for evaluating and training multimodal agents. The significant performance gains observed in the study suggest that future research will likely focus on further refining policy optimization techniques to address other aspects of the thinking-acting gap. As models become more complex and the variety of external tools expands, the need for robust exploration strategies will only increase. The uncertainty-based prefix selection mechanism, in particular, offers a promising direction for managing the trade-off between exploration and exploitation in increasingly dynamic environments. Researchers may explore extending this mechanism to handle more complex multi-step tool interactions and long-horizon planning tasks. The industrial trajectory indicated by the study points towards a future where model size is less of a bottleneck for agent performance. As companies seek to deploy AI agents at scale, the efficiency gains offered by algorithms like AXPO will be critical. The ability to achieve high performance with smaller models allows for more flexible deployment architectures, including hybrid cloud-edge systems. This trend could lead to the emergence of specialized, lightweight agents that are tailored for specific tasks, rather than relying on monolithic general-purpose models. The focus will likely shift from scaling parameters to scaling algorithmic intelligence and training efficiency. Furthermore, the success of AXPO in bridging the thinking-acting gap may inspire similar innovations in other areas of artificial intelligence. The principles of fixing correct reasoning paths while resampling erroneous actions could be applied to domains such as natural language processing, where models often struggle with complex instruction following. Similarly, the uncertainty-based selection mechanism could enhance the reliability of autonomous systems that must make critical decisions under uncertainty. As the field of agentic AI continues to evolve, the insights provided by AXPO will serve as a foundational reference for developing more capable, efficient, and reliable intelligent systems. The journey towards fully autonomous multimodal agents is being paved by such algorithmic breakthroughs, promising a future where AI agents can seamlessly interact with the world with human-like reasoning and action capabilities.

The long-term impact of AXPO will also be felt in the standardization of agent evaluation metrics. The study's emphasis on Pass@1 and Pass@4 highlights the importance of measuring not just single-best performance but also the diversity and robustness of agent outputs. As the industry moves towards more complex agentic applications, these metrics will become increasingly important for assessing the true utility of AI systems. The AXPO framework provides a template for how such evaluations can be conducted rigorously, ensuring that progress in agent development is measured accurately and meaningfully. This focus on robust evaluation will help guide the development of future algorithms and models, ensuring that they are not only powerful but also reliable and safe for real-world deployment.