What is AXPO and how does it fix the thinking-acting gap in AI agents?

AXPO freezes internal reasoning while resampling only failed tool calls. This targets the 40% failure rate, delivering precise learning signals without discarding logic.

Why does this matter for AI efficiency and model scaling?

It proves algorithmic optimization beats raw scale. An 8B model trained with AXPO outperforms a 32B base model using only 25% of parameters, drastically lowering compute costs.

What are the broader implications for future AI development?

Its local resampling and uncertainty-guided strategies extend beyond vision models to code generation and automated workflows, offering a robust framework for sequence decision tasks.

AXPO：解決多模態智能體推理中思考與行動鴻溝的探索性策略優化

本文針對多模態智能體推理中普遍存在的"思考-行動鴻溝"問題，提出了一種名為AXPO的新型策略優化算法。現有強化學習方法在處理工具呼叫時，常因工具使用率低（僅約30%）及呼叫失敗率高導致學習訊號抑制。AXPO透過固定思維前綴並針對全錯子組重採樣工具呼叫及其後續動作，結合基於不確定性的前綴選擇策略，有效提升了模型探索能力。在九個多模態基準測試中，SFT+AXPO方案在平均Pass@1和Pass@4指標上均優於SFT+GRPO。特別是在8B參數規模下，SFT+AXPO在Pass@4上超越了32B基礎模型，且參數量僅為後者的四分之一。

Sources

arXiv