Why does reinforcement learning cause LLMs to collapse during tool use?

RL training frequently causes anomalous probability spikes in control tokens, disrupting the structured execution pipeline and preventing correct output formatting.

How do supervised signals help stabilize multi-step tool use?

They provide explicit optimization directions. Interleaving supervised fine-tuning with RL significantly improves stability, preventing catastrophic performance drops.

What is the main trade-off of this approach?

Stability improves, but performance degrades in out-of-distribution evaluations. Properly managing the learning rate is critical for maintaining generalization capabilities.

多步工具使用強化學習崩潰機理及監督信號修復策略

本文深入探討大語言模型在多步工具使用中應用強化學習時的穩定性挑戰。研究發現，儘管模型具備底層工具呼叫能力，但強化學習常導致災難性效能崩潰，表現為特定控制符號概率異常飆升，破壞結構化執行流程。作者系統評估了多種監督信號，包括離線策略監督、提示引導和錯誤範例監督，並對比了同步與交錯訓練策略。實驗表明，將監督微調與強化學習交錯進行能顯著提升訓練穩定性，但在分布外評估中效能有所下降。研究還分析了學習率對泛化能力的影響，揭示了理解強化學習失敗模式的重要性，為構建魯棒的多步工具使用智慧體提供了新範式。

Sources

arXiv