What is SDAR and what problem does it solve?

SDAR (Self-Distilled Agentic Reinforcement Learning) is a novel training framework that combines reinforcement learning with self-distillation through an innovative Sigmoid gating mechanism, addressing sparse reward signals and instability in multi-turn agent interactions.

How much does SDAR improve performance over existing methods?

Tests on Qwen2.5 and Qwen3 models show SDAR outperforms GRPO by 9.4%, 7.0%, and 10.2% on ALFWorld, WebShop, and Search-QA benchmarks, while effectively preventing the training collapse common in naive RL-OPSD combinations.

What is the practical value and future outlook of SDAR?

SDAR can be integrated as a plug-and-play module into existing training pipelines, significantly improving reliability in long-horizon tasks like customer service and code generation; its dynamic weighting principle also points toward multi-teacher distillation and adaptive reward shaping.

SDAR：基於自蒸餾門控機制的強化學習智能體訓練新方法

強化學習已成為大語言模型智能體後訓練的核心範式，但其基於軌跡層級的獎勵訊號對長週期互動的監督過於稀疏。儘管在線策略自蒸餾（OPSD）透過引入特權上下文提供了密集的令牌級指導，但在多輪智能體場景中直接應用會導致不穩定性加劇，且難以區分技能檢索缺陷與利用不當導致的教師拒絕。本文提出SDAR（自蒸餾智能體強化學習），將OPSD作為門控輔助目標，保持強化學習為主幹優化器。該方法將離散的令牌級訊號映射為Sigmoid門控，在教師認可的積極差距令牌上增強蒸餾，同時軟性衰減負面拒絕。在Qwen2.5和Qwen3系列模型上，SDAR在ALFWorld、WebShop和Search-QA基準上顯著優於GRPO，分別提升9.4%、7.0%和10.2%，並有效避免了樸素GRPO+OPSD的不穩定性。