SDAR's Gated Self-Distillation Tackles Sparse Rewards in LLM Agent Reinforcement Learning
Reinforcement learning has become the cornerstone of agent training for large language models, yet sparse reward signals in long-horizon tasks remain a persistent bottleneck. SDAR addresses this by treating online policy self-distillation as a gated auxiliary objective while keeping reinforcement learning as the primary optimizer. A sigmoid gate maps discrete token-level signals into soft weights, amplifying distillation on teacher-approved tokens while dampening negative rejections. Across Qwen2.5 and Qwen3 models, SDAR outperforms GRPO by 9.4% on ALFWorld, 7.0% on WebShop, and 10.2% on Search-QA.
Background and Context
Reinforcement learning has established itself as the dominant paradigm for post-training large language model agents, primarily because it allows for the direct optimization of final task rewards. However, this approach faces a fundamental structural challenge: the reward signals provided are typically sparse, based on the entire interaction trajectory rather than individual steps. For complex tasks requiring long-horizon planning and multi-step reasoning, this coarse supervision is insufficient, leaving models without precise feedback during intermediate stages.
To address this sparsity, researchers have turned to Online Policy Self-Distillation (OPSD), which leverages a teacher branch with privileged context to provide dense, token-level guidance. While OPSD performs well in single-turn or simple environments, its direct application in multi-turn agent scenarios introduces significant instability. In these complex interactions, error accumulation amplifies quickly, and the system struggles to distinguish between failures caused by skill retrieval errors and those caused by improper utilization. This ambiguity leads to misleading learning signals, particularly when the teacher model issues negative rejections that may not reflect a fundamental lack of capability but rather a contextual misunderstanding.
Deep Analysis
The proposed SDAR (Self-Distillation Agent Reinforcement Learning) framework addresses these limitations by redefining the relationship between reinforcement learning and self-distillation. Rather than simply stacking the two methods, SDAR maintains reinforcement learning as the primary optimizer to ensure global convergence on task rewards, while treating OPSD as a gated auxiliary objective. The core innovation lies in a sophisticated signal mapping mechanism that converts discrete token-level distillation signals into continuous Sigmoid gating values.
This design employs an asymmetric processing logic: when the teacher model approves specific token outputs, indicating a positive gap, the gating mechanism significantly enhances the distillation intensity, forcing the agent to mimic high-quality decisions. Conversely, when the teacher model issues a negative rejection, SDAR does not forcefully suppress the agent's output. Instead, it softly attenuates the weight of this negative signal. This nuanced approach mitigates issues arising from incomplete skill retrieval or imperfect utilization strategies, preventing the training collapse often seen in naive combinations of GRPO and OPSD.
Industry Impact
Empirical validation of SDAR demonstrates its robustness across multiple representative agent benchmarks, including ALFWorld for text-environment interaction, WebShop for e-commerce simulation, and Search-QA for search-based question answering. The experiments, conducted on both Qwen2.5 and Qwen3 series models, confirm the method's generalizability and effectiveness. SDAR significantly outperformed the GRPO baseline, achieving a 9.4% improvement in ALFWorld, a 7.0% gain in Search-QA, and a substantial 10.2% increase in WebShop-Accuracy.
Crucially, ablation studies revealed that SDAR successfully avoids the multi-turn instability inherent in basic GRPO+OPSD approaches. As model scales increased, SDAR consistently surpassed various hybrid RL-OPSD baselines, proving its reliability across different architectures. This performance gain is not merely statistical; it represents a tangible improvement in the agent's ability to complete complex tasks with higher accuracy and stability, addressing a critical bottleneck in current AI agent development.
Outlook
From an industry perspective, SDAR offers a valuable solution to the persistent conflict between sparse supervision and signal noise in large language model training. For the open-source community, it provides a plug-and-play module that enhances agent performance without requiring complex architectural modifications, facilitating more efficient post-training paradigms. In industrial applications, where agents are increasingly deployed in customer service, automated office workflows, and code generation, the ability to stabilize multi-turn interactions is paramount for safety and usability.
SDAR's soft gating mechanism directly addresses these deployment needs by reducing erratic behavior. Furthermore, this work highlights that merely increasing the density of supervision signals is insufficient; the key lies in dynamic weighting based on signal credibility. This insight paves the way for future research into more complex teacher-student interactions, multi-teacher distillation, and adaptive reward shaping, ultimately driving the evolution of agents from those that can merely complete tasks to those that do so reliably, efficiently, and consistently.