STARE: Surprise-Guided Token-Level Advantage Reweighting for Stable Policy Entropy

Addressing the widespread policy entropy collapse observed in large language model training with verifiable-reward reinforcement learning (e.g., GRPO), this paper proposes STARE, a novel stabilization method. Through first-order gradient analysis, the authors uncover a token-level credit assignment mismatch and show that entropy evolution decomposes into the product of trajectory-level advantage and an entropy sensitivity function, revealing an advantage-surprise quadrant structure with near-critical properties. STARE leverages batch-level surprise quantiles to identify a subset of critical tokens and selectively reweights their effective advantage, while introducing a target-entropy gated feedback mechanism for stable entropy regulation. Across models from 1.5B to 32B parameters and tasks including short and long chain-of-thought reasoning as well as multi-turn tool use, STARE maintains stable policy entropy over thousands of training steps. On AIME24 and AIME25 benchmarks, STARE achieves a 4-8% average accuracy improvement over baselines such as DAPO, with reflective tokens and response length growing in tandem—demonstrating a healthy balance between exploration and exploitation and opening a new pathway for unlocking the training potential of RL.

Background and Context

In the post-training phase of large language models, reinforcement learning algorithms based on verifiable rewards have emerged as the dominant paradigm for enhancing complex reasoning capabilities. Among these, Group Relative Policy Optimization (GRPO) has become particularly prominent. However, this field has long been plagued by a severe challenge: during training, the policy distribution often collapses rapidly, leading to a sharp decline in policy entropy, a phenomenon known as "policy entropy collapse." This instability not only limits the model's exploration capacity but can also cause training to become unstable or trap the optimization process in local optima. The core contribution of the recent research presented here is the first systematic first-order gradient analysis of token-level entropy dynamics within the GRPO framework. This analysis precisely identifies the root cause of entropy collapse: a mismatch in token-level credit assignment.

The study reveals that the entropy change of a single token does not occur in isolation. Instead, it decomposes into the product of a trajectory-level advantage function and an entropy sensitivity function specific to the next-token distribution. This decomposition uncovers a critical "advantage-surprise" quadrant structure, with the system exhibiting near-critical properties. Based on these theoretical insights, the authors propose STARE (Surprise-guided Token-level Advantage Reweighting for policy Entropy stability). This method aims to maintain policy entropy stability through fine-grained token-level interventions, thereby addressing the performance bottlenecks that have long hindered reinforcement learning training in large language models.

Deep Analysis

From a technical perspective, the design of STARE elegantly combines theoretical analysis with engineering implementation. The algorithm begins by calculating the surprise quantiles of samples within a batch to dynamically identify a subset of critical tokens that have the greatest impact on entropy changes. These tokens are typically located at key decision nodes, where their prediction uncertainty is decisive for the overall policy entropy. STARE does not adjust all tokens uniformly; instead, it selectively reweights the effective advantage of these critical tokens. This reweighting mechanism adaptively adjusts their contribution to gradient updates based on their surprise levels, suppressing the influence of high-confidence tokens that cause entropy to drop too quickly while encouraging exploration of low-surprise tokens.

A more innovative aspect of STARE is the introduction of a target-entropy gated feedback mechanism. This mechanism continuously monitors the deviation between the current policy entropy and a preset target interval, dynamically adjusting the intensity of the reweighting accordingly. This closed-loop control strategy ensures that policy entropy remains constrained within an ideal range throughout the training process. It avoids noise interference caused by excessive exploration while preventing the loss of diversity due to premature convergence, achieving precise regulation of the training process. By leveraging batch-level surprise quantiles to identify critical token subsets and selectively reweighting their effective advantages, STARE stabilizes entropy regulation through a target-entropy gated feedback mechanism.

Industry Impact

The experimental evaluation covers language models ranging from 1.5 billion to 32 billion parameters, assessing performance across three representative families of reasoning tasks: short chain-of-thought (Short CoT), long chain-of-thought (Long CoT), and multi-turn tool use. Results indicate that STARE can continuously maintain policy entropy within the target band over thousands of reinforcement learning training steps, demonstrating exceptional training stability. On key reasoning benchmarks, AIME24 and AIME25, STARE significantly outperformed DAPO and other competitive baseline models, achieving an average accuracy improvement of 4% to 8%. Ablation studies further revealed that this performance gain was not merely due to an increase in parameter count but stemmed from a healthy balance between exploration and exploitation.

Specifically, as training progressed, the number of reflective tokens generated by STARE and the response length grew in tandem. This indicates that the model maintained reasoning depth without sacrificing the breadth of exploration. This dynamic balance mechanism effectively unlocked the potential of reinforcement learning in complex reasoning tasks, proving the critical role of stable entropy control in improving final model performance. For the open-source community, the publication of this methodology not only enriches the RLHF technology stack but also provides a new perspective on how to optimize credit assignment through its theoretical analysis framework. In industrial deployment, stable policy entropy means fewer risks of training collapse and more controllable computational resource consumption, which is crucial for deploying large-scale reasoning models.

Outlook

From the perspective of industry significance and potential impact, STARE offers an interpretable and efficient solution for the reinforcement learning post-training of large language models. The exploration-exploitation balance mechanism emphasized by STARE has direct guiding significance for improving model performance in tasks requiring high creativity and diversity, such as open-domain question answering and code generation. As large models evolve toward more complex cognitive tasks, maintaining the diversity and stability of policies will become a core issue. The approach of surprise-guided reweighting proposed by STARE is expected to serve as an important reference paradigm for future reinforcement learning algorithm design.

This methodology paves the way for unlocking the training potential of RL in large models. By ensuring that the policy entropy remains stable, STARE allows models to explore a wider range of reasoning paths without succumbing to the pitfalls of premature convergence. This is particularly relevant for applications where robustness and adaptability in complex environments are paramount. The near-critical properties identified in the advantage-surprise quadrant suggest that there is a delicate balance point that, when maintained, maximizes learning efficiency. Future research may build upon STARE's framework to further refine these control mechanisms, potentially leading to even more robust and capable reasoning models. The success of STARE in balancing exploration and exploitation sets a new standard for how reinforcement learning can be effectively applied to large-scale language models, marking a significant step forward in the field of AI training methodologies.

Sources