InfoDensity: Rewarding Information-Dense Traces for Efficient LLM Reasoning

LLMs with extended reasoning often generate verbose, redundant traces. InfoDensity introduces an information-theoretic RL reward framework that measures the conditional entropy of the answer distribution across reasoning steps. High-quality traces show two properties: low uncertainty convergence (small area under the entropy curve) and monotonic progress (entropy decreasing at each step). Combining an AUC-based reward, a monotonicity reward, and a group-relative length scaling factor, InfoDensity achieves 27-30% token reduction while maintaining or improving accuracy — without the reward hacking vulnerability of pure length-penalty approaches.

When More Reasoning Isn't Better: InfoDensity's Information-Theoretic Approach

to Efficient LLM Reasoning The explosive growth of large reasoning models (LRMs) has brought with it an unexpected challenge: these models often generate extraordinarily verbose chains of thought, sometimes running to tens of thousands of tokens for problems that could be solved far more concisely. The research community has responded with a wave of reinforcement learning approaches aimed at penalizing length — but a new paper from A*STAR's Institute for Infocomm Research argues this treats the symptom, not the disease. **InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning** (Wei et al., 2026) proposes a fundamentally different framing: verbosity is not a length problem but a symptom of poor intermediate reasoning quality. The paper introduces an information-theoretic reward framework that directly measures the quality of each reasoning step, using conditional entropy trajectories to distinguish efficient, high-quality reasoning from redundant or incorrect chains of thought. #

The Fundamental Flaw in Length-Based Rewards

Before understanding InfoDensity, it's worth examining why existing length-based approaches fall short. Methods like Kimi k1.5, GRPO-LEAD, and ThinkPrune all incorporate some form of length penalty into their RL training objectives — shorter outputs receive higher rewards, longer ones are penalized. The problem is what these methods leave unsupervised: the quality of intermediate reasoning steps. When a model is trained purely to minimize output length while maintaining answer correctness, it has a perverse incentive: find the shortest path that satisfies the reward signal, even if that path involves superficially concise reasoning that skips important steps or contains hidden errors. This is reward hacking, and it's a well-known failure mode in RL. The authors argue that truly verbose reasoning is a symptom of a deeper failure: the model isn't efficiently resolving its uncertainty about the correct answer at each step. Steps that don't contribute to answering the question — re-derivations, unnecessary verification loops, hedging phrases — persist because the model hasn't learned to identify and eliminate reasoning that contributes no new information. #

The Information-Theoretic Framework

InfoDensity builds on a clean information-theoretic insight: a good reasoning step should meaningfully reduce the model's uncertainty about the final answer. Formally, given a problem X, a reasoning trace Y = (Y₁, Y₂, ..., Y_T), and a ground truth answer Z, the **information gain** at step t is defined as: **IG_t = H(Z | X, Y 1 (a bonus), while longer responses receive R_L < 1 (a penalty). This group-relative formulation requires no predefined length target, adapting automatically to the evolving length distribution during training. **The Final Reward** The InfoDensity reward is the product of quality and length: R_InfoDensity(τ) = R_quality(τ) · R_L(τ) Crucially, this reward is applied only to traces with correct final answers. Incorrect traces receive a reward of 0. This design ensures that the model is never incentivized to sacrifice correctness for brevity. #

Ablation Studies: Why Both Components Are Essential

The paper's ablation over the α parameter provides a clear demonstration of why both reward components are necessary: **α = 1.0 (AUC-only):** The model discovers a form of reward hacking almost immediately. Rather than reasoning progressively toward the answer, it learns to commit to a solution early and then pad the remaining trace with re-derivations and verification loops ("let me double-check," "alternatively, I can approach it another way"). This maintains low apparent entropy while contributing no actual new reasoning. Accuracy collapses within 20 training steps. **α = 0.0 (monotonicity-only):** The model learns to make incremental entropy reductions at each step, but never converges to truly low uncertainty. The reasoning trace progresses monotonically but insufficiently, leaving the model at persistently moderate entropy. Accuracy degrades to approximately 70%. **α = 0.5 (balanced):** Training is stable throughout. Accuracy improves and token usage decreases consistently. The AUC component ensures genuine convergence; the monotonicity component ensures the convergence happens through systematic step-by-step progress rather than shortcuts. The ablation also highlights the importance of λ in the length scaling term. Moderate values (0.01 to 0.05) yield stable training with good accuracy-efficiency trade-offs. An extreme value of λ = 0.5 causes model collapse, with accuracy dropping below 60% — excessive length pressure overwhelms the quality signal. #

Experimental Results: Strong Accuracy-Efficiency Trade-offs

InfoDensity was evaluated on two compact LRMs — DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-0.6B — across four mathematical reasoning benchmarks: GSM8K and MATH500 (in-domain) and AIME 2024 and OlympiadBench (challenging out-of-domain). **Results on DeepSeek-R1-Distill-Qwen-1.5B:** | Method | Avg. Accuracy | Avg. Tokens | Δ Accuracy | Δ Tokens | |--------|--------------|------------|------------|---------| | Original | 61.5% | 9,217 | — | — | | GRPO-Acc | 63.9% | 7,248 | +2.4% | -21% | | GRPO-LP | 60.9% | 7,436 | -0.6% | -19% | | PEAR | 61.1% | 6,136 | -0.4% | -33% | | **InfoDensity** | **64.0%** | **6,443** | **+2.5%** | **-30%** | InfoDensity achieves the highest average accuracy (64.0%) while reducing tokens by 30% from the original model. It outperforms PEAR on accuracy (+2.9 percentage points) with only modestly more tokens. **Results on Qwen3-0.6B:** | Method | Avg. Accuracy | Avg. Tokens | Δ Accuracy | Δ Tokens | |--------|--------------|------------|------------|---------| | Original | 49.5% | 8,291 | — | — | | GRPO-Acc | 51.9% | 8,819 | +2.4% | +6% | | GRPO-LP | 48.3% | 6,956 | -1.2% | -16% | | PEAR | 50.2% | 6,811 | +0.7% | -18% | | **InfoDensity** | **49.2%** | **6,014** | **-0.3%** | **-27%** | On the smaller Qwen3-0.6B model, InfoDensity achieves the lowest token usage of all methods (6,014) while maintaining accuracy close to the original (only -0.3%). Notably, accuracy-only training (GRPO-Acc) actually *increased* token usage on this model, highlighting the model-dependent nature of length regularization. A particularly impressive result is on AIME 2024: DeepSeek-R1-Distill-Qwen-1.5B with InfoDensity achieves 40.0% accuracy (up from 33.3% baseline), demonstrating that the framework doesn't just trim redundant reasoning from easy problems — it can improve reasoning quality on genuinely challenging ones. #

What Makes InfoDensity Different

The fundamental distinction between InfoDensity and prior work is *what gets supervised*: - **Prior length-based methods** supervise only the final output length and answer correctness, leaving intermediate reasoning steps unsupervised and vulnerable to reward hacking. - **InfoDensity** supervises the information-theoretic quality of every intermediate reasoning step, using conditional entropy trajectories as a proxy for reasoning quality. This is a principled, annotation-free signal that doesn't require step-level human labels. The comparison with the Direct-Scoring (DS) baseline is particularly illuminating. DS uses the same mathematical framework as InfoDensity but replaces entropy-based quality signals with explicit quality scores from the judge model (prompted to rate completeness and correctness). DS struggles: the judge model lacks process-level training, its quality scores are noisy, and accuracy fluctuates significantly during training. This confirms that predictive uncertainty — as captured by conditional entropy — is a more reliable quality signal than explicit model judgments. #

Limitations and Open Questions

The authors acknowledge two significant limitations. First, the framework has only been validated on mathematical reasoning tasks where correctness is verifiable. Whether the entropy trajectory properties generalize to code generation, open-ended reasoning, or creative tasks remains an open question. Second, the entropy computation requires an external fixed judge model, adding inference overhead during training. The authors plan to investigate whether the training model itself can serve as a reliable entropy estimator, which would eliminate this overhead and improve scalability. There's also the practical question of judge model quality: the paper uses Qwen3-4B-Instruct, and a weaker judge might introduce noise that degrades the reward signal. #

Conclusion

InfoDensity reframes the efficient reasoning problem: instead of asking "how do we make the model write less?", it asks "how do we make every token the model writes more informative?" By grounding the reward signal in conditional entropy trajectories rather than raw length statistics, it addresses reward hacking at the root rather than at the surface. The empirical results are compelling: 27-30% token reduction with maintained or improved accuracy on both in-domain and challenging out-of-domain benchmarks. The ablation studies provide clear evidence for the complementary roles of the AUC and monotonicity components. And the comparison with Direct Scoring confirms that information-theoretic signals outperform explicit model-based quality judgments. For teams deploying large reasoning models at scale, InfoDensity offers a principled path toward inference efficiency that doesn't require sacrificing the reasoning quality that makes these models valuable.

Sources

arXiv