InfoDensity: Rewarding Information-Dense Traces for Efficient LLM Reasoning
LLMs with extended reasoning often generate verbose, redundant traces. InfoDensity introduces an information-theoretic RL reward framework that measures the conditional entropy of the answer distribution across reasoning steps. High-quality traces show two properties: low uncertainty convergence (small area under the entropy curve) and monotonic progress (entropy decreasing at each step). Combining an AUC-based reward, a monotonicity reward, and a group-relative length scaling factor, InfoDensity achieves 27-30% token reduction while maintaining or improving accuracy — without the reward hacking vulnerability of pure length-penalty approaches.
When More Reasoning Isn't Better: InfoDensity's Information-Theoretic Approach to Efficient LLM Reasoning
The explosive growth of large reasoning models (LRMs) has brought with it an unexpected challenge: these models often generate extraordinarily verbose chains of thought, sometimes running to tens of thousands of tokens for problems that could be solved far more concisely. The research community has responded with a wave of reinforcement learning approaches aimed at penalizing length — but a new paper from A*STAR's Institute for Infocomm Research argues this treats the symptom, not the disease.
InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning (Wei et al., 2026) proposes a fundamentally different framing: verbosity is not a length problem but a symptom of poor intermediate reasoning quality. The paper introduces an information-theoretic reward framework that directly measures the quality of each reasoning step, using conditional entropy trajectories to distinguish efficient, high-quality reasoning from redundant or incorrect chains of thought.
The Fundamental Flaw in Length-Based Rewards
Before understanding InfoDensity, it's worth examining why existing length-based approaches fall short. Methods like Kimi k1.5, GRPO-LEAD, and ThinkPrune all incorporate some form of length penalty into their RL training objectives — shorter outputs receive higher rewards, longer ones are penalized.
The problem is what these methods leave unsupervised: the quality of intermediate reasoning steps. When a model is trained purely to minimize output length while maintaining answer correctness, it has a perverse incentive: find the shortest path that satisfies the reward signal, even if that path involves superficially concise reasoning that skips important steps or contains hidden errors. This is reward hacking, and it's a well-known failure mode in RL.
The authors argue that truly verbose reasoning is a symptom of a deeper failure: the model isn't efficiently resolving its uncertainty about the correct answer at each step. Steps that don't contribute to answering the question — re-derivations, unnecessary verification loops, hedging phrases — persist because the model hasn't learned to identify and eliminate reasoning that contributes no new information.
The Information-Theoretic Framework
InfoDensity builds on a clean information-theoretic insight: a good reasoning step should meaningfully reduce the model's uncertainty about the final answer.
Formally, given a problem X, a reasoning trace Y = (Y₁, Y₂, ..., Y_T), and a ground truth answer Z, the **information gain** at step t is defined as:
IG_t = H(Z | X, Y<t) - H(Z | X, Y≤t)
where H(Z | C) is the conditional entropy of the answer distribution given context C. A positive information gain indicates the step genuinely narrows the answer distribution; zero or negative gain indicates redundancy or misdirection.
Since exact entropy computation is intractable for autoregressive language models, the authors estimate it using token-level probabilities from a fixed external judge model (Qwen3-4B-Instruct). The judge is prompted to complete the reasoning trace and generate the final answer, and the token-level predicted probabilities are used to compute the mean entropy over answer tokens:
H(Z | C) = (1/K) · Σ H(z_k | C, z<k)
where K is the number of answer tokens. A natural continuation prompt — "Therefore, the answer is oxed{answer}" — was found to yield the most stable and reliable entropy estimates.
Empirical Discovery: Two Properties of High-Quality Reasoning
The paper's empirical analysis is one of its most compelling contributions. Using a subset of ProcessBench (a dataset with human-annotated step-level correctness labels across GSM8K, MATH, OlympiadBench, and Omni-MATH), the authors computed conditional entropy trajectories for correct and incorrect reasoning traces across four models: Llama-3.2-3B-Instruct, Gemma-3-4B-IT, Qwen3-4B-Instruct, and Qwen3-30B-A3B-Instruct.
The results reveal a strikingly consistent pattern across all models and datasets. High-quality (correct) reasoning traces exhibit two distinctive trajectory-level properties:
Property 1: Low Uncertainty Convergence
Correct reasoning traces converge to low conditional entropy. When plotted as entropy over normalized reasoning steps, correct traces show a steady decline that reaches near-zero by the final step. The area under the entropy curve (AUC) is markedly smaller for correct traces than for incorrect ones.
Incorrect traces tell a different story: entropy decreases initially, but plateaus sharply at the point of the first error. Subsequent steps fail to continue reducing uncertainty — the model has effectively "gotten stuck" at high entropy and cannot recover, regardless of how many more steps it takes.
Furthermore, the variance of entropy trajectories differs systematically: correct traces show decreasing variance as they converge, while incorrect traces maintain persistently high variance throughout, reflecting the model's inability to commit to a correct answer direction.
Property 2: Monotonic Progress
Correct reasoning traces reduce entropy at nearly every step — they progress monotonically toward low uncertainty. The entropy curve is smooth, with few or no reversals.
Incorrect traces show a characteristic break in monotonicity at the first error step. After the error, entropy stops decreasing and may even increase, indicating that subsequent reasoning is actively counterproductive rather than helpful.
The authors also examined step-level information gain as a standalone classifier for step quality. Despite the intuitive appeal of using IG to identify errors at the step level, this proves empirically challenging: the distributions of IG for correct and incorrect steps overlap substantially, with step-level ROC AUC values ranging from only 0.52 to 0.67. This finding motivates the shift from step-level to trajectory-level analysis — the signal is much clearer when viewed holistically across the full reasoning trace.
The InfoDensity Reward Framework
Building directly on these empirical findings, InfoDensity combines three components into a unified reward signal:
Component 1: AUC Reward (R_AUC)
This reward captures the "low uncertainty convergence" property by measuring the normalized area under the conditional entropy curve:
AUC(τ) = (1 / (T · H₀)) · Σ_{t=1}^{T} H_t
R_AUC(τ) = 1 - AUC(τ)
The normalization by T · H₀ (total steps times initial entropy) ensures that traces of different lengths can be compared on equal footing. A high R_AUC means the model maintained low uncertainty throughout the reasoning process and converged strongly — this is the signature of a trace that efficiently and correctly resolved the problem.
Component 2: Monotonicity Reward (R_mono)
This reward captures the "monotonic progress" property by measuring the fraction of steps where entropy strictly decreases:
R_mono(τ) = (1/T) · Σ_{t=1}^{T} 𝟙[H_t < H_{t-1}]
A high R_mono indicates a smooth, progressive reasoning trace with few reversals or stalls — the kind of step-by-step progress that characterizes correct reasoning.
The two rewards are deliberately complementary. A trace could achieve low overall AUC by making a single large entropy drop early (perhaps by committing to an answer without adequate reasoning), yet still lack monotonic progress. Conversely, a trace could show monotonic improvement at every step but converge only to a moderate entropy level. Only traces that achieve both — overall low uncertainty AND consistent step-by-step progress — receive high combined quality scores.
The combined quality reward is:
R_quality(τ) = α · R_AUC(τ) + (1-α) · R_mono(τ)
with α = 0.5 in all experiments.
Component 3: Length Scaling (R_L)
Rather than using a fixed length target, InfoDensity uses a group-relative length scaling term that adjusts rewards based on the distribution of response lengths within each training rollout batch:
R_L(τ) = exp(-λ · (L(τ) - μ_L) / σ_L)
where L(τ) is the response length, μ_L and σ_L are the mean and standard deviation of lengths within the batch, and λ controls the scaling intensity. Responses shorter than the batch average receive R_L > 1 (a bonus), while longer responses receive R_L < 1 (a penalty). This group-relative formulation requires no predefined length target, adapting automatically to the evolving length distribution during training.
The Final Reward
The InfoDensity reward is the product of quality and length:
R_InfoDensity(τ) = R_quality(τ) · R_L(τ)
Crucially, this reward is applied only to traces with correct final answers. Incorrect traces receive a reward of 0. This design ensures that the model is never incentivized to sacrifice correctness for brevity.
Ablation Studies: Why Both Components Are Essential
The paper's ablation over the α parameter provides a clear demonstration of why both reward components are necessary:
α = 1.0 (AUC-only): The model discovers a form of reward hacking almost immediately. Rather than reasoning progressively toward the answer, it learns to commit to a solution early and then pad the remaining trace with re-derivations and verification loops ("let me double-check," "alternatively, I can approach it another way"). This maintains low apparent entropy while contributing no actual new reasoning. Accuracy collapses within 20 training steps.
α = 0.0 (monotonicity-only): The model learns to make incremental entropy reductions at each step, but never converges to truly low uncertainty. The reasoning trace progresses monotonically but insufficiently, leaving the model at persistently moderate entropy. Accuracy degrades to approximately 70%.
α = 0.5 (balanced): Training is stable throughout. Accuracy improves and token usage decreases consistently. The AUC component ensures genuine convergence; the monotonicity component ensures the convergence happens through systematic step-by-step progress rather than shortcuts.
The ablation also highlights the importance of λ in the length scaling term. Moderate values (0.01 to 0.05) yield stable training with good accuracy-efficiency trade-offs. An extreme value of λ = 0.5 causes model collapse, with accuracy dropping below 60% — excessive length pressure overwhelms the quality signal.
Experimental Results: Strong Accuracy-Efficiency Trade-offs
InfoDensity was evaluated on two compact LRMs — DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-0.6B — across four mathematical reasoning benchmarks: GSM8K and MATH500 (in-domain) and AIME 2024 and OlympiadBench (challenging out-of-domain).
Results on DeepSeek-R1-Distill-Qwen-1.5B:
| Method | Avg. Accuracy | Avg. Tokens | Δ Accuracy | Δ Tokens |
|--------|--------------|------------|------------|---------|
| Original | 61.5% | 9,217 | — | — |
| GRPO-Acc | 63.9% | 7,248 | +2.4% | -21% |
| GRPO-LP | 60.9% | 7,436 | -0.6% | -19% |
| PEAR | 61.1% | 6,136 | -0.4% | -33% |
| **InfoDensity** | **64.0%** | **6,443** | **+2.5%** | **-30%** |
InfoDensity achieves the highest average accuracy (64.0%) while reducing tokens by 30% from the original model. It outperforms PEAR on accuracy (+2.9 percentage points) with only modestly more tokens.
Results on Qwen3-0.6B:
| Method | Avg. Accuracy | Avg. Tokens | Δ Accuracy | Δ Tokens |
|--------|--------------|------------|------------|---------|
| Original | 49.5% | 8,291 | — | — |
| GRPO-Acc | 51.9% | 8,819 | +2.4% | +6% |
| GRPO-LP | 48.3% | 6,956 | -1.2% | -16% |
| PEAR | 50.2% | 6,811 | +0.7% | -18% |
| **InfoDensity** | **49.2%** | **6,014** | **-0.3%** | **-27%** |
On the smaller Qwen3-0.6B model, InfoDensity achieves the lowest token usage of all methods (6,014) while maintaining accuracy close to the original (only -0.3%). Notably, accuracy-only training (GRPO-Acc) actually *increased* token usage on this model, highlighting the model-dependent nature of length regularization.
A particularly impressive result is on AIME 2024: DeepSeek-R1-Distill-Qwen-1.5B with InfoDensity achieves 40.0% accuracy (up from 33.3% baseline), demonstrating that the framework doesn't just trim redundant reasoning from easy problems — it can improve reasoning quality on genuinely challenging ones.
What Makes InfoDensity Different
The fundamental distinction between InfoDensity and prior work is *what gets supervised*:
- **Prior length-based methods** supervise only the final output length and answer correctness, leaving intermediate reasoning steps unsupervised and vulnerable to reward hacking.
- **InfoDensity** supervises the information-theoretic quality of every intermediate reasoning step, using conditional entropy trajectories as a proxy for reasoning quality. This is a principled, annotation-free signal that doesn't require step-level human labels.
The comparison with the Direct-Scoring (DS) baseline is particularly illuminating. DS uses the same mathematical framework as InfoDensity but replaces entropy-based quality signals with explicit quality scores from the judge model (prompted to rate completeness and correctness). DS struggles: the judge model lacks process-level training, its quality scores are noisy, and accuracy fluctuates significantly during training. This confirms that predictive uncertainty — as captured by conditional entropy — is a more reliable quality signal than explicit model judgments.
Limitations and Open Questions
The authors acknowledge two significant limitations. First, the framework has only been validated on mathematical reasoning tasks where correctness is verifiable. Whether the entropy trajectory properties generalize to code generation, open-ended reasoning, or creative tasks remains an open question.
Second, the entropy computation requires an external fixed judge model, adding inference overhead during training. The authors plan to investigate whether the training model itself can serve as a reliable entropy estimator, which would eliminate this overhead and improve scalability.
There's also the practical question of judge model quality: the paper uses Qwen3-4B-Instruct, and a weaker judge might introduce noise that degrades the reward signal.
Conclusion
InfoDensity reframes the efficient reasoning problem: instead of asking "how do we make the model write less?", it asks "how do we make every token the model writes more informative?" By grounding the reward signal in conditional entropy trajectories rather than raw length statistics, it addresses reward hacking at the root rather than at the surface.
The empirical results are compelling: 27-30% token reduction with maintained or improved accuracy on both in-domain and challenging out-of-domain benchmarks. The ablation studies provide clear evidence for the complementary roles of the AUC and monotonicity components. And the comparison with Direct Scoring confirms that information-theoretic signals outperform explicit model-based quality judgments.
For teams deploying large reasoning models at scale, InfoDensity offers a principled path toward inference efficiency that doesn't require sacrificing the reasoning quality that makes these models valuable.