Minimal RLVR Training Needed: Boosting LLM Reasoning via Rank-1 Trajectory Extrapolation

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as the dominant paradigm for boosting reasoning in large language models, yet the geometric properties of their parameter trajectories have long gone unexplored. This paper reveals that RLVR weight trajectories exhibit remarkably low rank and high predictability: downstream performance gains are primarily captured by a rank-1 approximation of parameter increments, with projection magnitude evolving nearly linearly across training steps. Leveraging this insight, the authors propose RELEX, a computation-efficient method that estimates the rank-1 subspace from a short observation window and uses linear regression to extrapolate future checkpoints without any additional training. Experiments on Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base show that RELEX needs only 15% of full RLVR training steps to match or surpass complete RLVR performance on both in-domain and out-of-domain benchmarks. Remarkably, RELEX extrapolates to futures 10–20× beyond the observation window at zero training cost—for instance, predicting performance at step 1000 by observing only the first 50. Ablation studies confirm that increasing subspace rank or employing non-linear modeling yields no further gains; success stems from the denoising effect of rank-1 projection on stochastic optimization noise.

Background and Context

Reinforcement Learning with Verifiable Rewards (RLVR) has firmly established itself as the dominant paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its proven efficacy in improving mathematical reasoning and logical deduction, the academic community has largely overlooked the intrinsic geometric structure of the parameter update trajectories generated during this process. Traditional research efforts have predominantly focused on designing more complex reward functions or refining optimization algorithms, neglecting the fundamental laws governing how model weights evolve during training. This gap in understanding has left a critical question unanswered: what is the true geometric nature of the path taken by model parameters when optimized via RLVR?

Recent investigations have begun to challenge the assumption that these trajectories represent high-dimensional, chaotic random walks. Instead, emerging evidence suggests that the parameter updates exhibit remarkably low-rank structures and high degrees of predictability. This insight shifts the focus from algorithmic complexity to geometric simplicity, proposing that the vast majority of performance gains in downstream tasks can be captured by a rank-1 approximation of parameter increments. The magnitude of this projection evolves in a nearly linear fashion across training steps, fundamentally altering the perspective on how LLMs learn complex reasoning skills through reinforcement learning.

Deep Analysis

The core theoretical contribution of this research is the systematic revelation of the minimalist geometric characteristics inherent in RLVR training. The study demonstrates that model weight updates do not scatter randomly in high-dimensional space but instead concentrate along a single dominant direction. This rank-1 structure implies that the complex, multi-dimensional adjustments required for reasoning improvements are effectively driven by a primary vector of change. The projection amplitude of these updates scales almost linearly with the number of training steps, providing a robust mathematical foundation for predicting future model states without the need for continuous computation.

Building on this geometric insight, the authors propose RELEX (REinforcement Learning EXtrapolation), a novel method designed on the philosophy of "less is more." RELEX abandons the traditional, lengthy iterative training process in favor of an observation-based extrapolation strategy. The method operates by collecting early weight update data within a very short observation window and utilizing techniques such as Singular Value Decomposition (SVD) to estimate the rank-1 subspace of parameter changes. Once this subspace is identified, linear regression is employed to fit the evolution trend of the projection magnitude over training steps, allowing for the prediction of weight states at any future step.

A critical innovation within RELEX is its inherent denoising mechanism. By projecting parameter updates onto a rank-1 subspace, the method effectively filters out high-frequency noise generated during stochastic optimization processes. This denoising effect ensures that only the most informative update directions are retained, significantly improving the accuracy of the extrapolation. Unlike traditional methods that require continuous gradient calculations or the maintenance of complex optimizer states, RELEX generates future checkpoints without any additional backpropagation or model training once the subspace is estimated. This approach not only reduces computational overhead but also prevents performance degradation caused by noise accumulation, ensuring stable performance growth even in unseen training phases.

Industry Impact

Extensive experiments conducted on three distinct models within the Qwen series—Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base—validate the efficacy of the RELEX framework. The results indicate that RELEX requires only 15% of the full RLVR training steps to match or surpass the performance of complete training on both in-domain and out-of-domain benchmarks. For instance, on the Qwen3-8B-Base model, checkpoints generated using only a small number of early training steps achieved scores in mathematical reasoning benchmarks comparable to those of models trained for thousands of steps. This dramatic reduction in required training steps represents a significant leap in computational efficiency for the industry.

The extrapolation capabilities of RELEX further highlight its potential impact. The method can predict performance at steps 10 to 20 times beyond the observation window at zero additional training cost. A notable example from the study shows that observing only the first 50 training steps allows for an accurate prediction of model performance at step 1000, with performance continuing to improve as the extrapolation extends. This capability offers a new strategic option for researchers, enabling them to rapidly assess potential performance in the early stages of training and allocate computational resources more flexibly.

Ablation studies further confirm the minimalism of the RELEX design. Increasing the subspace rank to two or higher, or employing non-linear modeling techniques, yielded no additional performance gains. This finding reinforces the sufficiency of the rank-1 approximation, suggesting that the dominant component of RLVR trajectories is adequate to explain most performance variations. Any attempt to capture higher-dimensional details appears redundant, underscoring the efficiency of focusing on the primary direction of parameter change. This simplicity not only reduces computational costs but also democratizes access to advanced LLM optimization, allowing researchers and developers with limited resources to participate effectively in model refinement.

Outlook

The introduction of RELEX marks a significant shift in how the AI community approaches the optimization of LLM reasoning capabilities. By revealing the low-rank nature of RLVR trajectories, this research provides a new theoretical entry point for future studies. It invites exploration into whether other optimization algorithms exhibit similar geometric structures and how these insights can be leveraged to design even more efficient fine-tuning methods. The success of RELEX suggests that the field may benefit from a broader re-evaluation of optimization dynamics, moving away from brute-force computational scaling toward more geometrically informed strategies.

For the industrial sector, RELEX offers a practical solution to the escalating costs of training large models. By drastically reducing the computational resources required for RLVR, it enables faster iteration cycles and reduces the uncertainty associated with long-term training projects. This efficiency gain is particularly valuable in commercial applications where time-to-market and operational costs are critical factors. Furthermore, the ability to predict long-term performance from short observation windows allows for more agile decision-making in model development pipelines.

Ultimately, RELEX is not merely a tool for accelerating training but a profound insight into the optimization dynamics of deep models. It challenges the prevailing notion that complex reasoning requires complex, high-dimensional parameter updates. Instead, it proposes that simplicity and geometric structure are key to unlocking the full potential of LLMs. As the field continues to evolve, the principles underlying RELEX are likely to influence the design of next-generation training algorithms, paving the way for more efficient, interpretable, and accessible AI systems.