What is RELEX and how does it work?

RELEX is a compute-efficient method requiring no extra training. It estimates a rank-1 subspace from short RLVR observation windows and uses linear regression to extrapolate checkpoints.

Why does this matter for LLM development?

It matches full RLVR performance using only 15% of training steps and extrapolates 10–20× further at zero cost, significantly reducing compute expenses for reasoning optimization.

What should researchers watch for next?

Researchers should watch if this low-rank property applies to other optimization algorithms or how to leverage it for designing more efficient fine-tuning strategies.

仅需極少量RLVR訓練：透過秩1軌跡外推實現大語言模型推理躍升

強化學習結合可驗證獎勵（RLVR）已成為提升大語言模型推理能力的主流範式，但其參數軌跡的幾何特性長期未被深入探索。本文揭示了RLVR權重軌跡具有極低秩且高度可預測的特性，發現下游性能增益主要由參數增量的秩1近似捕捉，且該投影幅度隨訓練步數近乎線性演變。基於此，作者提出了一種計算高效的RELEX方法，僅透過短觀測窗口估計秩1子空間，利用線性迴歸外推未來檢查點，無需額外訓練模型。在Qwen2.5-Math-1.5B、Qwen3-4B-Base和Qwen3-8B-Base三個模型上的實驗表明，RELEX僅需15%的全量RLVR訓練步數，即可在域內和域外基準上達到或超越完整RLVR的性能。更令人驚訝的是，RELEX能以零訓練成本外推至觀測窗口10-20倍遠的未來，例如僅觀察前50步即可預測1000步後的性能提升。消融實驗證實，增加子空間秩或使用非線性建模均無進一步增益，其成功源於秩1投影對隨機優化噪聲的去噪效應。

Sources

arXiv