DelTA is a discriminative token credit assignment method that dynamically estimates coefficients and re-weights the RLVR objective to optimize LLM training.

Why is DelTA important?

Standard RLVR is often hindered by high-frequency formatting tokens; DelTA suppresses this shared noise to focus on truly discriminative signals.

How does DelTA perform?

Across seven math benchmarks, DelTA surpassed strong Qwen3-8B baselines by an average margin of 3.26 points, showing strong generalization.

DelTA：基於判別性Token信用分配的強化學習優化方法

本文針對大語言模型中基於可驗證獎勵的強化學習（RLVR）技術，深入剖析了響應級獎勵向Token級概率更新轉化的內在機制。研究發現，標準的策略梯度更新方向本質上是一個線性判别器，透過正負側質心來調整Token概率，但這種方法易受高頻格式Token干擾，削弱了對高獎勵響應的區分能力。為此，作者提出了DelTA方法，透過估計Token係數來放大特定側的梯度方向並抑制共享或弱判别性方向。該方法重新加權了自歸一化的RLVR代理目標，使有效質心更具對比性。在七個數學基準測試中，DelTA在Qwen3-8B-Base和Qwen3-14B-Base上分別以3.26和2.62的平均分優勢超越最強同規模基線，並在程式碼生成和域外評估中展現出良好的泛化能力。

Sources

arXiv