What is the DemoPSD framework?

DemoPSD is a disagreement-regulated strategy self-distillation framework designed to overcome privileged information leakage in LLM reasoning training. It selectively adopts teacher guidance, steering student models toward a reverse KL barycenter objective that balances knowledge acquisition with preserving the student's own reasoning abilities.

Why does DemoPSD matter for LLM training?

Conventional OPSD methods cause student models to overfit and rely on answer-dependent shortcuts only available during training, leading to performance collapse at test time. DemoPSD cuts this dependency through provable leakage decay while maintaining exploration capacity to prevent local optima.

How does DemoPSD perform in practice?

Across four scientific domains in SciKnowEval, DemoPSD significantly outperforms GRPO and SDPO while maintaining higher training entropy. It demonstrates robust generalization on the out-of-distribution GPQA benchmark, offering a more reliable training path for complex reasoning tasks.

DemoPSD：基於分歧調節的策略自蒸餾框架，破解特權資訊洩露難題

近期基於線上策略自蒸餾（OPSD）的大語言模型推理訓練方法雖具實用性，但教師模型在特權資訊條件下的密集 token 級監督易導致過擬合、抑制探索，並引發特權資訊洩露問題，即學生模型編碼了測試時不可用的答案依賴捷徑。為此，本文提出 DemoPSD 框架，透過選擇性採納教師指導解決上述痛點。該方法不擬合完整的教師分佈，而是將學生引導至反向 KL 重心目標，即教師與學生分佈的加權幾何組合，以平衡從教師學習與保留學生自身推理能力。透過測量分佈差異自適應控制各 token 位置的混合程度，DemoPSD 在理論上證明了其具備洩露衰減與探索保留能力。在 SciKnowEval 四個科學領域的廣泛實驗表明，DemoPSD 優於 GRPO 和 SDPO，保持更高訓練熵，並在分佈外 GPQA 基準上展現魯棒的泛化能力。

Sources

arXiv