DemoPSD: Disagreement-Regulated Strategy Self-Distillation Framework for Overcoming Privileged Information Leakage
Recent large language model reasoning training methods based on Online Policy Self-Distillation (OPSD) offer practical utility, but the teacher model's dense token-level supervision under privileged information conditions easily causes overfitting, suppresses exploration, and triggers privileged information leakage—where student models encode answer-dependent shortcuts unavailable during testing. To address these issues, we propose DemoPSD, a framework that tackles these pain points through selective adoption of teacher guidance. Rather than fitting the full teacher distribution, DemoPSD steers students toward a reverse KL barycenter objective—a weighted geometric combination of teacher and student distributions—thereby balancing knowledge acquisition from the teacher with preservation of the student's own reasoning abilities. By measuring distributional divergence and adaptively controlling mixing intensity at each token position, DemoPSD provably achieves leakage decay and exploration retention. Extensive experiments across four scientific domains in SciKnowEval demonstrate that DemoPSD outperforms both GRPO and SDPO, maintains higher training entropy, and exhibits robust generalization on the out-of-distribution GPQA benchmark.
Background and Context
The training of reasoning capabilities in Large Language Models (LLMs) has increasingly relied on Online Policy Self-Distillation (OPSD) as a highly efficient paradigm. In this framework, a single model assumes dual roles, acting simultaneously as a teacher and a student to engage in self-play and iterative learning across varying levels of information access. While OPSD offers practical utility for enhancing model performance, recent academic scrutiny has revealed significant intrinsic flaws in its operational mechanics. The core issue arises when the teacher model operates under privileged information conditions; the dense, token-level supervision signals it generates often cause the student model to overfit to specific patterns within the training domain. This overfitting severely suppresses the model's willingness to explore novel solutions in unknown scenarios.
A more critical and fundamental defect identified in this context is the phenomenon of "privileged information leakage." During the training phase, student models inadvertently learn to rely on answer-dependent shortcuts that are only available when the teacher possesses privileged data. In real-world testing environments, where such privileged information is absent, these shortcuts become invalid, leading to a precipitous drop in model performance. This leakage represents a structural failure in current self-distillation methods, as it creates a dependency on data distributions that do not exist during inference. To address these compounding issues of overfitting and leakage, researchers have developed the DemoPSD framework. This new approach aims to fundamentally reconstruct the knowledge transfer mechanism within self-distillation by introducing the concept of "selective adoption of teacher guidance," thereby offering a robust theoretical perspective and practical path for improving the resilience of complex reasoning tasks.
Deep Analysis
From a technical implementation standpoint, DemoPSD abandons the traditional methodology of directly fitting the complete teacher distribution. Instead, it introduces a more granular mechanism known as the "reverse KL barycenter objective." This framework calculates the divergence between the teacher's distribution and the student's distribution, utilizing this difference as a regulatory factor to dynamically construct a weighted geometric combination target. This objective function serves as a balancing act, incorporating high-quality reasoning paths provided by the teacher while simultaneously preserving the student's existing reasoning capabilities. By avoiding the direct imitation of the teacher's full output distribution, the model is steered toward a compromise that mitigates the risk of encoding privileged shortcuts.
The operational mechanics of DemoPSD involve an adaptive control system that regulates the mixing intensity at each token position based on measured distributional divergence. Rather than applying uniform supervision across all tokens, the framework evaluates the value of the teacher's guidance at specific points. In positions where the distributional difference is large, indicating high potential value in the teacher's guidance, the model prioritizes absorbing the teacher's information. Conversely, in positions where the difference is small or where the student already possesses high confidence, the system retains more of the student's original output. This selective mechanism is theoretically proven to achieve "leakage decay," effectively severing the student's dependency on privileged information, while simultaneously ensuring "exploration retention" to prevent the model from stagnating in local optima during dense distillation processes.
Industry Impact
The validation of DemoPSD was conducted through extensive experiments on the SciKnowEval benchmark, which covers four distinct scientific domains to comprehensively assess model performance in complex scientific reasoning tasks. The results demonstrate that DemoPSD significantly outperforms current state-of-the-art methods, including GRPO (Group Relative Policy Optimization) and SDPO (Self-Distillation with Policy Optimization). A key metric in these evaluations is training entropy; DemoPSD maintained significantly higher training entropy compared to its counterparts. This higher entropy level serves as direct empirical evidence of the framework's ability to suppress overfitting and maintain diversity in exploration, preventing the model from collapsing into narrow, over-specialized decision paths.
Furthermore, to rigorously test the generalization capabilities of the model, researchers evaluated DemoPSD on the out-of-distribution (OOD) GPQA benchmark. The framework exhibited robust generalization, showing a much smaller performance degradation when faced with unseen data distributions compared to baseline models. Ablation studies further revealed that by dynamically adjusting the distribution mixing ratio, the model could more effectively identify and filter out spurious correlations that relied on privileged information. This allowed the model to learn reasoning logic based on true causal relationships rather than statistical artifacts. These findings provide critical insights into information flow within self-distillation mechanisms and highlight the framework's potential to enhance the reliability of LLMs in high-stakes scientific applications.
Outlook
The introduction of DemoPSD represents a significant correction to existing LLM training paradigms, offering tools of practical value to both the open-source community and industrial applications. In the industrial sector, the deployment of large models in vertical domains often faces dual challenges: data distribution shifts and privacy protection. The "leakage decay" characteristic emphasized by DemoPSD aids in constructing safer and more reliable reasoning systems, thereby reducing compliance risks associated with data leakage. Additionally, the framework's ability to maintain high training entropy implies that models can retain strong generalization capabilities even under resource constraints, which is beneficial for reducing the computational costs associated with large-scale model fine-tuning.
Looking forward, the reverse KL barycenter objective proposed by DemoPSD provides a new mathematical framework for designing more complex self-supervised learning algorithms. Future research may explore extending this framework to multimodal domains or other strategy optimization scenarios within reinforcement learning. By providing a rigorous theoretical derivation and solid experimental validation, this work offers a feasible solution to break through the current bottlenecks in LLM reasoning training. It is poised to drive the development of next-generation reasoning models toward greater universality and robustness, ensuring that AI systems can perform complex logical deductions with greater independence and reliability in diverse, real-world environments.