Faithfulness of Confidence Expressions in Quantifying Large Reasoning Models: Challenges and an Evaluation Framework

This paper investigates the critical reliability gap in Large Reasoning Models (LRMs) regarding the faithfulness of their confidence expressions—what the authors term Faithful Calibration (FC). While LRMs expose extended reasoning traces to demonstrate their thought processes, a severe misalignment often exists between the model's internal uncertainty and the confidence it communicates through language. Existing evaluation methods struggle with the characteristics of LRM chain-of-thought outputs, which lack clear step boundaries, exhibit structural inconsistency, and involve complex conditional dependencies. To address these challenges, the authors propose a novel quantification framework that systematically evaluates FC by combining three dimensions of internal uncertainty—token-level probabilities, hidden-state representations, and sampling response consistency—with linguistic decisiveness analysis. The study also introduces a prefix-conditioned sampling method to control conditional and structural variation across reasoning trajectories. Experimental results reveal that reasoning behavior alone does not automatically improve confidence faithfulness, and prompt interventions designed for non-reasoning models fail equally in reasoning contexts. Significant disagreements among different confidence estimators on the same trajectory expose the fragility of current evaluation approaches. The study establishes FC as an independent reliability and alignment objective for LRMs, with particular relevance to high-stakes application scenarios.

Background and Context

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide spectrum of tasks, yet a critical barrier to their trustworthy deployment remains the faithful expression of uncertainty. This concept, termed Faithful Calibration (FC), demands a precise alignment between a model’s internal state of uncertainty and the confidence it communicates through its linguistic output. While standard LLMs face this challenge, it becomes exponentially more complex in the context of Large Reasoning Models (LRMs). These advanced architectures generate extended reasoning traces, often referred to as Chain-of-Thought (CoT) outputs, to solve problems step-by-step. Users naturally interpret these lengthy, detailed derivations as evidence of deep deliberation, professional competence, and high confidence. However, this intuitive trust may be misplaced if the model’s internal uncertainty is not accurately reflected in its external expression. The existing landscape of evaluation methodologies is ill-equipped to handle the unique characteristics of LRM outputs. Traditional FC assessment paradigms were primarily designed for short-text generation tasks, where boundaries between steps are clear and structures are relatively simple. In contrast, LRM reasoning trajectories lack distinct step boundaries, exhibit structural inconsistencies, and encode complex conditional dependencies throughout the entire sequence. These features make it exceptionally difficult to estimate the model’s internal confidence at any given point in the reasoning process. Consequently, there is a significant gap in understanding whether LRMs can truly express their confidence faithfully, leaving a potential reliability risk that has not been systematically quantified or understood. To address these fundamental challenges, this research introduces a novel quantification framework designed to systematically evaluate the Faithful Calibration of LRMs. The core innovation of this framework lies in its multi-dimensional approach to measuring internal uncertainty. Rather than relying on a single metric, the framework correlates linguistic decisiveness with three distinct sources of internal uncertainty: token-level probability distributions, hidden-state representations, and sampling response consistency. By integrating these diverse signals, the framework aims to capture the model’s true level of certainty during the reasoning process with greater granularity than previous methods allowed. This comprehensive approach seeks to bridge the gap between the model’s internal cognitive state and its external verbal output, providing a more robust basis for assessing reliability.

Furthermore, recognizing the high variance and complexity inherent in LRM reasoning trajectories, the study develops a prefix-conditioned sampling method. This technique is crucial for controlling conditional and structural variations across different reasoning paths, ensuring that the evaluation results are both fair and comparable. By standardizing the conditions under which reasoning traces are generated, the framework can isolate the effects of the reasoning process itself on confidence expression. This methodological rigor lays the groundwork for a more accurate estimation of internal confidence in long-form text generation, setting a new standard for how we evaluate the reliability of next-generation reasoning models.

Deep Analysis

The experimental evaluation of this framework was conducted across a diverse set of mainstream Large Reasoning Models, various datasets, and different prompt scenarios to ensure a comprehensive assessment of performance. The results reveal a concerning reality: faithful confidence expression remains a major hurdle for LRMs. Contrary to the assumption that extended reasoning automatically leads to better calibration, the study found that the act of reasoning itself does not inherently improve the faithfulness of confidence expressions. This implies that even when a model generates seemingly detailed and logical reasoning steps, its internal uncertainty may not be properly verbalized. As a result, users may be misled into believing the model is more certain than it actually is, creating a dangerous illusion of competence. A particularly striking finding is the failure of prompt interventions that have proven effective for non-reasoning models. Strategies designed to improve calibration in standard LLMs, such as specific prompting techniques, were found to be ineffective when applied to LRMs. This suggests that the introduction of a reasoning mechanism fundamentally alters how the model expresses internal uncertainty. The complex, multi-step nature of reasoning appears to disrupt the calibration logic that works in simpler generation tasks, rendering traditional corrective measures obsolete. This highlights a critical need for new calibration strategies specifically tailored to the architectural and operational characteristics of reasoning models. Additionally, the study uncovered significant disagreements among different confidence estimators when evaluating the same reasoning trajectory. For instance, estimates derived from token-level probabilities often diverged sharply from those based on hidden-state representations or sampling consistency. This lack of consensus exposes the fragility of current evaluation approaches, which often rely on single metrics to gauge reliability. The divergence indicates that no single internal signal is sufficient to capture the full picture of a model’s confidence. Instead, a multi-perspective assessment is necessary to accurately reflect the model’s reliability, as different metrics may capture different aspects of uncertainty that are not always correlated.

The prefix-conditioned sampling method played a vital role in revealing these discrepancies by controlling for structural variations. By ensuring that comparisons were made under consistent conditions, the study could isolate the specific impact of the reasoning process on confidence expression. This control mechanism allowed the researchers to demonstrate that the observed misalignments were not merely artifacts of varying output lengths or structures, but were intrinsic to how LRMs process and express uncertainty. The findings underscore the complexity of the problem and the inadequacy of existing tools in addressing it, pointing to the need for more sophisticated evaluation frameworks.

Industry Impact

The implications of these findings for the industry are profound, particularly as Large Reasoning Models are increasingly deployed in high-stakes environments. The study establishes Faithful Calibration as an independent and critical objective for reliability and alignment in LRMs. In sectors such as medical diagnosis, legal advisory, and financial risk management, the accuracy of a model’s confidence expression is directly tied to the safety and trustworthiness of the decisions made. If a model expresses overconfidence in an incorrect reasoning path, or conversely, expresses excessive caution in a correct one, the consequences can be severe. Therefore, ensuring that LRMs faithfully communicate their uncertainty is not just a technical nuance but a fundamental requirement for ethical and safe AI deployment.

This research highlights a significant gap in current model development practices. While much effort has been directed toward improving the accuracy and complexity of reasoning capabilities, the calibration of confidence expressions has been largely overlooked. The finding that reasoning behavior does not automatically enhance faithfulness suggests that developers cannot assume that better reasoning leads to better reliability. Instead, specific optimization efforts must be dedicated to FC, potentially involving adjustments in model architecture, training strategies, or post-processing techniques. Ignoring this aspect could lead to the widespread deployment of models that appear competent but are fundamentally unreliable in their self-assessment. The evaluation framework and the identified methodological vulnerabilities provide valuable guidance for both the open-source community and industrial developers. By exposing the fragility of single-metric evaluation approaches, the study encourages the adoption of more robust, multi-dimensional assessment protocols. This shift is essential for building more resilient and trustworthy AI systems. Developers are urged to critically evaluate the uncertainty expression mechanisms of LRMs before deployment, ensuring that they meet the rigorous standards required for high-risk applications. The study serves as a wake-up call, emphasizing that reliability is as important as capability in the next generation of AI systems. Moreover, the failure of existing prompt interventions in reasoning contexts signals a need for new tools and techniques. The industry must invest in developing calibration methods that are specifically designed for the unique challenges posed by long-chain reasoning. This includes exploring new ways to integrate confidence signals into the training process and designing architectures that inherently support faithful uncertainty expression. The research provides a clear direction for future innovation, urging the community to prioritize FC as a key area of focus to prevent the deployment of models that could mislead users in critical decision-making scenarios.

Outlook

Looking ahead, the establishment of Faithful Calibration as a distinct and critical alignment objective for Large Reasoning Models opens new avenues for research and development. The current study provides a foundational framework for quantifying this issue, but significant work remains to be done. Future research should focus on designing model architectures that are intrinsically calibrated to express uncertainty faithfully. This may involve novel training objectives that explicitly optimize for the alignment between internal uncertainty states and external linguistic expressions. By embedding FC into the core design of LRMs, developers can create systems that are not only more accurate but also more transparent and trustworthy in their self-assessments.

The divergence among different confidence estimators identified in this study suggests that hybrid approaches may be necessary for accurate evaluation. Future frameworks could combine token-level probabilities, hidden-state analyses, and sampling consistency into a unified metric that captures the full spectrum of uncertainty. Additionally, the prefix-conditioned sampling method introduced here can be expanded to cover a wider range of reasoning scenarios and model types, providing a more comprehensive understanding of how different architectures handle uncertainty. This expanded evaluation capability will be crucial for benchmarking the reliability of new models as they emerge. Furthermore, the failure of traditional prompt interventions highlights the need for new calibration techniques tailored to reasoning models. Research into adaptive prompting, dynamic confidence adjustment, and post-hoc correction methods specific to long-chain outputs could yield significant improvements. These techniques must account for the complex conditional dependencies and structural variations inherent in reasoning traces. By developing tools that can dynamically adjust confidence expressions based on real-time internal signals, developers can enhance the reliability of LRMs in real-world applications. Finally, the industry must prioritize the integration of FC into the standard development lifecycle of LRMs. This involves not only technical innovation but also the establishment of industry standards and best practices for evaluating and reporting confidence calibration. As LRMs become more prevalent in high-stakes domains, the ability to trust their uncertainty expressions will be a key differentiator between reliable and risky AI systems. By addressing the challenges of Faithful Calibration, the AI community can move closer to deploying reasoning models that are not only intelligent but also honest and dependable in their communication of knowledge and doubt.