Agent Fidelity: Can Open Large Models Explain Closed Models?

This paper investigates the applicability of mechanistic interpretability to closed API models and introduces the concept of 'agent fidelity'—assessing how effectively open models can infer the behavior of closed ones when only limited interfaces like log probabilities are available. The study systematically evaluates agent fidelity across three levels (prediction, attribution, and representation) for four major model families: Llama, Qwen, GPT, and Gemini. The experiments reveal that high consistency at the prediction level severely overestimates fidelity at the attribution level: models may agree on answers but fundamentally disagree on their reasoning. Furthermore, the researchers uncover an 'access validity reversal'—white-box signals such as attention patterns, while stable across models, prove poor predictors of causal attribution, whereas black-box input ablation methods are surprisingly more accurate. The study warns that mechanistic insights from open models cannot be naively transferred to closed targets, offering important guidance for interpretability research.

Background and Context

The field of mechanistic interpretability has long operated under the assumption that full access to a model's internal architecture is a prerequisite for understanding its decision-making processes. This paradigm relies on the ability to inspect weights, activation values, and attention mechanisms directly. However, the current landscape of deployed artificial intelligence is dominated by closed API models, such as those offered by major technology firms, which restrict access to only the final output tokens and their associated log probabilities. This restriction creates a significant "proxy problem" for researchers and auditors: how can one reliably infer the internal logic of a black-box system when the only available data points are surface-level predictions? The research paper under review addresses this critical gap by introducing the concept of "agent fidelity," a metric designed to evaluate how effectively open-weight models can serve as proxies for closed models.

The study systematically defines agent fidelity across three distinct dimensions: prediction, attribution, and representation. By doing so, it moves beyond simple accuracy comparisons to explore whether open models can genuinely explain the reasoning behind a closed model's outputs. The research team selected four major model families—Llama, Qwen, GPT, and Gemini—to conduct a comprehensive evaluation. This selection ensures that the findings are not limited to a single architectural lineage but reflect broader trends across different training methodologies and data distributions. The core hypothesis challenges the prevailing notion that insights gained from open models can be naively transferred to closed targets, suggesting instead that the lack of internal access fundamentally alters the reliability of interpretability techniques.

To establish a rigorous baseline, the researchers constructed an evaluation framework that quantifies the divergence between open and closed models at multiple levels. The study emphasizes that while open models are often used as surrogates for auditing or debugging closed systems, this practice may lead to significant misinterpretations if not properly validated. The paper argues that current interpretability methods often assume a direct mapping between the internal mechanisms of open and closed models, an assumption that crumbles when internal access is unavailable. By systematically testing this boundary, the research aims to provide a more pragmatic benchmark for the field, highlighting the limitations of using open models as proxies and warning against the overconfidence in cross-model generalizations.

Deep Analysis

The technical methodology employed in this study is multifaceted, designed to isolate specific aspects of model behavior and compare them across the open-closed divide. In the prediction layer, the researchers measured consistency by comparing the outputs of open and closed models on binary classification tasks, using log-odds as a scalar reading that is compatible with API access. This approach allows for a direct comparison of how similarly the models represent the input space. In the attribution layer, the study introduced leave-one-out attribution techniques, which involve observing the impact of removing specific input components on the final output. This method helps infer the causal logic behind decisions without requiring access to internal weights. Finally, in the representation layer, the analysis focused on the similarity of internal activation states, providing a deeper look into how information is processed within the models.

The experimental setup involved eleven models spanning the four selected families, evaluated primarily through zero-shot or few-shot inference on pre-trained models rather than task-specific fine-tuning. This strategy was chosen to ensure that the assessment of agent fidelity was generalizable and not biased by specific training adjustments. The results revealed a startling discrepancy: high consistency at the prediction level severely overestimates fidelity at the attribution level. Many models that agreed on the final answers exhibited fundamental disagreements on the reasoning behind those answers. This finding directly challenges the assumption that predictive accuracy implies mechanistic transparency, suggesting that two models can arrive at the same conclusion via entirely different logical pathways.

A particularly significant discovery in the study is the phenomenon of "access validity reversal." The researchers observed that white-box signals, such as attention patterns and perturbation magnitudes, while stable across different models, are poor predictors of causal attribution. In contrast, black-box input ablation methods, which rely solely on input-output relationships, proved surprisingly more accurate in capturing the factors that influence model outputs. This reversal indicates that the most accessible internal signals in open models may not be the most relevant for understanding the causal mechanisms of closed models. The study further confirmed through ablation experiments that prediction-level consistency alone is insufficient to support the migration of mechanistic insights to closed targets, necessitating stricter attribution consistency checks.

Industry Impact

The implications of these findings are profound for the open-source community and the broader AI research ecosystem. For researchers relying on open-weight models to audit or understand commercial black-box systems, the study serves as a critical cautionary tale. It warns against the overinterpretation of white-box metrics, such as attention heads, which may appear stable and interpretable in open models but fail to correlate with the actual decision-making processes of closed APIs. This disconnect means that conclusions drawn from open models about the behavior of closed models may be misleading, potentially leading to incorrect assessments of safety, bias, or reliability in deployed systems. The research underscores the need for a more nuanced understanding of the limitations of proxy-based interpretability.

In the context of industrial deployment, where most enterprises depend on closed APIs due to performance, cost, or proprietary constraints, the study provides a theoretical foundation for model auditing and debugging. It highlights that simple prediction alignment is not a sufficient proof of interpretability, urging the development of new evaluation standards that can measure the reliability of black-box explanations. By demonstrating that black-box ablation methods can be more effective than white-box signals in certain contexts, the research offers practical guidance for engineers who need to diagnose issues in closed systems without violating intellectual property rights or service terms. This shift in perspective could lead to more robust and legally compliant methods for ensuring the trustworthiness of AI systems in high-stakes environments.

Furthermore, the study impacts the regulatory and ethical landscape of AI by exposing the risks of assuming that transparency in open models translates to transparency in closed ones. If auditors and regulators rely on open-model proxies to assess the safety of closed models, they may miss critical vulnerabilities or biases that are not captured by surface-level predictions. The research calls for a reevaluation of current audit practices, advocating for methods that explicitly account for the fidelity gap between open and closed systems. This could influence how AI safety standards are developed, ensuring that they are based on empirical evidence of proxy reliability rather than theoretical assumptions about model similarity.

Outlook

Looking forward, this research opens a new avenue for study in the domain of mechanistic interpretability, specifically focusing on how to build robust explanatory frameworks under conditions of restricted access. The identification of the "access validity reversal" suggests that future work should prioritize the development of black-box-centric interpretability techniques that do not rely on the assumption of internal structural similarity between open and closed models. Researchers are encouraged to explore hybrid approaches that combine the stability of white-box signals with the causal accuracy of black-box ablation methods, potentially leading to more effective tools for auditing and debugging. The open-sourcing of the code and results from this study will accelerate empirical research in this area, allowing the community to test and refine these new methodologies across a wider range of models and tasks.

The study also points to the need for more sophisticated evaluation metrics that go beyond simple prediction accuracy. Future benchmarks should incorporate rigorous attribution consistency checks to ensure that open models are not just mimicking the outputs of closed models but are also capturing their underlying reasoning processes. This shift could lead to the development of new standards for "proxy fidelity," providing a clearer understanding of when and how open models can be trusted as surrogates for closed ones. As the AI industry continues to rely heavily on closed APIs, these advancements will be crucial for maintaining transparency and accountability in the deployment of large language models.

Finally, the research highlights the importance of cross-model interpretability migration studies. By systematically analyzing the boundaries of agent fidelity, the study provides a roadmap for understanding the transferability of mechanistic insights. This knowledge will be invaluable for developers who seek to leverage the transparency of open models to improve the safety and reliability of closed systems. As the field matures, the integration of these insights into practical tools and frameworks will be essential for ensuring that the benefits of mechanistic interpretability are accessible even in environments where full model access is not possible. The study thus serves as a foundational step toward a more rigorous and realistic approach to AI interpretability in a world dominated by closed APIs.

Sources

arXiv