Proxy Fidelity: Can Open Large Models Explain Closed Models?

This paper examines the limitations of mechanistic interpretability under closed API settings and introduces the concept of "proxy fidelity" to assess whether measurements on open large models can reliably infer the behavior of closed models. The research team systematically evaluates proxy fidelity across three levels: prediction, attribution, and representation. Through extensive experiments on 11 models spanning four families — Llama, Qwen, GPT, and Gemini — the authors find that predictive fidelity severely overestimates attribution fidelity: consistency in answers often masks profound disagreements in reasoning logic. The paper reveals a phenomenon of "access effectiveness inversion," whereby white-box signals such as attention patterns, while stable, exhibit very weak predictive power for causal attribution; conversely, black-box input ablation experiments more accurately capture causal attribution. These findings indicate that mechanistic interpretability insights cannot be automatically transferred to closed targets, and agreement at the prediction level alone is insufficient to justify such transfers — offering a critical warning for the open-source community when assessing the validity of model interpretability tools.

Background and Context

The field of mechanistic interpretability (MI) has long operated under the assumption that understanding the internal mechanics of large language models is essential for ensuring their safety and reliability. However, a significant structural barrier has emerged: the vast majority of commercially deployed models are accessible only through closed Application Programming Interfaces (APIs). These interfaces typically expose only the output token probabilities, withholding the internal hidden states, activations, and gradients that are critical for deep mechanistic analysis. This asymmetry in data access creates a fundamental "proxy problem." When researchers must rely on open-source models as proxies to understand closed, proprietary systems, it becomes unclear whether measurements taken on the open models can yield reliable inferences about the behavior of the closed targets. The core challenge lies in determining whether the internal signals of an open model, such as Llama or Qwen, can accurately reflect the decision-making processes of black-box models like GPT or Gemini.

This disconnect is particularly problematic because the most impactful models in industry are often the least transparent. Existing interpretability methods predominantly depend on white-box access, allowing researchers to inspect attention heads, residual streams, and activation patterns directly. Consequently, many conclusions drawn about model behavior may be artifacts of the specific architecture or training data of open-source models, failing to generalize to the more complex, commercially valuable closed models. Without a rigorous framework to assess the validity of these proxy relationships, the open-source community risks building interpretability tools and theories that are ineffective when applied to the real-world systems that dominate the market. Establishing a metric for "proxy fidelity" is therefore not merely an academic exercise but a critical necessity for ensuring that interpretability research remains relevant and effective in a landscape dominated by closed APIs.

To address this gap, the research team developed a systematic methodology for evaluating proxy fidelity across three distinct levels of abstraction: prediction, attribution, and representation. By defining these layers, the study aims to dissect where and why the alignment between open and closed models breaks down. The evaluation framework is designed to be API-compatible, meaning it can be applied even when internal model states are inaccessible. This approach allows for a direct comparison between the capabilities of open models as proxies and the actual behavior of closed models. The study focuses on identifying the specific conditions under which open models can serve as valid surrogates, providing a foundational benchmark for future research into cross-model interpretability. The goal is to move beyond anecdotal evidence and provide a quantifiable measure of how well open models can explain their closed counterparts.

Deep Analysis

The experimental design of this study is notable for its breadth and rigor, covering eleven models spanning four major families: Llama, Qwen, GPT, and Gemini. This diverse selection ensures that the findings are not limited to a single architectural paradigm or training methodology. The researchers employed a multi-layered evaluation strategy, utilizing log-odds as a scalar measure for representation-level fidelity in binary classification tasks, which is compatible with API access. For attribution-level analysis, the team implemented leave-one-out (LOO) attribution techniques, a method that involves systematically masking parts of the input to observe changes in the output. This allows for a granular examination of how specific input tokens contribute to the final prediction. By maintaining consistent evaluation standards across different model architectures, the study minimizes the confounding effects of structural differences, isolating the variable of model openness as the primary factor influencing fidelity.

The results reveal a startling discrepancy between predictive fidelity and attribution fidelity. Predictive fidelity, which measures the agreement between the open and closed models on final answers, was found to severely overestimate attribution fidelity. In many cases, models exhibited high consistency in their outputs, suggesting they were solving the problem in the same way. However, deeper analysis showed that this surface-level agreement often masked profound disagreements in the underlying reasoning logic. Two models might arrive at the correct answer through entirely different causal pathways, meaning that an interpretability tool trained on the open model's internal signals would fail to explain the closed model's actual decision process. This finding challenges the common assumption that output consistency implies mechanistic similarity, highlighting a critical blind spot in current interpretability practices.

Perhaps the most significant discovery is the phenomenon of "access effectiveness inversion." Traditional white-box signals, such as attention patterns and perturbation magnitudes, were observed to be highly stable across different models. However, this stability did not translate into predictive power for causal attribution. In other words, while the attention mechanisms of open and closed models might look similar, they do not necessarily point to the same causal factors in the input. Conversely, black-box input ablation experiments, which treat the model as an opaque function, proved to be more accurate in capturing causal attribution than the available white-box signals. This inversion suggests that the internal structures of large language models are not directly comparable across different training regimes or architectures, and that simpler, black-box methods may sometimes outperform complex mechanistic analyses when bridging the open-closed divide.

Industry Impact

These findings have profound implications for the open-source AI community and the broader industry of model development. For researchers in mechanistic interpretability, the study serves as a stark warning: insights derived from open models cannot be automatically transferred to closed targets. This necessitates a shift in methodology, moving away from the assumption that open-source models are perfect stand-ins for proprietary systems. Researchers must now adopt a more cautious approach, explicitly testing the proxy fidelity of their interpretability tools before applying them to closed models. This could lead to the development of new evaluation benchmarks that specifically measure the transferability of interpretability insights, ensuring that tools designed for open models are validated for use in black-box contexts.

For industry practitioners, the results suggest that relying on open-source interpretability tools to audit or understand closed commercial models may lead to significant biases and errors. If the internal reasoning of a closed model differs substantially from that of an open proxy, audits based on the proxy's mechanics may miss critical vulnerabilities or biases present in the closed system. This highlights the need for the development of new evaluation standards and hybrid methods that can effectively bridge the gap between white-box mechanistic analysis and black-box auditing. Companies investing in AI safety and compliance must recognize that current interpretability solutions may be insufficient for the models they actually use, potentially requiring significant investment in custom auditing frameworks that do not rely on open-source proxies.

Furthermore, the study underscores the importance of developing more robust attribution methods that can operate effectively in the absence of white-box access. The failure of traditional white-box signals to predict causal attribution in closed models points to a need for alternative techniques that can infer internal logic from input-output behavior alone. This could spur innovation in areas such as causal inference, counterfactual analysis, and black-box optimization, providing new tools for understanding complex AI systems. By highlighting the limitations of current approaches, the research encourages the community to explore more nuanced and realistic models of interpretability that account for the realities of API-based access.

Outlook

The introduction of the "proxy fidelity" framework marks a significant step forward in the rigorous evaluation of mechanistic interpretability. By providing a structured way to assess the validity of open models as proxies for closed systems, the study offers a valuable resource for future research. The open-sourcing of the code and results further facilitates this progress, allowing other researchers to build upon these findings and develop more effective interpretability tools. As the AI industry continues to rely on increasingly complex and closed models, the ability to accurately understand their internal workings will remain a critical challenge. This research provides a crucial baseline for addressing that challenge, emphasizing the need for caution and methodological rigor.

Looking ahead, the field of mechanistic interpretability must adapt to the reality of a predominantly closed AI ecosystem. This will likely involve a greater emphasis on black-box and hybrid methods, as well as a more critical examination of the assumptions underlying current interpretability techniques. The discovery of the "access effectiveness inversion" suggests that simplicity may sometimes be superior to complexity in certain contexts, prompting a reevaluation of the value placed on intricate mechanistic analyses. Researchers will need to develop new metrics and benchmarks that can accurately capture the nuances of cross-model behavior, ensuring that interpretability tools remain effective and reliable.

Ultimately, this study serves as a call to action for the AI community to rethink its approach to model transparency. While open-source models remain valuable for research and development, they are not a panacea for understanding the black-box systems that dominate the industry. By acknowledging the limitations of proxy fidelity and developing new methods to bridge the gap between open and closed models, the community can make significant strides toward more transparent, safe, and reliable AI systems. The insights provided by this research are essential for navigating the complex landscape of modern AI, ensuring that interpretability efforts are both scientifically sound and practically useful.

Sources