What is "proxy fidelity"?

Proxy fidelity assesses whether measurements on open models can reliably infer closed model behavior. The paper evaluates it across prediction, attribution, and representation levels.

Why does high predictive fidelity not imply high attribution fidelity?

Answer agreement often masks deep disagreements in reasoning logic. White-box signals like attention patterns have weak predictive power for causal attribution.

What do these findings mean for AI interpretability research?

Mechanistic interpretability insights cannot be automatically transferred to closed targets. The community needs new evaluation standards beyond prediction-level agreement.

代理保真度：開放式大型模型能否解釋封閉式模型？

本文深入探討機械可解釋性在封閉式 API 環境中的侷限性，提出「代理保真度」這一核心概念，用以評估基於開放大型模型的測量結果是否能有效推斷封閉式模型行為。研究團隊在預測、歸因與表徵三個層面系統性地評估代理保真度。透過對橫跨 Llama、Qwen、GPT 與 Gemini 四大系列共十一個模型進行廣泛實驗，研究發現預測保真度嚴重高估了歸因保真度：模型在答案上的一致性往往掩蓋了它們在推理邏輯上的巨大分歧。論文揭示了一種「訪問有效性倒置」現象，指出白盒訊號（如注意力模式）雖穩定，卻對因果歸因的預測能力極弱，而黑盒輸入消融實驗反而能更精確地捕捉因果歸因。此發現表明，機械可解釋性的洞見無法自動遷移至封閉目標，僅憑預測層面的一致性不足以支撐此種遷移，為開源社群評估模型可解釋性工具的有效性提供了重要警示。

Sources

arXiv