What norm-fixation problem does this research address?

When transmitting coordinate-indexed objects like steering vectors across checkpoints, RMSNorm architectures require fixing sign-permutation group B_d symmetry—permutation-only alignment is incomplete and causes systematic errors.

Why does this matter for model interpretability?

Many existing interpretability tools assume LayerNorm-style permutation symmetry and fail on RMSNorm models, producing unreliable results. The study proves B_d-based alignment recovers 91.1% of cross-run coordinates vs 60.3% for endpoint matching.

What should researchers and practitioners watch for?

All interpretability claims must state their norm assumptions explicitly to be reproducible. The community needs B_d-aware alignment methods, and practitioners should verify sign-consistency in model merging and fine-tuning workflows.

Symbol-Permutation Coordinate Transfer and Norm Fixation in RMSNorm Transformers

This paper investigates the norm-fixation problem arising when coordinate-indexed objects — such as steering vectors and sparse autoencoders — are transmitted across checkpoints in modern large language model pipelines. The authors show that the residual-flow norms of RMSNorm architectures possess symmetry under the sign-permutation group $B_d$, and that alignment via permutations alone is incomplete. They introduce a symbolic-marginal Hungarian matching algorithm, proving that raw symbol-correlation matching has a structural accuracy ceiling under decorrelated coordinates, which they eliminate via symbolic marginalization. Experiments demonstrate that composing local $B_d$ norms for coordinate-preserving transfer recovers 91.1% of cross-run coordinates at 1500 steps, significantly outperforming endpoint matching at 60.3%. On tasks including TinyLlama SAE reconstruction, Qwen sentiment steering, and refusal steering, $B_d$-norm-based alignment far exceeds permutation-only baselines. The framework further proves that symbolic transfer during state training preserves trajectory consistency, and reveals that interpretability claims must be stated relative to explicit norms to be reproducible.

Background and Context

Modern large language model (LLM) pipelines have become increasingly complex, necessitating the ability to transmit coordinate-indexed objects across different model checkpoints. These objects include steering vectors, sparse autoencoder (SAE) features, Top-k neuron sets, and attribution lists, which are critical for model editing, interpretability analysis, and intervention. However, the transmission of these objects is only well-defined if the model's residual flow norms are fixed. Without a consistent normalization framework, the internal representations of the model become ambiguous, leading to significant errors when attempting to align or transfer features between different training stages or model variants.

A fundamental theoretical gap has been identified in how current tools handle normalization symmetries. Previous research often assumed that alignment could be achieved through permutations alone, corresponding to the permutation group $S_d$. This assumption holds for architectures using LayerNorm, where the residual flow chart exhibits symmetry under $S_d$ (allowing for global sign flips). However, the majority of modern LLMs utilize RMSNorm, which introduces a generic per-channel gain. This architectural choice fundamentally alters the symmetry group of the residual flow. For RMSNorm architectures, the symmetry group expands to the signed permutation group $B_d = S_d \ltimes \{\pm 1\}^d$. This means that each channel can independently flip its sign, a degree of freedom that permutation-only alignment completely ignores.

This oversight has led to a systemic failure in many existing model editing and interpretability methods. By incorrectly assuming a simpler norm structure, these tools introduce systematic biases when applied to RMSNorm-based models. The recent study highlights that ignoring the sign-permutation symmetry results in an incomplete alignment process. Consequently, any attempt to transfer coordinate-indexed objects without accounting for $B_d$ symmetry is theoretically flawed, rendering subsequent alignment tools ineffective and potentially producing misleading results in critical applications such as sentiment steering or refusal intervention.

Deep Analysis

To address the incompleteness of permutation-only alignment, the authors introduce a symbolic-marginal Hungarian matching algorithm. This method moves beyond treating coordinates as an unordered set for simple permutation matching. Instead, it explicitly handles the sign-permutation symmetry inherent in RMSNorm. The core innovation lies in the proof that raw symbol-correlation matching, under decorrelated coordinates, suffers from a structural accuracy ceiling. This ceiling is determined solely by the proportion of positive signs in the true norm, making it impossible to achieve high accuracy without addressing the sign dimension directly.

The proposed algorithm eliminates this structural limitation through symbolic marginalization. By marginalizing over the sign permutations, the algorithm effectively removes the ambiguity that prevents accurate matching. This allows for a more precise recovery of the true norm transformation between checkpoints. The technical implementation focuses on coordinate-preserving transfer rather than function-level merging. This distinction is crucial because it ensures that the semantic consistency of internal representations is maintained throughout the model's fine-tuning process, providing a robust foundation for downstream tasks.

The study further demonstrates that composing local $B_d$ norms allows for the preservation of coordinate identity across a fine-tuning trajectory. By saving the local $B_d$ norm at each checkpoint along the same baseline, the researchers constructed a mechanism that precisely tracks coordinate changes. This mechanism corrects not only the permutation order of the coordinates but also the sign flip of each individual coordinate channel. This dual correction ensures that the transferred objects remain functionally equivalent to their original counterparts, a feat that was previously unattainable with standard alignment techniques.

Industry Impact

The experimental validation of this framework reveals significant performance gaps between $B_d$-norm-based alignment and traditional permutation-only baselines. In a coordinate recovery experiment involving 1500 steps of fine-tuning on the same baseline, the proposed method recovered 91.1% of cross-run coordinates. In stark contrast, the traditional endpoint matching method, which relies solely on permutations, managed to recover only 60.3%. This substantial gain is not merely a result of routing through the baseline but is directly attributable to the correct handling of sign symmetry. The data underscores the practical necessity of $B_d$ normalization for reliable model operations.

In specific application tasks, the superiority of $B_d$ alignment becomes even more pronounced. In the TinyLlama sparse autoencoder (SAE) reconstruction task, the normalized mean squared error (NMSE) under $B_d$ normalization was a mere 0.004. Conversely, under the permutation-only $S_d$ normalization, the error rate skyrocketed to 1.08. This indicates that permutation-only methods fail to capture the essential structure of the features, leading to near-total reconstruction failure. The implications for research relying on SAEs for mechanistic interpretability are severe, as standard methods may be analyzing noise rather than meaningful features.

The impact on steering tasks is equally dramatic. In Qwen sentiment steering, the $B_d$ norm preserved 95.8% of the steering effect. However, under $S_d$ normalization, this effectiveness dropped precipitously to 17.2%. More critically, in refusal steering tasks, the use of $S_d$ normalization caused the steering sign to reverse, completely negating the intervention and potentially inducing the opposite behavior. These results demonstrate that ignoring sign symmetry does not just reduce efficiency; it can actively invert the intended model behavior, posing significant risks for safety and control applications.

Outlook

The framework also proves that symbolic transfer during state training preserves trajectory consistency. The AdamW state, when transferred using the $B_d$ norm, successfully maintains the recovered trajectory. In contrast, states aligned only by permutations deviate from the functionally equivalent checkpoint trajectory. This finding suggests that the benefits of $B_d$ normalization extend beyond static feature transfer to dynamic training processes, ensuring that optimization paths remain consistent and predictable. This has profound implications for distributed training and model merging strategies, where maintaining state consistency is paramount. Furthermore, the study reveals a critical requirement for reproducibility in interpretability research. The authors demonstrate that interpretability claims must be stated relative to explicit norms to be reproducible. Without specifying the norm assumption, results from different labs or tools may be incomparable or even contradictory. This calls for a shift in community standards, where researchers must explicitly declare the normalization framework used in their analyses. It also suggests that many past interpretability findings may need re-evaluation under the correct $B_d$ symmetry constraints. For the broader industry, understanding and applying symbol-permutation transfer offers a pathway to optimize model merging strategies and improve fine-tuning efficiency. By reducing performance degradation caused by norm inconsistencies, companies can build more robust model intervention tools. Future research should focus on the efficient computation and transmission of $B_d$ norms in large-scale models. Additionally, exploring the application of this framework to other architectures could further solidify the theoretical foundations of LLM interpretability and alignment, moving the field toward a more standardized and reliable practice.

The transition from permutation-only to sign-permutation-aware alignment marks a significant maturation in the field of mechanistic interpretability. As LLMs continue to grow in size and complexity, the ability to precisely track and manipulate internal representations becomes increasingly vital. The $B_d$ norm framework provides the necessary mathematical rigor to ensure that these manipulations are accurate and reproducible. This research not only solves a specific technical bottleneck but also establishes a new standard for how we understand and interact with the internal workings of modern language models. The implications for safety, control, and scientific understanding of AI systems are far-reaching, urging the community to adopt more rigorous theoretical standards in their daily work.

Sources

arXiv