The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

This paper systematically investigates two recurring phenomena in Transformer language models: massive activations (extreme outlier values in hidden channels) and attention sinks (tokens attracting disproportionate attention regardless of semantic relevance). Through mechanistic analysis of Llama and Qwen families, the authors trace the complete causal pathway: SwiGLU feed-forward blocks act as directional quadratic amplifiers, generating extreme activations when token representations align with a shared trigger direction.

RMSNorm transforms spike tokens into sparse, near-constant vectors, collapsing their key projections into low-dimensional subspaces that become geometrically separable from non-sink keys — the root cause of attention sinks.

Ablation experiments demonstrate both phenomena can be independently suppressed without degrading performance, revealing their co-occurrence as an architectural artifact. Accepted at ICML 2026.

The Spike, the Sparse and the Sink: Complete Causal Account

Background

Two anomalous phenomena in pre-norm decoder-only Transformers: massive activations (extreme outlier values in few channels) and attention sinks (tokens attracting disproportionate attention). Both affect quantization, pruning, and KV-cache management.

Life Cycle of Massive Activations

Step-up blocks (early FFN layers) inject extreme values. Residual stream passively propagates them. Step-down blocks (late layers) neutralize with equal-magnitude opposite values. In Llama 2 7B: block 4 injects, block 62 neutralizes.

Core Mechanism: SwiGLU as Directional Quadratic Amplifier

Under near-identity gating approximation, SwiGLU output becomes a quadratic form with rank-one dominant structure. All spike channels share a common trigger direction. Alignment produces synchronized activation.

From Spikes to Sinks: The Normalization Bridge

RMSNorm transforms spike tokens through: bounding, sparsification, and near-constancy. Key projections collapse to low-dimensional subspaces, creating geometric separability from non-sink keys.

Ablations

  • Sandwich norm reduces spike magnitude from 3818 to 520
  • DynamicTanh eliminates massive activations entirely
  • d_head controls sink formation (4.1% at d=8 to 46.0% at d=128)
  • Both phenomena independently suppressible without performance loss

Practical Impact

Quantization, KV-cache management, architecture design guidance. Accepted at ICML 2026.

In-Depth Analysis and Industry Outlook

From a broader perspective, this development reflects the accelerating trend of AI technology transitioning from laboratories to industrial applications. Industry analysts widely agree that 2026 will be a pivotal year for AI commercialization. On the technical front, large model inference efficiency continues to improve while deployment costs decline, enabling more SMEs to access advanced AI capabilities. On the market front, enterprise expectations for AI investment returns are shifting from long-term strategic value to short-term quantifiable gains.

However, the rapid proliferation of AI also brings new challenges: increasing complexity of data privacy protection, growing demands for AI decision transparency, and difficulties in cross-border AI governance coordination. Regulatory authorities across multiple countries are closely monitoring these developments, attempting to balance innovation promotion with risk prevention. For investors, identifying AI companies with truly sustainable competitive advantages has become increasingly critical as the market transitions from hype to value validation.