Attention Residuals Paper: Kimi Rewrites the 10-Year-Old Residual Connection
Moonshot AI's AttnRes paper challenges the decade-old fixed residual connection paradigm in Transformers. The core innovation replaces fixed accumulation with softmax attention — each layer learns a pseudo-query to compute attention weights over all preceding layer outputs. Block AttnRes variant reduces memory overhead for large-scale models. Validated on Kimi Linear (48B MoE, 1.4T tokens), consistently outperforming baselines across MMLU, GPQA-Diamond, BBH, Math, and HumanEval with minimal overhead.
Attention Residuals Paper: Kimi Rewrites the 10-Year-Old Residual Connection
Moonshot AI's Attention Residuals (AttnRes) paper, published March 16, 2026, challenges the fundamental design of fixed additive residual connections in Transformers—a design that has gone essentially unchanged since the original Transformer paper in 2017.
The Problem: What's Wrong with Fixed Residuals?
Standard residuals: `h_l = F_l(h_{l-1}) + h_{l-1}` (fixed 1:1 weighting)
Three systematic issues:
1. **PreNorm Dilution**: Layer normalization compresses inter-layer variance; fixed residuals progressively dilute learned representations, making deeper layers contribute less
2. **Information Access Limitation**: Each layer can only see the previous layer's output—distant layers (5 or 10 steps back) are inaccessible without the "telephone game" distortion of intermediate transforms
3. **Uneven Gradient Propagation**: Gradients flow unevenly across depths, affecting training stability
The AttnRes Solution
Replace fixed addition with learnable depth-wise attention:
h_l = Σ_{j<l} α_{l,j} · h_j
Where α_{l,j} are learned weights—each layer selectively attends to ALL prior layers with learned importance weights. This provides adaptive selection (dynamic weighting based on context), full historical visibility, and end-to-end learned weights.
Block AttnRes: Making It Practical
Pure AttnRes has O(L²) overhead. **Block AttnRes** partitions layers into blocks of k layers, applying attention within blocks and standard residuals between blocks. This reduces overhead from O(L²) to O(L·k)—practical at scale.
Experimental Results (Kimi Linear, 48B MoE)
Key finding: Block AttnRes matches the performance of a standard PreNorm baseline trained with ~1.25x more compute. Lower scaling loss means the advantage grows with model size.
At scale, a 1.25x compute advantage means:
- Training a $1B frontier model: save ~$200M, or
- Same budget produces a proportionally stronger model
Open Questions
- Independent reproduction needed (only Kimi's MoE tested so far)
- Does the 1.25x advantage hold at 1B, 7B, 70B parameter scales?
- Performance on long-context tasks?
- Sensitivity to block size k?
The paper and code are on GitHub. If independent validation confirms the results across architectures, AttnRes could become a fundamental building block of next-generation large models.
In-Depth Analysis and Industry Outlook
From a broader perspective, this development reflects the accelerating trend of AI technology transitioning from laboratories to industrial applications. Industry analysts widely agree that 2026 will be a pivotal year for AI commercialization. On the technical front, large model inference efficiency continues to improve while deployment costs decline, enabling more SMEs to access advanced AI capabilities. On the market front, enterprise expectations for AI investment returns are shifting from long-term strategic value to short-term quantifiable gains.