Kimi Launches Attention Residuals with 1.25x Compute Advantage
Moonshot AI's Kimi team released the Attention Residuals (AttnRes) paper, proposing softmax attention over previous layers to replace Transformer's decade-old fixed residual connections. Validated on Kimi Linear 48B MoE model, AttnRes achieves 1.25x compute-equivalent performance with under 4% training overhead and under 2% inference latency increase. Paper and code are open-sourced on GitHub.
Kimi Launches Attention Residuals: 1.25x Compute Advantage by Rethinking Transformer Residuals
On March 16, 2026, Moonshot AI's Kimi team published a paper introducing **Attention Residuals (AttnRes)**—a novel architecture that replaces fixed additive residual connections in Transformers with an attention-based mixing mechanism across layers, achieving what Block AttnRes demonstrates as a **~1.25x compute advantage** over standard PreNorm baselines.
The Problem: Fixed Residual Connections Are a Known Limitation
Standard Transformer residuals follow a simple rule: each layer's output is added to the previous layer's output (1:1 fixed weighting). This creates:
- **PreNorm dilution**: Output magnitudes become increasingly uneven with depth
- **Rigid information flow**: Each layer can only directly access the previous layer
- **Uneven gradient propagation**: Gradients distribute unevenly across depths
The Core Innovation: Attention Replaces Addition
Instead of fixed additive residuals, AttnRes lets each layer **selectively weight contributions from all prior layers** using a depth-wise attention mechanism analogous to token-level attention. Each layer learns which previous layers are most relevant for its computation.
Block AttnRes makes this practical by partitioning layers into blocks and applying attention over block-level representations, reducing memory and communication overhead from O(depth²) to manageable scale.
Results on Kimi Linear (48B MoE)
When integrated into Kimi Linear—a 48-billion parameter Mixture-of-Experts model—AttnRes demonstrated:
- Improvements across reasoning, coding, and general evaluation benchmarks
- **Block AttnRes matches the performance of a baseline trained with ~1.25x more compute**
- Lower scaling loss vs. PreNorm baseline (advantage amplifies at scale)
What 1.25x Compute Means in Practice
At a training budget of $100M, the 1.25x efficiency gain is worth ~$20M in saved compute. At $1B scale, that's ~$200M. As frontier models push toward trillion-parameter scale, architectural efficiency gains of this magnitude have enormous commercial implications.
The Kimi team has published the paper and code on GitHub for community verification—an open approach consistent with Chinese AI companies' recent strategy of building credibility through transparent technical disclosure.
In-Depth Analysis and Industry Outlook
From a broader perspective, this development reflects the accelerating trend of AI technology transitioning from laboratories to industrial applications. Industry analysts widely agree that 2026 will be a pivotal year for AI commercialization. On the technical front, large model inference efficiency continues to improve while deployment costs decline, enabling more SMEs to access advanced AI capabilities. On the market front, enterprise expectations for AI investment returns are shifting from long-term strategic value to short-term quantifiable gains.