What is Attention Residuals and how is it different from normal Transformers?

Traditional Transformers use fixed additive residuals (each layer's output + previous layer, fixed 1:1 weighting). Attention Residuals replaces this with a depth-wise attention mechanism that lets each layer selectively weight contributions from ALL prior layers. Block AttnRes makes this practical by partitioning layers into blocks. The result: ~1.25x compute efficiency over standard PreNorm baselines.

How significant is the 1.25x compute advantage?

Very significant at scale. In absolute terms: training a $100M model with AttnRes delivers the same performance as a $125M model trained conventionally—a $25M saving. At frontier scale ($1B+ training budgets), this translates to $200M+ in savings per run. The advantage also amplifies at larger scales due to lower scaling loss.

Has Attention Residuals been independently verified?

Not yet as of March 2026. The results are based on Kimi's own testing on their MoE architecture. The paper and code have been published on GitHub for community reproduction. Independent verification across different architectures and scales is needed before the community can fully endorse the claims.

Kimi Launches Attention Residuals with 1.25x Compute Advantage

Moonshot AI's Kimi team released the Attention Residuals (AttnRes) paper, proposing softmax attention over previous layers to replace Transformer's decade-old fixed residual connections. Validated on Kimi Linear 48B MoE model, AttnRes achieves 1.25x compute-equivalent performance with under 4% training overhead and under 2% inference latency increase. Paper and code are open-sourced on GitHub.

Kimi Launches Attention Residuals: 1.25x Compute Advantage by Rethinking Transformer Residuals

On March 16, 2026, Moonshot AI's Kimi team published a paper introducing **Attention Residuals (AttnRes)**—a novel architecture that replaces fixed additive residual connections in Transformers with an attention-based mixing mechanism across layers, achieving what Block AttnRes demonstrates as a **~1.25x compute advantage** over standard PreNorm baselines. #

The Problem: Fixed Residual Connections Are a Known

Limitation Standard Transformer residuals follow a simple rule: each layer's output is added to the previous layer's output (1:1 fixed weighting). This creates: - **PreNorm dilution**: Output magnitudes become increasingly uneven with depth - **Rigid information flow**: Each layer can only directly access the previous layer - **Uneven gradient propagation**: Gradients distribute unevenly across depths #

The Core Innovation: Attention Replaces Addition Instead

of fixed additive residuals, AttnRes lets each layer **selectively weight contributions from all prior layers** using a depth-wise attention mechanism analogous to token-level attention. Each layer learns which previous layers are most relevant for its computation. **Block AttnRes** makes this practical by partitioning layers into blocks and applying attention over block-level representations, reducing memory and communication overhead from O(depth²) to manageable scale. #

Results on Kimi

Linear (48B MoE) When integrated into Kimi Linear—a 48-billion parameter Mixture-of-Experts model—AttnRes demonstrated: - Improvements across reasoning, coding, and general evaluation benchmarks - **Block AttnRes matches the performance of a baseline trained with ~1.25x more compute** - Lower scaling loss vs. PreNorm baseline (advantage amplifies at scale) #

What

1.25x Compute Means in Practice At a training budget of $100M, the 1.25x efficiency gain is worth ~$20M in saved compute. At $1B scale, that's ~$200M. As frontier models push toward trillion-parameter scale, architectural efficiency gains of this magnitude have enormous commercial implications. The Kimi team has published the paper and code on GitHub for community verification—an open approach consistent with Chinese AI companies' recent strategy of building credibility through transparent technical disclosure. #

In-Depth Analysis and Industry Outlook From

a broader perspective, this development reflects the accelerating trend of AI technology transitioning from laboratories to industrial applications. Industry analysts widely agree that 2026 will be a pivotal year for AI commercialization. On the technical front, large model inference efficiency continues to improve while deployment costs decline, enabling more SMEs to access advanced AI capabilities. On the market front, enterprise expectations for AI investment returns are shifting from long-term strategic value to short-term quantifiable gains.