What exactly is different about Attention Residuals vs standard Transformer residuals?

Standard residuals use fixed 1:1 additive weights where each layer only sees the previous layer. Attention Residuals replace this with learnable depth-wise attention weights—each layer dynamically selects information from ALL prior layers with learned importance weights. Block AttnRes partitions layers into blocks (e.g., 8 layers each) to make this computationally practical, reducing overhead from O(L²) to O(L·k).

What does '1.25x compute advantage' mean in practice?

It means an AttnRes model trained with 100B tokens achieves the same performance as a standard Transformer trained with 125B tokens (or equivalently: same performance with ~20% less compute). At frontier training scale ($1B+ budgets), this translates to hundreds of millions in potential savings—or proportionally stronger models for the same budget.

Has the Attention Residuals result been independently verified?

Not yet as of March 2026. Results are based on Kimi's internal testing on their 48B parameter MoE architecture (Kimi Linear). The paper and code are published on GitHub for community reproduction. The AI research community is actively attempting to verify whether the 1.25x advantage holds on other architectures (dense models, different parameter scales). Independent validation is the critical next step.

Attention Residuals Paper: Kimi Rewrites the 10-Year-Old Residual Connection

Moonshot AI's AttnRes paper challenges the decade-old fixed residual connection paradigm in Transformers. The core innovation replaces fixed accumulation with softmax attention — each layer learns a pseudo-query to compute attention weights over all preceding layer outputs. Block AttnRes variant reduces memory overhead for large-scale models. Validated on Kimi Linear (48B MoE, 1.4T tokens), consistently outperforming baselines across MMLU, GPQA-Diamond, BBH, Math, and HumanEval with minimal overhead.

Attention Residuals Paper: Kimi Rewrites

the 10-Year-Old Residual Connection Moonshot AI's Attention Residuals (AttnRes) paper, published March 16, 2026, challenges the fundamental design of fixed additive residual connections in Transformers—a design that has gone essentially unchanged since the original Transformer paper in 2017. #

The Problem: What's

Wrong with Fixed Residuals? Standard residuals: `h_l = F_l(h_{l-1}) + h_{l-1}` (fixed 1:1 weighting) **Three systematic issues:** 1. **PreNorm Dilution**: Layer normalization compresses inter-layer variance; fixed residuals progressively dilute learned representations, making deeper layers contribute less 2. **Information Access Limitation**: Each layer can only see the previous layer's output—distant layers (5 or 10 steps back) are inaccessible without the "telephone game" distortion of intermediate transforms 3. **Uneven Gradient Propagation**: Gradients flow unevenly across depths, affecting training stability #

The AttnRes Solution

Replace fixed addition with learnable depth-wise attention: ``` h_l = Σ_{j<l} α_{l,j} · h_j ``` Where α_{l,j} are learned weights—each layer selectively attends to ALL prior layers with learned importance weights. This provides adaptive selection (dynamic weighting based on context), full historical visibility, and end-to-end learned weights. #

Block AttnRes: Making It Practical Pure

AttnRes has O(L²) overhead. **Block AttnRes** partitions layers into blocks of k layers, applying attention within blocks and standard residuals between blocks. This reduces overhead from O(L²) to O(L·k)—practical at scale. #

Experimental Results (Kimi

Linear, 48B MoE) **Key finding**: Block AttnRes matches the performance of a standard PreNorm baseline trained with **~1.25x more compute**. Lower scaling loss means the advantage grows with model size. **At scale**, a 1.25x compute advantage means: - Training a $1B frontier model: save ~$200M, or - Same budget produces a proportionally stronger model #

Open

Questions - Independent reproduction needed (only Kimi's MoE tested so far) - Does the 1.25x advantage hold at 1B, 7B, 70B parameter scales? - Performance on long-context tasks? - Sensitivity to block size k? The paper and code are on GitHub. If independent validation confirms the results across architectures, AttnRes could become a fundamental building block of next-generation large models. #

In-Depth Analysis and Industry Outlook From

a broader perspective, this development reflects the accelerating trend of AI technology transitioning from laboratories to industrial applications. Industry analysts widely agree that 2026 will be a pivotal year for AI commercialization. On the technical front, large model inference efficiency continues to improve while deployment costs decline, enabling more SMEs to access advanced AI capabilities. On the market front, enterprise expectations for AI investment returns are shifting from long-term strategic value to short-term quantifiable gains.