MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 9 min readBoris Tamarkin
MLA
Attention
Deepseek
Flash Attention
KV Cache
Training
Inference

MLA weight absorption: what we kept and what we dropped for the C++ specialists

Multi-Head Latent Attention in production: why DeepSeek's absorbed decode path is the right choice for KV cache, why it is the wrong choice for training, and how the C++ specialist ensemble uses both.

MegaCpp
Focused on applied C++ model engineering
Article Preview
MLA weight absorption: what we kept and what we dropped for the C++ specialists
Published 9 min readBoris Tamarkin

MLA Weight Absorption: What We Kept, What We Dropped for the C++ Specialists

Multi-Head Latent AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns is the one piece of the DeepSeek-V3 architecture that keeps showing up in every candidate attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns stack for MegaCpp's specialist SLMs. The draw is obvious: a compressed latent c_kv an order of magnitude smaller than concatenated K/V, a KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack that fits in one B200 for sequences that would otherwise need two, and a numerically clean separation of the positional (RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries) and content (NoPE) parts of the query. The trap is that the public DeepSeek inference path does not just use MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns, it uses a specific reformulation called weight absorption, and that reformulation is correct for decode and wrong for trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200. This post walks through the analysis we did to pin that down, and which pieces of MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns survived into the C++ specialists.

What MLA looks like when we train it

The straightforward MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns path is what our MegaCpp trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 code does today. The KV projection produces a low-rank latent:

c_kv   = W_dkv @ x              # (B, T, kv_lora_rank)
K_nope = W_uk  @ norm(c_kv)     # (B, T, H, d_nope)
V      = W_uv  @ norm(c_kv)     # (B, T, H, d_v)
K      = concat(K_nope, K_rope)

The full K and V are materialised per layer, Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns consumes them as three dense tensors with the standard shapes, and softmax(Q K^T / sqrt(d)) V is fused into one kernel with O(T) activation memory. With recompute_kv_upproj=True (our trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 default), only c_kv ([B, T, kv_lora_rank]) and K_rope ([B, T, 1, d_rope]) are saved for backward; K and V are regenerated from the latent in the backward pass. This is a classic activation/FLOP trade and it works cleanly with every dense backend: FA3, FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample, FlexAttention, PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample/SplashQuick term guideSplashThe stable TPU attention family used for dense or local-mask lanes before MegaCpp drops to narrower planner-driven sparse contracts.GroundingAbout: Block-sparse attention on TPU Example: Splash mask cache sample Example: clustered sparse forward-cache sample, SDPA, and the manual fallback. The numerical model is one attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns operator per block with one set of Q/K/V inputs and one softmax. The checked-in Fused MLA projection sample is the small public-safe version of that idea: recompute the latent projection chain in backward, not the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns algebra itself.

What weight absorption is, algebraically

DeepSeek's inference code carries a second path, attn_impl != "naive", that rewrites the score computation by absorbing the KV up-projection into Q and keeping attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns in latent space. The identity is basic linear algebra:

Q_nope @ K_nope^T  =  Q_nope @ (W_uk @ c_kv)^T
                   =  (Q_nope @ W_uk^T) @ c_kv^T

Let Q' = Q_nope @ W_uk^T. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns over the NoPE part is now Q' @ c_kv^T instead of Q_nope @ K_nope^T. The value side absorbs the same way:

output  =  attn_weights @ V
        =  attn_weights @ (W_uv @ c_kv)
        =  (attn_weights @ c_kv) @ W_uv

The net result is that the KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack stores c_kv (kv_lora_rank dims) plus K_rope (d_rope dims per KV head), and never materialises the full per-head K and V tensors. For decode of one token against a long history that is exactly the right optimisation: a DeepSeek-V3-scale cache drops from the multi-megabyte per-token regime to the sub-megabyte regime, and the sequence-length axis is walked over a compact latent instead of a fat K/V.

The price is that the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns score is no longer a single Q @ K^T. It is a sum of two terms that must be combined before softmax:

scores = (Q' @ c_kv^T) + (Q_rope @ K_rope^T)

Softmax is not distributive over addition, so the two terms have to be summed inside whatever kernel computes attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns. No public Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns variant supports this split-score contract. FA3 and FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample expect a single Q/K/V triple. FlexAttention exposes a score_mod hook but not a "pre-softmax additive score from a second QK product". PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample/SplashQuick term guideSplashThe stable TPU attention family used for dense or local-mask lanes before MegaCpp drops to narrower planner-driven sparse contracts.GroundingAbout: Block-sparse attention on TPU Example: Splash mask cache sample Example: clustered sparse forward-cache sample has the same single-product assumption. SDPA has the same assumption. Therefore, to use the absorbed form in a fused kernel you would have to write that kernel yourself.

The FLOP argument, for training

With DeepSeek-V3 defaults (H=128 heads, d_nope=128, d_rope=64, d_v=128, kv_lora_rank=512) the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns core FLOPs separate cleanly. The standard expand-then-attend path spends 2*B*H*T^2*320 FLOPs in the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns core: qk_head_dim=192 on the Q @ K^T product plus v_head_dim=128 on attn @ V. The absorbed path pays attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns against the full kv_lora_rank on both sides: 2*B*H*T^2*512 for the NoPE score, 2*B*H*T^2*64 for the RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries score, and 2*B*H*T^2*512 on the value side, for a total of 2*B*H*T^2*1088. The projection FLOPs (W_uk absorption into Q plus the W_uv projection after attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns) equal the FLOPs saved by skipping the explicit KV up-projection, so projections net to zero.

The ratio is 1088 / 320 = 3.4x more attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns core FLOPs under absorption. For our NAM heads (H=24, kv_lora_rank=512) the constant changes but not the ratio; the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns core is still 3.4x heavier under absorption. The ratio improves when kv_lora_rank drops, but kv_lora_rank must stay at least as large as d_nope (and in practice 4x larger) for the low-rank approximation to carry the representational weight it is supposed to carry.

The activation-memory argument is worse. In standard MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns with recompute_kv_upproj=True, the saved tensors are c_kv and K_rope, a few tens of megabytes. Under absorption we would still save those, plus the expanded Q' = Q_nope @ W_uk^T at shape (B, T, H, kv_lora_rank), which is larger than standard Q at (B, T, H, qk_head_dim). More importantly, without Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns we would have to save the materialised (B, H, T, T) attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-weight tensor to backprop through the split-score softmax. At B=8, H=24, T=8192 in BF16 that alone is 25 GiB per block. Catastrophic is the technical term.

The trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 verdict is unambiguous: weight absorption is a decode optimisation. For trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 on the kind of sequence lengths our specialists see (4K up to 64K packed context graphs from the v4 context-graph sampler), expand-then-attend with Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns and latent-only activation save is strictly cheaper on both FLOPs and memory.

Why inference is the opposite story

The inference regime inverts every term in that calculation. Decode runs at T_query = 1 against a T_kv that grows to tens of thousands of tokens. The attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns core FLOPs scale as T_query * T_kv, so the 3.4x constant sits on top of a tiny number. The activation-memory argument vanishes entirely because decode does not backpropagate; there is no attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-weight tensor saved for backward, and there is no Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns benefit to forfeit because the incremental decode kernel is a trivial softmax over the history. The only thing that actually scales is the KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack itself, and that is exactly what absorption shrinks.

Concretely, under absorption the per-token KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack entry is c_kv plus K_rope: kv_lora_rank + H_kv * d_rope scalars. Under the standard path the cache entry is per-head K and V: H * (d_nope + d_rope) + H * d_v scalars. For DeepSeek-V3-scale attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns heads, absorption buys roughly an order of magnitude. For a 1M-token C++ context that is the difference between a KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack that fits in one B200 and a KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack that does not fit in two.

A secondary benefit for the serving path is that c_kv is a single low-rank tensor, which composes cleanly with paged KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack-cache, with block-sparse attention, and with the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-sink mitigations described in the long-context post. Paged allocation on a 512-dim latent is a lot easier than paged allocation on a 384-dim per-head K plus a 128-dim per-head V across H heads. The serving-side KV cache and paged attention article keeps that cache contract reader-visible as one latent row plus one RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries row per token instead of a head-by-head K/V slab.

What survived into the C++ specialists

MegaCpp's specialist ensemble uses MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns in two forms, chosen per regime.

Regime Form KV-cache entry Kernel path
Training expand-and-attend n/a (no cache) Flash Attention
Serving absorbed c_kv + K_rope per token split-score softmax
Long-ctx eval absorbed c_kv + K_rope per token split-score softmax

TrainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 uses standard MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns. Every specialist is trained with the expand-then-attend path, recompute_kv_upproj=True, and Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns on the dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks. The KV projection generates c_kv; K_nope, V, and the concatenated K are produced on the fly; only c_kv and K_rope are saved for backward. On the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-heavy production hybrid this is paired with the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns-up-proj selective recompute (--recompute-modules mla_up_proj), which is one of the named modules in the golden configuration. The reason this works is that MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns up-projection is cheap to recompute from the latent and expensive to store in activations; it is the exact kind of operation selective recompute was built for.

Inference uses absorbed MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns. For the serving path and long-context eval we run the absorbed form: Q absorbs W_uk, the KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack stores only c_kv and K_rope, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns is computed in latent space as a sum of the NoPE and RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries score terms, and the W_uv projection happens after the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-weight has been multiplied with c_kv. The split-score softmax is handled in a custom path; the usual Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns kernels are not involved, which is fine because decode is not the phase whose wall-clock we are protecting.

We explicitly do not mix the two. The checkpoint format is the standard one: W_dkv, W_uk, W_uv, and the RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries projections as distinct tensors. The inference loader reshapes W_uk into the per-head [H, d_nope, kv_lora_rank] layout and fuses it into the Q path once at model load; nothing about the trained weights changes. This has the useful side-effect that a single checkpoint can drive either path, and we can A/B the absorbed inference path against the naive inference path without retraining. That loader-time separation is also why shared MLA adapter boundaries matters: the seam is what keeps train-time MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns and decode-time MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns from leaking into each other's builder path.

The parts that did not survive

A few variants that sounded plausible did not make it into the specialist stack.

We considered absorbing at trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 time for "consistency" with the inference path. Rejected on the 3.4x attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-core FLOP penalty, the activation-memory blow-up, and the Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns incompatibility. There is no version of the cost curve where absorbed trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 is cheaper than expand-then-attend trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 on our sequence lengths.

We considered a custom Flash-AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-shaped kernel that fuses the split-score softmax. This is tractable on paper (the two score products can share the same tiling schedule if the RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries term is cached per KV block) but the engineering budget against the FLOP win is terrible: attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns is a single-digit percentage of the production hybrid's compute breakdown, and an absorbed kernel that only covers the forward is useless without a matching backward, which no public kernel (FA3/FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample, CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample hopper_fmha, TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample sparse_mla_bwd) supports. FA3/FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample backward is documented as "no plans, accuracy open problem" by upstream. For the specialists, writing a proprietary FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-plus-split-score attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backward is not where the hours go.

We considered a quantised c_kv cache (FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper e4m3fn) for decode. This is on the roadmap but not in production. The per-token scale granularity interacts with the RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries term in a way we have not finished measuring, and the current FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backward literature is empty; the inference-only variant is doable but has to carry its own calibration pass.

What the analysis leaves us with

The MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns absorption question is the clearest example in our stack of a transformation that is simultaneously an optimisation and a de-optimisation depending on which axis you measure. Train with the standard path because FLOPs and Flash AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns memory dominate. Serve with the absorbed path because KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack dominates. Keep one checkpoint format and switch regimes at load time, not at trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 time. And never read a single-kernel microbenchmark as a claim about full-model throughput; we learned that lesson the expensive way on the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper Mamba scan, and this was the cheap version of the same mistake waiting to happen.

For the C++ specialists specifically, the combination of MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 + absorbed MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.Groundingfused MLA on NVIDIA shared MLA adapter boundaries public-safe MLA integration patterns decode + the v4 context-graph sampler is what makes 64K repo-level reasoning affordable at inference. The decode KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack is small enough that one specialist can hold the full context of a realistic translation unit plus its headers and call graph on a single GPU; the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 path is fast enough that we can retrain any of the eight specialists on a week of calendar time per checkpoint. Neither property holds under the naive combination of the two.

FAQ

Frequently asked questions

Is recompute_kv_upproj already a form of weight absorption?+
No. recompute_kv_upproj keeps the standard dense MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. operator and only regenerates K and V from c_kv during backward to save activation memory. Weight absorption changes the score algebra itself so decode can attend over c_kv directly.
Could FlexAttention's score_mod express the absorbed MLA score?+
Not cleanly. FlexAttention's public contract still starts from a normal query, key, value call and applies score_mod after the scalar QK score already exists. That is useful for masks, biases, and soft caps, but absorbed MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. needs a second pre-softmax product, (Q' @ c_kv^T), summed with the RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table. product before softmax, plus the value-side projection after attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. over c_kv. That is why the article treats absorbed MLA as a custom decode path rather than a drop-in FlexAttention trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… route.
Why is weight absorption a load-time serving transform instead of a training-format change?+
Because the checkpoint can stay in the standard MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. form while the serving loader reshapes W_uk into the absorbed Q path once at model load. That keeps trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… on the cheaper expand-and-attend path and still lets decode switch to the smaller c_kv plus K_rope cache without retraining or forking the checkpoint format. The boundary is the same one described in Public MLA integration patterns for Megatron and Shared MLA adapter boundaries.
Does absorbed MLA have a model-load cost?+
Yes. Absorption is not a checkpoint rewrite, but serving still has to reshape and fuse the up-projection into the decode path before the first request. That one-time transform can create temporary allocation pressure, especially if the serving stack also calibrates a quantized c_kv cache. We keep it at the loader boundary so trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… checkpoints remain standard and load-time cost is measured separately from decode throughput.
If decode later quantizes the MLA cache, what is the safe boundary?+
The quantization target is the semantic latent cache, c_kv, not a collapsed "everything in the KV row" blob. K_rope remains a separate positional lane because the absorbed decode score is still the sum of a latent NoPE product and a RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table. product. That is why KV cache and paged attention describes the cache as two rows per token, and why the local shared MLA adapter sample keeps train-time MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. and decode-time MLA separated at the adapter boundary.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

RoPE

Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Pallas

JAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.

Splash

The stable TPU attention family used for dense or local-mask lanes before MegaCpp drops to narrower planner-driven sparse contracts.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

CUTLASS

NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.

Paged attention

The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Topic hubs