MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 12 min readDavid Gornshtein
MLA
Triton
H200
Blackwell
RoPE
KV Cache
Fused Kernels

Fused MLA on Hopper and Blackwell: projection, RoPE, and the KV cache that ships

The NVIDIA side of Multi-Latent Attention in the MegaCpp ensemble: a fused down-norm-up projection, a fused split-RoPE-concat Triton kernel, a compressed KV cache, and how it all lands on Megatron-Core.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Fused MLA on Hopper and Blackwell: projection, RoPE, and the KV cache that ships
Published 12 min readDavid Gornshtein

The cross-path MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns story (which parts of DeepSeek-V3 we trained with, why weight absorption is the wrong choice for the training path, what survived into inference) is already covered in our MLA weight absorption: what we kept and what we dropped for the C++ specialists post. This one is narrower: how MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns runs on NVIDIA once you care about kernels. Fused down-projection / RMSNorm / up-projection. Fused split / RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention / concat in one Triton kernel. A compressed KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack that stores c_kv plus a single RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention-applied key row instead of full K/V. And the Blackwell tensor-core path through Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer. Everything here is the part of MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns that only matters if you are pointing it at H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 or GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story.

The adapter seam that makes this manageable is described separately in Shared MLA adapter boundaries and the narrower Public MLA integration patterns for Megatron.

If you want the checked-in proof surfaces before the full kernel story, the minimal companion files are Fused MLA projection sample for FusedDownNormUp, MLA integration pattern sample plus MLA shared adapter sample for the adapter seam, and Sparse MLA FP8 dispatch sample for the FP8 wrapper-dispatch boundary that shows up again once MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns runs through Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer.

The shortest checked-in reading path is MegaCpp model wiring examples, then MLA shared adapter sample, MLA integration pattern sample, Fused MLA projection sample, and finally Sparse MLA FP8 dispatch sample. That order keeps the compatibility seam visible before the kernel details.

For first touch, keep four MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns terms separate:

Those names connect this kernel article back to MLA weight absorption for the latent/cache boundary and forward to Sparse MLA dimension generalization for the shape vocabulary that reappears as MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns dimensions get generalized.

When the article later mentions GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story, read sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough as target labels, not as generic product branding. sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough is the datacenter Blackwell lane, while sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story consumer lane. That distinction matters for MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns because the Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer path can be shared at the framework level while still carrying different tensor-core and numeric guardrails underneath.

Why MegaCpp cares about this

At specialist scale the attention block is still the compute cost we optimise hardest, and MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns's compressed latent makes it a numerically clean place to fuse things. The training path is unambiguous: expand, Flash Attention, done. Where kernels start to matter is between the projection, the norm, the split into nope / ropeQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention, the rotary application, and the recombine. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 that sequence without fusion is five kernel launches per attention per layer per microbatch step; on a depth-56 hybrid preset that becomes the kind of cost we do not want to pay twice. The other question - whether MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns pays off for the specialist SLM sizes at all - is what the compressed KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack answers: 4x smaller KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack at 4-bit quantization relative to bf16 is a deployment unlock for our on-device inference targets.

What we built in the public MegaCpp MLA path

The baseline MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns training path is the reference here. It takes x, does the down-projection (either separate w_dq + w_dkv or a fused w_dqkv), RMSNorm on the latent, up-projection through w_uq and w_ukv, reshape to (B, T, H, qk_nope_head_dim + qk_rope_head_dim) on the Q side, split the KV side into low_rank_main (kv_lora_rank) and low_rank_rope (qk_rope_head_dim), apply partial RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention only to the ropeQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention portion, concatenate back, and hand the lot to Flash Attention. recompute_kv_upproj=True is the default training knob: backward recomputes the large H * (d_nope + d_v) tensor from the small kv_lora_rank latent. recompute_q_upproj is optional and wraps norm / up-project / split / RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention / concat in one cp.checkpoint so inductor has the right adjacency to fuse RMSNorm and matmul into a single Triton kernel on the compile path. We explicitly do not use the weight-absorbed training path; the rationale is spelled out in the dedicated post, and the one-line summary is that the FLOP increase and the loss of Flash Attention compatibility dominate the KV-memory win that only applies at decode.

The fused projection building block centers on FusedDownNormUp. It is a standalone autograd.Function that fuses y = W_ukv @ RMSNorm(W_dkv @ x), saving (x, rrms, w_dkv, w_ukv, rms_weight) for backward and recomputing latent = W_dkv @ x and normed = latent * rrms during the backward pass. The point is memory: the large kv_lora_rank-sized latent is not held across the block, only the tiny per-token rrms scalar is. Backward reassembles the full RMSNorm gradient formula directly (grad_latent = rrms * (grad_normed - latent * rrms^2 * dot / d)) so we do not pay for an autograd.grad pass through F.rms_norm. An important caveat: this is a standalone building block and an experimental unit-test target. The live MultiLatentAttention.forward path in the baseline training lane still uses the separate down / norm / up chain, and the mla_fused_down_proj flag in the config enables only a single-matmul fused down-projection (w_dqkv), not the full down+norm+up fusion. Megatron-Core has the same seam on a different axis with its LayerNormColumnParallelLinear; on our path today inductor fuses RMSNorm + matmul into a single Triton kernel (triton_tem_fused__fused_rms_norm_mm_t) when we get the adjacency right in the checkpoint wrapper, which is the reason _q_norm_upproj_and_rope is laid out the way it is. The compact checked-in sketch of that building block is Fused MLA projection sample.

Fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention lives in two Triton kernel surfaces: the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-specific path and the generic dense-attention path. The MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns version is the hotter one. The reason MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns needs its own fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention kernel is the partial-RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention layout: only rope_dim = qk_rope_head_dim // 2 * 2 of each head gets rotated, the rest pass through. A naive PyTorch path allocates three intermediate tensors per application (q_nope, q_rope, rotated q_rope) and torch.cats them back. The fused Triton RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention kernels - Q-side forward and backward, KV-side forward and backward, plus the packed thd layout variant - do the split, the rotation, and the concat in a single grid launch. Autotune covers BLOCK_H in {1, 2, 4, 8, 16, 32, 64, 128} keyed by emb_dim and head_num. The kernel operates in place on Q for forward and on DO (grad output) for backward, and the backward formula is the transpose of the forward rotation: dx1 = dy1*cos - dy2*sin, dx2 = dy1*sin + dy2*cos. Packed thd mode uses cu_seqlens to walk the sequence boundaries per token and find the right cos/sin row. On the bthd path the token index is just pid_m % seq_len.

The generic (non-MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns) fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention for Q and K lives in a separate Triton kernel from the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-specific one because the layout is different: it rotates the full head dimension, not a ropeQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention subportion, and it groups ROPE_GROUP_SIZE=4 heads per thread block so cos/sin loads are shared. With GQA it specializes on n_heads_q and n_heads_k as tl.constexpr and skips the K write when the head index exceeds n_heads_k. The split-half convention is the same as the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns kernel (y1 = x1*cos + x2*sin, y2 = -x1*sin + x2*cos), which means we maintain one RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention numerical contract across both kernels and the reference partial-rotary implementation. The long-context piecewise and YaRN RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention variants are precomputation-only and feed both the fused and reference paths.

Weight absorption specific to the fused NVIDIA path is subtle. The classic DeepSeek inference path absorbs W_uk into Q and W_uv into the output projection, leaving attention to operate in kv_lora_rank dims. On NVIDIA the interesting thing is that Megatron now has an absorbed-weights MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns variant, and the public MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns shared-runtime sample has explicit adapter hooks for it. We do not route training through the absorbed variant; we do allow it for inference spec selection when the decode shape is the dominant workload, because it changes the KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack from storing K and V expansions to storing the kv_lora_rank latent plus the single ropeQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention'd key row. The key algebraic identity is simple ((Q_nope @ W_uk) @ c_kv^T == Q_nope @ (W_uk @ c_kv)^T), but the kernel consequence is that FA3/FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample cannot fuse the absorbed path, and the attention dot products move into kv_lora_rank=512 dims rather than qk_head_dim=192, which is why it is strictly a decode-side choice.

The training-shaped decode path keeps an MLAKVCache with two buffers per layer: low_rank_main of shape (B, T_max, kv_lora_rank) and rope_caches of shape (B, T_max, 1, qk_rope_head_dim). The RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention key is cached once per token, broadcast across the H query heads at read time. update enforces uniform cache_seqlens across batch (because T_new is a single position advance per update call), writes at self.cache_seqlens[0], and advance bumps the counter after all layers have updated. That last invariant - advance the counter once after the full layer stack, not per layer - is the thing that tripped the earliest bring-up, because per-layer advance would have silently written into the same cache slot N times. The serving-side memory angle is the same one discussed in KV Cache and Paged Attention for the MegaCpp Specialist Ensemble.

The quantized variant downstream of the compressed cache uses PolarQuant and TurboQuant. Both reach roughly 3.8x compression at 4 bits versus bf16 with minimal quality loss; TurboQuant (random orthogonal rotation plus per-coordinate scalar quantization) is the successor we prefer, and PolarQuant is retained for analytical guarantees we still rely on for one specialist.

How it lands in MegaCpp

the public MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns shared-runtime sample is an adapter surface. One adapter wraps the standard MLASelfAttention, one wraps FusedMLASelfAttention (the Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer tensor-core path), and one wraps the upstream absorbed-weights variant when it is available. The adapters exist for one reason: to pass a pipeline-layer offset to the underlying class when the upstream constructor supports it, so pipeline-parallel stage placement lines up without forking the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns implementation. The adaptation step rewrites the GPT layer spec in place instead of rebuilding the whole object graph.

The checked-in adapter sample is intentionally narrow about that job. It handles pipeline-layer offsets and rotary normalization, and its declared scope is MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-specific compatibility only. That is a useful boundary because it keeps upstream attention-builder drift separate from MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-specific wiring bugs instead of turning the adapter into a second attention implementation.

The attention layer spec is built by asking the current transformer implementation for the right attention submodules and then threading the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-relevant options through unchanged. On the Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer path that includes QK layernorm, multi-latent attention, QK L2 norm, TE op fusionQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer, Kitchen integration, TE activation selectionQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer, and the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns down-projection fusion flag. The point is that every knob that matters for MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns fusion is a passthrough, not a MegaCpp reinvention; we own the adaptation seams, Megatron owns the module implementation details.

The hybrid MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns / DSA interleave is a MegaCpp-specific layout. The deep-hybrid full spec selects MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns for some attention ranks and DSA for others; MTP (multi-token-predict) layers always get MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, never DSA, because the MTP head needs dense attention semantics the sparse path does not provide. Each branch builds its own attention spec through the same shared builder logic.

Blackwell tensor-core path: H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story both route MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns through Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer. H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 gets the full FusedMLASelfAttention bf16/fp8 path; GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story pins disable_rht=True because the Random Hadamard Transform is not stable on that target, which shifts MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns QK numerics enough that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story results are tracked separately from H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 results. That sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough / sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough split is the same consumer-vs-datacenter boundary described in What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths and The FA4 Catalog on Blackwell: shared branding does not imply one tensor-core or memory contract.

What lands as-is: the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns module, the fused MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention Triton kernels, the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack path, and the PolarQuant and TurboQuant wrappers. What gets lifted but guarded: the standalone fused down-norm-up path, still experimental. What moves to Megatron: the attention layer spec, the distributed-optimizer integration, TP/PP/SP wiring, and the FP8 communication layer. The optimizer-side shard and parity assumptions behind that handoff are easier to read next to Distributed Optimizer Stress. The weight-absorbed training path is intentionally dropped.

Design choices that survived validation

The fused MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention kernels landed as part of a larger Megatron-optimization wave (alongside fused Mamba conv, PP with parameter-count-weighted stage partitioning, TP all-reduce overlap, sequence parallelism for norm/dropout, ZeRO-1 distributed optimizer, FP8 comm, and EP load balancing). MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-specific bugs we fixed on the way: latent-dim mismatch when resuming from non-MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns checkpoints, TP all-reduce placement for MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns's asymmetric Q/KV projections, FlexAttention score_mod ignoring per-head RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention frequencies, and gradient checkpointing interacting with the in-place fused MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention kernel. That last one is the same recompute-boundary problem described more generally in Activation checkpointing deep dive. We also tried weight-absorbed training: it lost on FLOPs (attention dot products move from qk_head_dim=192 to kv_lora_rank=512) and on Flash Attention compatibility (absorbed attention has to compute and sum nope/ropeQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingHistory: long context and attention sinks Reference: shared MLA adapter boundaries Reference: KV cache and paged attention score components separately, which breaks every fused FA kernel we use). Autotune for fused_mla_rope lands at BLOCK_H=16 or 32 on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 depending on emb_dim; on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story the kernel is bandwidth-bound and autotune prefers a larger block to amortise loads.

Production checklist

What MLA fuses on NVIDIA

Stage Naive launches per attention Fused path Memory note
Down + RMSNorm + Up 3 GEMMs + 1 norm FusedDownNormUp autograd.Function only rrms saved, latent recomputed in bwd
Q-side split + RoPE + concat 3 ops + cat MLA-specific Triton kernel in-place on Q, autotune over BLOCK_H
KV-side rope row + concat 2 ops + cat KV-side fused kernel one RoPE'd key row stored, not full K
KV cache (decode) full K, V compressed c_kv + 1 RoPE row ~4x smaller at 4-bit

The fused projection's autograd contract:

class FusedDownNormUp(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, w_dkv, rms_weight, w_ukv, eps):
        latent = x @ w_dkv.T
        rrms = torch.rsqrt(latent.pow(2).mean(-1, keepdim=True) + eps)
        normed = latent * rrms * rms_weight
        y = normed @ w_ukv.T
        ctx.save_for_backward(x, rrms, w_dkv, w_ukv, rms_weight)
        return y
FAQ

Frequently asked questions

Why not train with the absorbed-weights MLA path?+
Because it raises attention FLOPs and breaks the fused Flash Attention paths the training stack depends on. The cache win matters at decode time, not in the training path.
Where does FusedDownNormUp actually land today?+
As an experimental building block, not the default live path. The shipped training lane still uses separate down, norm, and up steps and relies on compiler adjacency to fuse the hot pieces.
What is this article not claiming yet?+
It is not claiming a settled H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. latency number for FusedDownNormUp, a public root cause for GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. RHT instability, or a universal win for packed thd layout over bthd. Those are measurement targets, not conclusions. The safe reading is narrower: the checked-in samples define the adapter, projection, FP8 dispatch, and dimension contracts; deployment claims still need receipt-local benchmarks.
Where are the checked-in sample surfaces for the MLA adapter and FP8 dispatch seams?+
Use Fused MLA projection sample for FusedDownNormUp, MLA integration pattern sample and MLA shared adapter sample for the adapter boundary, and Sparse MLA FP8 dispatch plus Sparse MLA FP8 dispatch sample for the wrapper-vs-storage dispatch contract.
Why are GB10 MLA results tracked separately from H200 if both routes go through Transformer Engine?+
Because the shared bridge does not imply the same numeric or kernel contract. H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. keeps the full FusedMLASelfAttention bf16/fp8 path, while GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. pins disable_rht=True and ends up with a different stability and autotune surface. That is close enough to share the integration story, but not close enough to merge the receipts.
Why does the Sparse MLA FP8 dispatch sample belong next to the adapter samples?+
Because the adapter seam is where MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.-specific module wiring meets wrapper-backed tensor surfaces. Sparse MLA FP8 dispatch sample is the checked-in public-safe reminder that a Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts. FP8 wrapper is not interchangeable with an ordinary bf16 tensor just because the logical dtype looks familiar. That is exactly the kind of edge a narrow adapter seam is supposed to isolate.
Why keep the adapter seam and the FP8 dispatch seam next to this kernel article instead of folding everything into one MLA story?+
Because the kernel lane only becomes meaningful after two earlier contracts are already honest. The adapter seam owns layer-spec, pipeline-offset, and RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table. normalization; the FP8 dispatch seam owns wrapper-backed storage and scale metadata. This article starts after those boundaries are settled, when the question becomes which fused projection, RoPE, and cache surfaces are actually worth keeping. The shortest cross-read is Public MLA integration patterns for Megatron, Shared MLA adapter boundaries, and Sparse MLA FP8 dispatch.
What do kv_lora_rank and qk_rope_head_dim mean in the checked-in surfaces?+
kv_lora_rank is the latent width the KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step. and up-projection share, while qk_rope_head_dim is the RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.-only slice that the fused MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. kernels rotate in place. The public-safe decoder order is MLA weight absorption for the latent/cache vocabulary, Fused MLA projection sample for the projection path, and Sparse MLA dimension generalization for how those widths later reappear as d_total and d_v.
Why keep separate MLA-specific and generic RoPE kernels?+
Because MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. rotates only a ropeQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table. sub-portion of each head while the generic dense-attention path rotates the full head layout. The numerical contract is shared, but the tensor layout is not.
What has to be true before the fused projection path graduates from experimental?+
Backward parity has to stay within noise against the unfused baseline, peak-memory savings have to survive a full training-shaped run rather than a single microbenchmark, and the path has to keep a clean fallback when compiler or backend conditions change.
Which checked-in file should a first-touch reader open before this article?+
Start with the MegaCpp model wiring examples section on MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. integration and Sparse MLA, then MLA shared adapter sample for the narrow compatibility seam, and only then return here for the kernel story. That keeps the builder boundary explicit before the fused-kernel details.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

RoPE

Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

sm_100a

Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.

sm_121a

Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.

Topic hubs