Shared MLA adapter boundaries
Why MegaCpp keeps MLA-specific normalization behind one shared adapter seam instead of leaking MLA conditionals through the whole attention builder stack.

The point of a shared MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: public-safe MLA integration patterns adapter is not abstraction for its own sake. The point is to contain drift.
MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: public-safe MLA integration patterns support tends to create pressure in exactly the wrong places: layer-spec construction, positional handling, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns-module selection, and pipeline offset plumbing. If those conditions spread through the generic builder path, every later attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns change becomes harder to reason about. A shared adapter is the cheaper boundary. It normalizes the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: public-safe MLA integration patterns-only pieces and leaves the rest of the stack boring.
What problem this boundary solves
MegaCpp's public sample is intentionally small, but it encodes a real design rule: keep MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: public-safe MLA integration patterns-specific compatibility in one adapter contract.
That matters because upstream MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split surfaces in this area are not static.
Recent MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-LM issue reports show exactly the kind of drift a narrow seam is
meant to isolate: local layer-spec initialization can miss MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: public-safe MLA integration patterns-specific
k_channels and v_channels, packed-sequence rotary handling can leak tensor
shape assumptions, and unsupported MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: public-safe MLA integration patterns attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns backends can fall through to
NoBackend. A shared adapter keeps those MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: public-safe MLA integration patterns-only corrections in one handoff
instead of teaching every generic attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns path the same exceptions.
That also makes failures easier to classify. If the issue is missing MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: public-safe MLA integration patterns channel metadata, packed-sequence rotary shape, or backend selection, the adapter can turn it into one local compatibility decision or one explicit unsupported path. The generic builder should not need a separate branch for each failure mode.
Why one shared adapter is safer than many tiny special cases
There are three practical wins.
- reviewability improves, because the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: public-safe MLA integration patterns contract has a clear home
- upstream upgrades get cheaper, because compatibility edits stay localized
- public documentation gets more honest, because we can point to one visible boundary instead of implying MLA is just a generic attention toggle
The adapter should answer only boundary questions before the generic builder runs: which MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: public-safe MLA integration patterns layer metadata is required, which positional or packed-sequence shape needs normalization, and whether the requested backend is a supported path. If a change does not affect one of those answers, it belongs outside the adapter.
This is the same architectural reason people isolate position-bias or cross-entropy fusion boundaries instead of threading ad hoc switches through the whole model stack. Once a feature changes the contract of layer construction, the safest default is to contain it.
Example -> article -> upstream docs
- example: Shared MLA adapter sample
- related article: Public-safe MLA integration patterns for Megatron
- upstream docs: Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: public-safe MLA integration patterns docs and MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-LM issue reports on layer-spec and backend support
Frequently asked questions
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.
Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…