Author Mamba3 spec inside Megatron
Why an author-pure Mamba3 path still needs an explicit pre-projection RMSNorm when it is wrapped into a Megatron-local Mamba stack.

This seam is easy to describe badly. It is not just "drop an author model into MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split." The real issue is that the surrounding stack has assumptions about where normalization happens.
In the author Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode path, the projection is a plain linear surface. In the MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-local path, normalization may already be fused elsewhere or replaced by an identity surface. If that surrounding norm is not actually doing work, the author path must restore the missing pre-projection RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Mamba3 hybrid Example: author Mamba3 spec sample explicitly.
That is why this public example keeps the contrast visible: one lane restores the norm contract, the other leaves the residual stream unconstrained.
It sits on the same boundary described in porting to Megatron friction: a wrapped module can be locally correct and still violate the host runtime contract at the seam.
If you want the checked-in local starter pack before the prose, open the MegaCpp model wiring examples, then author-spec seam sample for the compact seam and fused residual-plus-RMSNorm example for the surrounding norm boundary.
If these terms are new
- The author path is the narrow Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode block contract copied from the model author's surface: its own projection order, norm assumptions, and state-space mixer geometry.
- The MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-local path is the host-framework lane around that block: fused norm placement, residual plumbing, and any distributed wrappers.
- An identity norm surface means the host wrapper still exposes a norm slot, but that slot is effectively a no-op and therefore does not restore the author block's expected pre-projection scaling.
- Pre-projection RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Mamba3 hybrid Example: author Mamba3 spec sample means the explicit norm immediately before
in_proj; this is the step the checked-in seam restores when the host lane exposes an identity norm surface. - The compact seam is a teaching-sized checked-in receipt. It proves one boundary condition cleanly; it does not claim the whole production feature matrix.
The checked-in example is explicit about that boundary:
author-spec seam sample
marks the working lane as uses_identity_norm=True and
explicit_pre_norm=True, while the same compact receipt leaves tensor
parallelQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, context parallelQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample Reference: context parallel and sequence parallel, packed sequences, and inference unsupported on
purpose. That keeps this article grounded in one contract instead of quietly
smuggling in a broader distributed story.
Why this is worth showing publicly
This is exactly the kind of integration bug that disappears in abstract design diagrams. Everything still "looks like Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode." The failure is in the seam between author assumptions and host-framework assumptions.
The public example is useful because it shows the real rule in a compact form: if the embedding or mixer path expects a fused norm and the wrapped module does not supply it, that norm has to be put back explicitly.
It is also the input-side sibling of Mamba linear CE parity deep dive: that post keeps the output-and-loss boundary visible, while this one keeps the pre-projection norm boundary visible.
The most useful runtime seam from the research packet is narrower than a full architecture rewrite: the host-side norm slot often also owns the residual-fork and saved-input surface used by fused runtime paths. That is why this article treats explicit pre-projection RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Mamba3 hybrid Example: author Mamba3 spec sample as boundary repair rather than style. If the host lane silently replaced that boundary with an identity surface, the wrapped author block can be locally correct and still violate the surrounding runtime contract. Transformer Engine bridge on NVIDIA is the closest runtime-side companion.
What the compact seam proves, and what it does not
This compact article proves one rule: if the host path removes or bypasses the norm surface that the author block expects, the wrapped author lane must put that norm back explicitly before projection.
It also proves what this receipt is not trying to cover. The checked-in seam
does not advertise TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample Reference: context parallel and sequence parallel, packed-sequence, or inference support. Those are
separate ownership boundaries. The shortest public-safe follow-on for that
broader surface is
tensor-parallel Mamba mixer example,
which shows integer head/group sharding, packed in_proj ownership, and the
replicated angle_proj rule that the compact seam intentionally leaves out.
For the cost and runtime side of that same story, continue to
Mamba 3 parallel performance.
Frequently asked questions
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Root-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.
Context parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.
Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…
NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.