MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 3 min readDavid Gornshtein
Mamba3
Megatron
RMSNorm
Spec

Author Mamba3 spec inside Megatron

Why an author-pure Mamba3 path still needs an explicit pre-projection RMSNorm when it is wrapped into a Megatron-local Mamba stack.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Author Mamba3 spec inside Megatron
Published 3 min readDavid Gornshtein

This seam is easy to describe badly. It is not just "drop an author model into MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split." The real issue is that the surrounding stack has assumptions about where normalization happens.

In the author Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode path, the projection is a plain linear surface. In the MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-local path, normalization may already be fused elsewhere or replaced by an identity surface. If that surrounding norm is not actually doing work, the author path must restore the missing pre-projection RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Mamba3 hybrid Example: author Mamba3 spec sample explicitly.

That is why this public example keeps the contrast visible: one lane restores the norm contract, the other leaves the residual stream unconstrained.

It sits on the same boundary described in porting to Megatron friction: a wrapped module can be locally correct and still violate the host runtime contract at the seam.

If you want the checked-in local starter pack before the prose, open the MegaCpp model wiring examples, then author-spec seam sample for the compact seam and fused residual-plus-RMSNorm example for the surrounding norm boundary.

If these terms are new

  • The author path is the narrow Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode block contract copied from the model author's surface: its own projection order, norm assumptions, and state-space mixer geometry.
  • The MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-local path is the host-framework lane around that block: fused norm placement, residual plumbing, and any distributed wrappers.
  • An identity norm surface means the host wrapper still exposes a norm slot, but that slot is effectively a no-op and therefore does not restore the author block's expected pre-projection scaling.
  • Pre-projection RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Mamba3 hybrid Example: author Mamba3 spec sample means the explicit norm immediately before in_proj; this is the step the checked-in seam restores when the host lane exposes an identity norm surface.
  • The compact seam is a teaching-sized checked-in receipt. It proves one boundary condition cleanly; it does not claim the whole production feature matrix.

The checked-in example is explicit about that boundary: author-spec seam sample marks the working lane as uses_identity_norm=True and explicit_pre_norm=True, while the same compact receipt leaves tensor parallelQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, context parallelQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample Reference: context parallel and sequence parallel, packed sequences, and inference unsupported on purpose. That keeps this article grounded in one contract instead of quietly smuggling in a broader distributed story.

Why this is worth showing publicly

This is exactly the kind of integration bug that disappears in abstract design diagrams. Everything still "looks like Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode." The failure is in the seam between author assumptions and host-framework assumptions.

The public example is useful because it shows the real rule in a compact form: if the embedding or mixer path expects a fused norm and the wrapped module does not supply it, that norm has to be put back explicitly.

It is also the input-side sibling of Mamba linear CE parity deep dive: that post keeps the output-and-loss boundary visible, while this one keeps the pre-projection norm boundary visible.

The most useful runtime seam from the research packet is narrower than a full architecture rewrite: the host-side norm slot often also owns the residual-fork and saved-input surface used by fused runtime paths. That is why this article treats explicit pre-projection RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Mamba3 hybrid Example: author Mamba3 spec sample as boundary repair rather than style. If the host lane silently replaced that boundary with an identity surface, the wrapped author block can be locally correct and still violate the surrounding runtime contract. Transformer Engine bridge on NVIDIA is the closest runtime-side companion.

What the compact seam proves, and what it does not

This compact article proves one rule: if the host path removes or bypasses the norm surface that the author block expects, the wrapped author lane must put that norm back explicitly before projection.

It also proves what this receipt is not trying to cover. The checked-in seam does not advertise TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample Reference: context parallel and sequence parallel, packed-sequence, or inference support. Those are separate ownership boundaries. The shortest public-safe follow-on for that broader surface is tensor-parallel Mamba mixer example, which shows integer head/group sharding, packed in_proj ownership, and the replicated angle_proj rule that the compact seam intentionally leaves out. For the cost and runtime side of that same story, continue to Mamba 3 parallel performance.

FAQ

Frequently asked questions

Why is the RMSNorm called out explicitly here?+
Because the whole bug class is about an assumed normalization step disappearing at the integration seam. This article isolates that contract instead of folding it into a broader wrapper story. The quickest local proof surface is author-spec seam sample, and the closest normalization-side companion is fused residual-plus-RMSNorm example, which makes the surrounding residual-plus-RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths. boundary explicit.
What does uses_identity_norm=True mean in the checked-in example?+
It means the host-side wrapper still has a place where a norm would normally live, but the sample intentionally models that place as doing no work. That is why author-spec seam sample has to pair uses_identity_norm=True with explicit_pre_norm=True on the working lane.
Does adding explicit pre-projection RMSNorm also recreate the fused host fast path?+
No. It restores the missing input-scale contract before in_proj, but it does not by itself recreate the fused residual-add-plus-RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths. boundary around the block. That wider boundary is where launch count and memory traffic get cut back down again. Author-spec seam sample proves the input-side repair; fused residual-plus-RMSNorm example is the next checked-in proof surface for the fused boundary the compact seam is not trying to reproduce. That sample also makes the fast-path guard explicit: the fused lane is only taken when the residual, delta, and weight tensors are CUDA, contiguous, shape-aligned, and dtype-aligned.
Why not just remove the host pre-norm and follow the author spec literally?+
Because the author-side BC/QK-style norm only protects specific internal projections. It does not stabilize the surrounding residual stream or replace the fused residual-plus-norm boundary the host stack expects. In practice that turns "more author-pure" into a broken seam: numerically wrong if the residual path stays unconstrained, and slower if the missing boundary gets rebuilt out of extra unfused PyTorch ops.
Does norm_before_gate close this seam by itself?+
No. MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and… Core exposes norm_before_gate as a mixer-local choice, but that is still inside the Mamba mixer surface. The seam in this article is the wrapper contract before in_proj: whether the residual stream has already crossed a real norm boundary or only an identity slot. Treating the mixer-local flag as a drop-in replacement would hide the same ownership problem the compact example is trying to keep visible. For the wider distributed mixer surface, continue to tensor-parallel Mamba mixer example.
Could a custom fused internal-norm kernel bridge both contracts?+
Potentially, but that would be a new runtime contract, not a switch on this compact seam. It would need to own the residual fork, input-scale boundary, projection-local BC/QK-style norm, saved-tensor policy, dtype guards, and backward path together. Until that proof exists, keep the author-spec seam sample and the fused residual-plus-RMSNorm example as separate receipts.
Could I repair this seam with LayerNorm instead?+
LayerNorm may still stabilize a wrapper, but it is not the same narrow repair. RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths. rescales by the root mean square, while the RMSNorm paper explicitly motivates the method by removing LayerNorm's re-centering step. For this seam, the goal is not "add any norm." The goal is "restore the missing RMS-shaped input surface the wrapped block expected." If you swap norm families, treat that as a new host contract and revalidate the wrapper accordingly.
How is this different from Mamba linear CE parity?+
This post is about the input-side seam before in_proj: whether the wrapped author block still receives the normalization step it expects. Mamba linear CE parity deep dive is the output-side companion: whether the logits-to-loss boundary still matches what the runtime expects after an output-layer swap.
Does repairing the pre-projection norm also prove the full authored Mamba3 mixer path?+
No. This seam repairs the host-side input contract only. It does not, by itself, prove the wider authored mixer features such as trapezoidal discretization, data-dependent A, complex RoPE on the state projections, or MIMO arithmetic. Those belong to the mixer-implementation lane described in Mamba3 kernel journey and the surrounding hybrid context in Mamba-3 hybrid.
Where should I go next for the wider architecture context?+
Read Mamba-3 hybrid for the backbone story and migration policy: native Megatron vs narrow custom seams for the broader porting rule.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

RMSNorm

Root-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

TP

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

CP

Context parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

Topic hubs