MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 14 min readDavid Gornshtein
Mamba3
State Space
MIMO
Performance
Parallelism

Mamba 3 Parallel Performance: Where It Beat Attention, and Where It Lost

MIMO scaling, chunk-size behavior, the PsiV cache trade-off, and an honest tally of where a Mamba 3 hybrid outran pure attention on NVIDIA H200 and where it did not.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Mamba 3 Parallel Performance: Where It Beat Attention, and Where It Lost
Published 14 min readDavid Gornshtein

The question that gates everything else for a C++ specialist model is blunt: does a Mamba 3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode block actually go faster than an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns block, at the sequence lengths we care about, on the hardware we have? The MegaCpp stack gave us a first answer. This post is the performance side of that answer: MIMO scaling, chunk-size behavior, the PsiV cache trade-off, and a frank tally of where the hybrid pulled ahead of pure attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns and where it did not.

All numbers below come from two NVIDIA H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 training hosts, a smaller debug-validation system, and the TPU v6e lane we use for XLA ablations. Configuration labels are project shorthand; shapes are real.

If you want the checked-in local decoder before the tables, start with the MegaCpp example index, then tensor-parallel Mamba3 mixer sample for ownership and sharding, Mamba3 3D-to-2D shared-memory sample for the legality-sensitive kernel surface, Mamba3 PsiV cache scaffold for the fail-closed cache gate, and Mamba-3 porting notes for TPU for the cross-backend doc companion. The surrounding runtime lanes live in Mamba 3 kernel journey, training speed by feature, the training on H200: eight GPU, and tensor parallel and sharding.

If these performance terms are new

  • chunk_size is the number of sequence tokens processed inside one scan chunk before the recurrent state is handed to the next chunk. It is a kernel tiling and state-handoff size, not micro-batch size.
  • R or mimo_rank is the number of V-like channels carried through the same MIMO scan per head.
  • TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding geometry means how heads and groups divide across tensor-parallel ranks. If local heads or groups do not divide evenly, the mixer cannot assign an integer shard to each rank.
  • PsiV is the per-chunk product v * psi. It is a kernel-side intermediate, not a separate model family.
  • MBS means micro-batch size, the number of sequences this rank carries through one forward/backward step before gradient accumulation moves on.
  • Occupancy is the fraction of a GPU's execution slots the kernel can keep active at once; here it is a proxy for how much register pressure is choking launch parallelism.

The quickest checked-in decoder for those terms is MegaCpp example index, author-path Mamba3 spec near-copy for the author-path contract, tensor-parallel Mamba3 mixer sample for sharding ownership, Mamba3 PsiV cache scaffold for the fail-closed cache gate, and Mamba3 3D-to-2D shared-memory sample for the layout-legality repair that underpins the TMA story. The external baseline is similarly narrow: Mamba is the selective-SSM starting point, and NVIDIA's empirical Mamba study is the primary-source reason to keep "pure SSM" and "hybrid SSM-plus-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns" performance claims separate. The adjacent memory and alternative-mixer hubs are separate on purpose: M2RNN and Engram cover the extra memory tiers, and Gated DeltaNet, hyper-connections, and DynamicTanh cover the other recurrent or residual experiments around the same hybrid stack.

Why this matters

The hybrid is expensive to ship. A Mamba-3 MIMO kernel is register-heavy, the fork has to be disciplined across hosts, and the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample compiler surface is not as battle-tested as PyTorch's own ops. Paying that ship cost only makes sense if the hybrid wins where it is supposed to: at 16K to 64K context on C++ snippets packed with cross-file structure. A careful perf accounting keeps the hybrid argument honest. When the hybrid wins it is because O(N) scan cost beats O(N^2) attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns at the sequence lengths our training data actually uses; when it loses it is usually because the kernel sits off its register ceiling or because a micro-optimization got announced before nsys said anything.

1. The shapes that matter

Before percentages, the geometry. The MIMO scan is parameterized by (H, G, N, P, R, chunk_size, B, S):

  • H=16 heads per Mamba layer
  • G=1 B/C group (we keep ngroups=1 for the author-pure contract)
  • N=64 state dimension
  • P=64 head width
  • R=4 MIMO rank, four up-projections sharing one scan
  • chunk_size=16 for the MIMO kernel (not the 256 we used in the Mamba 2 reference)
  • B=1, S=8192 per rank at the reference shape, MBS=8 in practice

Those numbers are locked by AuthorMamba3Config. The config layer refuses overrides that do not satisfy H = hidden_size * expand / head_dim because the author kernel assumes a specific head count; silent mismatches on SSM head geometry corrupt gradients in ways that only show up after hours of training. The compact checked-in anchor for that fixed-shape contract is author-path Mamba3 spec near-copy, while the distributed ownership version lives in tensor-parallel Mamba3 mixer sample.

The local TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding mixer sample makes the distributed side concrete. It keeps in_proj column-parallel, packs [z, x, B, C, dd_dt, dd_A, trap] into that projection, leaves angle_proj replicated across TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding ranks, and requires both nheads and ngroups to divide cleanly by TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding world size. That is why TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding geometry shows up in a performance article rather than only in a model-spec article: those ownership rules affect local memory shape, communication, and which launch shapes are even legal.

2. MIMO is where Mamba 3 earns its FLOPs

Mamba 2 already reframed the selective scan as a structured state-space duality and made it a matrix operation. Mamba 3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode MIMO adds a rank-R outer product to the state update. The PsiV tensor that dominates the kernel is a per-chunk pointwise product:

psi_v[cs, r, p] = v[b, chunk_start + cs, h, p] * psi[h, r, p]

where psi is the learned MIMO_V parameter of shape (H, R, P). At R=4, each head carries four up-projected channels of V through the scan at once. Arithmetic intensity goes up without widening the head or adding heads.

In practice, MIMO is how we get attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-like representational width out of an O(N) kernel. For C++ tokens, where one head needs to track both "what scope am I in" and "what type does this identifier bind to", a single scan with four channels behaves closer to four narrow scans than to one wider scan, and the perf profile stays linear.

The measured price: the MIMO scan is register-heavy. On the reference mid-sized hybrid at MBS=8, nsys captures on an NVIDIA H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 host show:

Kernel Time Regs Smem Occupancy
mamba_mimo_fwd ~1.19 s 239 196 KiB ~6.2 %
mamba_mimo_bwd_fwd ~1.03 s 255 196 KiB ~6.2 %
mamba_mimo_bwd_bwd ~2.11 s 255 228 KiB ~12.5 %

Three observations. First, the double-backward kernel is the tall pole; it runs at about 12.5 percent occupancy at 255 regs per thread, effectively at the compiler ceiling for this launch shape (65536 / (2 * 128) = 256). Second, forward and first-backward are both at about 6.2 percent occupancy, meaning the scan kernels are not memory-bound; they are register-bound on a compute-bound workload. Third, the backward is larger than the forward by more than 2x, which is where optimization pressure belongs.

3. Chunk size behavior

We ran the MIMO forward kernel across eleven parameter shapes to sanity-check correctness and pick a chunk size. Target tolerance was rel_err < 0.1; in practice every shape came in below 0.01:

shape (N, P, R, chunk, BB) stable_max_rel max_abs
16, 64, 4, 8, 128 0.006 0.28
32, 64, 4, 16, 256 0.007 0.90
64, 64, 4, 16, 256 0.008 0.54
128, 64, 4, 16, 256 0.008 1.04
256, 64, 4, 8, 256 0.009 1.34
64, 128, 4, 16, 256 0.005 0.58
128, 32, 4, 16, 256 0.007 0.96
128, 128, 4, 8, 256 0.008 0.85
128, 64, 8, 8, 256 0.006 0.32
128, 64, 2, 32, 256 0.005 2.56
128, 64, 1, 64, 256 0.009 6.41

Two observations drove the chunk-size choice. Larger chunks (chunk=64, R=1) pushed max_abs up roughly 10x without materially helping throughput, because the register window grew with R * P. Smaller chunks (chunk=8) were fine on correctness but spent more time on launch overhead and inter-chunk state plumbing. The sweet spot for the reference shape is chunk=16, which lets us keep R=4 without asking the compiler for more shared memory than the target NVIDIA kernel configuration could sustain.

On the 14-gradient backward test at the smallest shape, stable_max_rel landed between 0.004 and 0.012 with bad_frac < 0.05 everywhere. That is the tolerance we carry forward as the correctness gate when we patch the kernel.

4. The PsiV recomputation tax

PsiV is computed from scratch in fwd, in bwd_fwd, and again inside bwd_bwd. Three independent materializations per forward-backward iteration of a tensor whose two inputs are stable within the iteration. The next cache pass is about turning that into one materialization plus two loads.

The cache is an activation checkpoint, not a hash table. Save PsiV to gmem during fwd, pass it into the backward kernels as an extra input, skip the recompute. Shape (B, S, H, R, P), BF16, chunk-contiguous layout. For the reference shape at MBS=8 that is about 5.6 GiB of extra per-rank memory, fine inside the roughly 132 GiB peak we measured on the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 runs. Expected envelope is 1.5 to 2.3 percent total TFLOP/s. The public-safe implementation posture for that idea is the same one shown in Mamba3 PsiV cache scaffold: make the gate and refusal path explicit before treating the path as a measured win.

There is a real failure mode this design is ready to accept: if the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample compiler is already CSE-ing psi_v = v * psi across its scheduling stages (hoisting the load and keeping the product in a register across back-to-back ct.mma calls), the runtime cost is already near zero and we get nothing. That is why the first step is a Python-level materialization to measure the ceiling. If the Python hack does not move nsys numbers, the whole pursuit is archived. The checked-in scaffold uses an explicit env gate, default OFF, so the run refuses loudly instead of pretending the cache landed.

5. What beat pure attention

Three places, measurable.

Context length. At 32k to 64k tokens of C++, quadratic attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns cost becomes the dominant line item in an otherwise well-tuned stack. A Mamba 3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode layer at the same shape runs O(N) per token with a constant that is not small (around 1.2 s per step for MIMO forward at the reference shape is nothing to brag about) but it does not grow with sequence length. On our v4 context-graph snippets (up to 64k tokens of Callers -> Target -> Callees), the hybrid spends most of its FLOPs on the scan and reserves attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns for the handful of layers that actually need content-addressable lookup. That claim is meant to stay paired with the architecture-facing article Mamba 3 + Transformers: Why MegaCpp uses a hybrid stack for C++, which is where the retrieval-versus-mixing split is argued at the block-family level rather than only at the timing table level.

Per-head information density. MIMO at R=4 means each head carries four channels through the same scan. Equivalent attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-based width would require either more heads (more KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack) or a wider head dimension (more per-op compute). On NVIDIA H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, the MIMO path is cheaper at the same representational capacity for the shapes in this post.

Long-sequence memory stability. With a minority of attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks, KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack growth at long context is small. For inference with a 64k prompt, the peak live set stays well under what an equivalent all-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns stack would demand, which lets us keep a larger micro-batch at eval time. That budget story is the serving-side companion to A Memory-Budget Anatomy for One Specialist on H200:8. That part of the story connects directly to MLA and weight absorption and KV cache and paged attention: the hybrid only pays off if the long-context serving path also keeps cache growth under control.

6. Where it lost

Three places, and we do not pretend otherwise.

Forward-only P1 was a wash. We flipped TL_DISABLE_TMA_LOWER and TL_DISABLE_WARP_SPECIALIZED from True to False on the MIMO forward kernel and added TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE, expecting 20 to 30 percent forward speedup. We measured over 19 samples at MBS=8 on an 8x NVIDIA H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 host:

Metric Baseline Selective P1 Delta
Throughput (TFLOP/s) 183.016 183.005 -0.006 %
Iter 1 lm loss 11.8775 11.8775 identical
Iter 25 lm loss 5.3296 5.1818 -0.15
Val test iter 25 5.2686 5.1094 -0.16
Peak reserved (GiB) 131.924 132.686 +0.76

Throughput delta is inside measurement noise. Reason: forward is a small fraction of the iteration (~1.2 s of ~5.5 s total), so a 25 percent speedup on forward-only moves the whole step by about one percent, which is exactly what the noise envelope swallows. The loss delta is BF16 FMA-ordering noise from a different kernel schedule; iter-1 loss is bit-identical, which is the sanity check we actually trust.

Backward kernels hit a real TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample bug when TMA was enabled. mamba_mimo_bwd_fwd and mamba_mimo_bwd_bwd used three rank-3 shared-memory descriptors, which TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample's TMA lowering cannot handle (InputDim() == 2 assertion, "Cannot detect TMA layout"). We fixed it by flattening 3D shared memory to 2D via zero-copy reshapes (qk_dot_shared[c, r1, r2] -> [c, r1 * R + r2], Q[B, S, R, G, N] -> [B, S*R, G, N]). Correctness survived: 14 gradient tensors at rel_err 0.0038 to 0.0116, bit-for-bit with the TMA-off baseline inside BF16 rounding. Full performance measurement is still pending a fresh NVIDIA H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 slot with both forward and backward patched. If you want the smaller checked-in proof surfaces for that sentence, read near-copy 3D-to-2D shared-memory sample for the layout rewrite itself and Upstream PRs we wrote for TileLang and Megatron-Core for the exact compiler-failure lane and its upstream follow-up.

Small-context regime. Pure attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns has a floor MIMO does not beat at small context. On a 4K-context ablation sweep on a TPU v6e-x4 slice, the dense Transformer baseline landed at a lower loss at nearly identical tokens-per-second than the AdamW hybrid over the same budget. The gap closes at longer context and with the MIMO path enabled, but at 4K tokens the Mamba blocks spend their O(N) cheapness on sequence lengths that do not exercise it. The hybrid becomes clearly dominant at 16K or 64K, which is where our v4 data actually lives; at 4K, ship dense.

7. What comes after

The concrete list of perf work still on the table, sized by realistic gain:

  1. Full P1: TMA plus warp specialization on fwd and both backward kernels, gated by the 3D-to-2D TMA layout fix. Modeled gain is in the 5 to 10 percent range; measurement is still pending a fresh NVIDIA H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 slot.
  2. PsiV cache (P2): intra-step activation checkpoint removes two of three recomputes, modeled at 1.5 to 2.3 percent. Phase A Python scaffold first, abandon if it does not move nsys.
  3. MBS=10 at fp8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper param-gather: orthogonal win, paired with the Liger main-head backward fix. A point or two if the micro-batch headroom actually opens.
  4. We do not ship the P3 register split of bwd_bwd. The short version is that the claimed 30 to 50 percent kernel speedup did not survive a careful line-by-line read of the reverse-scan live set.

Overall the public MegaCpp Mamba path answered the gating question with a qualified yes: Mamba 3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode MIMO is the cheaper kernel at the context lengths we train at, it represents per-head information at better density than attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns does, and the costs are concentrated in bwd_bwd where we already have optimization paths ready. The wins are real; we just report them honestly, including the ones that came in at noise.

What held up, and what did not

Kept: MIMO at R=4 and chunk=16 on the reference shape; the 3D-to-2D TMA layout fix on the backward kernels; the PsiV cache (P2) as the next perf pass with a Phase A Python gate; honest reporting of the -0.006 percent selective P1 delta; BF16 correctness gates at rel_err below about 0.012; attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns as a minority path for sharp retrieval; dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns at 4K-only ablations. Thrown away: the P3 register split, which did not justify its complexity; larger chunks with R=1, which raised max_abs without throughput gain; and any claim that forward-only P1 delivered a meaningful speedup. The residual risk is still straightforward: the full P1 number needs a fresh H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 measurement pass.

The surrounding runtime story is easier to read alongside Mamba 3 kernel journey, Dynamo and torch.compile Breakage on a Mamba-3 Hybrid, and The Torch 2.12 journey: scan-kernel wins only matter if the compile lane stays stable and the backward path stays honest on the same shapes. The serving-side consequence also shows up in KV cache and paged attention, where long-context cache pressure is one reason the hybrid can justify its complexity.

FAQ

Frequently asked questions

Where did Mamba 3 clearly beat attention in this stack?+
At the long-context regime the training corpus actually uses, where O(N) scan cost and smaller cache growth matter more than the small-context floor of dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.. Mamba 3 + Transformers: Why MegaCpp uses a hybrid stack for C++ is the architecture-facing companion to that performance claim.
Why was forward-only P1 treated as a wash?+
Because the measured throughput delta was inside noise once the full step was counted. A local forward improvement is not enough if it barely moves end-to-end training time. The runtime optimization receipt is the smaller checked-in reminder that MegaCpp keeps runtime fixes only when the full receipt moves.
Why is 4K context not the acceptance regime for this article?+
Because the whole hybrid claim here is about the long-context lane where scan cost, cache growth, and register-heavy kernels actually matter. At 4K tokens, the quadratic penalty of dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. is still cheap enough that the hybrid does not get to cash in its main advantage yet.
Do we know the exact crossover between 4K and 16K?+
Not from this public-safe receipt. The honest statement is narrower: the 4K ablation did not justify the hybrid by itself, while the 16K-to-64K lane is where scan cost and cache growth made the case. The missing follow-up is a dedicated sweep between those points, not a retroactive threshold inferred from two endpoints. The companion architecture framing stays in Mamba 3 + Transformers: Why MegaCpp uses a hybrid stack for C++.
What is the next practical optimization after this article?+
The next realistic candidate is still the PsiV cache and the fully measured TMA-enabled backward path, because the real cost concentration is in backward kernels rather than in headline forward-only tweaks. The local proof split is intentional: Mamba3 PsiV cache scaffold keeps the gate honest, while near-copy 3D-to-2D shared-memory sample keeps the TMA-legality repair visible.
What did the 3D-to-2D TMA fix actually change?+
Only the shared-memory view and the index mapping, not the algebra. The checked-in checked-in sample flattens q_shared from (chunk, R, N) to (chunk * R, N) and qk_dot_shared from (chunk, R, R) to (chunk, R * R), then remaps the row and offset arithmetic so the output shape stays the same. The win was a TMA-compatible layout, not a different Mamba update rule.
Why does tensor-parallel geometry belong in a performance article?+
Because end-to-end cost depends on whether the mixer can be split cleanly across ranks. The checked-in TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node. sample makes that explicit: local heads and groups must divide evenly, in_proj is sharded, and angle_proj stays replicated. That ownership pattern changes both communication and memory residency.
Why not replace the remaining attention blocks outright?+
Because the hybrid stack still wants attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. for sharp retrieval-style behavior, bounded dense-context work, and compatibility with the serving surfaces that already exist. The measured Mamba win is real, but it is not universal enough to erase the mixed-block design.
How does this performance article relate to M2RNN, Engram, or Gated DeltaNet?+
Those are adjacent model features, not the main kernel lane measured here. M2RNN and Engram cover the extra memory tiers that ride beside the hybrid backbone. Gated DeltaNet, hyper-connections, and DynamicTanh cover the alternative mixer and residual-side experiments. This article is narrower: the Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and… scan lane itself.
Where is the smallest local receipt for the env-gated optimization story?+
Read Mamba3 PsiV cache scaffold for the explicit cache gate and refusal contract, then runtime optimization receipts sample for the smaller "keep the change only if the full receipt moves" rule that this article applies to P1 and P2 alike.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

TP

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

Paged attention

The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

Topic hubs