MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 15 min readDavid Gornshtein
Upstream
TileLang
Megatron
Kernels
Open Source

Upstream PRs we wrote for TileLang and Megatron-Core

A focused walk-through of the TileLang and Megatron-Core upstream PRs we have prepared: the bug, the fix, and what each contribution unblocks in our training stack.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Upstream PRs we wrote for TileLang and Megatron-Core
Published 15 min readDavid Gornshtein

This is the focused tour of the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey and Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample contributions in this batch. Six packs, two repos, one common shape: every one of them is a bug that crashed or silently corrupted a real training run before it became a public reproducer and write-up. Each section answers what broke, what the fix is, and what it unblocks. The adjacent runtime-heavy lane for Mamba-3, Sparse-MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: public-safe MLA integration patterns Reference: fused MLA on NVIDIA, Liger, and DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample sits in Upstream PRs we wrote for Mamba-3, Sparse-MLA, Liger and DSA, while the broader filing cadence and pack taxonomy live in Upstream PRs: how a small training shop ends up patching everyone else's libraries.

Why MegaCpp cares about this

TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey (the kernel-DSL surface we use for every Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode backward kernel and for SparseMLA) and Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample (parallelism, dispatch, and module plumbing) are two of our most aggressive dependencies. Both move quickly. TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey in particular ships maintainer PRs faster than we can keep up; the rebase tax on local patch stack compounds, which is why upstreaming is the cheaper path in the long run. TileLang TMA and H200 reality is the runtime-side picture of why those compiler details are not optional cleanup.

That maintenance pressure is the operational reason How we keep a patch lane exists as a separate article: upstreaming is cheaper than carrying a growing local diff stack indefinitely.

The packs in this set break into three sub-shapes. Compiler/lowering bugs that block a path the kernel author intended to use (TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey TMA bulk-copy on rank-3 smem; FloorMod divide-by-zero in LayoutInference). Dispatcher/integration gaps in MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split that crash or silently slow training on normal hardware (Hopper FLCE ValueError; the Mamba LinearCrossEntropyModule rebase-miss; Float16Module blanket bf16 casting Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode's fp32 contract). And kernel-level memory rewrites that do not change the math but cut the working set by an order of magnitude (DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample _compute_index_scores).

In practice, we validate each fix against a real training workload before we turn it into a clean upstream reproducer. That keeps the public submission grounded in behavior we have already seen on real training shapes. The checked-in MegaCpp model wiring examples catalog is the public-safe entry point for the concrete receipts discussed below, and How we keep a patch lane is the local companion when you want the evidence rules behind those receipts. If you want the bridge-level reason these six packs exist, read Porting To Megatron-Core Is Harder Than It Looks first: that post names the adapter seams, while this one isolates the subset that turned into upstream TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey or MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split work.

If these pack terms are new

  • A pack is one upstream-ready bug bundle: reproducer, write-up, validation, and current target-repo status.
  • TMA means Hopper's Tensor Memory Accelerator bulk-copy path; in this lane it matters because compiler legality can decide whether the kernel even compiles.
  • A regression pack is a bug bundle kept after the fix merged upstream so a revert becomes a clean test failure instead of another vague runtime crash.
  • A rebase-miss is the specific failure mode where a later merge restores an older file snapshot and silently drops an earlier fix.
  • MTP means multi-token prediction: a model-head path that predicts more than one future token and therefore touches the same post-processing and loss wiring that linear-CE fusion needs.

How we validate the work

The first version of a fix usually lands in the codebase that exposed the bug. From there we reduce it to a smaller public reproducer and a clean upstream-facing patch. The mechanics are intentionally simple: isolate the failing condition, confirm the fix on the real workload, then restate it in a form maintainers can run and review quickly.

Every pack has a self-contained reproducer. It depends only on the target library at the referenced upstream revision plus a small dependency set. It prints a clear sentinel when the bug fires and another when the fix is validated. The executable result is the source of truth; the write-up is there to explain it. In this lane the shortest checked-in anchors are TileLang TMA bulk-copy near-copy, Mamba3 MIMO 3D-to-2D smem near-copy, Megatron Hopper FLCE near-copy, and Mamba linear-CE parity near-copy.

Pack 08 - TileLang TMA bulk-copy on rank-3 smem

The bug, in one line: LowerBulkCopy used to assert shared_layout->InputDim() == 2 and refused to lower any rank-3 (or higher) shared-memory descriptor, which crashed the Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode MIMO backward kernels the moment we flipped TMA lowering on for Hopper.

The fix is already upstream. TileLang PR #746 replaces the hard ICHECK(InputDim()==2) with a LOG(WARNING) plus a fallback to LowerNormalCopy. Rank-3+ smem layouts now compile; TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey prints a warning and emits a non-bulk cp.async instead of aborting. Follow-ups #761 (1D TMA support) and #2005 (1D TMA regression test) cover the same area: #761 adds the 1D TMA transfer path, contiguous-condition checks, and sized store interface, while #2005 keeps the no-warp-specialization 1D TMA load/store case from silently regressing. The deeper lowering background for why this guard mattered is TileLang TMA bulk copy 3D shared-memory deep dive, and the checked-in public proof surface is TileLang TMA bulk-copy near-copy.

The pack still exists as regression coverage. Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode MIMO backward kernels structurally use three rank-3 smem descriptors (qk_dot_shared is [chunk_size, R, R]; Q/K loads land in [chunk_size, R, N]). They rely on the warn-and-fallback to compile under TL_DISABLE_TMA_LOWER=False. A silent revert would crash the build with a generic TVM error deep in a kernel log instead of a clean test signal, so pack 08 now serves as a CI tripwire: three configurations (3D smem bf16, a 4D variant, and the production-shaped MIMO qk_dot_shared layout), each validated on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 SXM and on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story (sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough). For readers crossing from compiler regression coverage into GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story naming, GB10 stack parity for MegaCpp keeps the consumer sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough target separate from datacenter sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough and tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough claims. It buys us the ability to keep TL_DISABLE_TMA_LOWER=False on Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode MIMO bwd, which is the precondition for the rest of the TMA pipelining work.

Pack 13 - TileLang FloorMod divide-by-zero in LayoutInference

The bug. Inside a T.Parallel(...) loop body, indexing of the form csr % R / csr // R (with R a Python-int compile-time constant closed over by an outer @tilelang.jit function) crashes LayoutInference with Check failed: pb->value != 0 (0 vs. 0) : Divide by zero. The crash fires in tvm::arith::TryConstFold<tvm::tir::FloorMod> while normalizing the iter-map for the parallel loop's output buffer fragment. The divisor has constant-folded to 0 even though the real Python value is R = 4; downstream substitution would resolve it, so the right behavior is to defer rather than abort.

The trigger is the kind of indexing pattern that appears after the rank-3 shared-memory flatten used in Mamba3 MIMO 3D-to-2D shared-memory deep dive: a T.Parallel(fused_chunk_size, N) loop adding a per-R bias via q_bias_frag[csr % R, n], followed by a loop that decomposes csr into csr // R and csr % R. With TL_DISABLE_TMA_LOWER=True, the kernel compiles; with it False, normalization fires on the transient zero and aborts. The crash is easiest to interpret next to Mamba3 MIMO 3D-to-2D smem near-copy, because that example keeps the flattened descriptor contract visible. The crash is host-side, so no CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 device is required to reproduce it.

The fix is upstream territory. The cleanest patch is in TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey's TVM constant-folding path: have TryConstFold<FloorMod> return NullOpt when the divisor transiently folds to zero instead of asserting. The architecturally cleaner fix is in the iter-map normalization pass: preserve the symbolic FloorMod until the divisor is pinned to a non-zero PrimExpr. Either way the right authors are the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey/TVM maintainers; pack 13 is a bug report with a reproducer, not a PR.

The obvious algebraic rewrite (csr - (csr // R) * R in place of csr % R) does not work: RewriteSimplifier canonicalizes the subtraction form back to FloorMod before LayoutInference runs (verified against TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey 0.1.8). The current temporary mitigation is to keep TL_DISABLE_TMA_LOWER=True and TL_DISABLE_WARP_SPECIALIZED=True on every affected backward kernel; this costs roughly 20% end-to-end throughput on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 vs the TMA-on projection but keeps compilation alive. We flip both flags back once the upstream fix lands.

Pack 10 - Megatron Hopper FLCE: land #3345 and add a non-Blackwell fallback

The bug. Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample's fused linear cross-entropy dispatcher on dev is Blackwell-only. Its platform probe raises ValueError(f"Unsupported architecture: {cc[0]}") for any device whose major capability is not 10. That is every H100/H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, every GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story, every A100, every Ada L40. The first call with --cross-entropy-loss-fusion --cross-entropy-fusion-impl linear on those devices crashes the forward pass.

The fix has two tiers. Tier A: land the open Megatron-LM PR #3345 (feat/hopper-kernels), which adds the Hopper (cc=9) entry point with CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: TileLang TMA bulk-copy sample kernels and rewires the dispatcher so the cc=10 and cc=9 branches share a gpu_entry symbol. The public PR also carries an explicit compatibility fix for nvidia-cute-dsl >= 4.4, so the public reproducer has to stamp that package boundary rather than only the MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split commit. Tier B: add a soft fallback for every other capability; when no native kernel exists, emit a RuntimeWarning and fall back to the unfused vocab-parallel CE implementation. Tier B is ~40 lines that wrap fused_vocab_parallel_cross_entropy with the same forward/backward signatures as the Blackwell/Hopper entries. The checked-in proof surface for the Tier A/Tier B boundary is Megatron Hopper FLCE near-copy. For a first-touch reader, the important distinction is simple: this is a dispatcher and packaging bug, not a math bug in the loss itself. The CE formula is fine; the wrong architecture gate prevents the good path from being selected.

Why #3345 mattered enough to validate early. The fusion is the difference between fitting a long-context config on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and OOM-ing. The non-fused path materializes a [s, b, V] logits tensor and a same-shape grad_logits buffer, roughly 7 GiB of avoidable peak per microbatch at our hidden=4096, vocab=151552 shape. We cherry-picked #3345 onto our dev pin, ran a full training step on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, and confirmed the Hopper kernels compile and bit-match F.cross_entropy to within bf16 tolerance. What remains is a clean H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 validation run against an unpatched tree.

The operator-level fallout from this path is easier to see in Megatron FLCE on Hopper, which focuses on the runtime and memory consequences rather than the filing sequence.

Pack 11 - Megatron Mamba LinearCrossEntropyModule rebase-miss

Megatron-LM PR #3226 wired LinearCrossEntropyModule into both the GPT model implementation and the Mamba model implementation on MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split's dev branch. PR #3207, "Reapply 'Add MTP support for hybrid models'", merged shortly afterward but rebased from a pre-#3226 snapshot of the Mamba model implementation. The replay clobbered the Mamba side of #3226: the output layer reverted to plain tensor_parallel.ColumnParallelLinear, the self.fuse_linear_cross_entropy flag was dropped, and post_process was changed to post_process or self.mtp_process in a way that affected both the main decoder head and the MTP head.

Hybrid Mamba models on dev then cannot take the linear-CE fusion path even when the flag is set. GPT models keep the fusion; Mamba models silently fall through to the materialize-[s, b, V]-logits path - the same ~7 GiB of avoidable peak per microbatch as pack 10.

The fix is the diff PR #3226 originally landed: re-import LinearCrossEntropyModule, restore self.fuse_linear_cross_entropy in __init__, swap ColumnParallelLinear(...) back to LinearCrossEntropyModule(...) for output_layer, and route forward() through the fused output layer when the flag is set, mirroring the GPT model path. One file, exactly the diff that was overwritten.

The current workaround is a runtime class-swap in the public MegaCpp linear-CE shim, behind an environment toggle, installed at import time. It has been running in production with the Liger CE kernel routed via the same installer. The reason to file upstream anyway is that runtime monkey-patching is a regression magnet; restoring the diff in-tree makes the wiring visible to the GPT linear-CE functional tests and removes the runtime install. The clean reader-facing anchors here are Megatron Hopper FLCE near-copy, Mamba linear-CE parity near-copy, and author Mamba3 spec near-copy.

The same pattern shows up again in Author Mamba3 spec inside Megatron: once MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split wiring drifts from the model's real contract, the cleanup has to restore the architecture boundary before any performance claim means much.

Pack 12 - DSA _compute_index_scores memory

The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample indexer's score function uses einsum("sbhd,skd->sbhsk", q.float(), k.float()) to build an [sq, b, h, sk] fp32 intermediate, ReLUs it, multiplies by per-head weights[..., None], and reduces over the head axis to produce [b, sq, sk]. The intermediate is sq * b * h * sk * 4 bytes; at sq=sk=4096, b=8, h=32 that is 16 GiB of fp32 working set, allocated, consumed once, discarded.

The fix accumulates directly into the [b, sq, sk] output buffer one head at a time via torch.bmm. Materialize k_bds = k.float().permute(1, 2, 0).contiguous() once ([b, d, sk], ~4 MiB), then loop over heads computing logits_h = bmm(q_h, k_bds), applying relu, multiplying by per-head weights, and add_-ing into the fp32 accumulator. FLOP count and arithmetic intensity are identical; the per-head logits tile becomes the largest live tensor instead of the full [sq, b, h, sk] block. Working set drops from ~16 GiB to ~268 MiB, ~60x. Numerical drift versus the einsum is max |a-b| / max(|a|, eps) = 1.9e-7 at production shape, far below any downstream topk stability threshold; gradient parity is verified via torch.autograd.gradcheck at small shape.

There is an open Megatron-LM PR #4039, "Fused Indexer Loss Kernel", that addresses the same memory problem with a split-K Triton kernel: about 60% memory saving with a 32% perf hit and TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding support explicitly deferred, according to the PR body. Our per-head streaming accumulator gets about 89% memory saving with no perf hit and no TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding deferral. Opening a competing PR would stall both, so pack 12 ships as a comment on #4039 framed as a complementary approach for the bf16 fallback paths. The local memory-pressure story behind why that difference matters is DSA indexer memory fix, and the checked-in shape receipt is DSA indexer memory near-copy.

Pack 16 - Megatron Float16Module silently casts Mamba3 fp32-contract params

Float16Module.__init__ in the MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split module wrapper walks every parameter of the wrapped module and casts to bf16 (or fp16) indiscriminately. The wrapper legitimately owns activation, input, and output precision at the pipeline boundary; the bug is that it also rewrites parameters whose producer contract says they must remain fp32, with no opt-out for the wrapped module. Upstream Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode deliberately keeps several parameters in fp32 because the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey scan kernel's dispatch signature requires fp32: Q_BIAS, K_BIAS (which is C_bias/B_bias), D, dt_bias, and on MIMO paths mimo_x_bias, mimo_z_bias, mimo_o_bias. Float16Module does not know about that contract and silently overrides it. The checked-in public counterpart for that ownership boundary is author Mamba3 spec near-copy.

Mamba3.forward computes DT = F.softplus(dd_dt + self.dt_bias). With dt_bias cast to bf16, DT is bf16 and goes into mamba_mimo_fwd_kernel, whose signature declares DT: T.Tensor([B, H, S], T.float32). On stacks where the kernel's argument validation fires first you get a clean RuntimeError: kernel mamba_mimo_fwd_kernel input DA_CS dtype expected float32, but got bfloat16. On stacks where validation order is different you get silent garbage and grad_norm=NaN on iter 1. D and the rest fail the same way on varlen and MIMO paths.

An earlier per-forward pre-hook re-cast the fp32 params on every step, which nsys showed cost ~305 ms/iter in .data.float() copies. The current shim patches Float16Module.__init__ to call the original initializer and then walk submodules of type Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode, restoring each fp32-contract parameter to fp32 exactly once. It installs alongside the MIMO __post_init__ patch in the same file and eliminates the per-step copies.

Pack 16 is filed against NVIDIA/Megatron-LM as a generic Float16Module contract bug, not a Mamba-only workaround request. The proposal is for Float16Module to honor a per-module fp32 contract (an opt-out attribute the wrapped module sets) rather than silently rewriting dtypes. The reproducer is shared with pack 05 and split by stage: the bf16 stage triggers the cast symptom (pack 16); the gqa_unpatched/gqa_patched stages trigger the Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-side GQA bug (pack 05).

How MegaCpp currently uses these fixes

Three of the six packs currently land as runtime patches in the MegaCpp codebase and stay there until upstream catches up. The Mamba LinearCrossEntropyModule reroute (pack 11) is a runtime class-swap behind an environment toggle; the Float16Module Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode fp32 restore (pack 16) is a one-shot init patch in the shim file; the Hopper FLCE bypass (pack 10) is a dispatcher probe that swaps in the Liger or Apple CCE entry when native fusion raises. Two more (TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey FloorMod in pack 13 and the rank-3 smem regression guard in pack 08) cost throughput rather than correctness; we keep TL_DISABLE_TMA_LOWER=True on the affected backward kernels until upstream lands the FloorMod fix, then flip it. The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample _compute_index_scores rewrite (pack 12) lands as a direct in-tree patch with no runtime probe. If you want the public-safe split between "recipe-native" and "runtime patch surface" rather than just the prose summary, NAM56R runtime patch surface sample is the compact receipt.

The throughput math is unsentimental. Hopper FLCE (pack 10) plus the DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample indexer rewrite (pack 12) together unblock the long-context microbatch sizes we want; without them we OOM the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 budget at the shapes the model actually ships at. The Mamba CE wiring (pack 11) closes the asymmetry between GPT and Mamba models on the same fusion flag. The Float16Module fix (pack 16) eliminates grad_norm=NaN on iter 1 of any Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode training run using the upstream wrapper. The TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey fixes (08, 13) are the ones we cannot land ourselves; they gate a ~20% Hopper backward win we are deliberately leaving on the table until the upstream FloorMod fix lands.

Publishing checklist

  • Every pack has a self-contained reproducer pinned to a named upstream revision plus a short dependency list, and it prints a clear sentinel on failure and on the validated path.
  • Reproducers stamp host capability and dependency versions in their first lines of output.
  • Packs that overlap with an open upstream PR are filed as comments on that PR, never as competing PRs.
  • Packs that have already shipped fixes upstream are repurposed as local regression tripwires, not re-filed.
  • Any local workaround points back to the public upstream thread it is waiting on.
  • Formatting follows the target project's normal contribution rules before a diff is attached.
  • The "post it" decision is human-typed, not automated.
  • Reproducers and pack bodies contain no host-identifying labels, non-public storage URIs, non-public branch codes, or employee names other than the named authors.
  • A successful rerun of the public reproducer is the source of truth for "ready", not a prose checklist.
  • Hopper FLCE stays described as a lane-specific validation until the upstream Hopper path and the unfused fallback are both easy to rerun from the public reproducer.
FAQ

Frequently asked questions

I need to keep an upstream fix from regressing locally. When is a regression pack still worth carrying?+
Because a merged fix can still regress or be bypassed by a later refactor. Keeping a small reproducible pack turns that risk into a visible test failure instead of a vague runtime crash months later.
I need to decide between a PR, an issue, or a comment on an existing thread. What makes that call?+
Because sometimes the right next move belongs to maintainers, overlaps an active upstream thread, or needs design agreement before code. For those cases, a precise reproducer is more useful than a competing patch.
Does the TensorRT-LLM DSA gather/scatter work change the Megatron pack 12 decision?+
No. That closed, unmerged thread is inference-side gather/scatter and K-cache work, while pack 12 is about avoiding a training-side score slab in _compute_index_scores. The TensorRT-LLM PR is useful comparison material, but it is not a replacement proof surface for the MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and… memory fix. The local background stays in DSA indexer memory fix, and the checked-in proof surface stays DSA indexer memory near-copy.
I need to explain the Hopper FLCE lane without overstating support. Why split it into two tiers?+
Because the first tier is the real Hopper kernel path and the second tier is the honest fallback when no native kernel exists for a device. Keeping those two responsibilities separate makes the runtime claim easier to verify and easier to review upstream.
Why does the Hopper FLCE reproducer need to print the CuTe DSL package version?+
Because the open Hopper PR has already needed a compatibility fix at that layer. If the reproducer only names a MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and… revision, a later failure can look like a kernel correctness problem when it is really a dependency-boundary mismatch.
How do I avoid mistaking a TileLang fallback for real TMA coverage?+
Treat "compiled" as only the first gate. For TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments. TMA work, the reproducer should also print whether the lowered module still contains TMA copies, using the same kind of compiler-phase check TileLang exposes through module_has_tma(). If a rank-3 smem case compiles because it intentionally fell back to cp.async, the pack should say that explicitly rather than counting it as a bulk-copy pass.
How should precision workarounds stay removable instead of becoming permanent lore?+
Tie every workaround to the upstream issue or PR that would delete it, and make the reproducer fail when the wrapped module finally exposes an fp32 opt-out contract. If the comment only says "Mamba needs fp32", the shim will survive forever; if it says "remove this once Float16Module honors the module-owned fp32 parameter contract", the deletion path is reviewable.
Which packs here are already merged upstream, and which ones still depend on maintainer action?+
Pack 08 is the cleanest merged case: the fix is upstream and the local pack now exists mostly as regression coverage. Pack 13 is the opposite shape: it is a compiler-bug reproducer that still needs maintainer-side resolution rather than a "ready to merge" local patch. Pack 12 intentionally stays on the existing indexer thread as a comment lane instead of competing with it. Packs 10, 11, and 16 still matter locally because the runtime boundary is already patched here, but the upstream goal is to remove those local shims by restoring the MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and… contract in-tree.
I want to publish one of these packs upstream. What evidence has to exist first?+
Real-lane validation, a minimal reproducer, and a body that avoids internal-only references. If any of those is missing, the pack is still local preparation, not a public-ready upstream contribution. In concrete terms, a first reader should be able to jump from the prose here to one checked-in proof surface such as Megatron Hopper FLCE near-copy or Mamba linear-CE parity near-copy and see the contract directly. How we keep a patch lane is the companion when you need the repo-side source policy behind that standard.
I want to audit the checked-in proof surfaces behind these six packs. Where should I start?+
I am debugging a TileLang compiler bug versus a Megatron integration bug. Which local examples should I open first?+
For TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments. compiler and lowering bugs, start with TileLang TMA bulk-copy near-copy and Mamba3 MIMO 3D-to-2D smem near-copy. For MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and… dispatcher and integration bugs, start with Megatron Hopper FLCE near-copy, Mamba linear-CE parity near-copy, and author Mamba3 spec near-copy. Porting To Megatron-Core Is Harder Than It Looks is the bridge-level explanation connecting both sets.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

CuTe DSL

The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

sm_100a

Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.

sm_121a

Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.

tcgen05

The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.

WGMMA

Hopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

TP

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.