MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 15 min readDavid Gornshtein
Upstream
Mamba3
Sparse Mla
Liger
DSA
Kernels

Upstream PRs we wrote for Mamba-3, Sparse-MLA, Liger and DSA

A focused walk-through of the Mamba-3, Sparse-MLA, Liger-Kernel and DSA upstream PRs we have prepared: the bug, the fix, and where each one currently sits.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Upstream PRs we wrote for Mamba-3, Sparse-MLA, Liger and DSA
Published 15 min readDavid Gornshtein

This is the focused tour of the Mamba-3 and Sparse-MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries side of the current upstream batch, plus two adjacent packs (Liger FLCE backward and DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample CUDA-graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample safety) that share the same shape. Eight packs across five repos: state-spaces/mamba, tile-ai/tilelang, NVIDIA/TransformerEngine, linkedin/Liger-Kernel, and NVIDIA/Megatron-LM. Each section answers what broke, what the fix is, and where the pack sits today. The sibling compiler-and-Megatron lane is covered in Upstream PRs we wrote for TileLang and Megatron-Core.

The checked-in MegaCpp model wiring examples page is the local receipt index for this upstream chain. The repo links in References are the public upstream half.

Why MegaCpp cares about this

Two architectural choices set the bar. Our hybrid presets put Mamba-3 SSM blocks in series with attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, so we live inside state-spaces/mamba and its TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample backward kernels every step. We run Sparse-MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries in the absorbed configuration on Hopper because the working-set savings are decisive at long context, so we live inside the tilelang_sparse_mla_* kernel family on the same step.

Both families are research code drafted at one configuration and not yet generalized. The Mamba-3 MIMO backward kernel covers G == 1 (MHA) and G == H (per-head) but not the GQA middle. The Sparse-MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample kernels hardcode the DeepSeek V3.2 dimensions (576 / 512) and bail out elsewhere. The Mamba-3 SISO backward computes the same V @ dO^T chunk-GEMM three times. The Sparse-MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries backward stores P/dP shared buffers in bf16 where they need fp32. None of these crash a regression test the maintainer wrote; all of them crash or corrupt our training run.

The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample and Liger packs are here for the same reason. DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample's CPU-syncing validations (torch.equal, tensor.any()) crash CUDA graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample, gating one of our highest-impact perf wins. Liger's LigerFusedLinearCrossEntropyFunction(reduction="none") returns a sensible-shaped forward tensor but corrupts gradients on any non-uniform grad_output because the kernel reads grad_output[0] and broadcasts it. Both bugs hide until you flip the flag you want to flip.

For the runtime-side view rather than the upstream-pack view, read Mamba3 kernel journey, Sparse MLA dimension generalization, Sparse MLA FP8 dispatch, and DSA and CUDA graph safety.

How we validate the work

The first version of these fixes lands close to the workload that exposed the bug, then gets reduced into a public reproducer and an upstream-facing patch. In a few cases we also keep a temporary application-level mitigation while the upstream path is still open. Each pack has a self-contained reproducer with sentinels such as BUG_REPRODUCED, FIX_VALIDATED, and MEMORY_SAVE_VERIFIED; the executable outcome is the source of truth.

Pack 02 - SparseMLA dimension generalization

The fused SparseMLA TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample kernels are hardcoded for DeepSeek V3.2 dimensions. The dispatcher admits only query.size(-1) == 576 and v_channels == 512; the forward path asserts dim_plus_tail_dim == 576; the backward path hardcodes D = 512; both also assert dim == next_power_of_2(dim). Any other MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries shape (for example d_total=128, v_channels=64) falls through to the unfused path that materializes the full [B, H, S, S] attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns block and can OOM at long context.

The fix is four small edits: drop the legacy dimension guard in the absorbed-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns dispatcher; plumb d_v through SparseMLA.apply() so the autograd Function knows the value channel count; relax the kernel-level assertions to dim % 16 == 0 (the real warp-op constraint); read D from o.shape[-1] (or the new d_v argument) in the backward instead of hardcoding 512. The kernel is already parameterized over dimensions; nothing about the math changes.

We validated this first in the MegaCpp codebase because without it the fused path is dead code for the hybrid presets discussed here. The pack is explicit about validation scope: the reproducer validates dimension plumbing (compile and launch at d_total=128, v_channels=64), not end-to-end convergence parity or fp64 gradcheck for the generalized kernel. The checked-in SparseMLA dimension generalization near-copy is the compact public proof surface for that scope. The precision fix in pack 14 is intentionally separate so the dimension patch can land independently.

Pack 03 - SparseMLA FP8 dispatch hazard

With FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper training, Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer wraps tensors in QuantizedTensor (Float8Tensor). The wrapper lies in several ways: .dtype returns the logical bf16, hiding FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper storage; .data_ptr() returns NULL (real data is at ._data.data_ptr()); .to(), .contiguous(), .reshape() do not unwrap; only .dequantize(), .float(), .permute(), .unsqueeze() do. Handing those wrappers to the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample SparseMLA kernel gives the kernel NULL data pointers.

TE caveat: newer Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer wrapper paths can turn the hard pointer failure into an explicit dense fallback. The RuntimeError: data pointer expected non-NULL may stop being the first symptom, but the operator can still lose the intended FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper storage path if dispatch reaches a bf16 dequantized view instead of an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-aware kernel. The underlying hazard remains. The dispatch fix is about correctness of intent, not only crash prevention.

The fix detects QuantizedTensor inputs in _fused_sparse_mla_absorbed() and dispatches to an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-aware variant (SparseMLA_FP8 with T.float8_e4m3fn GEMMs) instead of choosing only from the logical dtype. For models without that variant, .dequantize() before the kernel call is a correctness fallback that loses the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper storage win. The pack is filed against NVIDIA/TransformerEngine because the bug is in the wrapper's contract, not the kernel. The checked-in SparseMLA FP8 dispatch near-copy shows that boundary directly. The operational side of that wrapper boundary is the same one described in Transformer Engine bridge on NVIDIA: vendor wrappers only help when the dtype and storage contract stays explicit.

The upstream-facing tradeoff is also worth naming plainly. Once an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper path needs extra dequantize, transpose, or requantize work around backward, a narrow kernel win can disappear at the end-to-end level while the wrapper contract gets harder to reason about. That is why this pack stays framed as an intent and storage-contract fix, not just a crash fix.

That separation from pack 02 is intentional. MLA and weight absorption explains why the absorbed SparseMLA path already carries a shape and layout contract; the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper pack adds a second contract at the wrapper boundary, where storage, scales, and pointers have to stay truthful too. A path that silently falls back to bf16 can preserve a toy result while still invalidating the bandwidth and memory story the operator thought they enabled.

Pack 14 - SparseMLA backward precision (P/dP in accum_dtype)

In the SparseMLA backward kernel, the shared-memory buffers P_shared_cast and dP_shared_cast are allocated with dtype (bf16) before being consumed by the dKV gradient GEMM. P and dP have a wide dynamic range (exp of scaled scores, then multiplied by dO accumulations); bf16 storage loses precision in the dKV path and drifts versus an fp32-reference backward. The fix is one line per buffer: allocate with accum_dtype (fp32) instead.

Pack 14 is still waiting on a checked-in example bundle, so the body deliberately limits itself to the code-level change and the intended validation target (improved dKV accuracy against fp32/fp64 reference, no material Hopper regression). Filing a precision fix without numerical evidence wastes a maintainer's time. That is why this article links the surrounding proven packs and examples, but does not pretend pack 14 already has the same proof surface.

The evidence bar is higher here than for pack 02 because the failure is long-context weighted. A small debug shape can look fine while the dKV path only starts to drift once backward accumulates many score products. The right public proof is therefore a checked-in context-length sweep plus grad-accuracy evidence, not one short-shape compile/run receipt. Until that exists, Loss curves and divergence playbook is the better companion read than another launch log.

Pack 04 - Mamba3 SISO backward: eliminate redundant V @ dO^T

the public Mamba SISO backward DQKV kernel sample computes tl.dot(v_block, tl.trans(do_block)) (a CHUNK_SIZE x CHUNK_SIZE GEMM) three times in the inner chunk loop, each followed by the same causal-decay mask. All three produce identical results before diverging into dADT, dK, dQ. The fix computes vdot_block once, applies the mask once into vdot_masked, and reuses it for all three consumers. Net change: -25 lines, two redundant tl.dot calls and two mask applications removed per chunk.

This is honestly a code-clarity / CSE-regression-robustness PR, not a perf PR. On Triton 3.7 + H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 the compiler already CSEs the three dots automatically; measured speedup at our shapes is within timing noise (sigma about +/-2%). The value is in making the CSE explicit so that the fusion is a property of our source rather than of whichever Triton version is installed, guarding against future MLIR pass-ordering regressions, and shrinking the RECOMPUTE_MASK path.

There is also public compiler context for keeping that source-level CSE explicit. Triton issue #8695 documents a Blackwell backward compile failure with the same operand ... does not dominate this use error family. That issue does not prove our kernel trips the identical pass interaction, but it is useful cautionary context for avoiding dependence on fragile backward-kernel compiler behavior. The adjacent checked-in author Mamba3 spec near-copy is useful here because it shows the same bias toward explicit contracts over implicit compiler luck. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 all seven gradient tensors (dQ, dK, dV, dADT, dQK_dot, dD, d_issm_state) are bitwise identical between original and patched, including under varlen.

Pack 05 - Mamba3 MIMO GQA backward (missing 1 < G < H branch)

mamba_mimo_bwd_combined only handles two reduction cases: G == 1 (MHA) and G == H (per-head). Intermediate grouping (for example ngroups=8, nheads=128, giving 16 heads per group) hits else: raise ValueError("G value of {G} is not currently supported!") or, on a slightly different code path, produces silently incorrect gradients. G=1 and G=H both pass; the bug only surfaces in the GQA middle.

The fix is a third branch for 1 < G < H where H % G == 0. Inside it, compute the bias gradients first (dq_bias, dk_bias) by summing over batch and seq before reducing dq/dk - the bias grads have shape [H, R, N] and must come from the un-reduced dq/dk. Then reshape dq/dk from [B, S, R, H, N] to [B, S, R, G, hpg, N] and sum over dim=4. Same fix in the public Mamba varlen backward kernel sample. Roughly 15 lines per kernel.

On 8xH200 the patched kernel runs ngroups=8, nheads=128, d_inner=8192, headdim=64 at 279 TFLOP/s steady-state for 20 iterations with no NaN. The reproducer's gqa_unpatched stage raises; gqa_patched produces finite grads bitwise-identical across reruns. Out of scope: the B/C layout (r, g, n) vs (g, r, n) latent bug only triggered at TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding>1 with ngroups>1, and the Megatron Float16Module cast (pack 16) which shares this reproducer but targets a different repo. The reader-facing link for the adjacent ownership seam is author Mamba3 spec near-copy.

Pack 07 - Mamba3 MIMO bwd: 3D to 2D smem refactor for TMA compatibility

TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample's LowerBulkCopy requires shared_layout->InputDim() == 2 to emit TMA copies. Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode MIMO backward kernels carry three rank-3 smem descriptors that block the TMA path: qk_dot_shared is [chunk_size, R, R]; the Q/K loads in mamba_mimo_bwd_fwd_kernel and mamba_mimo_bwd_bwd_kernel land in [chunk_size, R, N]; the QK_DOT global tensor is [B, H, S, R, R]; two register fragments (qk_dot_frag, dgamma_diag_prereduce_frag) are also rank-3.

The fix is a flatten that does not change the math. [chunk_size, R, R] becomes [chunk_size, R * R]; every [c, r1, r2] indexer becomes [c, r1 * R + r2]. The signature Q: [B, S, R, G, N] becomes Q: [B, S * R, G, N] with callers passing q.view(B, S * R, G, N) (zero-copy). Smem footprint and register pressure are unchanged. Verified on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story (sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough): all 14 gradient tensors land well under the repo's 0.10 tolerance, bit-for-bit with the pre-patch TMA-off baseline within bf16 rounding. GB10 stack parity for MegaCpp is the local decoder for why that consumer target is not the same claim as datacenter sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough or tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough support. The checked-in Mamba3 MIMO 3D-to-2D smem near-copy is the public proof surface for this legality-preserving rewrite. H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 perf is a follow-up comment, not a filing precondition.

For the reader-side kernel explanation instead of the filing packet, Mamba3 MIMO 3D-to-2D shared-memory deep dive walks through the same rewrite from the staging-layout side and pairs it with TileLang TMA and H200 reality.

Independently of TMA the flatten has merit: simpler descriptors and forward-compatibility with cp.async.bulk.tensor.3d once TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample adds rank-3 TMA. Once flattened, the kernel becomes eligible for TMA pipelining on Hopper - but the TMA-on path is gated by pack 13 (the FloorMod DBZ in LayoutInference). The two packs land in order: 07 first (refactor, no behavior change), 13 second (compiler fix upstream), TL_DISABLE_TMA_LOWER=True workaround off third.

Pack 09 - Liger FLCE reduction="none" backward silently corrupts gradients

LigerFusedLinearCrossEntropyFunction.apply(..., reduction="none") returns a [BT] forward loss that looks reasonable, but the saved grad_input and grad_weight are scaled in backward by element_mul_kernel, which assumes grad_output is a scalar and reads only grad_output[0]. Any non-uniform per-token grad_output (loss-mask weighting, document-boundary masking, per-token scaling) silently produces the wrong gradient for every row except the first.

loss.sum().backward() gives grad_output = [1, 1, ...]; reading the first element returns 1.0 and the math coincidentally matches reduction="sum" - silent pass. (loss * loss_mask).sum().backward() gives grad_output = loss_mask; the scalar read returns loss_mask[0] and scales every row by that one value, with measured max|delta grad_hidden| = 4.7e-2 versus eager PyTorch (bf16 noise floor is 5e-3). Downstream: a Megatron LM-head integration with loss-mask before reduction reports grad_norm = NaN on iter 1 and crashes with CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 illegal memory access on iter 2.

The fix replaces the scalar element_mul_kernel with a row-broadcast kernel when grad_output is a tensor: grad_input.mul_(grad_output.unsqueeze(-1).to(grad_input.dtype)) for the input side, and either redo the chunked MM in backward using saved per-chunk grad_logits or recompute them. Cost is one extra chunked matmul in backward; forward memory footprint is unchanged. The pack also asks for a docstring note that reduction="none" was forward-only on the previous implementation. The checked-in Liger FLCE reduction-none near-copy shows the broken-vs-safe contract directly.

For the operator-side explanation instead of the upstream-pack summary, Liger FLCE reduction none is the companion piece that focuses on why this gradient contract matters in normal training code.

Upstream context: Liger-Kernel issue #968 is now closed, but the row-broadcast backward fix described here did not land there. Draft PR #1126 remains open and currently chooses the fail-closed path of raising on reduction='none' backward. Open PR #1182 is narrower: it makes reduction kwarg handling explicit at the wrapper boundary, but it does not land the unreduced backward fix. Our pack carries the working row-broadcast fix as a comment on #1126. Our temporary application-level mitigation is to call Liger with reduction="mean", rescale by n_valid, and broadcast back. That is exact when the caller's loss_maskQuick term guideloss_maskThe per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.GroundingAbout: document masking and curriculum Example: packed rows schema sample Example: FIM long-context metadata sample equals (labels != ignore_index) and a uniform-loss approximation otherwise.

Pack 01 - DSA CUDA-graph safety

The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns helper contains several CPU-syncing ops that are fine in eager mode and fatal under CUDA graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample. torch.equal(finite, expected), torch.equal(key_positions, ...), and torch.equal(mask[bi], ref_mask) validations all call .item() internally and force a cudaStreamSynchronize. _scatter_topk_into_index_mask uses if torch.any(idx_chunk < 0): if valid_topk.any(): ... - two scalar reductions with branching. All trigger cudaErrorStreamCaptureUnsupported under --cuda-graph-impl transformer_engine.

The fix gates torch.equal() validations on torch.cuda.is_current_stream_capturing() so they run in eager mode but are bypassed under capture, and rewrites the branchy scatter into a branchless clamp / scatter / fixup using any(dim=-1) (last-dim, not scalar; no CPU sync). Verified on 8xH200 with attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, Mamba, and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack scopes captured: training completes cleanly and loss convergence is identical to the non-CG baseline within noise. The checked-in DSA CUDA-graph safety near-copy is the reader-facing proof surface for that contract. Pack 01 ships as an issue first because the relevant code is already active upstream and the maintainer can either land it directly or invite a PR.

The practical runtime version of pack 01 is DSA and CUDA graph safety, which is the better follow-on if you care more about why capture broke than how the issue gets filed upstream.

How MegaCpp currently uses these fixes

The Mamba-3 packs (04, 05, 07) are scoped as PRs against state-spaces/mamba in that order, to keep reviewer fatigue down on the same maintainer team. The Sparse-MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries packs split across three repos: 02 to tile-ai/tilelang, 03 to NVIDIA/TransformerEngine (the wrapper, not the kernel), and 14 held until it has a complete example bundle. The Liger pack (09) is scoped as a comment on draft PR #1126. The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample pack (01) is scoped as an issue against NVIDIA/Megatron-LM. Temporary mitigations stay documented until the corresponding upstream path lands. For first-touch readers, that "temporary mitigation" phrase means a small runtime patch layer that keeps training alive while upstream review happens; NAM56R runtime patch surface sample is the compact checked-in decoder for that boundary.

What is still pending

This lane is no longer one uniform "not filed yet" bucket. The adjacent public context already matters: TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample PR #746 is merged, Liger issue #968 is closed, draft Liger PR #1126 remains open, Liger PR #1182 remains open, and Megatron-LM PR #4039 remains open. What is still pending on our side is the filing wave itself: the Mamba bundle (04, 05, 07), the SparseMLA pair (02, 03), and the DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample write-up still need their final public posts, while pack 14 stays held until it has a checked-in example bundle. The goal is to attach each fix to the right live thread rather than open redundant parallel ones.

Publishing checklist

FAQ

Frequently asked questions

Why split this work into many small packs instead of one large upstream drop?+
Because each bug belongs to a different repo, maintainer set, and validation scope. Narrow packs are easier to review, easier to reproduce, and much less likely to hide unrelated failures behind one giant patch.
What counts as validation for a pack?+
A reproducer with a clear sentinel plus evidence that matches the claim. Some packs only claim compile or launch parity, while others claim numerical parity or a validated runtime win. The scope has to stay explicit. In practice that means a reader can inspect one concrete example and see the claim directly: SparseMLA dimension generalization near-copy for non-DeepSeek dimensions, SparseMLA FP8 dispatch near-copy for wrapper hazards, or DSA CUDA-graph safety near-copy for graph-capture safety.
Why keep temporary mitigations locally while the upstream path is still open?+
Because the production lane still has to run. A local mitigation keeps the workload alive, but it should stay documented as temporary and tied to a clear removal condition once the upstream outcome is known. The local receipt index for those temporary seams is MegaCpp model wiring examples, especially the near-copy examples linked in this article.
Why is pack 14 still held back instead of filed now?+
Because it is a precision fix without a checked-in numerical proof bundle yet. This article keeps the scope honest on purpose: a code-level suspicion is not enough reason to spend maintainer review time until the grad-accuracy evidence is attached.
Why is the FP8 pack filed against Transformer Engine instead of TileLang?+
Because the first broken promise is at the wrapper boundary, not in the MMA math. If a tensor wrapper reports logical bf16 while hiding FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. storage or a different data pointer, the kernel cannot make a correct dispatch choice from surface-level tensor metadata. Transformer Engine bridge on NVIDIA is the runtime-side follow-up on that boundary.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

sm_100a

Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.

sm_121a

Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.

tcgen05

The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

TP

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

loss_mask

The per-token training mask that decides which positions contribute to loss after packing, FIM rearrangement, or documentation-aware masking.

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Topic hubs