MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 10 min readDavid Gornshtein
Upstream
Open Source
Megatron
TileLang
Mamba
Liger

Upstream PRs: how a small training shop ends up patching everyone else's libraries

A guided tour of the upstream contributions we are submitting back to the open-source training stack, the cadence we hold ourselves to, and the categories that keep showing up.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Upstream PRs: how a small training shop ends up patching everyone else's libraries
Published 10 min readDavid Gornshtein

If you train a frontier-shaped model on a stack made of MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-LM, TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample, Liger-Kernel, state-spaces/mamba, and several NVIDIA reference kernels, you will sooner than you expect become the de facto maintainer of all of them. Not officially, but in the sense that the bugs you hit are real, the maintainers may not have hit them yet, and your training run does not get to wait for the upstream sprint cycle. This post is the honest tour of the upstream PR pipeline we keep open: why it exists, the checklist before filing, the categories the work falls into, and the cadence we settled on.

Why this matters

The selfish reason first: every patch we carry locally is a future merge conflict. We pin active upstream revisions, and we sometimes carry a small set of open upstream patches on top. Those branches move every week, and every patch we never upstreamed has to be manually rebased by someone who still remembers why it exists. The half-life of "I'll write it up later" is roughly one model preset.

The less selfish reason is that downstream users benefit when the rest of the ecosystem does the same thing. The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample indexer memory lane and the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample LowerBulkCopy warn-and-fallback illustrate the same pattern: once a failure surface is written up cleanly, it stops being only our private merge-conflict tax. This arrangement only works long-term if teams that hit a bug write it up properly.

Both examples are easier to interpret with their local companions: DSA indexer memory fix for the memory-footprint lane and Mamba3 MIMO 3D to 2D shared-memory deep dive for the lowering refactor lane.

The third reason is calibration. Writing an upstream PR forces you to separate "broken in our integration" from "broken in the library". A surprising number of bugs evaporate at that boundary. Of the sixteen packs in the current queue, several turned out to be already fixed upstream or aimed at the wrong upstream surface once we re-checked them carefully. The checklist below exists because of that.

This page works best as the map for the queue, with Upstream PRs for Mamba-3, Sparse-MLA, Liger, and DSA and Upstream PRs for TileLang and Megatron-Core carrying the narrower receipts. The bridge-level background is Porting To Megatron-Core Is Harder Than It Looks, which shows why adapter friction turns into upstream work.

What goes into a submission pack

A pack is the bundle of artifacts we need to file one upstream contribution. Our filing checklist is the source of truth for whether a pack is ready, and it has the following shape.

For this queue, "self-contained" also means a reader can jump from the overview to a checked-in proof surface without opening any internal tree. The main anchors are DSA CUDA-graph safety sample, DSA indexer memory sample, SparseMLA FP8 dispatch near-copy, SparseMLA dimension generalization near-copy, Megatron Hopper FLCE near-copy, and Mamba3 MIMO 3D-to-2D smem near-copy. Those are the compact public-safe receipts behind the pack families summarized below.

First, a markdown template for the issue or PR body. It is written in English, avoids non-public infrastructure or branch labels, and cites only the public identifiers needed to explain the bug.

Second, a self-contained reproducer. The reproducer must run as a single documented command against a clean checkout of the target library at a named SHA, with the dependency versions pinned next to it. It must print a BUG_REPRODUCED (or equivalent) sentinel when it triggers the bug, and a FIX_VALIDATED sentinel when run against the patched code. The reproducer also stamps the host capability in its first line of output so the maintainer can tell at a glance whether their own machine should reproduce.

Third, a validation-manifest entry. This records which host last validated the reproducer, what the exit code was, and which sentinels were printed. The manifest is the only thing we trust when we ask "is this pack ready"; we do not trust the date in the markdown body, and we do not trust anyone's memory.

Fourth, the upstream-state field. Before filing, the target project gets searched for relevant issues and pull requests, and the result is recorded as a new report, an overlap with an existing thread, or something already fixed. If it is already fixed, the pack does not get filed; it gets repurposed as regression coverage. The TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample LowerBulkCopy 3D shared-memory case is exactly that story: by the time it was re-checked, the relevant warn-and-fallback work had already shipped, so the reproducer stayed as regression coverage instead of becoming a duplicate bug report.

Fifth, an explicit "post it" gate. No pack, however ready, is filed without a human typing the words. We never automated past that gate.

The checklist we actually run

The pre-flight is short on purpose, because long checklists do not get followed. In order:

  1. The reproducer passes today on a host we can name. Not last week. Not "the manifest says it passed". Today.
  2. The template markdown renders cleanly in GitHub's preview pane. Code fences survive, tables survive, and no non-public links bleed through.
  3. The reproducer is attached as a file or a gist, not pasted inline. Hundreds of lines of Python in an issue body burns reviewer attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns.
  4. No non-public-only language: no infrastructure references, no non-public branch codes, no employee names other than the named authors.
  5. For MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split PRs we run the project's required formatter before the diff goes in, because the upstream PR template makes that a hard checklist item.
  6. The filing approval is recorded by the people doing the work.

The checklist has caught real things. The FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper dispatch hazard for SparseMLA was originally aimed at TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample because that is where the kernel lives. Reading the body in preview made it obvious the bug is actually in the Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer Float8Tensor wrapper (.dtype lies, .data_ptr() returns NULL, .contiguous() does not unwrap), so the target shifted accordingly. Similarly, an early manifest pointed one MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode backward note at the wrong repository even though the public sample being changed lived elsewhere. Both targeting errors were fixed before anything was filed.

The cadence

We submit in waves, not as a stream. A wave is two to four packs filed within a small window, batched by target repository, then a two-to-three-day pause before the next wave. The reason is simple: maintainers have inboxes. If six issues land on the same repository on the same morning, two may get triaged and the other four may sit. If two land, both are more likely to get triaged.

Within a wave we follow a coarse priority order. First come defensive fixes against bugs that crash training, such as the DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample issue or the MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split Float16Module cast that breaks Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode's fp32 contract. Second come fixes that can piggyback on an already-open upstream pull request, where a comment is more useful than a competing patch. Third come larger refactors that need real maintainer attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, such as SparseMLA dimension generalization or the Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode MIMO 3D-to-2D shared-memory refactor. Fourth come bug reports for which there is not yet a fix to offer, such as the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample FloorMod divide-by-zero in LayoutInference.

The cadence rule that takes the longest to internalize is that "ready" does not mean "filed today". A pack can sit in the ready state for a week while waiting for the right wave; the cost of holding a ready pack is low, and the cost of dumping six issues on one maintainer is not.

The categories

The packs cluster into a small number of recurring shapes. Once we noticed the shapes, the writing got faster, because each shape has a template.

The first category is CUDA-graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample and graph-capture safety. Library code from a year ago routinely contains torch.equal(...), tensor.any(), or if torch.any(idx < 0) constructs that implicitly cudaStreamSynchronize and crash graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample with cudaErrorStreamCaptureUnsupported. The fix is always the same: gate validations on torch.cuda.is_current_stream_capturing(), or rewrite the branchy logic into a branchless clamp/scatter/fixup. Pack 01 (DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample in MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-LM) is the canonical example.

The second category is dispatch hazards introduced by tensor-wrapper types. Float8Tensor, QuantizedTensor, and any other __torch_dispatch__-based wrapper has a habit of looking like a normal tensor at the Python level (.dtype, .shape, .contiguous() all behave) while being unsafe to hand to a raw CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 or TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample kernel. Pack 03 is the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper SparseMLA case. The MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split Float16Module blanket bf16 cast in pack 16 is the same family in reverse: an upstream "helper" silently rewrites tensor dtypes that another library expects to keep in fp32, and the result is either a clean dispatch error or silent NaN.

The third category is kernel-side numerical and dimension correctness. The SparseMLA dimension generalization (pack 02), the SparseMLA backward accum_dtype precision fix (pack 14), and the missing GQA branch in Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode MIMO backward (pack 05) sit here. None is glamorous; they are all either "this kernel hardcoded one shape and crashes on every other shape" or "this buffer was bf16 where it needed fp32 and the gradient drifted".

The fourth category is memory-footprint reductions. The DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample _compute_index_scores per-head streaming accumulator (pack 12) is the loudest: an einsum that materialized a 16 GiB intermediate, replaced with a per-head bmm that reuses a 268 MiB output buffer. Math unchanged, working set ~60x smaller. These patches almost always overlap with at least one in-flight upstream PR, which is why they go in as comments rather than competing PRs.

The fifth category is integration and dispatcher gaps in MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-LM. The Hopper FLCE dispatcher (pack 10) crashes with ValueError: Unsupported architecture: 9 on every cc!=10 device because the Blackwell entry was the only branch wired in. The MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode LinearCrossEntropyModule wiring (pack 11) was correctly added in one PR and silently reverted three weeks later by a rebase-miss in another. These are the easiest packs to write and the hardest to file, because the right framing is "your CI did not catch this".

The sixth category is toolchain and compiler bugs in TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample. One example is the FloorMod divide-by-zero in LayoutInference when TMA lowering is enabled on Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode backward kernels. Another is the now-fixed LowerBulkCopy InputDim==2 assert, which remains useful as a regression guard. These are bug reports, not patch drops; the right fix lives inside TVM's iter-map normalizer.

The seventh category is legality-preserving refactors that unlock a lowering path. Pack 07 (Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode MIMO backward 3D->2D smem flatten) is the entire category. Every [c, r1, r2] indexer becomes [c, r1*R + r2]; smem footprint and register pressure are identical and gradients are bitwise-equal to the unflattened baseline within bf16 rounding. The reason to do it is that TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample's TMA bulk-copy lowering requires InputDim()==2; once the descriptors are 2D, the backward kernel becomes eligible for TMA pipelining on Hopper.

Honest about state

None of the sixteen packs has been filed yet. Some are still blocked on fresh repro evidence. The SparseMLA precision fix remains a code-level note without a checked-in reproducer bundle, so it does not yet meet the bar described here. The Float16Module cast note shares reproducer coverage with another runtime issue but would still be tracked separately if filed. The LowerBulkCopy note has already been re-classified from "issue to file" to "local regression guard" because the underlying fix has shipped upstream.

These packs are written alongside the training work rather than by a dedicated open-source liaison. The packs that get written are the ones whose absence would cost more rebase time than the writeup costs. That filter is blunt, and also why the packs that do exist are concrete enough for someone else to validate.

Production checklist

  • Every patch we carry locally gets either an upstream pack or an explicit decision not to file, recorded in the filing checklist.
  • Every pack has a reproducer that runs against a named upstream SHA and prints a sentinel.
  • Reproducers stamp the host capability and the dependency versions in their first lines of output.
  • The validation manifest is the source of truth for "is this pack ready", not the markdown body.
  • We file in waves of two to four packs, batched by target repository, with two to three days between waves.
  • We never open a competing PR against an open upstream PR; we comment on the existing thread instead.
  • We do not file packs that are already fixed upstream; we repurpose them as regression tests.
  • The filing decision is always made by a human and recorded.
  • No non-public infrastructure references, no non-public branch codes, and no employee names other than the named authors.
  • For MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split PRs we run the project's formatter before any diff is attached.
FAQ

Frequently asked questions

What is the difference between an open upstream PR and a pack that is still in our queue?+
An open upstream PR is already on a public maintainer lane and waiting on review, follow-up evidence, or merge. A pack that is still in our queue is earlier than that: it may be locally ready but waiting for the right filing wave, it may overlap an existing public thread where a comment is better than a competing PR, or it may still need a fresh reproducer rerun before we can post it. That is why "not merged yet" is too coarse a label for this queue.
Where should I start if I want the checked-in proof surfaces behind this queue?+
Start with MegaCpp model wiring examples for the public-safe example index, then use Upstream PRs for Mamba-3, Sparse-MLA, Liger, and DSA and Upstream PRs for TileLang and Megatron-Core for the narrower pack-level receipts. This overview is the filing map; those three companion pages are the shortest path to the concrete reproducer and near-copy surfaces.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Mamba

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.