MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 7 min readDavid Gornshtein
TileLang
H200
Hopper
Kernels
TMA

TileLang TMA and H200 reality

Why TileLang shared-memory legality and TMA lowering on Hopper-class GPUs should be treated as concrete compiler contracts rather than assumed backend magic.

MegaCpp
Focused on applied C++ model engineering
Article Preview
TileLang TMA and H200 reality
Published 7 min readDavid Gornshtein

The useful way to talk about TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingHistory: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 is not to ask whether the kernel is mathematically correct. The useful question is whether the lowering accepts the shared-memory layout and TMA path the kernel actually requests. The deeper follow-up lives in TileLang TMA bulk copy 3D SMEM deep dive, with the neighboring layout case in Mamba3 MIMO 3D to 2D SMEM deep dive.

That is why MegaCpp keeps small legality-style samples. A compact reproducer is often more valuable than one more benchmark chart when the problem lives in the compiler contract. This article is the narrow companion to Upstream PRs we wrote for TileLang and Megatron-Core and the broader kernel history in Mamba3 kernel journey.

The same contract-first view shows up in DSA CUDA graph safety deep dive, where a path can be numerically fine and still fail because the runtime contract was wrong. On the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns side, The FA4 Catalog on Blackwell applies the same rule to backend eligibility instead of lowering legality. That is also why this article belongs next to Kernels that pay for themselves: a legality reproducer is worth keeping only if it explains a contract boundary we will hit again.

One important boundary up front: this article is about ordinary TMA bulk-copy legality on Hopper or H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, not the datacenter-only TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough cluster path. For the consumer-vs-datacenter split on sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro / sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough / sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough, TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough, and the public stopping point at the observed .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro metadata block, the right companion is What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths plus the checked-in GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story bundle around GB10 gate matrix, full GB10 tensor-path probe source, and compact gate-walk mirror. If you need the naming layer before either article, use MegaCpp model glossary.

First-touch terms

If you only read one section before reading the examples, read this one.

What the checked-in samples actually prove

The compact sample TileLang TMA bulk-copy SMEM sample is intentionally small. tma_bulk_copy_layout(shape) rewrites a logical (d0, d1, d2) view into (d0 * d1, d2), and requires_layout_fix(...) makes the teaching point explicit: the example is about whether the lowered width and layout shape fit the backend contract, not whether the copy math would be meaningful.

The near-copy version TileLang TMA bulk-copy SMEM near-copy preserves the real comparison surface. bulk_copy_3d_contract(...) names the natural 3D source and destination, bulk_copy_2d_contract(...) shows the flattened form, and compare_layouts(...) keeps both views side by side. That is the exact reason to keep a near-copy around: it shows that the intended data movement is stable while the compiler-facing contract changes. The local catalog in MegaCpp model wiring example index is the quickest way to see where this sample sits relative to the neighboring Mamba and GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story receipts.

The neighboring Mamba example makes the same point on a more realistic kernel surface. Mamba3 3D-to-2D SMEM near-copy preserves the q_shared, k_shared, and qk_dot_shared staging shapes that forced the rewrite, while Mamba3 3D-to-2D SMEM sample isolates the flattening rule itself.

If you need the adjacent "who owns explicit memory placement?" lane, pair this article with Our honest experience with CuTe DSL and the checked-in MegaCpp model wiring example index. Those are the public-safe anchors for the opposite trade-off: instead of proving a legality rewrite, you deliberately take manual control over TMA, WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron, and staging layout.

The useful comparison boundary to keep straight is that TileLang TMA bulk-copy SMEM near-copy is the local "same payload, different lowering contract" proof, while the separate probe in the GB10 TMA multicast probe surface is a GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story-side check for cluster multicast vocabulary. They share TMA terms, but they are not proving the same hardware path.

The Blackwell-side contrast stays outside the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 proof: tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough names the newer tensor-generation instruction family, and TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: TileLang TMA bulk-copy companion sample Reference: NVFP4 inference on GB10 is the associated Blackwell tensor-memory scratchpad. This article only uses those names to keep the Hopper TMA legality lane from being confused with the adjacent GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story gate walk.

Why legality fails before math

On Hopper, TMA is attractive because one thread can issue a bulk tensor copy while the rest of the block keeps computing. But that fast path only exists if the compiler can describe the movement using a legal descriptor, legal shared-memory staging, and legal synchronization. If the lowering cannot map the source view onto that contract, you do not have a slow TMA path. You have no TMA path.

That is the reason the public TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingHistory: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample Reference: Mamba3 kernel journey story here is compiler-facing instead of benchmark-facing. The point is not that every 3D view is bad. The point is that some logical 3D views need to be rewritten into a 2D staging shape before the lowering can emit the intended async copy. TileLang TMA bulk copy 3D SMEM deep dive is the smallest bulk-copy version of that lesson; Mamba3 MIMO 3D to 2D SMEM deep dive is the version where the same issue shows up on a real kernel surface.

Ordinary TMA bulk copy is not TMA multicast

Ordinary TMA bulk copy is the Hopper path for moving a tile between global memory and one block's shared memory using a tensor map and async completion machinery. That is the path these H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 articles talk about.

TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough is the cluster-scoped form where one global-memory tile is copied into the shared memory of multiple blocks in the same cluster. The checked-in public probe for that lives in the GB10 TMA multicast probe surface, which uses cp.async.bulk.tensor.2d.shared::cluster.global.tile ... multicast::cluster. That probe belongs to the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story and datacenter-Blackwell evidence story, not to the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 legality story in this article.

FAQ

Frequently asked questions

Is this article about math correctness or compiler legality?+
Compiler legality. The main question is whether the lowering accepts the memory layout and TMA request the kernel is making.
Why keep tiny legality-style samples around?+
Because when the problem is in the lowering contract, a compact reproducer is usually more useful than one more benchmark receipt.
Which checked-in files should I open first?+
Start with TileLang TMA bulk copy SMEM sample for the compact rule, then TileLang TMA bulk copy SMEM near-copy for the side-by-side 3D versus 2D contract, then MegaCpp model wiring example index to place those files in the wider TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments., Mamba, and GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. example catalog.
What should a legality-style repro include?+
The requested layout, the exact lowering surface being exercised, and the smallest checked-in sample that still fails or passes. If any of those are missing, the next reader has to rediscover what the contract even was.
Which TMA legality checks should I inspect before blaming TileLang?+
Start with the descriptor-facing checks, not the benchmark. The CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes. tensor-map contract keeps the global address and lower-rank strides on 16-byte boundaries, constrains the innermost copy box when no interleave is used, and ties swizzled boxes to fixed byte windows such as the 128-byte swizzle family. In this repo, that is why the compact TileLang TMA bulk copy SMEM sample and the side-by-side near-copy keep the logical 3D payload separate from the flattened compiler-facing layout before making any performance claim.
Where do proxy fences fit into this?+
After legality. A kernel can have a legal TMA layout and still need fence.proxy.async insertion so Hopper sees a valid transition from generic shared-memory traffic to async-proxy traffic. That is a different contract surface from the 3D-versus-2D layout issue, but it is part of the same lowering story.
Is the TMA path here the same thing as TMA multicast on Blackwell?+
No. This article is about whether a Hopper-class lowering can materialize a legal bulk-copy layout into shared memory. TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster. is the cluster-scoped datacenter path that shows up in the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. gate articles and in the GB10 TMA multicast probe surface; it should not be inferred from a plain H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. bulk-copy repro.
Where does TMEM fit into this picture?+
It does not fit into the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. legality path directly. TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. belongs to Blackwell tensor-generation and UMMAQuick term guideUMMAA Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.-style paths, while this article is about Hopper TMA moving tiles between global memory and ordinary shared-memory staging. For the Blackwell-side boundary, use What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths.
How is this different from CuTe DSL or CUTLASS-style kernels?+
This article is about compiler-managed legality: can TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments. lower a requested bulk-copy layout into a valid Hopper TMA path? The CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang. lane is where you accept the opposite bargain and own staging, swizzles, and instruction selection more explicitly. Use Our honest experience with CuTe DSL and the local MegaCpp model wiring example index when you need that side of the boundary.
Which checked-in file should I open if I want a direct TileLang-versus-CuTe mapping before the articles?+
Start with MegaCpp model wiring example index, Our honest experience with CuTe DSL, and TileLang TMA bulk copy SMEM near-copy. If you want the smaller primitive-by-primitive checked-in appendix after those surfaces, use TileLang TMA bulk copy 3D SMEM deep dive plus the compact and near-copy TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments. samples in this repo.
Where should I go if the Blackwell-side terms still feel underspecified?+
Use MegaCpp model glossary for the naming layer, then compact gate-walk mirror and GB10 gate matrix for the smallest checked-in map of how tcgen05.*, TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable., TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster., and the observed .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. stop differ from this Hopper legality lane.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

tcgen05

The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.

TMA multicast

The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.

.nv.capmerc

Observed section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.

TMEM

Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.

UMMA

A Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

CUTLASS

NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.

CuTe DSL

The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

sm_100

Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.

sm_100a

Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.

sm_121a

Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.

WGMMA

Hopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.