MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 3 min readDavid Gornshtein
TileLang
TMA
Hopper
Smem
Deep Dive

TileLang TMA bulk-copy 3D shared-memory deep dive

A deeper reproducer-driven look at why TileLang TMA bulk-copy paths can fail on shared-memory layout legality before the math is even the problem.

MegaCpp
Focused on applied C++ model engineering
Article Preview
TileLang TMA bulk-copy 3D shared-memory deep dive
Published 3 min readDavid Gornshtein

This class of bug matters because it looks like a kernel problem but is often a lowering problem. The checked-in example keeps the shape of that failure close to the original repro pack: the intended data movement is fine, but the shared-memory/TMA layout contract is not.

That distinction matters on Hopper-class hardware. It is easy to misdiagnose this as a kernel-math problem rather than a layout/lowering problem. In practice, the first failure can happen much earlier: the lowering cannot map the requested form onto the expected TMA/shared-memory contract, so the form is rejected or de-optimized before the intended fast path is even in play.

For the 128-byte TMA swizzle case, the useful preflight is small: keep the global tensor base 128-byte aligned, make the shared-memory destination obey the swizzle repeat boundary the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 guide illustrates with alignas(1024), and keep the shared box's inner dimension no wider than the 128-byte swizzle span, while still allocating enough shared memory for the full swizzle width when the inner box is narrower. If the natural 3D tile violates that shape, padding or flattening to a compiler-visible 2D contract is less suspicious than hoping the lowerer can recover the descriptor later.

The checked-in TileLang TMA bulk-copy SMEM compact example and TileLang TMA bulk-copy SMEM near-copy are the narrow local proof surfaces that make this article safer than the original repro bundle alone. The compact sample makes the rewrite explicit with tma_bulk_copy_layout(shape) -> (d0 * d1, d2), while the near-copy keeps the natural 3D and flattened 2D contracts side by side so you can verify that the payload stays the same while the compiler-facing form changes. Use the MegaCpp model wiring example index if you need the neighboring Mamba and GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story receipts after that.

One practical seam is that "compiled" is not the same thing as "the intended bulk async path survived." If replacing the async copy with a scalar-load loop makes the kernel sane, or if the profiler shows no cp.async.bulk.tensor traffic, treat that as a lowering or fallback miss rather than proof that the original 3D TMA path worked.

The same rule applies to TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample version boundaries. Older LowerBulkCopy paths could reject rank-3 shared-memory descriptors before code generation; newer paths can compile by warning and falling back to an ordinary copy. That fallback is safer than a compiler abort, but it is still not evidence that the bulk TMA path survived. The adjacent public-safe write-up is our TileLang upstream handoff, and the shape-preserving kernel rewrite is covered in the Mamba3 3D-to-2D shared-memory deep dive.

TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: TileLang TMA bulk-copy companion sample Reference: NVFP4 inference on GB10 is a different Blackwell tensor-memory surface from this TMA bulk-copy seam, so a successful descriptor rewrite here should not be read as proof that tensor-memory allocation or matrix-instruction paths were exercised.

Example -> article -> upstream docs

FAQ

Frequently asked questions

If the kernel compiles, does that prove the bulk-copy path actually survived?+
No. A compile can still land on a safer fallback while the real bulk async path disappears. The fastest proof bar is a profiler receipt: if cp.async.bulk.tensor events are missing and transferred bytes stay at zero, treat it as a lowering-path miss rather than proof that the original 3D TMA contract worked.
What should I check after the layout shape looks legal?+
Check synchronization lowering, not just tensor ranks. TMA uses async proxy operations, so ordinary CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes. work and bulk async copies need the right proxy fence or barrier ordering when they exchange data through shared memory. If the flattening contract is sane but the kernel still corrupts or hangs, inspect the emitted path for fence.proxy.async-style ordering before blaming the math.
What descriptor detail can still break a legal-looking tile?+
The tensor map is part of the contract too. The CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes. TMA path passes a descriptor object to the kernel, so the generated signature still has to preserve that descriptor as a grid-constant input rather than treating it like an ordinary mutable pointer. If the shape, swizzle, and fences all look right but the emitted 3D load is rejected, check the descriptor handoff before rewriting the math.
Is a LowerBulkCopy fallback good enough?+
It is good enough to avoid treating the compiler as broken, but it is not a performance proof. If the post-fallback kernel no longer emits the bulk tensor-copy instruction, count the result as a safe correctness lane and keep the 3D-to-2D rewrite or descriptor fix on the table for the actual fast path.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

TMEM

Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

CUTLASS

NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.