TileLang TMA bulk-copy 3D shared-memory deep dive
A deeper reproducer-driven look at why TileLang TMA bulk-copy paths can fail on shared-memory layout legality before the math is even the problem.

This class of bug matters because it looks like a kernel problem but is often a lowering problem. The checked-in example keeps the shape of that failure close to the original repro pack: the intended data movement is fine, but the shared-memory/TMA layout contract is not.
That distinction matters on Hopper-class hardware. It is easy to misdiagnose this as a kernel-math problem rather than a layout/lowering problem. In practice, the first failure can happen much earlier: the lowering cannot map the requested form onto the expected TMA/shared-memory contract, so the form is rejected or de-optimized before the intended fast path is even in play.
For the 128-byte TMA swizzle case, the useful preflight is small: keep the
global tensor base 128-byte aligned, make the shared-memory destination obey
the swizzle repeat boundary the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 guide illustrates with alignas(1024),
and keep the shared box's inner dimension no wider than the 128-byte swizzle
span, while still allocating enough shared memory for the full swizzle width
when the inner box is narrower. If the natural 3D tile violates that shape,
padding or flattening to a compiler-visible 2D contract is less suspicious than
hoping the lowerer can recover the descriptor later.
The checked-in TileLang TMA bulk-copy SMEM compact example
and TileLang TMA bulk-copy SMEM near-copy
are the narrow local proof surfaces that make this article safer than the
original repro bundle alone. The compact sample makes the rewrite explicit with
tma_bulk_copy_layout(shape) -> (d0 * d1, d2), while the near-copy keeps the
natural 3D and flattened 2D contracts side by side so you can verify that the
payload stays the same while the compiler-facing form changes. Use the
MegaCpp model wiring example index if you need
the neighboring Mamba and GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story receipts after that.
One practical seam is that "compiled" is not the same thing as "the intended
bulk async path survived." If replacing the async copy
with a scalar-load loop makes the kernel sane, or if the profiler shows no
cp.async.bulk.tensor traffic, treat that as a lowering or fallback miss
rather than proof that the original 3D TMA path worked.
The same rule applies to TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample version boundaries. Older LowerBulkCopy
paths could reject rank-3 shared-memory descriptors before code generation;
newer paths can compile by warning and falling back to an ordinary copy. That
fallback is safer than a compiler abort, but it is still not evidence that the
bulk TMA path survived. The adjacent public-safe write-up is
our TileLang upstream handoff, and
the shape-preserving kernel rewrite is covered in the
Mamba3 3D-to-2D shared-memory deep dive.
TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: TileLang TMA bulk-copy companion sample Reference: NVFP4 inference on GB10 is a different Blackwell tensor-memory surface from this TMA bulk-copy seam, so a successful descriptor rewrite here should not be read as proof that tensor-memory allocation or matrix-instruction paths were exercised.
Example -> article -> upstream docs
- example: TileLang TMA bulk-copy SMEM near-copy
- compact companion: TileLang TMA bulk-copy SMEM compact example
- article: this deep-dive route, plus the broader context article TileLang TMA and H200 reality
- adjacent checked-in catalog: MegaCpp model wiring example index
- upstream docs: CUDA asynchronous copies and Tensor Memory Accelerator and CUTLASS CuTe TMA tensors
Frequently asked questions
If the kernel compiles, does that prove the bulk-copy path actually survived?+
cp.async.bulk.tensor events are missing and transferred bytes stay at zero, treat it as a lowering-path miss rather than proof that the original 3D TMA contract worked.What should I check after the layout shape looks legal?+
fence.proxy.async-style ordering before blaming the math.What descriptor detail can still break a legal-looking tile?+
Is a LowerBulkCopy fallback good enough?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.
A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.
NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.
CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.