MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 8 min readDavid Gornshtein
Mamba3
Smem
TileLang
Kernels
Deep Dive

Mamba3 MIMO 3D-to-2D shared-memory deep dive

Why some Mamba3-style kernels need an explicit 3D-to-2D shared-memory legality rewrite before the backend will accept the tile layout.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Mamba3 MIMO 3D-to-2D shared-memory deep dive
Published 8 min readDavid Gornshtein

The deeper point of the 3D-to-2D shared-memory example is that layout legality is its own contract. A kernel can be conceptually correct and still fail because backend lowering may accept a narrower shared-memory layout than the original kernel-side tensor view. That is the same class of issue described in TileLang TMA and H200 reality and the broader Mamba3 kernel journey: backend legality is often a layout contract, not a math contract.

The checked-in reference example keeps that lesson visible without dragging in the whole training stack.

The practical split is useful:

  • the compact example explains the flattening rule
  • the reference example preserves the actual shared-memory and indexing surfaces that had to be rewritten: Q/K shared-memory staging and the qk_dot_shared tile

That makes it easier to see why this was not a math rewrite. It was a layout rewrite done to preserve the same indexing semantics while making the shared-memory view easier for the lowering path to accept. It is the same kind of narrow, defensible kernel work discussed in Kernels that pay for themselves: the legality rewrite is worth keeping only because it preserves a hot path without broadening the contract far beyond what the backend actually accepts.

The hardware split is worth saying plainly. Hopper TMA can describe and move rank-3 payloads, but the compute lane that wants to consume the staged tile is still a matrix lane: wgmmaQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron wants a 2D shared-memory view with bank-safe swizzle and leading-dimension rules the lowering can prove. The rewrite is therefore not "TMA cannot do 3D." It is "the shared-memory form feeding the Tensor Core step still has to look 2D enough for the backend and the hardware contract."

That also keeps this Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode legality story separate from the TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough questions on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story, where the cluster-copy probe belongs to the Blackwell capability lane rather than this Hopper shared-memory flattening example.

The research packet sharpens that failure surface further. The real conflict was not "rank-3 tensors exist." It was the intersection of a logical rank-3 staging view with the exact SWIZZLE_128B interleave and matrix-leading-dimension proof that the wgmmaQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron consumer wants. In practice that shows up as a lowering problem, not a math problem: the NVRTC-facing path would have to invent a 2D matrix interpretation on your behalf or reject the program. The explicit flatten wins because it stops asking the lowering to guess.

First-touch terms

  • Q/K staging here means the temporary shared-memory tiles that hold slices of Q and K before the dot-product stage.
  • qk_dot_shared is the shared-memory tile that holds the intermediate dot-product fragment. In the checked-in example it starts life as a logical (chunk_size, R, R) view.
  • 3D-to-2D rewrite means flattening those logical 3D shared-memory views into 2D tiles while keeping the same payload and logical indexing semantics.
  • TMA means Hopper's Tensor Memory Accelerator bulk-copy path, where the descriptor and shared-memory layout have to satisfy the lowering contract before the fast path can turn on.
  • SWIZZLE_128B means the tensor-map shared-memory bank swizzle mode that asks the backend to present the staged tile as a bank-friendlier matrix operand. In this article it is the signal that the logical 3D view has crossed into the wgmma-facing layout contract, not proof that TMA itself cannot address rank-3 payloads.
  • Legality means the backend can actually lower the staging form. It does not mean the original 3D tensor view was mathematically wrong.

Once those terms are clear, Mamba3 kernel journey is the tooling hub and Mamba 3 parallel performance is the measured runtime hub for the same kernel family.

What the local files prove

The Mamba3 MIMO 3D-to-2D shared-memory compact example is the teaching version. flatten_3d_tile_to_2d(depth, rows, cols) makes the rewrite explicit by returning (depth * rows, cols), and tile_is_layout_compatible(...) captures the narrow legality guard the sample cares about.

The Mamba3 MIMO near-copy shared-memory reference example preserves the kernel-facing contracts. build_3d_contract() keeps the natural shapes for q_shared, k_shared, and qk_dot_shared; build_2d_contract() shows the flattened replacement; remap_q_index(...) and remap_qk_dot_offset(...) make the semantic preservation explicit. That is the important engineering move: the storage view changes, but the logical indexing story does not.

If you want the shortest checked-in legality receipt instead of the whole near-copy file, explain_legality(...) is the smallest bridge. It compares the 3D and flattened 2D q_shared / qk_dot_shared surfaces and returns the narrow tma_compatible_width guard the sample treats as the TMA-facing check. That helper is intentionally narrower than a full lowering receipt: it proves the sample's preserved width and remap boundary, not that every real swizzle or descriptor combination is automatically legal.

What changes and what stays the same

What changes is the shared-memory staging shape. For example, a logical q_shared tile of (CHUNK_SIZE, R, N) becomes (CHUNK_SIZE * R, N), and qk_dot_shared changes from (CHUNK_SIZE, R, R) to (CHUNK_SIZE, R * R).

What stays the same is the kernel's meaning. The same Q/K fragments are staged. The same dot products are represented. The same output shape survives. The rewrite is there to make the lowering happy, not to change the algebra.

That is why this article is paired with the broader TMA legality article. The CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 and CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary documentation for TMA both frame the async-copy path as descriptor- and coordinate-driven; that makes layout restatement a normal part of getting onto the fast path. In other words: if the lowering accepts only the flatter staging form, the right question is "does the rewrite preserve semantics?" not "did we betray the original pretty tensor view?"

That distinction is also why the rewrite is not a hidden runtime tax. Once the descriptor is built, the coordinate translation lives in the tensor-map contract rather than in a per-element software remap, so the cost is code-shape and compiler-legality complexity more than extra runtime math. The handoff to the decode side stays clean for the same reason: a TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample prefill lane that stages (batch * seq, head_dim) and a CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary decode lane that keeps a richer logical layout can still agree on the same byte order in global memory.

The offset algebra is the practical reason that handoff stays cheap. In the natural 3D view the linear offset is b * seq * head_dim + s * head_dim + h; in the flattened 2D view it is (b * seq + s) * head_dim + h. Those are the same byte address. So the rewrite changes the shared-memory matrix contract that wgmmaQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron consumes, but it does not require a second global-memory repack just to move from prefill staging into the later decode-side interpretation.

The flatter view also matches why the swizzle is there in the first place. Hopper shared memory still exposes 32 banks, so a naive column-oriented consumer can serialize badly if too many lanes land on the same bank. The 128-byte swizzle is what spreads those accesses into a bank-safe matrix operand, and that only works cleanly when the staged tile is presented as the kind of 2D surface wgmmaQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron already knows how to consume.

The compiler failure surface is narrower than "3D is illegal." The real conflict is the bundle of a rank-3 shared-memory view plus the exact swizzle and leading-dimension proof the wgmmaQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron consumer wants. At that point the backend would have to invent a matrix interpretation on your behalf or reject the program. The explicit flatten wins because it makes that matrix interpretation visible instead of implicit.

Why this is worth keeping as a separate article

The Mamba example is the bridge between the toy legality sample and the real kernel story. TileLang TMA bulk copy 3D SMEM deep dive shows the same issue in the smallest possible bulk-copy example. This article shows where that same legality pressure lands on real kernel surfaces: Q/K staging, qk_dot_shared, and index remapping that must remain stable.

If you care about how this lands upstream rather than just why the layout changes, the filing-side companion is Upstream PRs we wrote for Mamba-3, Sparse-MLA, Liger and DSA. Pack 07 there keeps the claim narrow on purpose: legality-preserving flatten first, compiler-side TMA enablement later.

Example -> article -> upstream docs

The checked-in MegaCpp model wiring examples groups the compact and near-copy files under "TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample TMA and shared-memory legality," which is the local reason this article keeps those two proof surfaces together instead of collapsing them into one sample.

That split also explains why TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample and CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary can both be right while using opposite authoring styles. TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample takes the imperative route and flattens early so lowering sees an obviously legal matrix tile. CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary can often keep the logical 3D surface and compose it with a legal physical layout at compile time. The byte-level storage contract can still match across the handoff, so the difference is compiler strategy, not a hidden runtime transpose.

If the next question is whether this sort of legality seam is enough to justify moving a kernel into a more explicit stack, Our honest experience with CuTe DSL is the next repo-local decision log.

FAQ

Frequently asked questions

Was this a math rewrite?+
No. The point of the example is that the math and indexing semantics stay the same while the shared-memory layout is rewritten into a form the backend lowering path will accept.
Why keep both a compact example and a near-copy reference example?+
Because the compact version teaches the flattening rule, while the near-copy version preserves the real Q/K staging and qk_dot_shared surfaces that engineers actually need for comparison.
What is the core engineering lesson here?+
Shared-memory legality is its own contract. A kernel can be conceptually correct and still fail if the backend only accepts a narrower layout surface than the original tensor view suggests.
If Hopper TMA can move 3D tensors, why does this kernel still flatten to 2D?+
Because the bottleneck is not pure copy support. The staged tile still has to feed a wgmmaQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family. Tensor Core path that wants a 2D shared-memory view with bank-safe swizzle the lowering can prove, so the rewrite happens at the compute-facing boundary rather than because Hopper forgot how to copy rank-3 payloads.
What specifically made the lowering unhappy?+
Not 3D payloads by themselves. The bad combination was a rank-3 shared-memory view plus the swizzled 2D matrix contract that wgmmaQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family. expects. The flatten works because it makes the physical matrix interpretation explicit instead of leaving the backend to infer it.
Which local functions show that the rewrite preserved semantics?+
remap_q_index(...) and remap_qk_dot_offset(...) in the near-copy file make the remapping explicit, while build_3d_contract() and build_2d_contract() let you compare the old and new staging forms directly. That is why the article keeps linking the compact and near-copy samples together instead of collapsing them into one file.
What is the smallest checked-in helper that shows the legality boundary directly?+
explain_legality(...) in the checked-in sample is the shortest receipt. It compares the 3D and flattened 2D q_shared / qk_dot_shared contracts and returns the width guard the sample treats as the TMA-facing legality check. It is a sample-level legality proxy, not a substitute for the full lowering path.
Why do q_shared and qk_dot_shared flatten differently?+
Because they preserve different matrix roles after the rewrite. q_shared and k_shared keep N as the contiguous feature-width dimension, so the outer CHUNK_SIZE and R axes merge into matrix rows. qk_dot_shared already represents per-chunk R x R dot-product fragments, so the chunk axis stays as rows while the rank pair collapses into columns. The rule is not "flatten every 3D tile the same way." It is "make the Tensor Core-facing 2D contract explicit without changing the byte-order story."
Why is the checked-in example worth keeping if the compact example already explains the idea?+
Because the near-copy path preserves the real Q/K staging and qk_dot_shared layout pressures that triggered the legality rewrite. The compact example teaches the rule; the checked-in example shows where the rule came from. Mamba3 kernel journey is the wider backend story once the staging rewrite itself is clear.
Why can CuTe often keep a 3D view when TileLang has to flatten it?+
Because CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface. can encode the logical shape and the physical swizzle as one layout algebra and let compile-time layout composition resolve the flat offsets. TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.'s lowering surface is narrower here, so the safe move is to make the 2D Tensor Core view explicit in the kernel.
Is this the same thing as TMA multicast?+
No. This article is about staging legality for a Hopper bulk-copy path. TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster. is the cluster-scoped path discussed in GB10 Blackwell tensor paths and probed in the GB10 TMA multicast probe kernel.
Does the TileLang prefill to CuTe decode handoff need a layout conversion pass?+
Not if both sides preserve the same global-memory byte order. The surface representation changes, but the flattened TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments. staging view and the algebraic CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface. layout can still meet on the same linear storage contract, so the real seam is ownership and compile model, not a separate runtime transpose.
Does the flatten rewrite repack the tensor in HBM?+
No. The rewrite merges outer indices into the shared-memory matrix-row coordinate, but the underlying linear byte order can stay the same. That is why the checked-in example can explain the handoff with build_3d_contract(), build_2d_contract(), and the remap_* helpers instead of introducing a second storage format.
Where should I go once the staging rewrite itself is clear?+
Mamba3 kernel journey is the backend-tooling continuation, while Mamba 3 parallel performance is the measured runtime continuation. Mamba3 PsiV cache scaffold is the runtime-gating companion for one nearby optimization path.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

TMA multicast

The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.

WGMMA

Hopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

CuTe DSL

The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Topic hubs