MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 8 min readDavid Gornshtein
Mamba3
CUDA
TileLang
Pallas
CuTe DSL
Kernels
H200
TPU

The Mamba 3 Kernel Journey: CUDA, Pallas, TileLang, and an Honest Look at CuTe DSL

How the Mamba 3 kernel stack works in MegaCpp: TileLang on H200, Pallas on TPU v6e, a CuTe DSL port that was evaluated but not adopted, and what each attempt showed.

MegaCpp
Focused on applied C++ model engineering
Article Preview
The Mamba 3 Kernel Journey: CUDA, Pallas, TileLang, and an Honest Look at CuTe DSL
Published 8 min readDavid Gornshtein

Shipping a hybrid Mamba-3-plus-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backbone for a C++ code model forces the same conversation three times: once for CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200, once for TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries, and once for the kernel DSL you trust enough to keep patching. This post is that conversation written down.

Short version: TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample is the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side kernel surface for the MIMO-heavy Mamba path on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample is the selective TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries kernel surface where a custom kernel is actually justified, torch_xla.experimental.scan still owns the main TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries state update path, and a CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample port was useful as a reference but not worth carrying as a second production lane. The architecture-level companion is Mamba 3 + Transformers: why MegaCpp uses a hybrid stack for C++.

If these kernel terms are new

The first-touch vocabulary here is easy to overload, so define it before arguing about performance:

The checked-in decoder for those terms is the pair MegaCpp example index and Pallas kernel selection notes. The local teaching-sized examples that matter most in this article are Mamba3 3D-to-2D shared-memory sample and Mamba3 PsiV cache scaffold. For the exact upstream-facing compiler bug surface behind the 3D shared-memory story, keep Upstream PRs we wrote for TileLang and Megatron-Core nearby too: that article is the cleaner handoff when the question becomes "which lowering failure, exactly, and what changed upstream?" The external primary-source split is equally simple: the Mamba repository defines the selective SSM family, while the JAX Pallas docs define the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries custom-kernel surface this article keeps distinct from plain XLA. The cluster split stays deliberate: Author Mamba3 spec inside Megatron is the input-side seam, Mamba linear CE parity deep dive is the output-side seam, and this post is the kernel and backend seam.

Why this matters

Kernel-tool choice for a hybrid-model stack looks like a pure performance decision and turns out to be an operational one. Two kernel paths can lower to very similar machine code on supported shapes and still differ massively in how easy they are to patch, debug, and keep aligned with the surrounding runtime.

For a stack that already has to keep attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns kernels, state-space kernels, parallelism code, and precision policy in one working lane, iteration cost is a first-class metric. The question is not only "which kernel is faster?" It is also "which kernel can absorb one more correctness fix without turning the fork into a second product?"

1. What we actually ship on H200

The production CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side path for the Mamba-3 half of the hybrid lives on the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample side of the stack, not in handwritten CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 C++. In practical terms, that means the MIMO kernels are maintained as a small, explicit patch surface instead of a fully separate kernel family.

That choice is reflected in the local proof surfaces:

The important public-safe summary is that MegaCpp keeps the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side Mamba path on the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample lane because the fixes it needs are small, inspectable, and reproducible there.

2. The TMA layout fix, in plain language

The most useful first-touch explanation in this whole article is also the simplest one: some kernel DSL lowerings accept only narrower shared-memory layouts than the model-side logic would naturally produce.

That is what the 3D-to-2D rewrite is about. It is not "new math." It is a layout-legality repair. The checked-in sample says it directly:

  • model-side logic wants a logical (depth, rows, cols) tile
  • the backend lowering wants a legal 2D tile
  • an explicit reshape can preserve the same payload while making the lowering happy

That distinction matters because it stops readers from confusing compiler legality work with architecture changes. The right companion here is the near-copy layout companion sample for the closer contract and Mamba3 MIMO 3D-to-2D shared-memory deep dive for the longer explanation.

It also shows why this article stops at kernel legality. A legal tile layout is not the same thing as a full tensor-parallel or end-to-end runtime answer. That follow-on lives in tensor-parallel Mamba3 mixer sample and Mamba 3 parallel performance.

3. PsiV and why the cache stays fail-closed

PsiV is easy to oversell because it sounds like a free reuse opportunity: if the same product is needed several times, why not materialize it once and pass it along?

The checked-in example deliberately refuses to hand-wave that answer. In Mamba3 PsiV cache scaffold, the public contract is explicit:

  • the cache is a scaffold, not a shipped optimization
  • if the gate is turned on before the implementation exists, the correct behavior is to refuse the run
  • the gate is intentionally explicit in the checked-in sample, so operators can tell the difference between "scaffold present" and "optimization shipped"

That last point is exactly why example-grounded wording matters. A vague "P2 is behind an env gate" summary is less useful than a checked-in gate name and a clear fail-closed behavior.

4. TPU v6e: XLA first, Pallas second

The TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries side of the same stack is easier to understand if you start from the default and only then add the custom-kernel lane.

The default is plain XLA around a shape-stable scan. The checked-in note Mamba-3 porting notes for TPU puts the rules in blunt form:

That is the right first-touch explanation for readers who only know PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample as "the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries kernel language." In this stack, PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample is a selective tool, not the main story. The main story is still compile-stable state updates.

Where PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample does show up usefully is on the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns side of the hybrid or on specialized TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries kernels where the tiling and mask rules are worth making explicit. The local policy note is Pallas kernel selection notes, and the adjacent post is Mamba-3 fused trapezoidal scan on TPU v6e.

One more TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries boundary is worth keeping explicit here: scan or scan_layers is not a generic compile-speed knob for every XLA lane. If the stacked block stops being structurally homogeneous, the safe fallback is the plain loop rather than forcing scan_layers across an incompatible body. The local XLA breakage matrix keeps that contract narrow: the traced block still has to stay structurally homogeneous and traceable, and current public-safe guidance keeps custom PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample-backed hotspots outside that traced surface. That is why this kernel article keeps "shape-stable scan first, plain loop when the body is not scan-friendly, selective PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample for explicit hotspots" as the main rule instead of presenting scan_layers as the universal answer.

5. TileLang versus CuTe DSL

The honest A/B here is not "which one looked more sophisticated?" It is "which one justified becoming a permanent lane?"

TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample won that comparison for operational reasons:

  • the changes MegaCpp actually needed were small enough to express directly in the existing TileLang surface
  • the checked-in legality and cache experiments already explain the live bug surfaces in that language
  • carrying a second full production kernel lane would have increased fork cost faster than it increased model quality

That does not make the CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample port useless. It still has reference value. A reference path is often enough when the goal is to validate an identity or a layout argument rather than to own two independent production implementations.

6. What the measurements support, and what they do not

Supported by the local public-safe proof surfaces:

Not supported tightly enough for stronger public copy:

The rule behind all three is the same one that shows up across MegaCpp posts: keep the proof surface visible. If the checked-in example says "scaffold only," the article should not promote it to "shipped." If the checked-in example says "layout legality," the article should not sell it as a new algorithm.

If you care about which of these kernel decisions actually changed step time rather than compile legality, the measured continuation is Mamba 3 parallel performance.

FAQ

Frequently asked questions

Why did TileLang win over the CuTe DSL port?+
Not because it was obviously more elegant. It won because the maintenance and iteration cost was lower while still covering the kernel surfaces MegaCpp actually needed to patch.
Why not keep a separate backward split as a hedge?+
Because a second kernel lane is only worth promoting when the checked-in proof surface shows both lower live-state pressure and a measured end-to-end win. In this cluster, the safer public rule is to keep split-kernel and CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface. ideas as diagnostic reference material until the examples prove more than local layout legality. That is why the maintained path still points readers at the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments. layout samples, the fail-closed PsiV cache scaffold, and the measured continuation in Mamba 3 parallel performance, rather than presenting an unmeasured split as a production hedge.
What does TMA mean in the concrete checked-in examples?+
It means the Hopper bulk-copy lowering that only turns on once the shared-memory descriptor is legal for the compiler path. The closest checked-in bridge is Mamba3 3D-to-2D shared-memory sample, which shows why flattening a logical 3D tile into a legal 2D tile matters.
Why keep the TPU and CUDA kernel stories in one article?+
Because they serve the same hybrid model surface. Seeing both lanes together makes it easier to separate architecture decisions from backend-specific implementation choices.
Why not treat scan_layers as the default TPU speed trick?+
Because it is a structure gate, not just a version gate. The traced block still has to stay homogeneous and compatible with the current XLA scan contract. When that is not true, the safe baseline is the plain loop, not forcing scan_layers across an incompatible body, and the public-safe TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels. lane in this cluster already keeps custom PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract. kernels as selective exceptions rather than as the default traced body. The compact decoder for that boundary is Torch 2.12 XLA breakage matrix.
Why is the PsiV cache still treated as experimental?+
Because the checked-in proof surface still treats it as a fail-closed scaffold. That is the correct public wording until the implementation and measurement story are both real. The shortest local decoder is Mamba3 PsiV cache scaffold, which names the exact env gate and the refusal behavior instead of hand-waving about "a possible cache path."
Where is the sharding and runtime continuation after this kernel article?+
The shortest grounded follow-on is tensor-parallel Mamba3 mixer sample for the TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node. ownership surface and Mamba 3 parallel performance for the measured cost side.
Where do the non-kernel Mamba seams live in this cluster?+
Read Author Mamba3 spec inside Megatron for the pre-projection norm seam, Mamba linear CE parity deep dive for the output-and-loss seam, and Mamba3 PsiV cache scaffold for the fail-closed runtime gate around one unfinished optimization idea.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

CuTe DSL

The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

Pallas

JAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

TP

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

CUTLASS

NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

TPU

Google's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

JAX

A separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.

Topic hubs