MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20268 min readDavid Gornshtein

Mamba3

CUDA

TileLang

Pallas

CuTe DSL

Kernels

H200

TPU

The Mamba 3 Kernel Journey: CUDA, Pallas, TileLang, and an Honest Look at CuTe DSL

How the Mamba 3 kernel stack works in MegaCpp: TileLang on H200, Pallas on TPU v6e, a CuTe DSL port that was evaluated but not adopted, and what each attempt showed.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

The Mamba 3 Kernel Journey: CUDA, Pallas, TileLang, and an Honest Look at CuTe DSL

Published April 18, 2026•8 min read•David Gornshtein

Shipping a hybrid Mamba-3-plus-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backbone for a C++ code model forces the same conversation three times: once for CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200, once for TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries, and once for the kernel DSL you trust enough to keep patching. This post is that conversation written down.

Short version: TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample is the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side kernel surface for the MIMO-heavy Mamba path on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample is the selective TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries kernel surface where a custom kernel is actually justified, torch_xla.experimental.scan still owns the main TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries state update path, and a CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample port was useful as a reference but not worth carrying as a second production lane. The architecture-level companion is Mamba 3 + Transformers: why MegaCpp uses a hybrid stack for C++.

If these kernel terms are new

The first-touch vocabulary here is easy to overload, so define it before arguing about performance:

TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample is the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side kernel DSL/compiler surface used for the main Mamba MIMO kernels on H200.
CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample is the CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample-family Python DSL we evaluated as a reference port, not the maintained production lane for this Mamba kernel. For the broader CUTLASS/CuTe ownership boundary, use Our honest experience with CuTe DSL.
TMA means Tensor Memory Accelerator: the Hopper bulk-copy path that can move tiles into shared memory efficiently, but only when the descriptor shape and lowering contract stay legal.
PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample is a TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries kernel DSL inside the JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: libtpu / PJRT / JAX ownership boundaries Reference: Pallas on TPU/XLA ecosystem. In MegaCpp terms, Pallas on TPU is the "reach for this only when plain XLA lowering is no longer enough" tool, not the default answer to every TPU kernel.
XLA is the compiled graph/runtime layer underneath the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lane, so the practical boundary in this cluster is whether a change stays inside one shape-stable XLA graph or widens into a separate bridge or custom-kernel surface.
SSD scan is the recurrent state update in the Mamba path. On TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries the safest public summary is still "keep the scan shape-stable and keep boundary metadata inside the compiled path."
SISO means single-input, single-output; MIMO means multi-input, multi-output.
PsiV is the elementwise product of two already-available kernel inputs. The checked-in cache example treats materializing it as an explicit fail-closed experiment, not as a silently enabled optimization.

The checked-in decoder for those terms is the pair MegaCpp example index and Pallas kernel selection notes. The local teaching-sized examples that matter most in this article are Mamba3 3D-to-2D shared-memory sample and Mamba3 PsiV cache scaffold. For the exact upstream-facing compiler bug surface behind the 3D shared-memory story, keep Upstream PRs we wrote for TileLang and Megatron-Core nearby too: that article is the cleaner handoff when the question becomes "which lowering failure, exactly, and what changed upstream?" The external primary-source split is equally simple: the Mamba repository defines the selective SSM family, while the JAX Pallas docs define the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries custom-kernel surface this article keeps distinct from plain XLA. The cluster split stays deliberate: Author Mamba3 spec inside Megatron is the input-side seam, Mamba linear CE parity deep dive is the output-side seam, and this post is the kernel and backend seam.

Why this matters

Kernel-tool choice for a hybrid-model stack looks like a pure performance decision and turns out to be an operational one. Two kernel paths can lower to very similar machine code on supported shapes and still differ massively in how easy they are to patch, debug, and keep aligned with the surrounding runtime.

For a stack that already has to keep attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns kernels, state-space kernels, parallelism code, and precision policy in one working lane, iteration cost is a first-class metric. The question is not only "which kernel is faster?" It is also "which kernel can absorb one more correctness fix without turning the fork into a second product?"

1. What we actually ship on H200

The production CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side path for the Mamba-3 half of the hybrid lives on the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample side of the stack, not in handwritten CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 C++. In practical terms, that means the MIMO kernels are maintained as a small, explicit patch surface instead of a fully separate kernel family.

That choice is reflected in the local proof surfaces:

MegaCpp example index treats TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample shared memory and TMA legality as one of the main checked-in kernel categories.
Mamba3 3D-to-2D shared-memory sample reduces the layout issue to one simple idea: a legal 2D tile can preserve the same logical payload as a natural 3D tile.
TileLang TMA and H200 reality is the adjacent post when you want the H200-specific rollout story rather than the vocabulary.

The important public-safe summary is that MegaCpp keeps the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side Mamba path on the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample lane because the fixes it needs are small, inspectable, and reproducible there.

2. The TMA layout fix, in plain language

The most useful first-touch explanation in this whole article is also the simplest one: some kernel DSL lowerings accept only narrower shared-memory layouts than the model-side logic would naturally produce.

That is what the 3D-to-2D rewrite is about. It is not "new math." It is a layout-legality repair. The checked-in sample says it directly:

model-side logic wants a logical (depth, rows, cols) tile
the backend lowering wants a legal 2D tile
an explicit reshape can preserve the same payload while making the lowering happy

That distinction matters because it stops readers from confusing compiler legality work with architecture changes. The right companion here is the near-copy layout companion sample for the closer contract and Mamba3 MIMO 3D-to-2D shared-memory deep dive for the longer explanation.

It also shows why this article stops at kernel legality. A legal tile layout is not the same thing as a full tensor-parallel or end-to-end runtime answer. That follow-on lives in tensor-parallel Mamba3 mixer sample and Mamba 3 parallel performance.

3. PsiV and why the cache stays fail-closed

PsiV is easy to oversell because it sounds like a free reuse opportunity: if the same product is needed several times, why not materialize it once and pass it along?

The checked-in example deliberately refuses to hand-wave that answer. In Mamba3 PsiV cache scaffold, the public contract is explicit:

the cache is a scaffold, not a shipped optimization
if the gate is turned on before the implementation exists, the correct behavior is to refuse the run
the gate is intentionally explicit in the checked-in sample, so operators can tell the difference between "scaffold present" and "optimization shipped"

That last point is exactly why example-grounded wording matters. A vague "P2 is behind an env gate" summary is less useful than a checked-in gate name and a clear fail-closed behavior.

4. TPU v6e: XLA first, Pallas second

The TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries side of the same stack is easier to understand if you start from the default and only then add the custom-kernel lane.

The default is plain XLA around a shape-stable scan. The checked-in note Mamba-3 porting notes for TPU puts the rules in blunt form:

keep the scan body shape-stable
materialize sequence or segment identifiers explicitly
prefer plain XLA fusion around the scan by default
use PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample only when custom tiling or mask handling clearly pays for itself

That is the right first-touch explanation for readers who only know PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample as "the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries kernel language." In this stack, PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample is a selective tool, not the main story. The main story is still compile-stable state updates.

Where PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample does show up usefully is on the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns side of the hybrid or on specialized TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries kernels where the tiling and mask rules are worth making explicit. The local policy note is Pallas kernel selection notes, and the adjacent post is Mamba-3 fused trapezoidal scan on TPU v6e.

One more TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries boundary is worth keeping explicit here: scan or scan_layers is not a generic compile-speed knob for every XLA lane. If the stacked block stops being structurally homogeneous, the safe fallback is the plain loop rather than forcing scan_layers across an incompatible body. The local XLA breakage matrix keeps that contract narrow: the traced block still has to stay structurally homogeneous and traceable, and current public-safe guidance keeps custom PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample-backed hotspots outside that traced surface. That is why this kernel article keeps "shape-stable scan first, plain loop when the body is not scan-friendly, selective PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample for explicit hotspots" as the main rule instead of presenting scan_layers as the universal answer.

5. TileLang versus CuTe DSL

The honest A/B here is not "which one looked more sophisticated?" It is "which one justified becoming a permanent lane?"

TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample won that comparison for operational reasons:

the changes MegaCpp actually needed were small enough to express directly in the existing TileLang surface
the checked-in legality and cache experiments already explain the live bug surfaces in that language
carrying a second full production kernel lane would have increased fork cost faster than it increased model quality

That does not make the CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample port useless. It still has reference value. A reference path is often enough when the goal is to validate an identity or a layout argument rather than to own two independent production implementations.

6. What the measurements support, and what they do not

Supported by the local public-safe proof surfaces:

TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample as the maintained CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side lane for the Mamba MIMO path
shared-memory legality fixes framed as layout rewrites rather than math changes
a fail-closed PsiV cache scaffold instead of a silently enabled cache
XLA-first TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries guidance with PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample as a selective tool rather than a default

Not supported tightly enough for stronger public copy:

"CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary is the real production answer and TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample is temporary"
"every PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample rewrite is automatically better than plain XLA lowering"
"a plausible cache reuse idea is already a shipped optimization"

The rule behind all three is the same one that shows up across MegaCpp posts: keep the proof surface visible. If the checked-in example says "scaffold only," the article should not promote it to "shipped." If the checked-in example says "layout legality," the article should not sell it as a new algorithm.

If you care about which of these kernel decisions actually changed step time rather than compile legality, the measured continuation is Mamba 3 parallel performance.

FAQ

Frequently asked questions

Why did TileLang win over the CuTe DSL port?+

Not because it was obviously more elegant. It won because the maintenance and iteration cost was lower while still covering the kernel surfaces MegaCpp actually needed to patch.

Why not keep a separate backward split as a hedge?+

Because a second kernel lane is only worth promoting when the checked-in proof surface shows both lower live-state pressure and a measured end-to-end win. In this cluster, the safer public rule is to keep split-kernel and CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface. ideas as diagnostic reference material until the examples prove more than local layout legality. That is why the maintained path still points readers at the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments. layout samples, the fail-closed PsiV cache scaffold, and the measured continuation in Mamba 3 parallel performance, rather than presenting an unmeasured split as a production hedge.

What does TMA mean in the concrete checked-in examples?+

It means the Hopper bulk-copy lowering that only turns on once the shared-memory descriptor is legal for the compiler path. The closest checked-in bridge is Mamba3 3D-to-2D shared-memory sample, which shows why flattening a logical 3D tile into a legal 2D tile matters.

Why keep the TPU and CUDA kernel stories in one article?+

Because they serve the same hybrid model surface. Seeing both lanes together makes it easier to separate architecture decisions from backend-specific implementation choices.

Why not treat scan_layers as the default TPU speed trick?+

Because it is a structure gate, not just a version gate. The traced block still has to stay homogeneous and compatible with the current XLA scan contract. When that is not true, the safe baseline is the plain loop, not forcing scan_layers across an incompatible body, and the public-safe TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels. lane in this cluster already keeps custom PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract. kernels as selective exceptions rather than as the default traced body. The compact decoder for that boundary is Torch 2.12 XLA breakage matrix.

Why is the PsiV cache still treated as experimental?+

Because the checked-in proof surface still treats it as a fail-closed scaffold. That is the correct public wording until the implementation and measurement story are both real. The shortest local decoder is Mamba3 PsiV cache scaffold, which names the exact env gate and the refusal behavior instead of hand-waving about "a possible cache path."

Where is the sharding and runtime continuation after this kernel article?+

The shortest grounded follow-on is tensor-parallel Mamba3 mixer sample for the TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node. ownership surface and Mamba 3 parallel performance for the measured cost side.

Where do the non-kernel Mamba seams live in this cluster?+

Read Author Mamba3 spec inside Megatron for the pre-projection norm seam, Mamba linear CE parity deep dive for the output-and-loss seam, and Mamba3 PsiV cache scaffold for the fail-closed runtime gate around one unfinished optimization idea.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

CuTe DSL

The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.

Grounding

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

Grounding

Pallas

JAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.

Grounding

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

Grounding

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Grounding

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

CUTLASS

NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

TPU

Google's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.

Grounding

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

JAX

A separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.

Grounding

Topic hubs

Entity Hub

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

A curated Mamba3 reading path: why MegaCpp kept a hybrid stack, how the kernels evolved across CUDA, TileLang, and TPU, and where the runtime wins actually held.

David Gornshtein • MegaCppMore posts →

The Mamba 3 Kernel Journey: CUDA, Pallas, TileLang, and an Honest Look at CuTe DSL

If these kernel terms are new

Why this matters

1. What we actually ship on H200

2. The TMA layout fix, in plain language

3. PsiV and why the cache stays fail-closed

4. TPU v6e: XLA first, Pallas second

5. TileLang versus CuTe DSL

6. What the measurements support, and what they do not

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

Mamba3 Architecture, Kernels, and Runtime Tradeoffs