The Mamba 3 Kernel Journey: CUDA, Pallas, TileLang, and an Honest Look at CuTe DSL
How the Mamba 3 kernel stack works in MegaCpp: TileLang on H200, Pallas on TPU v6e, a CuTe DSL port that was evaluated but not adopted, and what each attempt showed.

Shipping a hybrid Mamba-3-plus-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backbone for a C++ code model forces the same conversation three times: once for CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200, once for TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries, and once for the kernel DSL you trust enough to keep patching. This post is that conversation written down.
Short version: TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample is the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side kernel surface for the MIMO-heavy
Mamba path on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample is the selective TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries kernel surface where a custom
kernel is actually justified, torch_xla.experimental.scan still owns the main
TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries state update path, and a CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample port was useful as a reference but not
worth carrying as a second production lane. The architecture-level companion is
Mamba 3 + Transformers: why MegaCpp uses a hybrid stack for C++.
If these kernel terms are new
The first-touch vocabulary here is easy to overload, so define it before arguing about performance:
- TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample is the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side kernel DSL/compiler surface used for the main Mamba MIMO kernels on H200.
- CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample is the CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample-family Python DSL we evaluated as a reference port, not the maintained production lane for this Mamba kernel. For the broader CUTLASS/CuTe ownership boundary, use Our honest experience with CuTe DSL.
- TMA means Tensor Memory Accelerator: the Hopper bulk-copy path that can move tiles into shared memory efficiently, but only when the descriptor shape and lowering contract stay legal.
- PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample is a TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries kernel DSL inside the JAXQuick term guideJAXA separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.GroundingAbout: libtpu and JAX interaction Reference: libtpu / PJRT / JAX ownership boundaries Reference: Pallas on TPU/XLA ecosystem. In MegaCpp terms, Pallas on TPU is the "reach for this only when plain XLA lowering is no longer enough" tool, not the default answer to every TPU kernel.
- XLA is the compiled graph/runtime layer underneath the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries lane, so the practical boundary in this cluster is whether a change stays inside one shape-stable XLA graph or widens into a separate bridge or custom-kernel surface.
- SSD scan is the recurrent state update in the Mamba path. On TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries the safest public summary is still "keep the scan shape-stable and keep boundary metadata inside the compiled path."
- SISO means single-input, single-output; MIMO means multi-input, multi-output.
- PsiV is the elementwise product of two already-available kernel inputs. The checked-in cache example treats materializing it as an explicit fail-closed experiment, not as a silently enabled optimization.
The checked-in decoder for those terms is the pair MegaCpp example index and Pallas kernel selection notes. The local teaching-sized examples that matter most in this article are Mamba3 3D-to-2D shared-memory sample and Mamba3 PsiV cache scaffold. For the exact upstream-facing compiler bug surface behind the 3D shared-memory story, keep Upstream PRs we wrote for TileLang and Megatron-Core nearby too: that article is the cleaner handoff when the question becomes "which lowering failure, exactly, and what changed upstream?" The external primary-source split is equally simple: the Mamba repository defines the selective SSM family, while the JAX Pallas docs define the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries custom-kernel surface this article keeps distinct from plain XLA. The cluster split stays deliberate: Author Mamba3 spec inside Megatron is the input-side seam, Mamba linear CE parity deep dive is the output-side seam, and this post is the kernel and backend seam.
Why this matters
Kernel-tool choice for a hybrid-model stack looks like a pure performance decision and turns out to be an operational one. Two kernel paths can lower to very similar machine code on supported shapes and still differ massively in how easy they are to patch, debug, and keep aligned with the surrounding runtime.
For a stack that already has to keep attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns kernels, state-space kernels, parallelism code, and precision policy in one working lane, iteration cost is a first-class metric. The question is not only "which kernel is faster?" It is also "which kernel can absorb one more correctness fix without turning the fork into a second product?"
1. What we actually ship on H200
The production CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side path for the Mamba-3 half of the hybrid lives on the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample side of the stack, not in handwritten CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 C++. In practical terms, that means the MIMO kernels are maintained as a small, explicit patch surface instead of a fully separate kernel family.
That choice is reflected in the local proof surfaces:
- MegaCpp example index treats TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample shared memory and TMA legality as one of the main checked-in kernel categories.
- Mamba3 3D-to-2D shared-memory sample reduces the layout issue to one simple idea: a legal 2D tile can preserve the same logical payload as a natural 3D tile.
- TileLang TMA and H200 reality is the adjacent post when you want the H200-specific rollout story rather than the vocabulary.
The important public-safe summary is that MegaCpp keeps the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side Mamba path on the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample lane because the fixes it needs are small, inspectable, and reproducible there.
2. The TMA layout fix, in plain language
The most useful first-touch explanation in this whole article is also the simplest one: some kernel DSL lowerings accept only narrower shared-memory layouts than the model-side logic would naturally produce.
That is what the 3D-to-2D rewrite is about. It is not "new math." It is a layout-legality repair. The checked-in sample says it directly:
- model-side logic wants a logical
(depth, rows, cols)tile - the backend lowering wants a legal 2D tile
- an explicit reshape can preserve the same payload while making the lowering happy
That distinction matters because it stops readers from confusing compiler legality work with architecture changes. The right companion here is the near-copy layout companion sample for the closer contract and Mamba3 MIMO 3D-to-2D shared-memory deep dive for the longer explanation.
It also shows why this article stops at kernel legality. A legal tile layout is not the same thing as a full tensor-parallel or end-to-end runtime answer. That follow-on lives in tensor-parallel Mamba3 mixer sample and Mamba 3 parallel performance.
3. PsiV and why the cache stays fail-closed
PsiV is easy to oversell because it sounds like a free reuse opportunity: if the same product is needed several times, why not materialize it once and pass it along?
The checked-in example deliberately refuses to hand-wave that answer. In Mamba3 PsiV cache scaffold, the public contract is explicit:
- the cache is a scaffold, not a shipped optimization
- if the gate is turned on before the implementation exists, the correct behavior is to refuse the run
- the gate is intentionally explicit in the checked-in sample, so operators can tell the difference between "scaffold present" and "optimization shipped"
That last point is exactly why example-grounded wording matters. A vague "P2 is behind an env gate" summary is less useful than a checked-in gate name and a clear fail-closed behavior.
4. TPU v6e: XLA first, Pallas second
The TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries side of the same stack is easier to understand if you start from the default and only then add the custom-kernel lane.
The default is plain XLA around a shape-stable scan. The checked-in note Mamba-3 porting notes for TPU puts the rules in blunt form:
- keep the scan body shape-stable
- materialize sequence or segment identifiers explicitly
- prefer plain XLA fusion around the scan by default
- use PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample only when custom tiling or mask handling clearly pays for itself
That is the right first-touch explanation for readers who only know PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample as "the TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries kernel language." In this stack, PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample is a selective tool, not the main story. The main story is still compile-stable state updates.
Where PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample does show up usefully is on the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns side of the hybrid or on specialized TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries kernels where the tiling and mask rules are worth making explicit. The local policy note is Pallas kernel selection notes, and the adjacent post is Mamba-3 fused trapezoidal scan on TPU v6e.
One more TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries boundary is worth keeping explicit here: scan or scan_layers
is not a generic compile-speed knob for every XLA lane. If the stacked block
stops being structurally homogeneous, the safe fallback is the plain loop
rather than forcing scan_layers across an incompatible body. The local XLA
breakage matrix keeps that contract narrow: the traced block still has to stay
structurally homogeneous and traceable, and current public-safe guidance keeps
custom PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample-backed hotspots outside that traced surface. That is why this
kernel article keeps "shape-stable scan first, plain loop when the body is not
scan-friendly, selective PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample for explicit hotspots" as the main rule instead
of presenting scan_layers as the universal answer.
5. TileLang versus CuTe DSL
The honest A/B here is not "which one looked more sophisticated?" It is "which one justified becoming a permanent lane?"
TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample won that comparison for operational reasons:
- the changes MegaCpp actually needed were small enough to express directly in the existing TileLang surface
- the checked-in legality and cache experiments already explain the live bug surfaces in that language
- carrying a second full production kernel lane would have increased fork cost faster than it increased model quality
That does not make the CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample Example: tcgen05 gate matrix sample port useless. It still has reference value. A reference path is often enough when the goal is to validate an identity or a layout argument rather than to own two independent production implementations.
6. What the measurements support, and what they do not
Supported by the local public-safe proof surfaces:
- TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample as the maintained CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side lane for the Mamba MIMO path
- shared-memory legality fixes framed as layout rewrites rather than math changes
- a fail-closed PsiV cache scaffold instead of a silently enabled cache
- XLA-first TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.GroundingAbout: Torch XLA / PJRT reality History: TPU v6e host bring-up Reference: libtpu / PJRT ownership boundaries guidance with PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample as a selective tool rather than a default
Not supported tightly enough for stronger public copy:
- "CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingAbout: CuTe DSL experiments Example: MegaCpp model wiring example index Reference: TileLang and CuTe boundary is the real production answer and TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample is temporary"
- "every PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample rewrite is automatically better than plain XLA lowering"
- "a plausible cache reuse idea is already a shipped optimization"
The rule behind all three is the same one that shows up across MegaCpp posts: keep the proof surface visible. If the checked-in example says "scaffold only," the article should not promote it to "shipped." If the checked-in example says "layout legality," the article should not sell it as a new algorithm.
If you care about which of these kernel decisions actually changed step time rather than compile legality, the measured continuation is Mamba 3 parallel performance.
Frequently asked questions
Why did TileLang win over the CuTe DSL port?+
Why not keep a separate backward split as a hedge?+
What does TMA mean in the concrete checked-in examples?+
Why keep the TPU and CUDA kernel stories in one article?+
Why not treat scan_layers as the default TPU speed trick?+
scan_layers across an incompatible body, and the public-safe TPUQuick term guideTPUGoogle's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels. lane in this cluster already keeps custom PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract. kernels as selective exceptions rather than as the default traced body. The compact decoder for that boundary is Torch 2.12 XLA breakage matrix.Why is the PsiV cache still treated as experimental?+
Where is the sharding and runtime continuation after this kernel article?+
Where do the non-kernel Mamba seams live in this cluster?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.
CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.
JAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.
A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
Google's Tensor Processing Unit accelerator/runtime surface, where the important boundary in these posts is usually XLA or PJRT ownership rather than handwritten GPU kernels.
The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
A separate frontend above PJRT/libtpu. In these TPU posts it mainly matters as the owner of NamedSharding, PartitionSpec, and the optional call_jax or Pallas-adjacent bridge lanes.