Entity Hub

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

A curated Mamba3 reading path: why MegaCpp kept a hybrid stack, how the kernels evolved across CUDA, TileLang, and TPU, and where the runtime wins actually held.

This hub works best in sequence. Start with the hybrid-model story and the author-level spec, then move into the kernel lane and finally the runtime or cache surfaces that make the Mamba3 path concrete.

mamba3

state-space

MIMO

ssm

cache

scaffold

Curated set

Articles in reading order

Why this hub

Best if you want the Mamba3 lane as one connected engineering story instead of scattered kernel notes.

Model and Runtime Contract

Read these first to understand why Mamba3 stayed in the stack.

01
April 18, 2026•8 min read•David Gornshtein
Mamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and which parts are design choice versus published literature.
The top-level explanation of why MegaCpp kept a Mamba3-plus-transformer hybrid for C++ workloads.
Mamba3
Transformers
Hybrid
State Space
Read article
02
April 18, 2026•10 min read•MegaCpp Engineering
Hybrid Layer Interleaving: Why A/M/E/R Schedules Need Real Execution Plans
A code-grounded explanation of how interleaved schedules work for NAM52 and NAM56R-style hybrid models, based on hybrid pattern notes, scheduling examples, and authoritative parallelism references.
The execution-plan companion that explains why A/M/E/R schedules need typed layer roles instead of vague hybrid branding.
Hybrid Models
Scheduling
Mamba
MoE
Read article
03
April 19, 2026•3 min read•David Gornshtein
Author Mamba3 spec inside Megatron
Why an author-pure Mamba3 path still needs an explicit pre-projection RMSNorm when it is wrapped into a Megatron-local Mamba stack.
The author-side spec readback once the model had to fit Megatron and the wider stack honestly.
Mamba3
Megatron
RMSNorm
Spec
Read article
04
April 19, 2026•2 min read•David Gornshtein
Fail-closed hybrid pattern translation
Why MegaCpp refuses to silently remap unsupported hybrid block families when translating NAM56R-style patterns into Megatron-native plans.
The translation boundary for unsupported hybrid block families once the Mamba3 lane has to survive a Megatron-native plan.
Megatron
Hybrid Models
Pattern Translation
NAM56R
Read article
05
April 18, 2026•14 min read•David Gornshtein
Mamba 3 Parallel Performance: Where It Beat Attention, and Where It Lost
MIMO scaling, chunk-size behavior, the PsiV cache trade-off, and an honest tally of where a Mamba 3 hybrid outran pure attention on NVIDIA H200 and where it did not.
Where the Mamba3 path beat attention, where it lost, and why the answer changes with parallel layout.
Mamba3
State Space
MIMO
Performance
Read article

Kernel Lane

These are the concrete implementation notes once the architecture story is clear.

06
April 18, 2026•8 min read•David Gornshtein
The Mamba 3 Kernel Journey: CUDA, Pallas, TileLang, and an Honest Look at CuTe DSL
How the Mamba 3 kernel stack works in MegaCpp: TileLang on H200, Pallas on TPU v6e, a CuTe DSL port that was evaluated but not adopted, and what each attempt showed.
The end-to-end kernel story across CUDA, Pallas, TileLang, and CuTe DSL tradeoffs.
Mamba3
CUDA
TileLang
Pallas
Read article
07
April 19, 2026•8 min read•David Gornshtein
Mamba3 MIMO 3D-to-2D shared-memory deep dive
Why some Mamba3-style kernels need an explicit 3D-to-2D shared-memory legality rewrite before the backend will accept the tile layout.
A focused deep dive into one of the more hardware-shaped Mamba3 kernel surfaces.
Mamba3
Smem
TileLang
Kernels
Read article
08
April 18, 2026•5 min read•David Gornshtein
Mamba-3 fused trapezoidal scan on TPU v6e
How we took the Mamba-3 trapezoidal SSM update from a CUDA Triton kernel to a Pallas/XLA-friendly scan on TPU v6e, and what survived the deployment port.
The TPU-side kernel note once the Mamba3 lane had to survive a very different backend.
Mamba3
TPU
V6e
Pallas
Read article

Cache and Follow-Through

These finish the picture with runtime and adjacent architecture surfaces.

09
April 19, 2026•4 min read•David Gornshtein
Mamba3 PsiV cache scaffold
Why the Mamba3 PsiV cache path is published as a scaffold with a fail-closed gate instead of a silent fallback.
The cache scaffold and runtime shape decisions that matter once the kernels themselves are in place.
Mamba3
Cache
Scaffold
Runtime
Read article
10
April 18, 2026•12 min read•David Gornshtein
M2RNN and Engram: The Memory Subsystem Inside the Hybrid
Where matrix-state RNN layers, causal n-gram Engram branches, and the learned concept bank fit inside our Mamba 3 + Transformer hybrid — and which pieces remain useful in the public memory stack.
A useful adjacent read when the Mamba3 story turns into broader memory and recurrence design.
M2rnn
Engram
Memory
Hybrid
Read article
11
April 18, 2026•15 min read•David Gornshtein
Upstream PRs we wrote for Mamba-3, Sparse-MLA, Liger and DSA
A focused walk-through of the Mamba-3, Sparse-MLA, Liger-Kernel and DSA upstream PRs we have prepared: the bug, the fix, and where each one currently sits.
The upstream follow-through when the Mamba3 lane had to be pushed beyond the local tree.
Upstream
Mamba3
Sparse Mla
Liger
Read article

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

Model and Runtime Contract

Mamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++

Hybrid Layer Interleaving: Why A/M/E/R Schedules Need Real Execution Plans

Author Mamba3 spec inside Megatron

Fail-closed hybrid pattern translation

Mamba 3 Parallel Performance: Where It Beat Attention, and Where It Lost

Kernel Lane

The Mamba 3 Kernel Journey: CUDA, Pallas, TileLang, and an Honest Look at CuTe DSL

Mamba3 MIMO 3D-to-2D shared-memory deep dive

Mamba-3 fused trapezoidal scan on TPU v6e

Cache and Follow-Through

Mamba3 PsiV cache scaffold

M2RNN and Engram: The Memory Subsystem Inside the Hybrid

Upstream PRs we wrote for Mamba-3, Sparse-MLA, Liger and DSA

Adjacent topic hubs

GB10 and Blackwell Bring-Up

Modal Training and Benchmark Operations

MLA Integration, Dispatch, and Weight Absorption

Evaluation, Benchmarks, and Verifier Loops

Megatron Parallelism and Layout Boundaries

TPU Sparse Attention and Pallas Kernels

H200 Training and Kernel Bring-Up

TPU v6e and XLA Runtime Surfaces

C++ Data Pipelines and Corpus Packaging

MoE, Routing, and Distributed Model Splits