Entity Hub

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

A curated Mamba3 reading path: why MegaCpp kept a hybrid stack, how the kernels evolved across CUDA, TileLang, and TPU, and where the runtime wins actually held.

This hub works best in sequence. Start with the hybrid-model story and the author-level spec, then move into the kernel lane and finally the runtime or cache surfaces that make the Mamba3 path concrete.

mamba3
state-space
MIMO
ssm
cache
scaffold
Curated set
11
Articles in reading order
Why this hub

Best if you want the Mamba3 lane as one connected engineering story instead of scattered kernel notes.

Model and Runtime Contract

Read these first to understand why Mamba3 stayed in the stack.

  1. 01
    April 18, 20268 min readDavid Gornshtein

    Mamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++

    A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and which parts are design choice versus published literature.

    The top-level explanation of why MegaCpp kept a Mamba3-plus-transformer hybrid for C++ workloads.

    Mamba3
    Transformers
    Hybrid
    State Space
  2. 02
    April 18, 202610 min readMegaCpp Engineering

    Hybrid Layer Interleaving: Why A/M/E/R Schedules Need Real Execution Plans

    A code-grounded explanation of how interleaved schedules work for NAM52 and NAM56R-style hybrid models, based on hybrid pattern notes, scheduling examples, and authoritative parallelism references.

    The execution-plan companion that explains why A/M/E/R schedules need typed layer roles instead of vague hybrid branding.

    Hybrid Models
    Scheduling
    Mamba
    MoE
  3. 03
    April 19, 20263 min readDavid Gornshtein

    Author Mamba3 spec inside Megatron

    Why an author-pure Mamba3 path still needs an explicit pre-projection RMSNorm when it is wrapped into a Megatron-local Mamba stack.

    The author-side spec readback once the model had to fit Megatron and the wider stack honestly.

    Mamba3
    Megatron
    RMSNorm
    Spec
  4. 04
    April 19, 20262 min readDavid Gornshtein

    Fail-closed hybrid pattern translation

    Why MegaCpp refuses to silently remap unsupported hybrid block families when translating NAM56R-style patterns into Megatron-native plans.

    The translation boundary for unsupported hybrid block families once the Mamba3 lane has to survive a Megatron-native plan.

    Megatron
    Hybrid Models
    Pattern Translation
    NAM56R
  5. 05
    April 18, 202614 min readDavid Gornshtein

    Mamba 3 Parallel Performance: Where It Beat Attention, and Where It Lost

    MIMO scaling, chunk-size behavior, the PsiV cache trade-off, and an honest tally of where a Mamba 3 hybrid outran pure attention on NVIDIA H200 and where it did not.

    Where the Mamba3 path beat attention, where it lost, and why the answer changes with parallel layout.

    Mamba3
    State Space
    MIMO
    Performance

Kernel Lane

These are the concrete implementation notes once the architecture story is clear.

  1. 07
    April 19, 20268 min readDavid Gornshtein

    Mamba3 MIMO 3D-to-2D shared-memory deep dive

    Why some Mamba3-style kernels need an explicit 3D-to-2D shared-memory legality rewrite before the backend will accept the tile layout.

    A focused deep dive into one of the more hardware-shaped Mamba3 kernel surfaces.

    Mamba3
    Smem
    TileLang
    Kernels
  2. 08
    April 18, 20265 min readDavid Gornshtein

    Mamba-3 fused trapezoidal scan on TPU v6e

    How we took the Mamba-3 trapezoidal SSM update from a CUDA Triton kernel to a Pallas/XLA-friendly scan on TPU v6e, and what survived the deployment port.

    The TPU-side kernel note once the Mamba3 lane had to survive a very different backend.

    Mamba3
    TPU
    V6e
    Pallas

Cache and Follow-Through

These finish the picture with runtime and adjacent architecture surfaces.

  1. 09
    April 19, 20264 min readDavid Gornshtein

    Mamba3 PsiV cache scaffold

    Why the Mamba3 PsiV cache path is published as a scaffold with a fail-closed gate instead of a silent fallback.

    The cache scaffold and runtime shape decisions that matter once the kernels themselves are in place.

    Mamba3
    Cache
    Scaffold
    Runtime
  2. 10
    April 18, 202612 min readDavid Gornshtein

    M2RNN and Engram: The Memory Subsystem Inside the Hybrid

    Where matrix-state RNN layers, causal n-gram Engram branches, and the learned concept bank fit inside our Mamba 3 + Transformer hybrid — and which pieces remain useful in the public memory stack.

    A useful adjacent read when the Mamba3 story turns into broader memory and recurrence design.

    M2rnn
    Engram
    Memory
    Hybrid
  3. 11
    April 18, 202615 min readDavid Gornshtein

    Upstream PRs we wrote for Mamba-3, Sparse-MLA, Liger and DSA

    A focused walk-through of the Mamba-3, Sparse-MLA, Liger-Kernel and DSA upstream PRs we have prepared: the bug, the fix, and where each one currently sits.

    The upstream follow-through when the Mamba3 lane had to be pushed beyond the local tree.

    Upstream
    Mamba3
    Sparse Mla
    Liger

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.