Topic Hub

MLA Integration, Dispatch, and Weight Absorption

A curated MLA reading path: the weight-absorption contract, Megatron-safe integration boundaries, dispatch and FP8 edges, and the adapter surfaces that keep MLA connected to the rest of the stack.

This hub is for readers who keep seeing MLA across H200, Megatron, and Blackwell posts and need one grounded path through the integration story. Start with the architecture and boundary documents, then move into dispatch or cache surfaces, and finish with the adapter and upstream follow-through.

MLA

weight absorption

dispatch

FP8

Megatron

adapters

Curated set

Articles in reading order

Why this hub

Best if you want MLA as a real system boundary with concrete implementation tradeoffs, not just as a glossary term.

Architecture and Boundaries

Read these first to understand what MLA changes and where the clean seams actually are.

01
April 18, 2026•9 min read•Boris Tamarkin
MLA weight absorption: what we kept and what we dropped for the C++ specialists
Multi-Head Latent Attention in production: why DeepSeek's absorbed decode path is the right choice for KV cache, why it is the wrong choice for training, and how the C++ specialist ensemble uses both.
The core architectural readback for what MLA changes in projection layout, KV handling, and the weight-absorption contract.
MLA
Attention
Deepseek
Flash Attention
Read article
02
April 19, 2026•3 min read•David Gornshtein
Public MLA integration patterns for Megatron
How MegaCpp keeps MLA-specific compatibility logic behind a narrow adapter seam instead of scattering it through the whole builder path.
The best integration overview once the concept has to survive real Megatron ownership boundaries.
MLA
Megatron
Attention
Integration
Read article
03
April 19, 2026•2 min read•David Gornshtein
Shared MLA adapter boundaries
Why MegaCpp keeps MLA-specific normalization behind one shared adapter seam instead of leaking MLA conditionals through the whole attention builder stack.
The practical boundary document for where adapters, shared modules, and MLA responsibilities should stop leaking into each other.
MLA
Megatron
Attention
Adapters
Read article

Kernel, Cache, and Dispatch Surfaces

Once the boundary is clear, these explain the parts of MLA that become hardware-shaped.

04
April 18, 2026•12 min read•David Gornshtein
Fused MLA on Hopper and Blackwell: projection, RoPE, and the KV cache that ships
The NVIDIA side of Multi-Latent Attention in the MegaCpp ensemble: a fused down-norm-up projection, a fused split-RoPE-concat Triton kernel, a compressed KV cache, and how it all lands on Megatron-Core.
The NVIDIA-side implementation lane once MLA has to compete on real H200 and Blackwell hardware.
MLA
Triton
H200
Blackwell
Read article
05
April 19, 2026•3 min read•David Gornshtein
Sparse MLA dimension generalization
Why SparseMLA kernels that hardcode DeepSeek-sized dimensions fail to scale down cleanly to NAM56R-style shapes, and what a generalized contract changes.
A focused read on how the sparse MLA path generalizes dimensions without turning into one-off kernel glue.
Sparse Mla
Dimensions
Kernels
NAM56R
Read article
06
April 19, 2026•2 min read•David Gornshtein
Sparse MLA FP8 dispatch
Why SparseMLA needs an FP8-aware dispatch contract when Transformer Engine wrappers hide FP8 storage behind a bf16-looking logical surface.
The dispatch and precision edge where MLA starts colliding with FP8 policy and runtime safety.
Sparse Mla
FP8
Transformer Engine
Dispatch
Read article
07
April 18, 2026•6 min read•David Gornshtein
KV Cache and Paged Attention for the MegaCpp Specialist Ensemble
Per-specialist KV cache layout, MLA cache after weight absorption, paged attention adoption status, and what changes between H200 and GB10 - including the MegaCpp serving plan.
A useful companion once MLA decisions start affecting cache residency and long-context serving surfaces.
KV Cache
MLA
Paged Attention
FA3
Read article

Adapters and Upstream Follow-Through

These complete the story once MLA has to live inside the broader model stack.

08
April 18, 2026•11 min read•David Gornshtein
The adapter stack: how LoRA, QLoRA, and hot-swap compose MegaCpp specialists
The LoRA, QLoRA, DoRA, VeRA, and DyLoRA family behind MegaCpp specialists, the registry and lifecycle that turn adapters into versioned releases, the hot-swap runtime, and the inference-facing API they power.
The adapter-side companion for understanding how MLA interacts with LoRA-family seams and runtime metadata.
Lora
QLoRA
Adapters
PEFT
Read article
09
April 18, 2026•15 min read•David Gornshtein
Upstream PRs we wrote for Mamba-3, Sparse-MLA, Liger and DSA
A focused walk-through of the Mamba-3, Sparse-MLA, Liger-Kernel and DSA upstream PRs we have prepared: the bug, the fix, and where each one currently sits.
What changed when the MLA lane had to move beyond the local tree and into upstream-facing patches.
Upstream
Mamba3
Sparse Mla
Liger
Read article

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.

MLA Integration, Dispatch, and Weight Absorption

Architecture and Boundaries

MLA weight absorption: what we kept and what we dropped for the C++ specialists

Public MLA integration patterns for Megatron

Shared MLA adapter boundaries

Kernel, Cache, and Dispatch Surfaces

Fused MLA on Hopper and Blackwell: projection, RoPE, and the KV cache that ships

Sparse MLA dimension generalization

Sparse MLA FP8 dispatch

KV Cache and Paged Attention for the MegaCpp Specialist Ensemble

Adapters and Upstream Follow-Through

The adapter stack: how LoRA, QLoRA, and hot-swap compose MegaCpp specialists

Upstream PRs we wrote for Mamba-3, Sparse-MLA, Liger and DSA

Adjacent topic hubs

GB10 and Blackwell Bring-Up

Modal Training and Benchmark Operations

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

Evaluation, Benchmarks, and Verifier Loops

Megatron Parallelism and Layout Boundaries

TPU Sparse Attention and Pallas Kernels

H200 Training and Kernel Bring-Up

TPU v6e and XLA Runtime Surfaces

C++ Data Pipelines and Corpus Packaging

MoE, Routing, and Distributed Model Splits