Topic Hub

MLA Integration, Dispatch, and Weight Absorption

A curated MLA reading path: the weight-absorption contract, Megatron-safe integration boundaries, dispatch and FP8 edges, and the adapter surfaces that keep MLA connected to the rest of the stack.

This hub is for readers who keep seeing MLA across H200, Megatron, and Blackwell posts and need one grounded path through the integration story. Start with the architecture and boundary documents, then move into dispatch or cache surfaces, and finish with the adapter and upstream follow-through.

MLA
weight absorption
dispatch
FP8
Megatron
adapters
Curated set
9
Articles in reading order
Why this hub

Best if you want MLA as a real system boundary with concrete implementation tradeoffs, not just as a glossary term.

Architecture and Boundaries

Read these first to understand what MLA changes and where the clean seams actually are.

  1. 01
    April 18, 20269 min readBoris Tamarkin

    MLA weight absorption: what we kept and what we dropped for the C++ specialists

    Multi-Head Latent Attention in production: why DeepSeek's absorbed decode path is the right choice for KV cache, why it is the wrong choice for training, and how the C++ specialist ensemble uses both.

    The core architectural readback for what MLA changes in projection layout, KV handling, and the weight-absorption contract.

    MLA
    Attention
    Deepseek
    Flash Attention
  2. 02
    April 19, 20263 min readDavid Gornshtein

    Public MLA integration patterns for Megatron

    How MegaCpp keeps MLA-specific compatibility logic behind a narrow adapter seam instead of scattering it through the whole builder path.

    The best integration overview once the concept has to survive real Megatron ownership boundaries.

    MLA
    Megatron
    Attention
    Integration
  3. 03
    April 19, 20262 min readDavid Gornshtein

    Shared MLA adapter boundaries

    Why MegaCpp keeps MLA-specific normalization behind one shared adapter seam instead of leaking MLA conditionals through the whole attention builder stack.

    The practical boundary document for where adapters, shared modules, and MLA responsibilities should stop leaking into each other.

    MLA
    Megatron
    Attention
    Adapters

Kernel, Cache, and Dispatch Surfaces

Once the boundary is clear, these explain the parts of MLA that become hardware-shaped.

  1. 04
    April 18, 202612 min readDavid Gornshtein

    Fused MLA on Hopper and Blackwell: projection, RoPE, and the KV cache that ships

    The NVIDIA side of Multi-Latent Attention in the MegaCpp ensemble: a fused down-norm-up projection, a fused split-RoPE-concat Triton kernel, a compressed KV cache, and how it all lands on Megatron-Core.

    The NVIDIA-side implementation lane once MLA has to compete on real H200 and Blackwell hardware.

    MLA
    Triton
    H200
    Blackwell
  2. 05
    April 19, 20263 min readDavid Gornshtein

    Sparse MLA dimension generalization

    Why SparseMLA kernels that hardcode DeepSeek-sized dimensions fail to scale down cleanly to NAM56R-style shapes, and what a generalized contract changes.

    A focused read on how the sparse MLA path generalizes dimensions without turning into one-off kernel glue.

    Sparse Mla
    Dimensions
    Kernels
    NAM56R
  3. 06
    April 19, 20262 min readDavid Gornshtein

    Sparse MLA FP8 dispatch

    Why SparseMLA needs an FP8-aware dispatch contract when Transformer Engine wrappers hide FP8 storage behind a bf16-looking logical surface.

    The dispatch and precision edge where MLA starts colliding with FP8 policy and runtime safety.

    Sparse Mla
    FP8
    Transformer Engine
    Dispatch
  4. 07
    April 18, 20266 min readDavid Gornshtein

    KV Cache and Paged Attention for the MegaCpp Specialist Ensemble

    Per-specialist KV cache layout, MLA cache after weight absorption, paged attention adoption status, and what changes between H200 and GB10 - including the MegaCpp serving plan.

    A useful companion once MLA decisions start affecting cache residency and long-context serving surfaces.

    KV Cache
    MLA
    Paged Attention
    FA3

Adapters and Upstream Follow-Through

These complete the story once MLA has to live inside the broader model stack.

  1. 08
    April 18, 202611 min readDavid Gornshtein

    The adapter stack: how LoRA, QLoRA, and hot-swap compose MegaCpp specialists

    The LoRA, QLoRA, DoRA, VeRA, and DyLoRA family behind MegaCpp specialists, the registry and lifecycle that turn adapters into versioned releases, the hot-swap runtime, and the inference-facing API they power.

    The adapter-side companion for understanding how MLA interacts with LoRA-family seams and runtime metadata.

    Lora
    QLoRA
    Adapters
    PEFT
  2. 09
    April 18, 202615 min readDavid Gornshtein

    Upstream PRs we wrote for Mamba-3, Sparse-MLA, Liger and DSA

    A focused walk-through of the Mamba-3, Sparse-MLA, Liger-Kernel and DSA upstream PRs we have prepared: the bug, the fix, and where each one currently sits.

    What changed when the MLA lane had to move beyond the local tree and into upstream-facing patches.

    Upstream
    Mamba3
    Sparse Mla
    Liger

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.