KV Cache and Paged Attention for the MegaCpp Specialist Ensemble
Per-specialist KV cache layout, MLA cache after weight absorption, paged attention adoption status, and what changes between H200 and GB10 - including the MegaCpp serving plan.

MegaCpp serves eight specialist SLMs behind a router, each holding its own KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack. The dominant memory line at decode is cache, not weights: per specialist, per request, per token, per layer. This post covers the cache layout we ship, what MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries's compressed latent buys versus standard FA3, where the paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack path stands, and what the MegaCpp servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off plan does differently on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 vs GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story.
One reader-first split matters throughout this article: a block tableQuick term guideblock tableThe per-sequence integer map from logical token ranges to the physical KV blocks the decode kernel should read next.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack is the per-step map from logical token positions to physical cache blocks, while prefix cachingQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.Groundinginference serving stack vLLM automatic prefix caching is the higher-level policy that decides whether already-filled blocks can be reused at all. The narrow checked-in proof surfaces are Dense FA4 KV-cache decode sample, Exact mask contract cache sample, MLA shared adapter sample, and Mamba3 PsiV cache scaffold.
Why MegaCpp cares about this
Training runs causal attention without a cache, so cost is zero. At servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off the picture inverts. Standard FA3 cache for one specialist scales as 2 * B * T_max * H * head_dim * bpe per attention layer; for the depth-52 hybrid with 13 attention layers, batch=8, T_max=8192, H=24, head_dim=128, bf16, that is ~6 GB per specialist. Eight specialists co-resident wants the better part of an H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200. Two things change the picture. The hybrid pattern: only attention layers carry a KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack; Mamba layers carry an SSM state cache that is O(d_state) per layer, not O(T_max * H * head_dim). MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries: the compressed-latent cache replaces full per-head K and V with one kv_lora_rank-wide latent plus a small RoPE'd key fragment. Paged attentionQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack sits on top as the substrate for shared block pools with prefix sharing.
What we built in the MegaCpp training stack
Three KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack implementations live in the MegaCpp training stack.
The contiguous FA3 cache layout uses tensors shaped (n_cache_slots, B, T_max, H, head_dim) for K and V (FA3-style: time before heads). Position is tracked per batch element via a cache_seqlens int32 tensor that flash_attn_with_kvcache updates in place. attn_layer_mapQuick term guideattention-layer mapThe layer-index remap that allocates cache slots only for attention-bearing layers in a hybrid stack instead of reserving one slot per layer unconditionally.GroundingExample: shared MLA adapter sample Reference: Mamba3 PsiV cache scaffold example maps global layer index to cache slot index, allocating slots only for attention layers in a hybrid pattern. On depth-52 with 13 attention layers, this drops cache memory by ~75% vs "one slot per layer".
The MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack is the compressed-latent layout. Per attention layer it stores low_rank_caches[layer_idx] of shape (B, T_max, kv_lora_rank) and rope_caches[layer_idx] of shape (B, T_max, 1, qk_rope_head_dim). With kv_lora_rank=512 and qk_rope_head_dim=64, each cached token costs 2 * (512 + 64) = 1152 bytes per layer in bf16, vs 2 * 24 * 128 * 2 = 12,288 bytes per layer for standard MHA - about 10x compression before any quantisation.
The quantized KV variant is the full-K/V compression lane we evaluated for non-MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries specialists. It stays separate from the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries cache layout: the quantizer expects full per-head K and V tensors, not the compressed latent. In the current servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off stack, MLAKVCache is the production choice and the full-K/V quantized lane remains optional for non-MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries shapes.
The KV update path branches on cache kind. MLA.forward stores (low_rank_main, key_rope) into the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries cache view and reads it back through the same per-layer accessor. The standard FA3 path uses flash_attn_with_kvcache with cache_seqlens and the layer's (k_cache, v_cache) slot. Both advance after the last layer.
Paged attention: what is wired, what is deferred
The paged path lives under a paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack block manager and adapter pair. The block pool is (num_blocks, n_cache_layers, block_size, num_heads, head_dim) for both K and V, allocated once at engine startup. Sequences hold a list of block indices; the adapter materialises a (B, max_blocks_per_seq) int32 block_tableQuick term guideblock tableThe per-sequence integer map from logical token ranges to the physical KV blocks the decode kernel should read next.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack that FA3's paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack decode path reads directly. Free blocks live in a LIFO list for O(1) alloc/free; a prefix-cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.Groundinginference serving stack vLLM automatic prefix caching dictionary maps content-based keys to block indices, with reference counting and LRU eviction for back-pressure.
Two adapter modes. When block_manager.block_size % 256 == 0, the FA3 zero-copy path can pass the pool tensors plus block_tableQuick term guideblock tableThe per-sequence integer map from logical token ranges to the physical KV blocks the decode kernel should read next.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack straight into flash_attn_with_kvcache. Below 256 the adapter falls back to a gathered per-layer view: correct, but no longer zero-copy.
Adoption status: the paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack block manager is structurally complete but deferred from production. The contiguous FA3 path is the production substrate today. Paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off is wired through the model-level paged adapter contract with explicit runtime checks, but execute-and-prove proof for the scheduler-managed paged decode lane is still open.
Constraints we hit: page_block_size must be a multiple of 256 for FA3 zero-copy, smaller blocks improved prefix-sharing granularity but lost the fast path, and the fall-back gather is a measurable hit at high QPS. Prefix cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.Groundinginference serving stack vLLM automatic prefix caching identity is content-based, so the same prompt across different specialists or LoRA bundles is not mistakenly aliased. Diagnostic counters for block reads, writes, and advances are non-optional. Reuse is block-granular, not token-granular.
Per-specialist budget
For one specialist on one H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 device - depth 52, 13 attention layers, MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries with kv_lora_rank=512, qk_rope_head_dim=64, bf16 cache, T_max=8192:
Per-token MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries cost per layer: (512 + 64) * 2 = 1152 bytes. Across 13 attention layers per token: about 15 KB. Per sequence at T_max=8192: about 120 MB. At batch=8 concurrent servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off: just under 1 GB per specialist. Eight specialists co-resident: roughly 7-8 GB cache for the whole ensemble at full T_max and batch=8. Fits alongside FP8 weights and the working set on one H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200.
Same shape with standard MHA: 2 * 24 * 128 * 2 = 12,288 bytes per token per layer, about 160 KB per token across 13 attention layers, about 1.3 GB per sequence at T_max=8192, about 10 GB per specialist at batch=8, about 80 GB across eight. That is why MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries matters for servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off workloads regardless of training considerations.
The Mamba state cache holds SSM state per Mamba layer at O(d_state * n_groups * d_inner). Constant in T_max, a few MB per Mamba layer per request - negligible vs attention cache, but a real line. The DSA indexer K cache is a separate small cache for the DSA indexer's per-layer K tensor on DSA-enabled specialists.
H200 vs GB10
Three things change between H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story. Headroom: GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story has more total memory per device but slower bandwidth, and unified memory means CPU and GPU contend for the same pool, so expandable-segment allocation is on by default. Attention backend: H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 runs FA3, while GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story with sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough lacks the same fast-path coverage and keeps the bounded dense decode lane as the production target; GB10 stack parity for MegaCpp is the local decoder for that consumer-Blackwell target split. Precision: GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story ships NVFP4-capable weights, but the cache remains bf16 because the 4-bit full-K/V path is not the same thing as a validated MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries-cache compression lane.
How it lands in MegaCpp
The servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off plan keeps the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingExample: Dense FA4 KV-cache decode sample Reference: inference serving stack as the per-specialist cache type, contiguous FA3 as today's substrate, and the paged path as the deferred-but-wired track B. The eight-specialist ensemble runs above eight cache domains rather than collapsing everything into one generic cache manager.
The production MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries helper in the Megatron stack keeps the same KV layout as the MegaCpp training stack. The DSA path uses an absorbed-MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries variant only on the decode lane, while training MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries stays on the standard expand-then-attend formulation. Production traffic remains on contiguous FA3 until the scheduler-managed paged sparse decode lane clears validation.
Frequently asked questions
Why is paged attention not the default if the implementation exists?+
Why not make the 4-bit KV lane the MLA default?+
How fine-grained is prefix caching in practice?+
Does the per-specialist layout table mean paged KV is the production default?+
What protects prefix-cache reuse when LoRA bundles churn?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.
The per-sequence integer map from logical token ranges to the physical KV blocks the decode kernel should read next.
The reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.
The layer-index remap that allocates cache slots only for attention-bearing layers in a hybrid stack instead of reserving one slot per layer unconditionally.
Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.
How we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.