MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 13 min readDavid Gornshtein
Inference
Serving
Ensemble
KV Cache
Scheduler
SLO

Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest

How we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model, admission control, and the SLOs we publish.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest
Published 13 min readDavid Gornshtein

The MegaCpp ensemble is eight specialists, not one model. That single architectural fact reshapes every servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off decision we make. A monolithic generalist can often live inside one vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off-style model engine with one schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention, one KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample manager, and one admission domain. Our ensemble still borrows that lower-level servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off pattern inside each specialist, but not above them: the router sits one layer higher and keeps the eight models out of one shared scheduling or eviction domain. This post is about the stack above the kernels: router, per-specialist schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention, KV layout, admission control, and published SLOsQuick term guideSLOsThe small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.Groundingobservability and SLO dashboards. Kernel choices and NVFP4 layout are covered elsewhere, the cache budget underneath this control plane is spelled out in KV cache and paged attention, and the mixed attention-plus-SSM model shape behind several specialists is expanded in Hybrid Layer Interleaving.

For first touch, six terms do most of the work in this article. A router is the small front-door classifier plus rules layer that decides which specialist should answer a request. A model engine is the per-model servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off runtime that owns decode for one loaded model replica; in vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off-style systems it is usually where batching, KV residency, and token generation meet. Continuous batchingQuick term guidecontinuous batchingThe serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.GroundingKV cache and paged attention observability and SLO dashboards means the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention keeps mixing decode steps from requests that are already resident instead of waiting for one request to finish before admitting the next. A paged KV pool means the attention cache is stored in fixed-size token blocks that can be reused and evicted independently instead of as one giant contiguous slab per sequence. A prefix cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: vLLM automatic prefix caching is the hash-indexed map from an already-seen prompt prefix to those existing KV blocks. Admission control is the explicit "can this request fit and still meet its deadline?" gate that runs before decode. TTFT is time-to-first-token, ITL is inter-token latency, and the article's SLOsQuick term guideSLOsThe small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.Groundingobservability and SLO dashboards are the latency and correctness bounds surfaced to callers. The quickest checked-in companions are the MegaCpp example index, dense FA4 KV-cache decode sample, exact-token sparse telemetry sample, and FA4 receipt summary sample.

Why this matters

An ensemble exposes the same product surface as a single model and is much harder to serve well. The cost of a bad servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off design is not a slow demo; it is silent specialist starvation, KV evictions that cascade across replicas, tail latency that wanders by intent class, and a debugging story where nobody can answer "which model actually answered this request" without grepping logs. We learned the hard way that the only way to make eight specialists feel like one product is to keep their boundaries crisp inside the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off stack: one router on top, one schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention per specialist, one paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample pool per specialist, one published SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundingobservability and SLO dashboards per specialist. This post is the design that survived two redesigns, and its cache-isolation assumptions line up with the residency story in KV cache and paged attention.

1. The shape of the problem

An incoming request is a chat-like blob of C++ context: a prompt, optional repo snippets, optional tool outputs, and a caller-declared intent (codegen, debug, build-fix, review, or unspecified). It arrives with a priority integer, an optional deadline, and a preferred decoding policy (greedy, typical, temperature-sampled). The servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off stack has to decide three things before any token comes out:

  1. Which specialist (or specialists) answer this request.
  2. Which instance of that specialist takes it, on which GPU, in which batch.
  3. How its KV footprint is paid for, and what to preempt if it will not fit.

Each decision has its own timescale. Routing is per-request and happens once. Scheduling is per-token and happens thousands of times per second. KV allocation sits underneath scheduling and decides whether the next decode step can even run. We kept the three decisions in three different components on purpose, because conflating them is how servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off systems end up with router logic leaking into kernel dispatch.

2. The top-level router

The router is a small model plus a rules layer. The small model is a distilled classifier in the tens of millions of parameters, trained on labeled dogfood traffic and curated examples from our enriched-corpus dataset family; it produces a distribution over the eight specialists plus a reject class. The rules layer overrides it in two cases: when the caller declared an intent that maps directly to a specialist (debug traces always go to Debug-SLM, build files to Build-SLM), and when the prompt tripwires a structural detector (SFINAE-heavy headers route to Template-SLM regardless of the classifier).

Shadow dispatch and what the router does not do

The router outputs a primary specialist and, for high-stakes requests, a shadow specialist. The shadow is only dispatched when the primary's top-1 probability is below a threshold tuned per intent. For pure codegen traffic the threshold is low and we almost never shadow. For review traffic, where the cost of a wrong specialist is a confidently wrong review, the threshold is high and we shadow more often.

That shadow traffic is not throwaway work. It is the calibration stream for later threshold tuning and promotion decisions: when the primary and shadow disagree, the router gets a labeled example of where its confidence boundary was too loose or too strict without trying to relearn on the live request path.

A few things the router deliberately does not do. No token-level reassignment — once routed, a request stays on its specialist for the whole generation. No cross-specialist output stitching; the quality penalty is visible. No online learning from real-time feedback; the classifier is retrained offline on labeled traffic.

3. One scheduler per specialist, not one across all of them

The instinct is to run one global schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention that picks the best GPU for each request across all specialists. We built that first and threw it away. Specialists have different KV-per-token footprints (hybrid ratios and head counts vary), different maximum contexts, and different ideal decode batch sizes. A global schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention has to carry all of that simultaneously; admission becomes a constraint solver and tail latency gets worse, not better, because requests sit behind decisions they have no dependency on.

Each specialist instance now runs its own continuous-batchingQuick term guidecontinuous batchingThe serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.GroundingKV cache and paged attention observability and SLO dashboards schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention. The in-repo primitive is a continuous-batch schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention that sits between incoming requests and a paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample block manager for that specialist. Its contract is small and explicit, and it is only practical because each specialist keeps its own KV budget instead of sharing one giant pool as discussed in KV cache and paged attention:

  • Hold a waiting queue ordered by (priority DESC, arrival ASC).
  • Try prefix-cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: vLLM automatic prefix caching reuse before allocating fresh blocks.
  • Admit a request only if enough free blocks exist to cover its prompt plus at least one decode step.
  • Preempt the lowest-priority running sequence when a strictly-higher-priority request is waiting.
  • Group within the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention by adapter identity, because adapter swaps are the second most expensive thing we do after KV eviction.

The router sits above the eight schedulersQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention as a thin fan-out. It picks a specialist, picks one of its replicas (least-loaded queue depth, with a small penalty for replicas currently preempting), and hands off. The schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention does not know about routing; the router does not know about blocks. This boundary is the single most important structural decision in the stack.

4. KV cache layout across specialists

Paged KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample is non-negotiable for continuous batchingQuick term guidecontinuous batchingThe serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.GroundingKV cache and paged attention observability and SLO dashboards; we inherited the design from vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off's paged-attentionQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample and block-tableQuick term guideblock tableThe per-sequence integer map from logical token ranges to the physical KV blocks the decode kernel should read next.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample model and stayed close to canonical. Paged attentionQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample here is the kernel-facing contract: decode reads K and V through block indices instead of assuming one contiguous cache buffer. A block tableQuick term guideblock tableThe per-sequence integer map from logical token ranges to the physical KV blocks the decode kernel should read next.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample is the per-sequence integer table that maps logical token positions to the physical KV blocks that currently hold them. Prefix cachingQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: vLLM automatic prefix caching is the cache-manager policy above that kernel contract: reuse already-filled blocks when a new request starts with the same token prefix. What is specialist-specific is the block size, the pool size, and the prefix-cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: vLLM automatic prefix caching key.

Specialist Block size (tokens) Why
Template-SLM 8 Long, repetitive header sequences; deep cross-request reuse
STL-SLM 8 Same prefix-heavy pattern over <ranges> / <algorithm>
Algo-, Memory-, Concurrency-, Systems-, Build-, Debug-SLM 16 Default; balances reuse vs block-table indexing cost

Pool size is set from the GPU's free HBM after weights and a reserved activation budget; we deliberately avoid dynamic pool growth because every growth event is a servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off stall waiting to happen. The prefix-cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: vLLM automatic prefix caching key is (specialist_id, adapter_id, hashed_token_prefix) — including the adapter is what stops cross-adapter cache poisoning that bit us once on a Debug-SLM A/B.

That reuse check also stays deliberately compact on the hot path. Prefix matching only pays for itself if prefill can answer "have I seen this exact specialist, adapter, and prefix before?" quickly enough that lookup is cheaper than rebuilding the blocks. The cache-side continuation is KV cache and paged attention.

Hybrid layers change the math

Several specialists interleave attention with Mamba3 SSM blocks. SSM layers do not have KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample in the usual sense; they have per-step conv state and SSM recurrence state that are bytes-per-layer, not bytes-per-token. So a hybrid specialist's KV footprint per token is lower than a pure-attention model of the same parameter count, and its preempt-and-resume path has to snapshot SSM state separately. The schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention treats the SSM snapshot as part of the sequence's serialized state; the block manager only owns the attention KV. The architectural reason those specialists exist at all is covered in Mamba 3 + Transformers, while the execution-plan consequences show up in Hybrid Layer Interleaving.

The practical consequence is that a hybrid specialist with a heavy Mamba3 ratio can hold more concurrent sequences in the same HBM than a pure-attention model of the same parameter budget — but the per-sequence preempt cost is higher, because we have to copy the SSM state to host-pinned memory on eviction and copy it back on resume. The schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention accounts for that asymmetry when it picks a preemption victim: at equal priority it prefers to preempt a pure-attention sequence over a hybrid one.

That advantage is conditional rather than magical. At short contexts, a dense hybrid-state snapshot can be comparable to or slightly costlier than evicting a small pure-attention cache, so the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention does not treat hybrid preemption as free. The benefit shows up as contexts stretch: attention KV keeps growing with the token window while the SSM side stays a bounded snapshot, so the preemption slope is flatter for long-running hybrid requests than for a pure- attention lane under the same pressure. For the capacity side of that same trade, continue with Long context and attention sinks and Mamba 3 + Transformers.

5. Admission control and SLOs

Admission is where the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off stack acquires its honesty. A request is admitted when (a) the chosen specialist's schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention has room for prompt + one decode step, (b) the caller's deadline is achievable given current queue depth, and (c) admitting does not push another in-flight request below its own deadline. If any of those fail, we either preempt a strictly-lower-priority request, return a typed 429-equivalent with a retry-after, or shadow-route to a less-loaded specialist when the router said the second-choice was viable.

What we publish

Per-specialist SLOsQuick term guideSLOsThe small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.Groundingobservability and SLO dashboards, surfaced on the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off dashboard and in the response headers:

  • Time-to-first-token (TTFT) p50/p95/p99.
  • Inter-token latency (ITL) p50/p95/p99.
  • Admission-to-first-token p95 (admission queueing visible to the caller).
  • Per-specialist correctness floor: rolling 24h C++ compilation-pass rate.
  • Preempted-once fraction; this is a transparency knob, not an SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundingobservability and SLO dashboards, and we publish it because it matters to callers building latency-sensitive pipelines.

We do not publish a single ensemble latency number. Asked once, given on demand, never on the dashboard, because nobody can act on it.

Backpressure and circuit breakers

Two failure modes have to be caught before they propagate. The first is a specialist whose correctness SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundingobservability and SLO dashboards is drifting — a model refresh ships a regression and the rolling compilation-pass rate drops below the floor. The router carries a circuit breaker per specialist that, on sustained breach, demotes that specialist to shadow-only and lets a configured fallback specialist take primary traffic for the affected intent. The breaker resets only when the rolling rate climbs back above the floor for a sustained window; we do not flap.

The second is a load spike against a single specialist. Admission already returns a typed retry-after when the chosen schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention is full, but if the spike is sustained the router escalates: it lowers the shadow threshold for that specialist (so the second-choice catches more low-confidence traffic), and if that is not enough it raises the priority floor for the affected intent. The result is that low-priority traffic gets throttled before high-priority traffic ever sees a deadline miss. Both knobs are operator-visible on the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off dashboard, and both are reversible without a deploy.

The dashboard matters here as readback, not as the control loop itself. The real breaker decision still has to live inline with admission and per-request routing, because the useful signal is whether this request can fit the current lane budget now; a panel that updates after the fact is for operators, not for protecting the hot path.

Fresh lanes also start under tighter supervision than warm ones. A new adapter rotation or a just-refreshed specialist begins with more shadowing and a more conservative promotion posture until the correctness window fills in. That keeps the router from treating "no history yet" as evidence of health, and it gives the lane enough real traffic to calibrate before it owns a full intent class without help.

6. Adapters, quantization, and the dispatch matrix

Each specialist replica carries its base NVFP4 weights plus a small number of adapter rotations. Adapters are LoRA-shaped delta tensors held in BF16 because the rank is low and the bandwidth cost is dominated by the base weights anyway. The schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention groups by adapter identity inside its batch so the dispatch kernel can fold the LoRA delta in a single pass. Cross-adapter batching is technically supported and operationally avoided; it doubles the activation traffic for marginal batch fill.

The deeper reason is latency isolation, not just code simplicity. Once one hot batch mixes materially different adapter rotations, the cheap requests start paying the bookkeeping and memory pattern of the heaviest delta in the group. Keeping batches narrow by adapter identity is therefore the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off-side extension of the adapter lifecycle described in the adapter stack, not an arbitrary schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention preference. It also keeps heterogeneous-rank LoRA batching out of the default hot path: systems like Punica and S-LoRA only make mixed-adapter concurrency cheap by optimizing that dispatch surface explicitly, while our common lane stays simpler unless we actually need that broader multi-tenant behavior.

The dispatch surface looks like:

# pseudocode for the per-specialist dispatch step
batch = scheduler.assemble_step()              # group by adapter, decode/prefill split
kv_view = block_manager.view(batch.sequences)  # paged-KV table for FA-Blackwell / SDPA
adapter = adapter_pool.get(batch.adapter_id)   # LoRA delta in BF16

with serving_telemetry(batch):
    logits = engine.forward(
        tokens=batch.tokens,
        kv=kv_view,
        adapter=adapter,
        positions=batch.positions,
        mode=batch.mode,                        # "prefill" | "decode" | "spec_verify"
    )
    sampled = sampler.sample(logits, batch.policies)

scheduler.commit(batch, sampled)

spec_verify is the path that speculative decoding uses; it shares everything with decode except the K-token block shape. In servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off terms, verify is still one request inside one specialist schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention slot, not a second routed request. Keeping it in the same dispatch surface is what lets the spec-decode path inherit the same admission, preemption, and KV-rollback bookkeeping, which is the same integration point described in Speculative Decoding Inside an Eight-Specialist Ensemble.

That shared dispatch surface only works because admission stays stricter than plain decode. We do not reserve for "one more token" and hope the speculative window fits later; we admit only when the specialist can afford the prompt plus the full draft window it may have to verify. If that extra reservation does not fit, the request waits or falls back to the non-speculative lane. In other words, speculative decoding is a latency optimization that lives under the same capacity gate, not an exception to it. The cache-side continuation is KV cache and paged attention, and the policy-side continuation is Speculative decoding for ensembles.

What we kept and what we threw away

Kept: a small classifier plus rules at the top, one continuous-batchingQuick term guidecontinuous batchingThe serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.GroundingKV cache and paged attention observability and SLO dashboards schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention per specialist, one paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample pool per specialist, prefix-cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: vLLM automatic prefix caching keyed by (specialist, adapter, prefix-hash), per-specialist SLOsQuick term guideSLOsThe small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.Groundingobservability and SLO dashboards published as the only latency contract, adapter-grouped batches, a single dispatch surface for prefill/decode/spec-verify, and SSM state owned by the sequence rather than the block manager.

Threw away: a global cross-specialist schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention, token-level rerouting mid-generation, cross-specialist output stitching, online drafter learning, dynamic KV-pool growth, cross-adapter batching at scale, and an aggregate ensemble latency number on the dashboard.

The boundary between routing, scheduling, and KV management is the only thing keeping the eight specialists feeling like one product. Every redesign we made on either side of that boundary stuck; every redesign that tried to dissolve the boundary got rolled back inside a week. The follow-on servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off questions are downstream of that split: cache residency and block management in KV cache and paged attention, decode-path acceleration in speculative decoding for ensembles, and long-window stability in long context and attention sinks.

FAQ

Frequently asked questions

I am tempted to build one global scheduler. Why did this stack reject that design?+
Because the specialists do not share one stable cost model. KV-per-token footprint, maximum context, adapter mix, and preemption penalty differ enough that a global queue couples unrelated latency domains and makes tail behavior worse.
I am debugging eviction or reuse drift. Why keep one KV pool per specialist?+
It isolates prefix reuse and eviction pressure, and it prevents cache mistakes from crossing specialist or adapter boundaries. The pool boundary is operationally cheaper than debugging cross-specialist contamination after the fact.
I am reading the dashboard. What do TTFT and ITL actually tell me?+
TTFT is time-to-first-token: how long the caller waits from request arrival to the first emitted token. ITL is inter-token latency: how long the caller waits between later decode tokens once generation has started. We keep both because TTFT is dominated by admission plus prefill, while ITL is dominated by the steady-state decode path.
I need to validate prefix-cache correctness across adapters. Why is adapter_id in the key?+
Because two requests can share the same prompt prefix and still need different KV blocks if they are running with different LoRA deltas. The checked-in cache-side version of that rule is easiest to read next to KV cache and paged attention, the dense FA4 KV-cache decode sample, and exact-token sparse telemetry sample: the block mapping can be identical while the effective weights are not.
I need to know when shadow dispatch should turn on. What is the trigger?+
Only when router confidence is below an intent-specific threshold or when the request type has a high wrong-answer cost, such as review traffic. For routine codegen, shadowing stays rare because the extra cost usually outweighs the benefit.
Why does speculative decoding need a stricter admission check than plain decode?+
Because a successful draft burst can need several future KV blocks at once, not just the next decode token. If the specialist only budgets for prompt plus one step, speculation can turn a healthy batch into a mid-generation capacity fault. The safe rule is simple: reserve for the whole draft window up front or do not enable the speculative path for that request. The local continuation is Speculative decoding for ensembles and KV cache and paged attention.
Why does the router stay out of preemption and block-pool decisions?+
Because those decisions depend on per-specialist cache residency, adapter mix, and sequence-state cost that only the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state. and block manager can see. Keeping the router at the specialist-selection boundary prevents request classification from turning into cache policy, which is the same separation spelled out in KV cache and paged attention and observability and SLO dashboards.
Why do fresh adapter rotations start with heavier shadowing?+
Because a newly promoted lane has the least trustworthy correctness history. Extra shadowing gives the router a safer comparison window while the correctness floor, queue behavior, and adapter-specific cache patterns fill in on real traffic. Once that window is stable, the lane can own traffic with normal thresholds. The operator-side continuation is observability and SLO dashboards, and the artifact-side background is the adapter stack.
Why must the breaker live inline instead of on the dashboard?+
Because breaker action is part of admission control, not a reporting workflow. If overload decisions wait for a dashboard refresh, the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state. is already late. The dashboard is where operators confirm what happened; the router is where the request-level throttling and fallback decision has to happen.
Should I turn latency estimates into admission constants?+
No. Treat them as shape checks, not policy. The useful lesson is directional: attention KV cost grows with the live token window, hybrid recurrent state is a separate bounded sequence-state cost, shadow dispatch consumes real capacity, and mixed-adapter batches can inherit the heaviest lane's bookkeeping. The admission controller should still read the live per-specialist queue, cache, adapter, and deadline state before deciding. The local continuations are KV cache and paged attention, speculative decoding for ensembles, and observability and SLO dashboards.
I am triaging a serving incident. Which metric should I check first?+
Admission-to-first-token p95 and the per-specialist correctness floor. Together they tell you whether the problem is capacity pressure, a bad model refresh, or both. Observability and SLO dashboards is the better local handoff when you need dashboard interpretation rather than architecture.
I need to answer "which specialist served this request." Where should that show up?+
In the request trace and response metadata, not in a post-hoc dashboard guess. The router should stamp the primary specialist, optional shadow specialist, adapter identity, admission result, and breaker state before the request enters the per-specialist schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.. The servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for… dashboard can aggregate those fields later, but the incident receipt has to preserve the per-request path first. That keeps router calibration, KV pressure, and correctness-breach triage tied to the same evidence trail described in observability and SLO dashboards.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Serving

How MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…

scheduler

The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.

continuous batching

The serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.

Paged attention

The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.

block table

The per-sequence integer map from logical token ranges to the physical KV blocks the decode kernel should read next.

Prefix cache

The reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.

SLO

A single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.

SLOs

The small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

vLLM

How MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…