Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest
How we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model, admission control, and the SLOs we publish.

The MegaCpp ensemble is eight specialists, not one model. That single architectural fact reshapes every servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off decision we make. A monolithic generalist can often live inside one vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off-style model engine with one schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention, one KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample manager, and one admission domain. Our ensemble still borrows that lower-level servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off pattern inside each specialist, but not above them: the router sits one layer higher and keeps the eight models out of one shared scheduling or eviction domain. This post is about the stack above the kernels: router, per-specialist schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention, KV layout, admission control, and published SLOsQuick term guideSLOsThe small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.Groundingobservability and SLO dashboards. Kernel choices and NVFP4 layout are covered elsewhere, the cache budget underneath this control plane is spelled out in KV cache and paged attention, and the mixed attention-plus-SSM model shape behind several specialists is expanded in Hybrid Layer Interleaving.
For first touch, six terms do most of the work in this article. A router is the small front-door classifier plus rules layer that decides which specialist should answer a request. A model engine is the per-model servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off runtime that owns decode for one loaded model replica; in vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off-style systems it is usually where batching, KV residency, and token generation meet. Continuous batchingQuick term guidecontinuous batchingThe serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.GroundingKV cache and paged attention observability and SLO dashboards means the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention keeps mixing decode steps from requests that are already resident instead of waiting for one request to finish before admitting the next. A paged KV pool means the attention cache is stored in fixed-size token blocks that can be reused and evicted independently instead of as one giant contiguous slab per sequence. A prefix cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: vLLM automatic prefix caching is the hash-indexed map from an already-seen prompt prefix to those existing KV blocks. Admission control is the explicit "can this request fit and still meet its deadline?" gate that runs before decode. TTFT is time-to-first-token, ITL is inter-token latency, and the article's SLOsQuick term guideSLOsThe small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.Groundingobservability and SLO dashboards are the latency and correctness bounds surfaced to callers. The quickest checked-in companions are the MegaCpp example index, dense FA4 KV-cache decode sample, exact-token sparse telemetry sample, and FA4 receipt summary sample.
Why this matters
An ensemble exposes the same product surface as a single model and is much harder to serve well. The cost of a bad servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off design is not a slow demo; it is silent specialist starvation, KV evictions that cascade across replicas, tail latency that wanders by intent class, and a debugging story where nobody can answer "which model actually answered this request" without grepping logs. We learned the hard way that the only way to make eight specialists feel like one product is to keep their boundaries crisp inside the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off stack: one router on top, one schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention per specialist, one paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample pool per specialist, one published SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundingobservability and SLO dashboards per specialist. This post is the design that survived two redesigns, and its cache-isolation assumptions line up with the residency story in KV cache and paged attention.
1. The shape of the problem
An incoming request is a chat-like blob of C++ context: a prompt, optional repo snippets, optional tool outputs, and a caller-declared intent (codegen, debug, build-fix, review, or unspecified). It arrives with a priority integer, an optional deadline, and a preferred decoding policy (greedy, typical, temperature-sampled). The servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off stack has to decide three things before any token comes out:
- Which specialist (or specialists) answer this request.
- Which instance of that specialist takes it, on which GPU, in which batch.
- How its KV footprint is paid for, and what to preempt if it will not fit.
Each decision has its own timescale. Routing is per-request and happens once. Scheduling is per-token and happens thousands of times per second. KV allocation sits underneath scheduling and decides whether the next decode step can even run. We kept the three decisions in three different components on purpose, because conflating them is how servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off systems end up with router logic leaking into kernel dispatch.
2. The top-level router
The router is a small model plus a rules layer. The small model is a distilled classifier in the tens of millions of parameters, trained on labeled dogfood traffic and curated examples from our enriched-corpus dataset family; it produces a distribution over the eight specialists plus a reject class. The rules layer overrides it in two cases: when the caller declared an intent that maps directly to a specialist (debug traces always go to Debug-SLM, build files to Build-SLM), and when the prompt tripwires a structural detector (SFINAE-heavy headers route to Template-SLM regardless of the classifier).
Shadow dispatch and what the router does not do
The router outputs a primary specialist and, for high-stakes requests, a shadow specialist. The shadow is only dispatched when the primary's top-1 probability is below a threshold tuned per intent. For pure codegen traffic the threshold is low and we almost never shadow. For review traffic, where the cost of a wrong specialist is a confidently wrong review, the threshold is high and we shadow more often.
That shadow traffic is not throwaway work. It is the calibration stream for later threshold tuning and promotion decisions: when the primary and shadow disagree, the router gets a labeled example of where its confidence boundary was too loose or too strict without trying to relearn on the live request path.
A few things the router deliberately does not do. No token-level reassignment — once routed, a request stays on its specialist for the whole generation. No cross-specialist output stitching; the quality penalty is visible. No online learning from real-time feedback; the classifier is retrained offline on labeled traffic.
3. One scheduler per specialist, not one across all of them
The instinct is to run one global schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention that picks the best GPU for each request across all specialists. We built that first and threw it away. Specialists have different KV-per-token footprints (hybrid ratios and head counts vary), different maximum contexts, and different ideal decode batch sizes. A global schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention has to carry all of that simultaneously; admission becomes a constraint solver and tail latency gets worse, not better, because requests sit behind decisions they have no dependency on.
Each specialist instance now runs its own continuous-batchingQuick term guidecontinuous batchingThe serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.GroundingKV cache and paged attention observability and SLO dashboards schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention. The in-repo primitive is a continuous-batch schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention that sits between incoming requests and a paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample block manager for that specialist. Its contract is small and explicit, and it is only practical because each specialist keeps its own KV budget instead of sharing one giant pool as discussed in KV cache and paged attention:
- Hold a waiting queue ordered by
(priority DESC, arrival ASC). - Try prefix-cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: vLLM automatic prefix caching reuse before allocating fresh blocks.
- Admit a request only if enough free blocks exist to cover its prompt plus at least one decode step.
- Preempt the lowest-priority running sequence when a strictly-higher-priority request is waiting.
- Group within the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention by adapter identity, because adapter swaps are the second most expensive thing we do after KV eviction.
The router sits above the eight schedulersQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention as a thin fan-out. It picks a specialist, picks one of its replicas (least-loaded queue depth, with a small penalty for replicas currently preempting), and hands off. The schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention does not know about routing; the router does not know about blocks. This boundary is the single most important structural decision in the stack.
4. KV cache layout across specialists
Paged KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample is non-negotiable for continuous batchingQuick term guidecontinuous batchingThe serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.GroundingKV cache and paged attention observability and SLO dashboards; we inherited the design from vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off's paged-attentionQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample and block-tableQuick term guideblock tableThe per-sequence integer map from logical token ranges to the physical KV blocks the decode kernel should read next.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample model and stayed close to canonical. Paged attentionQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample here is the kernel-facing contract: decode reads K and V through block indices instead of assuming one contiguous cache buffer. A block tableQuick term guideblock tableThe per-sequence integer map from logical token ranges to the physical KV blocks the decode kernel should read next.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample is the per-sequence integer table that maps logical token positions to the physical KV blocks that currently hold them. Prefix cachingQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: vLLM automatic prefix caching is the cache-manager policy above that kernel contract: reuse already-filled blocks when a new request starts with the same token prefix. What is specialist-specific is the block size, the pool size, and the prefix-cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: vLLM automatic prefix caching key.
| Specialist | Block size (tokens) | Why |
|---|---|---|
| Template-SLM | 8 | Long, repetitive header sequences; deep cross-request reuse |
| STL-SLM | 8 | Same prefix-heavy pattern over <ranges> / <algorithm> |
| Algo-, Memory-, Concurrency-, Systems-, Build-, Debug-SLM | 16 | Default; balances reuse vs block-table indexing cost |
Pool size is set from the GPU's free HBM after weights and a reserved activation budget; we deliberately avoid dynamic pool growth because every growth event is a servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off stall waiting to happen. The prefix-cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: vLLM automatic prefix caching key is (specialist_id, adapter_id, hashed_token_prefix) — including the adapter is what stops cross-adapter cache poisoning that bit us once on a Debug-SLM A/B.
That reuse check also stays deliberately compact on the hot path. Prefix matching only pays for itself if prefill can answer "have I seen this exact specialist, adapter, and prefix before?" quickly enough that lookup is cheaper than rebuilding the blocks. The cache-side continuation is KV cache and paged attention.
Hybrid layers change the math
Several specialists interleave attention with Mamba3 SSM blocks. SSM layers do not have KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample in the usual sense; they have per-step conv state and SSM recurrence state that are bytes-per-layer, not bytes-per-token. So a hybrid specialist's KV footprint per token is lower than a pure-attention model of the same parameter count, and its preempt-and-resume path has to snapshot SSM state separately. The schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention treats the SSM snapshot as part of the sequence's serialized state; the block manager only owns the attention KV. The architectural reason those specialists exist at all is covered in Mamba 3 + Transformers, while the execution-plan consequences show up in Hybrid Layer Interleaving.
The practical consequence is that a hybrid specialist with a heavy Mamba3 ratio can hold more concurrent sequences in the same HBM than a pure-attention model of the same parameter budget — but the per-sequence preempt cost is higher, because we have to copy the SSM state to host-pinned memory on eviction and copy it back on resume. The schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention accounts for that asymmetry when it picks a preemption victim: at equal priority it prefers to preempt a pure-attention sequence over a hybrid one.
That advantage is conditional rather than magical. At short contexts, a dense hybrid-state snapshot can be comparable to or slightly costlier than evicting a small pure-attention cache, so the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention does not treat hybrid preemption as free. The benefit shows up as contexts stretch: attention KV keeps growing with the token window while the SSM side stays a bounded snapshot, so the preemption slope is flatter for long-running hybrid requests than for a pure- attention lane under the same pressure. For the capacity side of that same trade, continue with Long context and attention sinks and Mamba 3 + Transformers.
5. Admission control and SLOs
Admission is where the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off stack acquires its honesty. A request is admitted when (a) the chosen specialist's schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention has room for prompt + one decode step, (b) the caller's deadline is achievable given current queue depth, and (c) admitting does not push another in-flight request below its own deadline. If any of those fail, we either preempt a strictly-lower-priority request, return a typed 429-equivalent with a retry-after, or shadow-route to a less-loaded specialist when the router said the second-choice was viable.
What we publish
Per-specialist SLOsQuick term guideSLOsThe small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.Groundingobservability and SLO dashboards, surfaced on the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off dashboard and in the response headers:
- Time-to-first-token (TTFT) p50/p95/p99.
- Inter-token latency (ITL) p50/p95/p99.
- Admission-to-first-token p95 (admission queueing visible to the caller).
- Per-specialist correctness floor: rolling 24h C++ compilation-pass rate.
- Preempted-once fraction; this is a transparency knob, not an SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundingobservability and SLO dashboards, and we publish it because it matters to callers building latency-sensitive pipelines.
We do not publish a single ensemble latency number. Asked once, given on demand, never on the dashboard, because nobody can act on it.
Backpressure and circuit breakers
Two failure modes have to be caught before they propagate. The first is a specialist whose correctness SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundingobservability and SLO dashboards is drifting — a model refresh ships a regression and the rolling compilation-pass rate drops below the floor. The router carries a circuit breaker per specialist that, on sustained breach, demotes that specialist to shadow-only and lets a configured fallback specialist take primary traffic for the affected intent. The breaker resets only when the rolling rate climbs back above the floor for a sustained window; we do not flap.
The second is a load spike against a single specialist. Admission already returns a typed retry-after when the chosen schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention is full, but if the spike is sustained the router escalates: it lowers the shadow threshold for that specialist (so the second-choice catches more low-confidence traffic), and if that is not enough it raises the priority floor for the affected intent. The result is that low-priority traffic gets throttled before high-priority traffic ever sees a deadline miss. Both knobs are operator-visible on the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off dashboard, and both are reversible without a deploy.
The dashboard matters here as readback, not as the control loop itself. The real breaker decision still has to live inline with admission and per-request routing, because the useful signal is whether this request can fit the current lane budget now; a panel that updates after the fact is for operators, not for protecting the hot path.
Fresh lanes also start under tighter supervision than warm ones. A new adapter rotation or a just-refreshed specialist begins with more shadowing and a more conservative promotion posture until the correctness window fills in. That keeps the router from treating "no history yet" as evidence of health, and it gives the lane enough real traffic to calibrate before it owns a full intent class without help.
6. Adapters, quantization, and the dispatch matrix
Each specialist replica carries its base NVFP4 weights plus a small number of adapter rotations. Adapters are LoRA-shaped delta tensors held in BF16 because the rank is low and the bandwidth cost is dominated by the base weights anyway. The schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention groups by adapter identity inside its batch so the dispatch kernel can fold the LoRA delta in a single pass. Cross-adapter batching is technically supported and operationally avoided; it doubles the activation traffic for marginal batch fill.
The deeper reason is latency isolation, not just code simplicity. Once one hot batch mixes materially different adapter rotations, the cheap requests start paying the bookkeeping and memory pattern of the heaviest delta in the group. Keeping batches narrow by adapter identity is therefore the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off-side extension of the adapter lifecycle described in the adapter stack, not an arbitrary schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention preference. It also keeps heterogeneous-rank LoRA batching out of the default hot path: systems like Punica and S-LoRA only make mixed-adapter concurrency cheap by optimizing that dispatch surface explicitly, while our common lane stays simpler unless we actually need that broader multi-tenant behavior.
The dispatch surface looks like:
# pseudocode for the per-specialist dispatch step
batch = scheduler.assemble_step() # group by adapter, decode/prefill split
kv_view = block_manager.view(batch.sequences) # paged-KV table for FA-Blackwell / SDPA
adapter = adapter_pool.get(batch.adapter_id) # LoRA delta in BF16
with serving_telemetry(batch):
logits = engine.forward(
tokens=batch.tokens,
kv=kv_view,
adapter=adapter,
positions=batch.positions,
mode=batch.mode, # "prefill" | "decode" | "spec_verify"
)
sampled = sampler.sample(logits, batch.policies)
scheduler.commit(batch, sampled)
spec_verify is the path that speculative decoding uses; it shares everything with decode except the K-token block shape. In servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off terms, verify is still one request inside one specialist schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention slot, not a second routed request. Keeping it in the same dispatch surface is what lets the spec-decode path inherit the same admission, preemption, and KV-rollback bookkeeping, which is the same integration point described in Speculative Decoding Inside an Eight-Specialist Ensemble.
That shared dispatch surface only works because admission stays stricter than plain decode. We do not reserve for "one more token" and hope the speculative window fits later; we admit only when the specialist can afford the prompt plus the full draft window it may have to verify. If that extra reservation does not fit, the request waits or falls back to the non-speculative lane. In other words, speculative decoding is a latency optimization that lives under the same capacity gate, not an exception to it. The cache-side continuation is KV cache and paged attention, and the policy-side continuation is Speculative decoding for ensembles.
What we kept and what we threw away
Kept: a small classifier plus rules at the top, one continuous-batchingQuick term guidecontinuous batchingThe serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.GroundingKV cache and paged attention observability and SLO dashboards schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention per specialist, one paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample pool per specialist, prefix-cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: vLLM automatic prefix caching keyed by (specialist, adapter, prefix-hash), per-specialist SLOsQuick term guideSLOsThe small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.Groundingobservability and SLO dashboards published as the only latency contract, adapter-grouped batches, a single dispatch surface for prefill/decode/spec-verify, and SSM state owned by the sequence rather than the block manager.
Threw away: a global cross-specialist schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.Groundingobservability and SLO dashboards KV cache and paged attention, token-level rerouting mid-generation, cross-specialist output stitching, online drafter learning, dynamic KV-pool growth, cross-adapter batching at scale, and an aggregate ensemble latency number on the dashboard.
The boundary between routing, scheduling, and KV management is the only thing keeping the eight specialists feeling like one product. Every redesign we made on either side of that boundary stuck; every redesign that tried to dissolve the boundary got rolled back inside a week. The follow-on servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off questions are downstream of that split: cache residency and block management in KV cache and paged attention, decode-path acceleration in speculative decoding for ensembles, and long-window stability in long context and attention sinks.
Frequently asked questions
I am tempted to build one global scheduler. Why did this stack reject that design?+
I am debugging eviction or reuse drift. Why keep one KV pool per specialist?+
I am reading the dashboard. What do TTFT and ITL actually tell me?+
TTFT is time-to-first-token: how long the caller waits from request arrival to the first emitted token. ITL is inter-token latency: how long the caller waits between later decode tokens once generation has started. We keep both because TTFT is dominated by admission plus prefill, while ITL is dominated by the steady-state decode path.I need to validate prefix-cache correctness across adapters. Why is adapter_id in the key?+
I need to know when shadow dispatch should turn on. What is the trigger?+
Why does speculative decoding need a stricter admission check than plain decode?+
Why does the router stay out of preemption and block-pool decisions?+
Why do fresh adapter rotations start with heavier shadowing?+
Why must the breaker live inline instead of on the dashboard?+
Should I turn latency estimates into admission constants?+
I am triaging a serving incident. Which metric should I check first?+
I need to answer "which specialist served this request." Where should that show up?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
How MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…
The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.
The serving scheduler policy that keeps admitting decode and prefill work into the same rolling batch window instead of waiting for one whole batch to finish first.
The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.
The per-sequence integer map from logical token ranges to the physical KV blocks the decode kernel should read next.
The reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.
A single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.
The small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
How MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…