vLLM on GB10: the overlay, the registration fixes, and the paths we kept off
How MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for serving paths that were not yet honest.

vLLM support for the MegaCpp servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest lane on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story was not one bug and one fix. It was a bundle of compatibility cuts: a packaging decision, a model-registration decision, a checkpoint-loading decision, and a few deliberate non-decisions where the honest answer was "leave this path off until it is actually correct." This post records that state from the bench lane itself, and it reads best together with GB10 journey, Torch 2.13 on GB10, and GB10 stack parity for MegaCpp.
The short version is simple. Runtime monkey-patching was not durable enough for a multiprocessing engine, so we moved to an on-disk overlay. The base Qwen3.5 registration was resolving to the wrong servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest shape for the text-only checkpoints we were actually using, so we registered explicit text-only classes. Some servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest paths stayed disabled because they were not yet operationally honest: they either depended on worker-init behavior we had not stabilized, or they introduced training/servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest complexity without a validated payoff on the current lane. That keep-disabled posture follows the same rule as inference serving stack and how we keep a patch lane: a feature is not "ready" just because one narrow smoke lane printed text once.
One process-model decoder matters just as much. spawn here means the worker
starts from a fresh interpreter import graph rather than inheriting the
parent's already-mutated module objects. That is why a parent-only registry
patch could appear to work in logs and still fail in the actual vLLM servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest
topology.
Two more first-touch decoders help. An overlay here means a small checked-in patch bundle copied into the pinned vLLM source tree before install, not a runtime hook. WeightsMapper is the loader-side rename map that rewrites checkpoint keys into the names the existing vLLM loader already expects. MRoPE is the multi-axis rotary-position contract Qwen3.5 expects on the text path, so the text-only class still has to preserve that positional interface even though the vision branch is absent.
Why MegaCpp needed an overlay instead of a runtime hook
The first attempt was the tempting one: patch vLLM at runtime in the parent process, register the text-only class, and launch a smoke test. The parent process did call the patch code, and the first logs even showed the architecture resolving to Qwen3_5ForCausalLM. But the worker model was launched through spawn, which means child processes re-import Python modules from disk instead of inheriting the patched in-memory registry from the parent. In other words, the apparent success was misleading. The registration existed in the wrong process.
That is the key reason MegaCpp switched from a parent-process hook to an overlay strategy. Once the modified files live on disk inside the image, every worker process imports the same patched module graph. That solves the actual failure mode instead of only making the parent process look healthy.
The bench lane records that conclusion directly: the strategy-B smoke test was marked inconclusive, not successful, because the runtime patch did not survive worker spawn, and the follow-up options were all variants of the same lesson: use the official plugin mechanism or patch the installed module itself. The current GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story stack chose the second route because it is simpler to audit and deterministic inside an image build. That keeps the plugin alternative honest too: a plugin route only counts if the plugin is installed as an importable package in every worker environment, not merely registered in the parent launcher.
That plugin boundary is narrower than it first sounds. Upstream vLLM loads plugins from installable Python entry points and filters them through VLLM_PLUGINS, so the plugin lane only solves this problem when the same package is already baked into the worker environment. For this GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story image, the on-disk overlay was the smaller deterministic cut: one pinned image, one bounded diff, and no extra dependency on worker-side package discovery.
What the overlay actually changed
The GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story image is built around a pinned toolchain, then overlays patched vLLM files into the checked-out source tree before installing vLLM editable. The Dockerfile documents the intent clearly: clone vLLM at a pinned commit, copy the overlay files into that tree, and verify that the text-only class is importable before the image is considered sane.
That choice matters for two reasons.
First, it makes the servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest lane reproducible. There is no hidden startup hook whose success depends on import order or process topology. The image contains the exact model registry and loader logic that the workers will use.
Second, it makes the diff reviewable. MegaCpp can point to a bounded checked-in overlay bundle and say exactly which surfaces diverge from upstream. For a fast-moving upstream like vLLM, that is a much healthier operational posture than carrying an invisible runtime rewrite. That is the same bounded-diff argument behind how we keep a patch lane.
The adjacent local runtime keeps the same preference for explicit runtime shape. The launcher carries concrete engine kwargs such as enforce_eager and gpu_memory_utilization, and the worker extension is referenced by a fixed dotted import path. That is not the article's exact overlay bundle, but it is local evidence that the servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest lane was treated as a replayable contract rather than as an ad-hoc startup patch.
The model-registration fix
The next problem was model shape, not process shape. The checkpoints under evaluation were text-only Qwen3.5 derivatives, but the default vLLM registry path for Qwen3_5ForCausalLM was still effectively the multimodal-oriented wrapper path. MegaCpp needed the architecture name to resolve to a text-only handler that understood the actual checkpoint naming and loading constraints.
The overlay does exactly that. In the patched registry, Qwen3_5ForCausalLM is mapped to Qwen3_5ForCausalLMTextOnly, and Qwen3_5MoeForCausalLM is mapped to Qwen3_5MoeForCausalLMTextOnly. The corresponding model module adds those text-only subclasses and marks them as MRoPE-capable so the text path still provides the three-axis position inputs expected by the model configuration.
This was not cosmetic registration cleanup. Without it, the engine could select a class that was structurally wrong for the text-only checkpoints even before weight loading began.
The checkpoint-loading fix
Registration alone was not enough because the checkpoint names still did not line up cleanly with what vLLM expected. That is a checkpoint-contract problem before it is a loader problem, which is why it belongs next to checkpoint format and resume rather than next to generic registry folklore.
The checked-in loader notes show the mismatch in plain terms. The available text-only checkpoints used nested model.language_model.* prefixes and unfused projection naming, while vLLM expected model.* names together with fused parameter groups such as gate_up_proj, qkv_proj, and the linear-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns fused projections. MegaCpp did not try to manufacture a new checkpoint format for that. Instead, it relied on vLLM's existing fusion path and only fixed the naming seam that blocked it.
The text-only loader subclasses apply a WeightsMapper that rewrites model.language_model. to model. and skips mtp. and visual. prefixes. That is the important boundary. The overlay does not reimplement vLLM's parameter fusion rules; it just delivers names to the existing loader in the shape that lets those rules fire.
That was the principled cut. A custom one-off checkpoint conversion would have increased maintenance burden and hidden future drift. Letting vLLM perform its own stacked-parameter assembly after a minimal prefix rewrite is much easier to defend.
One more boundary stayed outside the overlay on purpose: fused-projection and
MRoPE shard math. The text-only class still has to respect Qwen3.5's
interleaved positional path and vLLM's fused qkv_proj expectations. A prefix
rewrite can repair the namespace seam; it does not rescue arbitrary tensor
parallelQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding geometries or broken head-group alignment. That is why the validated
GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story lane stayed on the bounded smoke shapes we actually exercised instead of
claiming the overlay had solved every larger Qwen3.5 servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest shape.
Why some serving paths stayed disabled
It is useful to separate "not fixed yet" from "intentionally off."
One disabled path was the runtime-patch approach itself. After the H100 smoke test showed that spawn-based workers did not inherit the parent-side registry mutation, keeping that path alive would have meant pretending the system was safer than it was. MegaCpp deferred that route and documented the real requirement: a plugin loaded in every worker or a patched module installed on disk.
Another path stayed off in the adjacent training lane: colocated vLLM during GSPO remained disabled and the stable run continued with use_vllm=False. That decision was operational, not ideological. The checked-in run notes are explicit that the non-vLLM run was stable, already showing the desired reward trend, and not worth destabilizing while the servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest-side loader story was still under active repair.
Any future revisit of colocated vLLM still needs two separate proofs this article does not claim: an explicit VRAM envelope for the shared training-plus-generation lane, and a sleep or unload policy that makes the handoff between backward and generation predictable. Without those two, "colocate" is just another unstable smoke mode, not an honest servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest or RL contract.
There is also a more mechanical keep-disabled choice in the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story smoke harness: it uses a constrained configuration with gpu_memory_utilization=0.70, max_model_len=2048, and a switch between compiled and eager execution through enforce_eager. That is not a production tuning guide. It is a bounded validation lane. MegaCpp kept the larger, more aggressive servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest envelope out of this smoke path until the basic registration and load semantics were honest.
Those knobs deserve a first-touch decode because they are easy to overread.
gpu_memory_utilization is simply the fraction vLLM reserves for KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack and
executor state, while enforce_eager keeps this smoke lane on the eager route
instead of mixing loader validation with a second compile/debug story. They are
stability controls for the validation lane, not proof of a wider servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest
envelope.
The broader rule is simple: we do not enable a servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest path just because a one-process smoke test can be made to print text. We enable it only when the process model, registry path, and weight-loading path all match the real servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest topology. That is the same proof boundary behind GB10 Blackwell tensor paths: what we actually proved: capability claims only count when the real runtime path is the one being exercised.
Why the GB10 lane needed its own explicit stack
The GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story container is not a generic CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary image with vLLM installed on top. It pins a specific CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary 13.2 toolchain, a nightly PyTorch line, a flashinfer revision, a vLLM commit, and then applies the overlay. That matters because the goal of this lane was not merely to get text generation once. The goal was to make the servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest lane repeatable on Blackwell-class hardware where upstream support was still moving quickly, which is the same reproducibility argument behind GB10 stack parity for MegaCpp and GB10 driver gates and false capability signals.
The image recipe makes that contract explicit: pinned revisions first, then overlay, then one import-level sanity check for the text-only class. MegaCpp treats that as a patch lane, not as a transient shell session. That is the right operational shape for infrastructure that will be rerun.
What we would still change upstream
The current overlay is a practical answer, not the ideal endpoint.
The cleaner upstream shape would be one of these:
- A plugin-based registration path that vLLM guarantees to load in every worker process.
- A first-class text-only Qwen3.5 registration and loader path upstream so local remapping is unnecessary.
- Better documentation around worker spawn semantics for anyone tempted to rely on parent-side runtime registration.
Until then, the on-disk overlay is the honest mechanism because it matches the real multiprocessing boundary.
The MegaCpp takeaway
The interesting part of this work is not that a few files were patched. It is where the boundary was drawn.
MegaCpp did not fork the whole servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest stack. It kept a narrow overlay that does three concrete jobs: resolve the architecture to the right text-only class, repair the checkpoint prefix mismatch, and make those fixes visible to spawned worker processes. At the same time, it kept unstable servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest paths disabled instead of promoting a fragile smoke-test result into a production claim.
That is what a healthy patch lane looks like. Fix the real boundary. Do not hide the remaining ones.
Frequently asked questions
What is the overlay in this vLLM lane?+
Why did parent-process runtime registration fail?+
spawn, so they re-imported modules from disk instead of inheriting the parent's in-memory registry mutation.What does WeightsMapper fix here?+
model.language_model. into the model. namespace that vLLM's existing loader and fused-parameter assembly already understand.What does MRoPE-capable mean in this article?+
Why does the overlay not solve arbitrary tensor-parallel sizes?+
qkv_proj shards still line up with the interleaved MRoPE head groups that the text path expects. The overlay keeps vLLM's existing fused-projection math intact; it does not patch every incompatible TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node. geometry into validity.Why leave some vLLM paths disabled if a smoke test can already print text?+
Why not use the official vLLM plugin mechanism instead of an overlay?+
Does sleep mode make colocated vLLM part of the validated GB10 contract?+
Which local runtime knobs made this a real validation lane?+
gpu_memory_utilization, max_model_len, and enforce_eager settings. Those are stability knobs for a narrow smoke envelope, not a promise that the wider servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,… envelope was already production-ready.What is locally proven here versus only externally documented?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.
How we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.