MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 19, 20269 min readDavid Gornshtein

PyTorch

GB10

vLLM

CUDA

Training

Serving

Torch 2.13 on GB10: the serving and training stack we actually chose

Q: Did Torch 2.13 by itself solve GB10?

No. The working stack is Torch 2.13 plus the matching CUDA, compiler, and source-built extension layer.

Q: Does TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0" mean one wheel stack should work everywhere?

No. That variable tells a source build which CUDA compute-capability targets to emit code for. It does not erase platform-tag, ABI, or host-architecture differences between aarch64 GB10 images and x86_64 cloud builders, which is why this lane still rebuilt FlashInfer, mamba_ssm, and vLLM from source instead of trusting one prebuilt wheel stack across both environments.

Q: Why is vLLM discussed as the forcing function?

Because the serving breakage was concentrated around the executor and model registration surface, so the stack shape followed that integration boundary.

Q: What is locally proven here versus only externally documented?

Locally proven: the checked-in GB10 bundle pins GCC 15, CUDA 13.2, Python 3.13, Torch 2.13 nightly cu132, and source builds for the extension-heavy layer, plus the adjacent training receipts for the CUDA-graph boundary. Externally documented: NVIDIA's broader Blackwell compatibility and tuning rules. Those docs explain the architecture contract; the MegaCpp bundle proves only the specific stack lane we actually ran.

Q: Why do sm_121a, sm_100a, tcgen05, or libcuda show up in a Torch stack article?

Because a GB10 stack choice only makes sense against the hardware and driver contract it is serving. sm_121a is the exact GB10 consumer lane we built for; sm_120f is the family compile target we often use for real GB10 kernels; sm_100a and tcgen05 are the datacenter-side terms that explain what this article is not claiming; and libcuda is the user-space loader boundary that belongs in the adjacent conservative gating article, not in a version-only success story. GB10 repro bundle overview and GB10 repro walkthrough are the shortest checked-in decoders for that split.

Q: Where do FA4, NVFP4, or TPU/XLA fit relative to this Torch article?

FA4 and NVFP4 are neighboring decisions in the same broader ecosystem, but they own different questions: FA4 is the attention-backend eligibility story, NVFP4 is the serving-format story, and this post is the toolchain and runtime-stack story. TPU/XLA is a separate substrate and runtime-ownership story built around PJRT rather than CUDA.

A public, evidence-based write-up of the stack choices around Torch 2.13, CUDA 13.2, GCC 15, GB10, and vLLM compatibility in the MegaCpp workflow.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Torch 2.13 on GB10: the serving and training stack we actually chose

Published April 19, 2026•9 min read•David Gornshtein

When people ask whether Torch 2.13 is “ready” on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story, the useful answer is not a yes-or-no. The useful answer is: ready for which lane, with which compiler, with which CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary toolchain, and with which servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest engine constraints.

The MegaCpp evidence points to a very specific stack choice. For the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story-shaped servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest lane, we pinned Ubuntu 24.04, GCC 15, CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary 13.2.51, Python 3.13, Torch 2.13 nightly for cu132, FlashInfer from source, mamba_ssm from source, and a source-built vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off checkout with a pinned overlay. For the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane, we kept the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story launcher aligned with that toolchain family but treated CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary-graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack shape behavior as the real compatibility boundary rather than pretending that “Torch 2.13 support” alone solved the whole stack.

That distinction matters because the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest problem and the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 problem failed in different ways.

The serving stack was a toolchain-compatibility problem first

The closer GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story-specific companions are the GB10 journey write-up and the vLLM GB10 overlay note. This post is about the stack choice; those two are about how that stack failed and then stabilized in practice.

The checked-in local map of that stack is MegaCpp example index for the recipe/runtime surfaces and GB10 repro bundle overview for the full GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story image and build lane. The public-safe boundary files that sit next to this article are GB10 repro walkthrough, GB10 gate matrix, and GB10 public claims note. Together they keep the toolchain recipe, the staged GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story gate walk, and the public claim boundary in one checked-in path.

The quickest checked-in decoder in the local bundle is GB10 repro walkthrough together with GB10 repro bundle overview. Those two files separate toolchain compatibility from hardware-capability claims so this article does not have to smuggle tensor-path assumptions into a Torch packaging discussion.

One first-touch decoder helps the rest of this article. sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is the consumer-Blackwell target reported on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story, while sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough is the architecture-specific datacenter Blackwell target used by B200-style cubins and tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough examples. MegaCpp's servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest and trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 stack was chosen for the sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough lane we could actually run, not for the richer TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity or tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough paths documented for SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro.

The related sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.GroundingAbout: GB10 stack parity History: GB10 journey Example: GB10 cubin patch repro label is the family-common GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story compile target we use when the recipe needs a broad consumer-Blackwell build target instead of the exact architecture-specific sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough lane.

Two adjacent terms also deserve a first-touch decode here. FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample means the FlashAttention-4 backend family and belongs to the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-kernel decision, not the package-version decision. NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference means the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest-side Blackwell FP4 format and belongs to deployment precision, not to whether Torch 2.13 imports. If the question turns into TPU ownership terms such as PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.GroundingAbout: libtpu / PJRT ownership boundaries About: Torch XLA / PJRT reality Example: TPU backend ownership note, PyTorch/XLA, or PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample, you have already left this CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary stack and should switch to the TPU/XLA articles instead of trying to stretch the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story story to fit.

The public GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story repro bundle includes a dedicated image build aimed at full GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story toolchain parity. That image pins gcc-15 and g++-15, installs CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Toolkit 13.2, moves Python to 3.13, and then installs Torch from the nightly cu132 index with torch>=2.13.0.dev0. It also builds causal_conv1d, mamba_ssm, FlashInfer, and vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off from source instead of relying on a mixed wheel stack. That is the most important signal in the whole record: MegaCpp did not treat GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story as a place for opportunistic binary compatibility. It treated GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story as a place where source-level rebuilds were the safe default.

The same Dockerfile also sets TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0", which is a concise way to say that the stack was being kept compatible across Hopper, Blackwell, and GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story-class targets in one build recipe. That is not the same as claiming one wheel set magically works everywhere. It is the opposite: one recipe, multiple architectures, source rebuild where needed.

This is also the place to keep the proof boundary straight. The local evidence proves that this toolchain recipe produced a runnable GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest and trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane. It does not prove that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story gained SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro-only tensor features such as TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity-backed tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails; those remain a separate, narrower hardware-claims story grounded in the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story gate bundle and the adjacent tensor-path articles.

Why GCC 15 and CUDA 13.2 were not optional details

The image comments are explicit about matching the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story toolchain: GCC 15 is described as matching GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story 15.2.0, and CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Toolkit 13.2.51 is described as matching cuda-toolkit-13-2. Those comments are not decoration. They explain why the image was built the hard way.

If you change two or three variables at once, it becomes impossible to tell whether a failure came from Torch, CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary, the host compiler, or one of the source extensions under vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off. The public MegaCpp bundle avoids that ambiguity by aligning the major host-side toolchain choices up front. In practice, that means Torch 2.13 on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story should be discussed as “Torch 2.13 plus CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary 13.2 plus GCC 15 plus source-built extensions,” not as an isolated library upgrade.

That is also why the stack uses Python 3.13 consistently in the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest image. Once the choice was made to rebuild the extension-heavy layer anyway, keeping the interpreter aligned with the image contract became less risky than depending on a looser prebuilt ecosystem.

vLLM compatibility was the real forcing function

The best evidence that vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off compatibility drove the stack shape is not a README claim. It is the amount of explicit patching around model registration and import surfaces.

The public build recipe checks out vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off at a fixed commit, overlays seventeen patched files, and then runs an import sanity test against Qwen3_5ForCausalLMTextOnly. The related public samples say the patch exists because vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off needed a text-only path and model-registry adjustments that survive subprocess re-imports. That is a stronger claim than “we tweaked config.” It means the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest lane was blocked at the model-executor layer, so the stack chose to own a pinned vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off overlay rather than wait for upstream drift to settle.

The local article-level explainer for that exact Qwen3_5ForCausalLMTextOnly rewrite is vLLM GB10 overlay and disabled paths.

The checked-in public-safe second source for that same bounded-smoke posture is the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story bundle itself together with vLLM GB10 overlay and disabled paths. Those surfaces keep enforce_eager, gpu_memory_utilization, pinned imports, and the "bounded validation lane first" rule visible without pointing readers at sibling-repo implementation paths.

This is the correct engineering move when the incompatibility is narrow and reproducible. It keeps the stack legible. Torch 2.13 is not being blamed for every issue, and vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off is not being treated as a black box either.

The portability notes around the image make the same point from another angle: GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story is aarch64, ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts is x86_64, and the notes explicitly warn that wheels are cross-incompatible, so the ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts image rebuilds source dependencies with matching versions. That is exactly the kind of detail that gets lost in casual “works on my machine” summaries and exactly the kind of detail that should drive stack design.

It also explains why this article keeps the lower driver lane separate. libcudaQuick term guidelibcudaThe user-space NVIDIA driver library that owns module load, metadata validation, and the helper-cubin patch lane in the GB10 experiments.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 cubin patch repro is the Linux user-space CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary driver library that loads cubins and routes helper paths, but a patched or suggestive libcudaQuick term guidelibcudaThe user-space NVIDIA driver library that owns module load, metadata validation, and the helper-cubin patch lane in the GB10 experiments.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 cubin patch repro path is not what made Torch 2.13 usable on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story. The usable stack story here is compiler plus CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary plus extension-build parity.

The training lane had a different boundary: CUDA graphs and dynamic shapes

That boundary also connects directly to FP8 in the training stack, because the graph-capture decision was not only a GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story quirk. It was part of the wider rule that precision and compile posture had to be measured together.

For this stack, FP8 was therefore a runtime contract as much as a dtype choice: it had to share the same graph-capture boundaries, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack exclusions, and receipt discipline as the rest of the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 lane.

The GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story single-device trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 launcher in the MegaCpp tree is useful because it does not pretend the hard part is package installation. The script spends its explanatory effort on runtime behavior: stream-mismatch warning suppression, single-device distributed-optimizer avoidance, and most importantly the CUDA-graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample boundary around dropless MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack.

The launcher says the dropless MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack path has dynamic shapes and cannot be fully captured in a CUDA graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample, with the failure surfacing as a CPU-to-CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary copy during graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample inside the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack all-to-all dispatcher. The resulting choice is to use Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample through an explicit safe-scope allowlist, leaving the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack MLP path uncaptured.

The smallest checked-in graph-capture receipts for that boundary are CUDA graph environment defaults sample, CUDA graph block validation sample, distributed CUDA graph runtime sample, and NAM56R CUDA graph launcher sample.

That is the exact kind of stack choice that matters more than broad version headlines. Torch 2.13 may be the right foundation for the lane, but the working trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 configuration still depends on where graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample is scoped and which dynamic paths are left outside it.

In other words, the MegaCpp trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 evidence says: do not talk about GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story readiness as if it were only a package-resolution question. The package set got the lane to runnable. The capture scope and runtime constraints made it stable.

What the stack choice really was

From the public evidence, the cleanest summary is this:

For GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story-shaped servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest, MegaCpp chose a source-built stack around Torch 2.13 nightly cu132, CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary 13.2.51, GCC 15, Python 3.13, FlashInfer from source, mamba_ssm from source, and pinned-overlay vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off.
For GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story-shaped trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200, MegaCpp stayed in the same toolchain family but treated CUDA-graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample scope, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack shape behavior, and single-device runtime details as first-class compatibility constraints.
For cross-environment servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off Serving the eight: router, per-specialist scheduler, and the KV layout that keeps them honest, MegaCpp explicitly avoided assuming wheel portability between aarch64 GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story targets and x86_64 cloud builders.

That is a good stack because it is honest about where compatibility really lives. Not in one version number, but in the interface between compiler, CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary toolkit, Torch ABI, extension builds, and runtime behavior.

What I would not simplify away

There is a strong temptation to compress this story into “Torch 2.13 fixed GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story” or “vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off is compatible now.” The public bundle and launcher history do not support that kind of simplification.

What they support is narrower and more useful. Torch 2.13 on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary 13.2 is a workable base for the MegaCpp GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story lane when the compiler is kept at GCC 15, the extension-heavy packages are rebuilt from source, and vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off is treated as a pinned integration surface rather than as an interchangeable wheel. TrainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 then adds a second layer of constraints around graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample and dynamic-shape MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack behavior.

That is not a marketing sentence. It is a reproducibility sentence. And for this topic, reproducibility is the only thing that matters.

The silicon-facing side of the same story is in GB10 journey, while the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200-runtime boundary is easier to read next to FP8 in the training stack.

FAQ

Frequently asked questions

Did Torch 2.13 by itself solve GB10?+

No. The working stack is Torch 2.13 plus the matching CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes., compiler, and source-built extension layer.

Does TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0" mean one wheel stack should work everywhere?+

No. That variable tells a source build which CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes. compute-capability targets to emit code for. It does not erase platform-tag, ABI, or host-architecture differences between aarch64 GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. images and x86_64 cloud builders, which is why this lane still rebuilt FlashInfer, mamba_ssm, and vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for… from source instead of trusting one prebuilt wheel stack across both environments.

Why is vLLM discussed as the forcing function?+

Because the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for… breakage was concentrated around the executor and model registration surface, so the stack shape followed that integration boundary.

What should I read next if I care more about runtime behavior than package versions?+

Read GB10 journey for the bring-up story and FP8 in the training stack for the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…-side precision and graph-capture boundary.

How is the CUDA-graph scope expressed in the public samples?+

As an allowlist, not as a whole-model capture switch. The public launcher sample sets --cuda-graph-impl transformer_engine with --cuda-graph-scope attn mamba moe_router moe_preprocess, while the article's trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… claim still stops before the dropless MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble. MLP and all-to-all dispatcher path. That is why the graph-capture story belongs next to NAM56R CUDA graph launcher sample, not in a broad "Torch 2.13 fixed training" headline.

What is locally proven here versus only externally documented?+

Locally proven: the checked-in GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. bundle pins GCC 15, CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes. 13.2, Python 3.13, Torch 2.13 nightly cu132, and source builds for the extension-heavy layer, plus the adjacent trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… receipts for the CUDA-graphQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph. boundary. Externally documented: NVIDIA's broader Blackwell compatibility and tuning rules. Those docs explain the architecture contract; the MegaCpp bundle proves only the specific stack lane we actually ran.

Why do sm_121a, sm_100a, tcgen05, or libcuda show up in a Torch stack article?+

Because a GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. stack choice only makes sense against the hardware and driver contract it is servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…. sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. is the exact GB10 consumer lane we built for; sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a. is the family compile target we often use for real GB10 kernels; sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. and tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. are the datacenter-side terms that explain what this article is not claiming; and libcudaQuick term guidelibcudaThe user-space NVIDIA driver library that owns module load, metadata validation, and the helper-cubin patch lane in the GB10 experiments. is the user-space loader boundary that belongs in the adjacent conservative gating article, not in a version-only success story. GB10 repro bundle overview and GB10 repro walkthrough are the shortest checked-in decoders for that split.

I need the smallest checked-in GB10 proof surfaces. What should I open first?+

Use GB10 repro bundle overview for the target and feature matrix, GB10 repro walkthrough for the compile-label and toolchain consequences, baseline arch-patch proof sample for the narrow arch-rewrite receipt, driver signal versus runtime proof sample for the "driver-visible signal is not runtime proof" rule, and compact gate-walk mirror plus GB10 gate matrix for the staged tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. gate walk.

Where do FA4, NVFP4, or TPU/XLA fit relative to this Torch article?+

FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. and NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8. are neighboring decisions in the same broader ecosystem, but they own different questions: FA4 is the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.-backend eligibility story, NVFP4 is the servingQuick term guideServingHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…-format story, and this post is the toolchain and runtime-stack story. TPU/XLA is a separate substrate and runtime-ownership story built around PJRTQuick term guidePJRTThe TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu. rather than CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes..

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

Grounding

sm_100a

Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.

Grounding

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

Grounding

sm_121a

Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.

Grounding

sm_120f

Family-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.

Grounding

tcgen05

The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.

Grounding

libcuda

The user-space NVIDIA driver library that owns module load, metadata validation, and the helper-cubin patch lane in the GB10 experiments.

Grounding

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Grounding

sm_100

Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.

Grounding

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Grounding

vLLM

How MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…

Grounding

vLLM on GB10: the overlay, the registration fixes, and the paths we kept off

PJRT

The TPU runtime interface between frontend code and the backend executor; it is the ownership seam between JAX/Torch-XLA frontends and libtpu.

Grounding

tcgen05.mma

The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.

Grounding

TMEM

Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

Pallas

JAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.

Grounding

Serving

How MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…

Grounding

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Grounding

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

NVFP4

NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.

Grounding

Topic hubs

Entity Hub

GB10 and Blackwell Bring-Up

A curated GB10 and Blackwell reading path: consumer-versus-datacenter tensor paths, driver-visible false positives, arch-patch repros, and the serving or precision choices that survived contact with the hardware.

David Gornshtein • MegaCppMore posts →

Torch 2.13 on GB10: the serving and training stack we actually chose

The serving stack was a toolchain-compatibility problem first

Why GCC 15 and CUDA 13.2 were not optional details

vLLM compatibility was the real forcing function

The training lane had a different boundary: CUDA graphs and dynamic shapes

What the stack choice really was

What I would not simplify away

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

GB10 and Blackwell Bring-Up