MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 19, 202610 min readDavid Gornshtein

MegaCpp

GB10

PyTorch

CUDA

Build Systems

vLLM

GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint

Q: Why did MegaCpp use a nightly PyTorch build for GB10 parity?

Because the nightly cu132 wheel was part of the same compatibility surface as CUDA 13.2 and the Blackwell-oriented toolchain. It was a runtime constraint, not a generic "latest is better" preference.

Q: Why rebuild so many dependencies from source?

Because wheel reuse was not enough across aarch64 and x86_64, and several extension-heavy packages needed to be rebuilt against the same compiler, CUDA, and PyTorch contract to keep receipts comparable.

Why MegaCpp mirrored the GB10 software stack so exactly: PyTorch 2.13 cu132 nightly, GCC 15, CUDA 13.2, rebuilt source dependencies, and the device-specific constraints that made parity operational rather than cosmetic.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Published April 19, 2026•10 min read•David Gornshtein

GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint

When people hear "stack parity," they often picture convenience: make one environment look like another so debugging feels tidier. In MegaCpp, the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough parity stack was not about tidiness. It was about keeping the same compiler, ABI, wheel, and source-build assumptions across two different execution environments so that a result meant the same thing in both places.

The core parity target was narrow and explicit:

Ubuntu 24.04
GCC 15.2
CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary toolkit 13.2.51
PyTorch 2.13 cu132 nightly
source-built flashinfer, vllmQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off, and selected kernel-side dependencies at pinned commits

That exact bundle shows up directly in the build recipe used to mirror the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough serving stack: the image installs GCC 15 from the Ubuntu toolchain PPA, installs CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary 13.2, and then force-reinstalls PyTorch 2.13.0.dev* from the cu132 nightly index before rebuilding the rest of the stack from source where wheels are not a safe substitute. The narrow reason this mattered becomes clearer if you read it beside GB10 journey: the toolchain was part of the bring-up proof, not post-hoc packaging trivia. For quick term definitions, use The MegaCpp model glossary. For the adjacent history and why GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough needed its own lane, continue with GB10 journey. When you want the checked-in evidence lane, start with GB10 repro walkthrough.

This is the environment-contract side of GB10 journey and Torch 2.13 on GB10 serving and training stack: the point is not that the package list is modern, but that the toolchain is close enough to make receipts comparable.

For first touch, sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro, sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough, and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough are not interchangeable names for "Blackwell." sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro is the baseline Blackwell target name in the compiler vocabulary, sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough is the architecture-specific datacenter Blackwell target used by B200-class cubins, and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is the architecture-specific consumer-Blackwell target reported on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough. sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.GroundingHistory: GB10 journey Example: GB10 cubin patch repro Reference: GB10 tensor-path proof summary is the family-specific consumer-Blackwell compile target we often prefer for shipping kernels when family-common optimizations matter more than one exact device label, because it keeps you on the consumer-Blackwell lane without pretending GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough turned into an SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro target. Use The MegaCpp model glossary for the target-name family, What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the boundary that makes those names matter, and GB10 arch patch probe sample for the checked-in probe lane.

Why parity mattered

MegaCpp had a practical problem, not an aesthetic one. The GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough stack was aarch64, while the mirrored environment used for external validation and portability work was x86_64. That means wheel-level parity is impossible in the naive sense: even if the package names match, the binaries do not. The bench-side parity recipe says this plainly: wheels are cross-incompatible, so source dependencies have to be rebuilt against matching versions instead of copied over blindly.

This is the real meaning of parity in a compiler-heavy ML stack. It is not "same pip install output." It is "same effective runtime contract": same PyTorch generation, same CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary toolchain generation, same compiler family, same pinned source revisions for the packages that actually compile kernels or extend the runtime. That runtime-contract phrasing is deliberate, and it matches the claim-boundary discipline in GB10 driver gates and false capability signals: a software-visible path only means something if the surrounding environment is comparable enough to trust the receipt.

For MegaCpp that mattered because several behavior boundaries are stack-sensitive:

whether a package can be installed from a wheel at all
whether a source package builds cleanly against the chosen PyTorch and CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary pair
whether runtime-compiled kernels target the same capability and shared-memory assumptions
whether a serving or benchmark receipt can be compared to device-local results without hidden environment drift

The nightly wheel was a constraint, not a preference signal

The GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough parity build pins PyTorch to 2.13 cu132 from the nightly index. That decision is worth stating directly because it is easy to misread. Nightly here is not a style choice and not a vague desire to be "latest." It is a compatibility constraint.

The parity scripts install PyTorch from the nightly cu132 channel and then build the rest of the stack around it. The reason is simple: the rest of the environment is already targeting CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary 13.2 and a current Blackwell-oriented toolchain. Once that is true, the PyTorch choice is no longer an isolated package decision. It becomes part of one coupled compiler surface.

In practice, that surface included:

torch 2.13.0.dev* from cu132
cuda-toolkit-13-2
cuda-python / cuda-bindings in the 13.2 line
editable or source installs for vllmQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off, flashinfer, and other extension-heavy packages

That is why the dependency notes distinguish between ordinary PyPI pins and "custom-built editable installs." The latter are the packages that cannot be treated as interchangeable wheels without losing control of the build surface.

GCC 15 and CUDA 13.2 were part of the same story

The parity image does not only pin PyTorch. It also installs GCC 15 and CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary 13.2 explicitly, then exports the expected CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary paths before any source builds begin. That matters because MegaCpp was not just importing Python modules. It was building and rebuilding extension code, JIT-capable libraries, and runtime-sensitive CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary packages on top of that environment. In this article, CUDA 13.2 is therefore shorthand for one GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough-specific compiler/runtime lane: the compile targets, ptxasQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 TMA multicast probe surface behavior, extension builds, and loader/runtime expectations that had to agree before a GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough receipt meant anything. The same build-surface coupling is visible again in GB10 libcuda driver patch lane and why it still is not silicon proof: libcudaQuick term guidelibcudaThe user-space NVIDIA driver library that owns module load, metadata validation, and the helper-cubin patch lane in the GB10 experiments.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 cubin patch repro is the NVIDIA user-space CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary loader library, so once the evidence lane moves below Python, sloppy environment drift stops being a minor inconvenience.

The shortest reader-first route for exact compile-target rules is What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the target boundary and GB10 repro walkthrough for the checked-in rulebook. This post stays on the operational question: why the stack had to mirror GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough so closely.

In other words, the stack boundary was not "Python package management." The real boundary was C++ and CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary compilation. If GCC, NVCC, and PyTorch are not aligned closely enough, the resulting environment may install successfully and still fail to produce comparable runtime behavior.

That is why the parity recipe is structured in the order it is:

install system toolchain
install CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary toolkit
install the exact PyTorch nightly lane
rebuild kernel-sensitive dependencies from source
overlay the exact vllmQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off modifications needed by the MegaCpp-serving path

This is less glamorous than benchmark plots, but it is the difference between a reproducible environment and an anecdotal one.

Why rebuilding from source was unavoidable

The dependency inventory for the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough stack makes the constraint visible. Several key packages are not consumed as ordinary released wheels. They are installed editable or from source at pinned commits, including vllmQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off, flashinfer, mamba_ssm, and other kernel-adjacent components.

That happened for two separate reasons.

First, architecture. aarch64 on one side and x86_64 on the other means prebuilt wheels are not portable across the two environments.

Second, stack shape. Even on one architecture, the environment carried source-level overlays and pinned revisions that were part of the actual working runtime. The dependency notes explicitly call out a vllmQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off commit plus a file overlay, plus additional divergence between the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough stack and the ModalQuick term guideModalA container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.GroundingAbout: Modal training platform overview History: Modal vs owned hardware Reference: Modal benchmark receipts-side stack. Once you are in that world, "install the nearest stable wheel" is no longer a real parity strategy. The cleanest continuation on that point is vLLM on GB10: the overlay, the registration fixes, and the paths we kept off, because it shows which divergences were part of the working runtime contract and which ones were deliberately excluded from the parity lane.

Parity mattered because device constraints were real, not hypothetical

The strongest reason this mattered for MegaCpp is that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough was not just another CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary machine with fewer SMs. The codebase carries an explicit GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough shared-memory preflight because the sm_121 lane keeps a 99 KiB maximum shared-memory limit per thread block even though compute capability 12.0 exposes 128 KB of shared-memory capacity per SM. The preflight module explains the operational impact directly: several TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample kernels can naively emit 140+ KiB of shared-memory descriptors, which compiles but then fails at launch with an opaque runtime error unless the aggressive shared-memory merge flag is enabled.

That logic is not a comment-only note. It is enforced behavior.

The README summarizes the production reason for shipping the preflight: on sm_121, training is refused unless every relevant TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample kernel declares TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE=True, because otherwise kernels can compile and only fail at first launch.

This is exactly the kind of device-specific behavior that makes parity matter. If MegaCpp validated adjacent serving and benchmark layers on a stack that drifted away from the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough runtime assumptions, then a "working" result would not tell us much. It could be hiding a compiler difference, a dependency-build difference, or a missing device-specific guard that only the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough-shaped environment would reveal.

That same device-specific layer is why a few below-Python terms still appear in this stack article. tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough, TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: TileLang TMA bulk-copy companion sample, TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough, libcudaQuick term guidelibcudaThe user-space NVIDIA driver library that owns module load, metadata validation, and the helper-cubin patch lane in the GB10 experiments.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 cubin patch repro, and the observed .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro / .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary boundary explain why a seemingly small stack drift can create very misleading "support" signals. This post does not try to re-teach those terms. It only needs the narrower operational conclusion: parity had to preserve the same build, loader, and runtime surface that those lower-level boundaries depend on. .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro stays observational-only here; it is the literal metadata section name visible in the public cubin lane, not a documented NVIDIA API surface.

For quick definitions, use The MegaCpp model glossary. For the reader-first proof boundary, continue with What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths. For the checked-in evidence lane, start with GB10 repro walkthrough.

The point was comparability, not identical hardware

The parity work did not pretend that two machines become identical once their package lists look similar. The bench recipe says the opposite: one side is aarch64, the other is x86_64, so exact wheel reuse is off the table from the start.

The goal was therefore narrower and more useful: preserve the parts of the stack that determine build behavior and runtime expectations, then compare outcomes across environments with fewer hidden variables.

In practice, that often meant one aligned source-build recipe rather than one portable wheel set. The local GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough stack notes keep that distinction explicit: the toolchain could stay in one family while extension-heavy packages were still rebuilt per host architecture instead of being copied blindly between aarch64 and x86_64. That is the operational shape of parity here: shared compiler and runtime rules, local rebuilds where binary artifacts stop being interchangeable.

That is why the parity image keeps the same broad stack story:

same Ubuntu generation
same GCC generation
same CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary generation
same PyTorch nightly lane
same pinned source commits for extension-heavy packages
same local overlay strategy where upstream releases did not yet match the working stack

For MegaCpp, that was enough to make benchmark and serving evidence interpretable. Without it, every receipt would have had a built-in disclaimer: maybe the result is real, or maybe it is just a toolchain mismatch.

What this means operationally

The GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough parity stack is a good example of a broader MegaCpp rule: version notes should describe the active compatibility surface, not just list packages.

A useful stack note answers questions like these:

which parts were wheel-installed versus source-built?
which CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary generation was assumed by the runtime?
which compiler generation built the local extensions?
which packages were pinned to commits because released artifacts were insufficient?
which device-specific constraints made that exact stack necessary?

That is the reason this parity work deserves its own write-up. It was not just a porting convenience for one benchmark harness. It was the environment contract that made later benchmark, serving, and kernel evidence comparable enough to trust.

FAQ

Frequently asked questions

Why did MegaCpp use a nightly PyTorch build for GB10 parity?+

Because the nightly cu132 wheel was part of the same compatibility surface as CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes. 13.2 and the Blackwell-oriented toolchain. It was a runtime constraint, not a generic "latest is better" preference.

Why rebuild so many dependencies from source?+

Because wheel reuse was not enough across aarch64 and x86_64, and several extension-heavy packages needed to be rebuilt against the same compiler, CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes., and PyTorch contract to keep receipts comparable.

Why does ABI parity matter for this stack?+

Because the risky packages here are not pure Python. They load C++ and CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes. extension code into the same process as PyTorch, Triton, CUDA, and the serving runtime. If those extensions were built against a different compiler or runtime generation, the package name could still look right while the binary interface is wrong. That is why this article treats GCC, CUDA, PyTorch, and source-built dependencies as one compatibility surface instead of separate install steps.

Why is the vllm overlay part of the parity story instead of a serving-only footnote?+

Because parity here means matching the extension and runtime surface that the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. lane actually used. Once vllmQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for… is source-built and locally overlaid, that overlay is part of the environment contract, not optional commentary.

What does TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE actually change on GB10?+

It changes the shared-memory layout story, not the hardware limit. On GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof., the useful effect is that buffers whose lifetimes do not overlap can reuse the same shared-memory bytes, so a kernel that looks too large in a naive staging plan can still fit under the 99 KiB launch boundary. That is why the flag belongs in the parity story: without the same layout rule, two environments can compile the same kernel family and still disagree at first launch.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

Grounding

sm_100

Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.

Grounding

.nv.merc.rela.*

Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.

Grounding

sm_100a

Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.

Grounding

sm_121a

Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.

Grounding

sm_120f

Family-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.

Grounding

tcgen05

The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.

Grounding

TMA multicast

The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.

Grounding

.nv.capmerc

Observed section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.

Grounding

TMEM

Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

libcuda

The user-space NVIDIA driver library that owns module load, metadata validation, and the helper-cubin patch lane in the GB10 experiments.

Grounding

PTX

NVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.

Grounding

vLLM

How MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…

Grounding

vLLM on GB10: the overlay, the registration fixes, and the paths we kept off

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

Grounding

Topic hubs

Entity Hub

GB10 and Blackwell Bring-Up

A curated GB10 and Blackwell reading path: consumer-versus-datacenter tensor paths, driver-visible false positives, arch-patch repros, and the serving or precision choices that survived contact with the hardware.

David Gornshtein • MegaCppMore posts →

GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint

GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint

Why parity mattered

The nightly wheel was a constraint, not a preference signal

GCC 15 and CUDA 13.2 were part of the same story

Why rebuilding from source was unavoidable

Parity mattered because device constraints were real, not hypothetical

The point was comparability, not identical hardware

What this means operationally

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

GB10 and Blackwell Bring-Up