Entity Hub

GB10 and Blackwell Bring-Up

A curated GB10 and Blackwell reading path: consumer-versus-datacenter tensor paths, driver-visible false positives, arch-patch repros, and the serving or precision choices that survived contact with the hardware.

This hub is for readers who want the GB10 lane in the right order. Start with the broad war story and the tensor-path proof summary, then move into the gate-by-gate repro pieces and finally the stack, serving, and precision follow-through.

GB10

tcgen05

driver-research

sm121a

libcuda

nvfp4

Curated set

Articles in reading order

Why this hub

Best if you care about what GB10 actually proved, where tcgen05 evidence stops, and how those hardware limits changed the rest of the MegaCpp stack.

Start Here

Build the hardware and runtime picture before drilling into the patch lanes.

01
April 18, 2026•17 min read•David Gornshtein
Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story
Field notes from bringing the MegaCpp SLM Ensemble up on NVIDIA GB10 and DGX Spark: silicon surprises, NaN bisects that ate days, regressions caused by our own patches, and the software-stack choices that held.
The broad GB10 war story: what was attempted, what held up, and what turned out to be wishful thinking.
GB10
Blackwell
SM121A
NVFP4
Read article
02
April 20, 2026•11 min read•David Gornshtein
What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths
Our GB10 tests show that some Blackwell datacenter-targeted SASS can be accepted and executed on consumer silicon, but they do not prove that the Blackwell Tensor Core Generation 5 matrix-instruction path (tcgen05.mma) physically executes on GB10. Older stronger claims overstate what the evidence supports.
The shortest accurate summary of which Blackwell tensor-path claims are backed by public evidence and which are still missing.
GB10
Blackwell
CUDA
Tensor Core
Read article
03
April 20, 2026•9 min read•David Gornshtein
Why Driver-Visible Paths Can Look Like Hardware Support on GB10, Even When Silicon Proof Is Missing
A field report on GB10 reverse engineering: how libcuda tables, helper cubins, and signed capability metadata can make tcgen05 look reachable from software while still falling short of proving that the underlying silicon really exposes the same path as B200 or GB100.
Read this before trusting any driver-visible capability bit as proof of real hardware execution.
GB10
Blackwell
CUDA
Driver Research
Read article

Gate Walk and Patch Lanes

These are the concrete repros and gate-by-gate explanations once the top-level claim is clear.

04
April 20, 2026•8 min read•David Gornshtein
Reproducing the sm_100a -> sm_121a Cubin Patch on GB10: CUDA/C++ Code, ELF Edits, and the Exact Point Where tcgen05 Stops
A practical GB10 reproduction guide for the narrow result we can defend publicly: a patched sm_100a baseline cubin executes on GB10, while tcgen05-oriented probes stop at later driver-side gates rather than producing a publication-grade tcgen05 proof.
The public arch-field repro from sm_100a to sm_121a, including the exact point where tcgen05 stops.
GB10
Blackwell
CUDA
C++
Read article
05
April 20, 2026•9 min read•David Gornshtein
Inside the GB10 Driver Patch Lane: libcuda Tables, Helper Cubins, Linux Hooks, and Why Deeper Patching Still Is Not tcgen05 Proof
A public-safe walkthrough of the deeper GB10 driver research lane: what was patched in libcuda, what changed in the cubin and toolchain path, where Linux- and loader-level hooks entered the picture, and why that deeper progress still stops short of publication-grade tcgen05 proof.
The deeper driver patch lane and why even aggressive patching still does not count as clean tensor-path proof.
GB10
Blackwell
CUDA
libcuda
Read article
06
April 19, 2026•10 min read•David Gornshtein
GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint
Why MegaCpp mirrored the GB10 software stack so exactly: PyTorch 2.13 cu132 nightly, GCC 15, CUDA 13.2, rebuilt source dependencies, and the device-specific constraints that made parity operational rather than cosmetic.
The toolchain and runtime choices that made the GB10 lane stable enough to test honestly.
MegaCpp
GB10
PyTorch
CUDA
Read article

Serving and Precision Follow-Through

Once the bring-up and gate story is understood, these explain the downstream execution choices.

07
April 19, 2026•9 min read•David Gornshtein
Torch 2.13 on GB10: the serving and training stack we actually chose
A public, evidence-based write-up of the stack choices around Torch 2.13, CUDA 13.2, GCC 15, GB10, and vLLM compatibility in the MegaCpp workflow.
The serving and training stack we actually chose once the GB10 environment stopped moving.
PyTorch
GB10
vLLM
CUDA
Read article
08
April 18, 2026•4 min read•David Gornshtein
NVFP4 Inference for the MegaCpp SLM Ensemble
Why we train in FP16/BF16 and ship in NVFP4, what Blackwell and GB10 actually give us, and which kernels survive the trip from B200 to DGX Spark.
The precision-policy readback for the GB10 inference lane once tensor-path assumptions were narrowed.
NVFP4
Blackwell
GB10
Inference
Read article
09
April 19, 2026•10 min read•David Gornshtein
vLLM on GB10: the overlay, the registration fixes, and the paths we kept off
How MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for serving paths that were not yet honest.
The serving overlay and the GB10-specific paths we intentionally kept disabled.
vLLM
GB10
Serving
Inference
Read article

Keep exploring

Adjacent topic hubs

These hubs cover nearby parts of the blog without turning the archive into a giant taxonomy.

GB10 and Blackwell Bring-Up

Start Here

Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

Why Driver-Visible Paths Can Look Like Hardware Support on GB10, Even When Silicon Proof Is Missing

Gate Walk and Patch Lanes

Reproducing the sm_100a -> sm_121a Cubin Patch on GB10: CUDA/C++ Code, ELF Edits, and the Exact Point Where tcgen05 Stops

Inside the GB10 Driver Patch Lane: libcuda Tables, Helper Cubins, Linux Hooks, and Why Deeper Patching Still Is Not tcgen05 Proof

GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint

Serving and Precision Follow-Through

Torch 2.13 on GB10: the serving and training stack we actually chose

NVFP4 Inference for the MegaCpp SLM Ensemble

vLLM on GB10: the overlay, the registration fixes, and the paths we kept off

Adjacent topic hubs

Modal Training and Benchmark Operations

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

MLA Integration, Dispatch, and Weight Absorption

Evaluation, Benchmarks, and Verifier Loops

Megatron Parallelism and Layout Boundaries

TPU Sparse Attention and Pallas Kernels

H200 Training and Kernel Bring-Up

TPU v6e and XLA Runtime Surfaces

C++ Data Pipelines and Corpus Packaging

MoE, Routing, and Distributed Model Splits