Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story
Field notes from bringing the MegaCpp SLM Ensemble up on NVIDIA GB10 and DGX Spark: silicon surprises, NaN bisects that ate days, regressions caused by our own patches, and the software-stack choices that held.

The GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough / DGX Spark looks, from the marketing slide, like a small Blackwell. It is not. It is a different ISA wearing the Blackwell logo, with a desktop-class die, a roughly 273 GB/s LPDDR5X memory bus, and a software stack that assumes you are running on a B200 until you prove otherwise. This post is the unvarnished account of bringing our hybrid SSM/attention/MoE stack up on this box: the silicon traps, the NaN hunts that turned out to be our own patches, and the software-stack recipe that finally held.
Why this matters
GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough is the first Blackwell-branded box most teams will actually touch with their own hands. Treating it as a small B200 is the failure mode that costs the most time. It shares the brand and the FP4 datatype but not the tensor-memory hardware, not the SMEM budget, not FlashAttention-4, and not the bandwidth headroom that makes B200 inference look easy. We brought up GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough because it is the right surface for kernel validation, single-node smoke tests, and end-to-end architectural sanity checks under the unified-memory ceiling ā and because the cheap ways it fails are the cheap ways our customers' GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough deployments will fail. This is what it took to get a clean training step.
Three first-touch terms are worth keeping separate from the start. FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample means
the FlashAttention-4 backend family, not "any fast attention on Blackwell."
NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference means NVIDIA's Blackwell low-precision serving format, not "all FP4 on
this box." And GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough's 128 GB unified LPDDR5X memory is a system-style memory
pool with a very different bandwidth story from H200 or B200 HBM, which is why
the memory-lane articles stay separate. For quick definitions of those terms
and the neighboring GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough labels, use The MegaCpp model glossary.
For the operational follow-on, continue with GB10 Stack Parity for MegaCpp.
For the checked-in evidence lane, start with GB10 repro walkthrough.
Two more labels show up throughout this article. OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.GroundingAbout: NVFP4 inference on GB10 Example: driver signal versus runtime proof sample is the warp-level
matrix path that still executes on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough; TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity is the datacenter-only staging
surface behind tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough. Keeping those separate is the fastest way to avoid
reading a B200 receipt as if it already proved something about GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough. For that
narrower proof boundary, continue with What Our GB10 Experiments Actually
Prove About Blackwell Consumer vs Datacenter Tensor Paths.
1. What GB10 actually is
The first thing to internalise is that sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is not a small sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough. The Blackwell umbrella covers two architecturally distinct chips: datacenter (sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough, B200) and consumer-class (sm_120a RTX 5090, sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough). In NVIDIA's compiler docs, that a suffix marks an architecture-specific target, which is why these names are not interchangeable aliases. NVIDIA's own forum reps put it bluntly: GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough's tensor cores are "closer to the GeForce Ampere-style MMA model". RT cores and DLSS silicon took the die budget that would have gone to TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity and tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough on the datacenter parts. Use The MegaCpp model glossary for the target-name family, GB10 Stack Parity for MegaCpp for the toolchain consequences, and GB10 repro walkthrough when you want the checked-in evidence lane.
| Property | B200 (sm_100a) | GB10 (sm_121a) |
|---|---|---|
| SM count | 132 | 48 |
| Memory | HBM3e ~8 TB/s | LPDDR5X ~273 GB/s, 128 GB unified |
| Dynamic SMEM budget | ~232 KiB | ~99 KiB |
tcgen05.* family / TMEM |
yes | absent |
| 2-SM TMA multicast | yes | cluster cap 1, effectively absent |
Hopper wgmma.mma_async |
n/a (deprecated) | n/a (deprecated) |
| FlashAttention-4 cubins | yes | rejected at driver |
| Tensor-core peak | ~2,250 BF16 / ~9,000 FP4 TFLOPS | ~100 BF16 / ~400 FP4 TFLOPS |
The folk wisdom that "FP4 doesn't work on consumer Blackwell" is wrong but understandable. FP4 does work on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough via warp-level OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.GroundingAbout: NVFP4 inference on GB10 Example: driver signal versus runtime proof sample. What does not work is the tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough-coupled UTCOMMA path, which is what most CUTLASS NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference examples hard-code. For first touch, that means tcgen05.* is the datacenter Blackwell tensor-core and TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity family, while the surviving GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough lane is the warp-level mma.sync / OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.GroundingAbout: NVFP4 inference on GB10 Example: driver signal versus runtime proof sample path plus smaller single-CTA TMA copies. The checked-in probes keep the hard terms separate on purpose: tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough is the smallest TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity-allocation probe, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is the load-side follow-on, and tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails is the matrix instruction itself. They do not prove the same thing, and none of them upgrades GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough into a datacenter-tensor-path receipt. Use The MegaCpp model glossary for the term family, What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the proof boundary, and GB10 gate matrix for the checked-in gate walk.
The same caution applies to TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source. In PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 TMA multicast probe surface, multicast::clusterQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source means one clustered cp.async.bulk.tensor transfer trying to fan a tile out to multiple CTAs. In this corpus, the checked-in bundle keeps that as one local multicast probe rather than as a synonym for the whole family. It is adjacent to the tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough family, not a hidden synonym for tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails, and not a deployment-grade GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough success receipt. Use The MegaCpp model glossary for the term family, What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the claim boundary, and full GB10 tensor-path probe source for the exact checked-in probe.
That lower-level split is where the gate-metadata terms enter. Once the clustered-copy or tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough probes move past the obvious front-door checks, the checked-in GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough lane starts talking about reservedSmem, .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough per-kernel records, .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro, and .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary because those are the later loader and capability-metadata boundaries the public-safe repro actually hits. They are not alternate kernel families. They are the names of the deeper stops that explain why "the cubin got farther" still does not mean "the tensor path executed cleanly." For the reader-first map, continue from The MegaCpp model glossary to Reproducing the sm_100a to sm_121a cubin patch on GB10, then use GB10 gate matrix before opening the lower-level patch helpers.
2. The toolchain dance
For first touch, CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: Torch 2.13 on GB10 stack here means the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough compiler/runtime surface under Triton, PyTorch 2.13, and source-built extensions, not a generic H200-ops or benchmark lane. Use The MegaCpp model glossary for the target-name and term-family definitions, GB10 Stack Parity for MegaCpp for the environment contract, and GB10 repro walkthrough for the checked-in rulebook.
Bringing up our hybrid pretraining recipe on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough took five separate fixes before the first iteration produced a finite loss. None of them were exotic; all of them were easy to miss if you assumed GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough behaved like a smaller B200. The environment-contract side of the same story lives in GB10 Stack Parity for MegaCpp.
ptxas
Triton ships its own ptxasQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 TMA multicast probe surface (12.8), which does not know what sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is and bails with Value 'sm_121a' is not defined for option 'gpu-name'. Point Triton at the system CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: Torch 2.13 on GB10 stack 13.0+ ptxasQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 TMA multicast probe surface instead:
if not os.environ.get("TRITON_PTXAS_PATH"):
for ptxas in ["/usr/local/cuda/bin/ptxas", shutil.which("ptxas")]:
if ptxas and os.path.exists(ptxas):
os.environ["TRITON_PTXAS_PATH"] = ptxas
break
If you do have to bridge an older Triton stack with an sm_121a -> sm_120a wrapper, isolate that lane's TRITON_CACHE_DIR too. TRITON_CACHE_DIR is Triton's on-disk compile cache, and reusing one shared cache between native and aliased targets turns the compile workaround into a provenance problem: the next run can silently pick up cubins built for the wrong target string. The public-safe follow-on is the same as the broader patch-lane rule in How we keep a patch lane: keep the workaround explicit, keep its cache isolated, and do not let a temporary bridge become invisible state.
is_big_gpu
PyTorch Inductor refuses max_autotune_gemm if the GPU has fewer than 68 SMs. GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough has 48. The local override was tiny:
os.environ["TORCHINDUCTOR_MAX_AUTOTUNE_GEMM"] = "1"
import torch
import torch._inductor.utils as inductor_utils
inductor_utils.is_big_gpu = lambda index=0: True
That override only reopens an Inductor heuristic gate. It does not make GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough a datacenter-class device, and it does not rescue B200-sized tiles or SMEM-heavy configs. The hardware envelope is still the 48-SM, 99 KiB lane from Section 1.
Some Triton configs still fail with shared-memory errors during autotune. That is fine; autotune handles it. The deeper point is that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough's 99 KiB SMEM budget is lower than the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 stack parity-shaped defaults many kernels assume, so every B200-sized tile has to be re-checked on the consumer lane.
MFU
The default MFU calculation in our trainer divides by H100's BF16 peak (~989 TFLOPS), which on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough reports a depressing single-digit percent. With the correct denominator (~62 TFLOPS BF16, ~500 TFLOPS NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference), MFU comes out closer to low double digits, which is roughly what the silicon can do given the bandwidth wall.
Liger graph break
LigerFusedLinearCrossEntropyFunction calls target_mask.sum().item() internally, forcing a torch.compile graph break and tanking Liger's throughput below the unfused baseline.
3. The kernel layer: TileLang wins, cuTile is a dead end on this box
The reader-first explanation for why these GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough kernels flatten out starts with The MegaCpp model glossary for the term families, What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the consumer-vs-datacenter boundary, and TileLang TMA bulk-copy sample for a compact checked-in example of the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough-side lowering constraints. The section only reads cleanly if you keep that boundary explicit: the local wins are on the OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.GroundingAbout: NVFP4 inference on GB10 Example: driver signal versus runtime proof sample and small-copy side, while the deeper gate-metadata names belong to the separate cubin-patch and driver-proof articles rather than to the training hot path itself.
The ensemble's hot path is the Mamba-3 MIMO backward-of-backward (bwd_bwd) kernel. We tried three independent paths to beat the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample baseline on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough. All three lost. The same consumer-vs-datacenter split kept reappearing: kernels that win on B200 are often paying for on-chip state that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough simply does not have.
cuTile Python rewrite
This was the most thorough attempt. Five algorithmic variants - fused monolithic, nested @ct.function per phase, 3-kernel split, hoisted loop invariants, full ct.static_iter unroll - all regressed against the 2-kernel A/B split baseline. The full unroll was several times slower. The 3-kernel split that won by a third on B200 (TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity, 228 KiB SMEM) regressed by a few percent on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough. The launch-overhead versus live-set trade-off flips the moment you change SMEM budget. The lesson is simple: never assume a cuTile variant that wins on one GPU will win on another. Re-sweep on the target hardware.
CuTe DSL hot-path port
This was the most fun and the most humbling. Here, CuTe DSL means the CUTLASS-style Python surface where layout, staging, and instruction choice are still explicit rather than hidden behind a high-level GEMM call. We got cute.nvgpu.warp.MmaF16BF16Op + TMA + persistent scheduler running on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough out of the box via the GeForce-Blackwell dense-GEMM pattern: pass "sm_120" as the SmemAllocator capacity key, and do not use CUTE_DSL_ARCH=sm_120a overrides because the cubin loader rejects them. The hand-written batched GEMM at L=256 ran in roughly 10 microseconds. torch.bmm on the same shape ran in roughly the same time. cuBLAS on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough already matches a hand-written CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: tcgen05 gate matrix sample Example: TileLang TMA bulk-copy sample kernel at small BF16 shapes. The TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample advantage is not GEMM efficiency; it is that TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample fuses about ten GEMMs plus on the order of 150 elementwise ops plus rotary plus reductions into one CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: Torch 2.13 on GB10 stack kernel with 16 CTAs each running 16 chunks in on-chip state. cuTile Python structurally cannot do that because it has to split into at least two kernels with gmem temporaries. The roughly 4x gap is the kernel-structure tax, not the instruction tax.
Triton M2RNN autotune sweep
This was the most anticlimactic lane. The sweep moved search around the edges, but it never closed the structural gap to the fused TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample path on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough. That is why the practical verdict here stays boring: TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample remained the hot-path choice, and the other lanes stayed receipts, not replacements.
4. NaN, NaN, NaN: a bisect that wasn't
The hardest two days of this project had nothing to do with kernels and everything to do with grad norm: nan. The symptom was simple: the canonical multi-GPU H200 hybrid training was producing finite gradients on day one ("the golden run") and grad norm: nan on every iteration two days later.
The obvious suspect was a recent commit that rewrote the MTP and main-head Liger CE patches from reduction="none" to reduction="mean" plus broadcast, explicitly to fix a silent grad corruption from an upstream Liger issue. Clean hypothesis: the fix broke training. Empirical reality: not the fix. We ran five mutations at HEAD - drop MTP, prefer native Hopper CE, fall back to vanilla CE on logits, revert the Mamba3 regional-compile change, and drop selective recompute. All five produced lm loss ~12 (finite), grad norm: nan on iter 1. The bug was upstream of those candidates, which is exactly why Training on 8x H200 SXM: the operator playbook treats provenance and receipts as part of the runtime contract.
So we set up a proper bisect on a clean checkout, with PYTHONPATH precedence and an import-path pre-check against the intended runtime package. Then we tested the claimed-golden commit itself under the same environment. NaN. Iter 1: finite loss, NaN grad norm. Same on the other claimed "last known finite". Same on HEAD. Three commits, including the one that allegedly produced healthy gradients twenty-four hours earlier, all produced NaNs under the same environment. The conclusion was forced by the data: the regression was not in our source. The software environment had drifted between the golden measurement and our bisect.
The two highest-probability culprits were straightforward:
- The vendored
megatron-lmcheckout had no.gitdirectory, so the Megatron-LM version was unverifiable. - The
state-spaces-mambafork carried uncommitted patches, including an FP32 upcast ofdd_dt + self.dt_biasbefore softplus and GQA-branch changes in the backward kernels.
5. The other NaN: a mutation we made ourselves
While that NaN was being investigated, GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough produced a different and equally instructive failure: cudaErrorMisalignedAddress at mamba_mimo_fwd_kernel, immediately on iter 0, during forward.
Root cause traced in an afternoon, once we stopped trusting the env gate. A "P1: enable TMA + warp specialization in Mamba3 MIMO kernels" commit had at some point been applied to the installed mamba_ssm site-packages files, not to a copy. The local Mamba patch helper had no restore path, which is exactly the maintenance failure How we keep a patch lane is meant to stop. The gate that was supposed to keep that patch disabled correctly skipped reapplying it on the next run - but the disk state was already mutated. Every subsequent Python import picked up the patched kernels regardless of the intended off-path.
The diff was tiny:
- tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True,
- tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True,
+ tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: False,
+ tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: False,
+ tilelang.PassConfigKey.TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE: True,
Plus @autotune(...) enabled. On the pinned TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample revision, the TMA-lower path produces bulk-copy descriptors that assume aligned multi-byte boundaries. Combined with the tile shapes in mamba_mimo_fwd_kernel, that produces unaligned addresses on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough ā the exact alignment-bug class that the CUTLASS sm_120/sm_121 issue tracker catalogues.
The practical boundary turned out to be descriptor ownership as much as tile shape. On this box, the safe rule is to treat TMA descriptors as host-owned receipts, not something producer warps should improvise inside the kernel. The compact local bridge is TileLang TMA bulk-copy 3D shared-memory deep dive together with the checked-in TileLang TMA bulk-copy sample: once the descriptor is created on the host side and passed down as immutable state, the question returns to layout legality instead of racing descriptor construction against the copy itself.
That ownership cut also matches NVIDIA's tensor-memory API surface: tensor maps are host-created descriptor objects that the kernel receives as opaque state, not ad-hoc device-side bookkeeping. On GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough that distinction mattered because "fix the kernel" was too vague; the durable repair was to move descriptor creation back to the host and then retune the staging plan around the smaller consumer-Blackwell SMEM envelope.
The hardware-envelope side of the fix was just as important as the patch rollback. Once we treated GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough as a 99 KiB dynamic-SMEM box on a 273 GB/s memory bus, the viable kernel shapes got narrower fast: smaller chunks, fewer warps, and host-built tensor maps beat every attempt to preserve B200-sized staging on the consumer die. That is not glamorous tuning. It is the cost of admitting that the memory system is part of the contract.
The key tuning point from the validation packet is that those knobs were one envelope, not four independent tricks. Host-built tensor maps only became stable once the staging plan also shrank to GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough-sized chunks and narrower warp/stage settings; keeping a B200-sized live set while "just fixing descriptors" still pushed the lane back into SMEM pressure or alignment trouble. The most useful checked-in companions are the TileLang TMA bulk-copy sample, Mamba3 MIMO SMEM legality sample, and the deeper reader-facing walkthroughs TileLang TMA bulk-copy 3D shared-memory deep dive plus Mamba3 MIMO 3D-to-2D SMEM deep dive.
That is also why the small checked-in examples matter. TileLang TMA bulk-copy sample isolates the descriptor-ownership side of the fix, while Mamba3 MIMO shared-memory layout example isolates the 3D-to-2D layout repair. Together they explain the shape of the fix without pretending the tiny examples are full training-kernel receipts.
Two lessons. First: never patch installed site-packages in place. The linear-CE patch already does this correctly via monkey-patch at import time; the mamba3 P1 patch needs to be reworked to write to a mamba_ssm_p1/ shadow tree and patch at import. Second: env gates do not protect against irreversible disk mutations. If your "off" path leaves the system in the "on" state, your gate is a label, not a switch.
The research packet also sharpened one reader-facing distinction that is easy to miss in a training diary: single-node finite training on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough and multi-GPU FP8 stability are different proof surfaces. The GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough cut was valuable precisely because it removed the collective dimension and showed that the hybrid stack, the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough-sized kernel envelope, and the local FP8 lane could all stay finite together. It did not settle the later distributed interaction, which is why the precision-specific follow-on stays FP8 in the training stack: what shipped and what we rolled back rather than this bring-up story.
That distinction is the point: FP8 on a single GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough is a local precision-and-kernel smoke, not a replacement for the distributed FP8 stability evidence that the larger H200 lane still has to provide.
6. What we did finally validate on GB10
After the P1 disk state was reverted, we ran a 13-layer hybrid cut (1 MLA + 3 DSA + 4 MoE + 4 Mamba3/M2RNN + 1 MTP) end-to-end on a single GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough and got finite gradients across every config we could reasonably build:
- BF16 with unfused attention, several iterations: finite, healthy loss decay.
- FP8 tensorwise at small and medium MBS: finite across tens of iterations.
- Plus TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample SparseMLA BF16, Liger MTP, Liger main-head, and DSA indexer fused: finite, byte-identical iter-1 grad to the no-SparseMLA run.
- Plus the full runtime configuration with hashed n-gram features, structure features, the Mamba3 MIMO path, the eight-group setting, and recompute enabled: finite, grad norm decaying smoothly across ten iterations.
- Plus the true per-layer dims (hidden=3584, ffn=18944, 28 heads) at moderate MBS: finite.
- Plus the canonical microbatch setting with the index-cache path enabled: finite, peak memory in the high-80s of GB, validation PPL in the expected band.
The model-side meaning of that 13-layer mix is the same public contract shown in NAM56R pattern composition sample and NAM56R feature placement receipt: the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough lane matters because it exercised a real hybrid layout, not a generic dense placeholder.
That last one is as close to the canonical golden config as a single GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough can physically run. Every component that the full training runtime uses and that fits on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough produces finite gradients on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough. The NaN that haunted the multi-GPU system lives in the intersection of EP=8 collective backward, megatron-lm SHA drift, and TE FP8 tensorwise behaviour, none of which a single GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough can exercise.
What we kept and what we threw away
Kept: TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample for Mamba3 MIMO bwd_bwd on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough, the OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.GroundingAbout: NVFP4 inference on GB10 Example: driver signal versus runtime proof sample-based NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference path, RHT-disabled NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference recipe, the system ptxasQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 TMA multicast probe surface for Triton, the is_big_gpu patch and capture_scalar_outputs knob, GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough as the validation surface for kernel correctness and single-node smoke tests, env-pinned and SHA-pinned dependencies for any "golden" claim, monkey-patch-at-import for kernel mutations.
Threw away: cuTile for the Mamba MIMO forward-backward-backward kernel path on this box, FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample source builds (silicon-blocked), trtllm-gen FMHA on SM12x, in-place mutation of installed site-packages, "the same recipe wins on every Blackwell" as a working assumption, and any throughput claim from a system whose Megatron-LM checkout has no .git directory.
If we had a do-over: pin megatron-lm to a SHA in a real .git checkout, snapshot the mamba_ssm fork state at every "golden" measurement, and never let a patch helper write to installed site-packages. Those three rules would have saved most of the time behind this writeup.
The toolchain half of the same story is Torch 2.13 on GB10: the serving and training stack we actually chose, and the maintenance lesson generalizes into how we keep a patch lane.
Frequently asked questions
Is GB10 a small B200?+
What do sm_100a, sm_121a, sm_120f, and the a suffix mean here?+
sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. is the datacenter Blackwell target used by B200-class kernels. sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. is the exact GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. target the driver reports on DGX Spark. sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a. is the consumer-Blackwell family compile target we often use for shipping kernels when we want family-common optimizations without pretending GB10 turned into an SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. target. The a suffix matters because NVIDIA treats it as an architecture-specific target, not a generic family label. Use The MegaCpp model glossary for quick definitions, GB10 Stack Parity for MegaCpp for the operational consequences, and GB10 repro walkthrough for the checked-in example lane.What is tcgen05 in plain English, and why does it matter here?+
tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe. is the smallest allocation-side probe, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes. is the load-side follow-on, and tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof. is the matrix instruction itself. Keeping those names separate is the whole point: moving one gate deeper does not turn the later probes into a clean runtime receipt. Use The MegaCpp model glossary for quick definitions, What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the proof boundary, and GB10 gate matrix for the checked-in gate walk.What is OMMA in plain English here?+
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. / TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. lane dropped out. In this article, OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable. means "the FP4 lane that actually executed on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.," not "every Blackwell FP4 example carries over unchanged." Use The MegaCpp model glossary for the term family and What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the proof boundary.What held up best on GB10 after the bring-up churn?+
What changed once we treated GB10 like a 99 KiB box instead of a small B200?+
What does "host-built tensor map" mean in practice?+
Why mention TRITON_CACHE_DIR in a GB10 bring-up story?+
sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. toolchain bridge is not just a compiler flag problem. If one run uses a wrapper or alias path for ptxasQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics. and the next run uses a native target, a shared Triton cache can blur the boundary between the two receipts. The safe practice is to isolate or clear TRITON_CACHE_DIR whenever that target-routing rule changes, for the same reason we keep patch lanes and golden runs tied to explicit state rather than ambient disk leftovers.Does forcing is_big_gpu = True mean GB10 behaves like a datacenter GPU?+
Where should I read next for FA4, NVFP4, or memory-fit terms?+
What is the main operational lesson beyond kernels?+
What did GB10 prove and what did it not prove?+
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. parity, which is why the narrower claim boundary stays in What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths and Why Driver-Visible Paths Can Look Like Hardware Support on GB10, Even When Silicon Proof Is Missing.Why does this article stop at a single-GB10 validation cut instead of settling the multi-GPU FP8 NaN story?+
Why say TMA multicast is "effectively absent" instead of simply "unsupported"?+
tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof., which avoids telling a stronger story than the evidence supports. Use The MegaCpp model glossary for quick definitions, What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the narrower claim boundary, Reproducing the sm_100a to sm_121a cubin patch on GB10 for the neighboring checked-in probe order, and GB10 repro walkthrough before dropping to the checked-in probe source.What is .nv.capmerc, and why does a training story mention it?+
.nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. is the observed section-name family for the later GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. metadata boundary where the public cubin-patch lane still stops, and .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary. is the companion section family that shows up alongside it in that same gate walk. They are not training knobs and not public API terms we verified in NVIDIA docs. They show up here only as reminders that some GB10 boundaries live below Python packaging and below model code. Use The MegaCpp model glossary for quick definitions, Why Driver-Visible Paths Can Look Like Hardware Support on GB10, Even When Silicon Proof Is Missing for the wording discipline, GB10 libcuda driver patch lane and why it still is not silicon proof for the history of the deeper stop, and GB10 gate matrix before the broader walkthrough.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.
Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.
The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.
Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.
Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.
Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
Family-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.
Documented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.
Documented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.
The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.
The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.
Observed section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.
The older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.
The user-space NVIDIA driver library that owns module load, metadata validation, and the helper-cubin patch lane in the GB10 experiments.
FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.
NVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.
The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.
CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.
Observed per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
Hopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.
NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.
A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.