MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 17 min readDavid Gornshtein
GB10
Blackwell
SM121A
NVFP4
TileLang
Transformer Engine
Training

Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story

Field notes from bringing the MegaCpp SLM Ensemble up on NVIDIA GB10 and DGX Spark: silicon surprises, NaN bisects that ate days, regressions caused by our own patches, and the software-stack choices that held.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story
Published •17 min read•David Gornshtein

The GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough / DGX Spark looks, from the marketing slide, like a small Blackwell. It is not. It is a different ISA wearing the Blackwell logo, with a desktop-class die, a roughly 273 GB/s LPDDR5X memory bus, and a software stack that assumes you are running on a B200 until you prove otherwise. This post is the unvarnished account of bringing our hybrid SSM/attention/MoE stack up on this box: the silicon traps, the NaN hunts that turned out to be our own patches, and the software-stack recipe that finally held.

Why this matters

GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough is the first Blackwell-branded box most teams will actually touch with their own hands. Treating it as a small B200 is the failure mode that costs the most time. It shares the brand and the FP4 datatype but not the tensor-memory hardware, not the SMEM budget, not FlashAttention-4, and not the bandwidth headroom that makes B200 inference look easy. We brought up GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough because it is the right surface for kernel validation, single-node smoke tests, and end-to-end architectural sanity checks under the unified-memory ceiling — and because the cheap ways it fails are the cheap ways our customers' GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough deployments will fail. This is what it took to get a clean training step.

Three first-touch terms are worth keeping separate from the start. FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample means the FlashAttention-4 backend family, not "any fast attention on Blackwell." NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference means NVIDIA's Blackwell low-precision serving format, not "all FP4 on this box." And GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough's 128 GB unified LPDDR5X memory is a system-style memory pool with a very different bandwidth story from H200 or B200 HBM, which is why the memory-lane articles stay separate. For quick definitions of those terms and the neighboring GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough labels, use The MegaCpp model glossary. For the operational follow-on, continue with GB10 Stack Parity for MegaCpp. For the checked-in evidence lane, start with GB10 repro walkthrough. Two more labels show up throughout this article. OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.GroundingAbout: NVFP4 inference on GB10 Example: driver signal versus runtime proof sample is the warp-level matrix path that still executes on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough; TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity is the datacenter-only staging surface behind tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough. Keeping those separate is the fastest way to avoid reading a B200 receipt as if it already proved something about GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough. For that narrower proof boundary, continue with What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths.

1. What GB10 actually is

The first thing to internalise is that sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is not a small sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough. The Blackwell umbrella covers two architecturally distinct chips: datacenter (sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough, B200) and consumer-class (sm_120a RTX 5090, sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough). In NVIDIA's compiler docs, that a suffix marks an architecture-specific target, which is why these names are not interchangeable aliases. NVIDIA's own forum reps put it bluntly: GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough's tensor cores are "closer to the GeForce Ampere-style MMA model". RT cores and DLSS silicon took the die budget that would have gone to TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity and tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough on the datacenter parts. Use The MegaCpp model glossary for the target-name family, GB10 Stack Parity for MegaCpp for the toolchain consequences, and GB10 repro walkthrough when you want the checked-in evidence lane.

Property B200 (sm_100a) GB10 (sm_121a)
SM count 132 48
Memory HBM3e ~8 TB/s LPDDR5X ~273 GB/s, 128 GB unified
Dynamic SMEM budget ~232 KiB ~99 KiB
tcgen05.* family / TMEM yes absent
2-SM TMA multicast yes cluster cap 1, effectively absent
Hopper wgmma.mma_async n/a (deprecated) n/a (deprecated)
FlashAttention-4 cubins yes rejected at driver
Tensor-core peak ~2,250 BF16 / ~9,000 FP4 TFLOPS ~100 BF16 / ~400 FP4 TFLOPS

The folk wisdom that "FP4 doesn't work on consumer Blackwell" is wrong but understandable. FP4 does work on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough via warp-level OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.GroundingAbout: NVFP4 inference on GB10 Example: driver signal versus runtime proof sample. What does not work is the tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough-coupled UTCOMMA path, which is what most CUTLASS NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference examples hard-code. For first touch, that means tcgen05.* is the datacenter Blackwell tensor-core and TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity family, while the surviving GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough lane is the warp-level mma.sync / OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.GroundingAbout: NVFP4 inference on GB10 Example: driver signal versus runtime proof sample path plus smaller single-CTA TMA copies. The checked-in probes keep the hard terms separate on purpose: tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough is the smallest TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity-allocation probe, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is the load-side follow-on, and tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails is the matrix instruction itself. They do not prove the same thing, and none of them upgrades GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough into a datacenter-tensor-path receipt. Use The MegaCpp model glossary for the term family, What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the proof boundary, and GB10 gate matrix for the checked-in gate walk.

The same caution applies to TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source. In PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 TMA multicast probe surface, multicast::clusterQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source means one clustered cp.async.bulk.tensor transfer trying to fan a tile out to multiple CTAs. In this corpus, the checked-in bundle keeps that as one local multicast probe rather than as a synonym for the whole family. It is adjacent to the tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough family, not a hidden synonym for tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails, and not a deployment-grade GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough success receipt. Use The MegaCpp model glossary for the term family, What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the claim boundary, and full GB10 tensor-path probe source for the exact checked-in probe.

That lower-level split is where the gate-metadata terms enter. Once the clustered-copy or tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough probes move past the obvious front-door checks, the checked-in GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough lane starts talking about reservedSmem, .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough per-kernel records, .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro, and .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary because those are the later loader and capability-metadata boundaries the public-safe repro actually hits. They are not alternate kernel families. They are the names of the deeper stops that explain why "the cubin got farther" still does not mean "the tensor path executed cleanly." For the reader-first map, continue from The MegaCpp model glossary to Reproducing the sm_100a to sm_121a cubin patch on GB10, then use GB10 gate matrix before opening the lower-level patch helpers.

2. The toolchain dance

For first touch, CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: Torch 2.13 on GB10 stack here means the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough compiler/runtime surface under Triton, PyTorch 2.13, and source-built extensions, not a generic H200-ops or benchmark lane. Use The MegaCpp model glossary for the target-name and term-family definitions, GB10 Stack Parity for MegaCpp for the environment contract, and GB10 repro walkthrough for the checked-in rulebook.

Bringing up our hybrid pretraining recipe on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough took five separate fixes before the first iteration produced a finite loss. None of them were exotic; all of them were easy to miss if you assumed GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough behaved like a smaller B200. The environment-contract side of the same story lives in GB10 Stack Parity for MegaCpp.

ptxas

Triton ships its own ptxasQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 TMA multicast probe surface (12.8), which does not know what sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is and bails with Value 'sm_121a' is not defined for option 'gpu-name'. Point Triton at the system CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: Torch 2.13 on GB10 stack 13.0+ ptxasQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 TMA multicast probe surface instead:

if not os.environ.get("TRITON_PTXAS_PATH"):
    for ptxas in ["/usr/local/cuda/bin/ptxas", shutil.which("ptxas")]:
        if ptxas and os.path.exists(ptxas):
            os.environ["TRITON_PTXAS_PATH"] = ptxas
            break

If you do have to bridge an older Triton stack with an sm_121a -> sm_120a wrapper, isolate that lane's TRITON_CACHE_DIR too. TRITON_CACHE_DIR is Triton's on-disk compile cache, and reusing one shared cache between native and aliased targets turns the compile workaround into a provenance problem: the next run can silently pick up cubins built for the wrong target string. The public-safe follow-on is the same as the broader patch-lane rule in How we keep a patch lane: keep the workaround explicit, keep its cache isolated, and do not let a temporary bridge become invisible state.

is_big_gpu

PyTorch Inductor refuses max_autotune_gemm if the GPU has fewer than 68 SMs. GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough has 48. The local override was tiny:

os.environ["TORCHINDUCTOR_MAX_AUTOTUNE_GEMM"] = "1"
import torch
import torch._inductor.utils as inductor_utils

inductor_utils.is_big_gpu = lambda index=0: True

That override only reopens an Inductor heuristic gate. It does not make GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough a datacenter-class device, and it does not rescue B200-sized tiles or SMEM-heavy configs. The hardware envelope is still the 48-SM, 99 KiB lane from Section 1.

Some Triton configs still fail with shared-memory errors during autotune. That is fine; autotune handles it. The deeper point is that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough's 99 KiB SMEM budget is lower than the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 stack parity-shaped defaults many kernels assume, so every B200-sized tile has to be re-checked on the consumer lane.

MFU

The default MFU calculation in our trainer divides by H100's BF16 peak (~989 TFLOPS), which on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough reports a depressing single-digit percent. With the correct denominator (~62 TFLOPS BF16, ~500 TFLOPS NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference), MFU comes out closer to low double digits, which is roughly what the silicon can do given the bandwidth wall.

Liger graph break

LigerFusedLinearCrossEntropyFunction calls target_mask.sum().item() internally, forcing a torch.compile graph break and tanking Liger's throughput below the unfused baseline.

3. The kernel layer: TileLang wins, cuTile is a dead end on this box

The reader-first explanation for why these GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough kernels flatten out starts with The MegaCpp model glossary for the term families, What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the consumer-vs-datacenter boundary, and TileLang TMA bulk-copy sample for a compact checked-in example of the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough-side lowering constraints. The section only reads cleanly if you keep that boundary explicit: the local wins are on the OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.GroundingAbout: NVFP4 inference on GB10 Example: driver signal versus runtime proof sample and small-copy side, while the deeper gate-metadata names belong to the separate cubin-patch and driver-proof articles rather than to the training hot path itself.

The ensemble's hot path is the Mamba-3 MIMO backward-of-backward (bwd_bwd) kernel. We tried three independent paths to beat the TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample baseline on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough. All three lost. The same consumer-vs-datacenter split kept reappearing: kernels that win on B200 are often paying for on-chip state that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough simply does not have.

cuTile Python rewrite

This was the most thorough attempt. Five algorithmic variants - fused monolithic, nested @ct.function per phase, 3-kernel split, hoisted loop invariants, full ct.static_iter unroll - all regressed against the 2-kernel A/B split baseline. The full unroll was several times slower. The 3-kernel split that won by a third on B200 (TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity, 228 KiB SMEM) regressed by a few percent on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough. The launch-overhead versus live-set trade-off flips the moment you change SMEM budget. The lesson is simple: never assume a cuTile variant that wins on one GPU will win on another. Re-sweep on the target hardware.

CuTe DSL hot-path port

This was the most fun and the most humbling. Here, CuTe DSL means the CUTLASS-style Python surface where layout, staging, and instruction choice are still explicit rather than hidden behind a high-level GEMM call. We got cute.nvgpu.warp.MmaF16BF16Op + TMA + persistent scheduler running on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough out of the box via the GeForce-Blackwell dense-GEMM pattern: pass "sm_120" as the SmemAllocator capacity key, and do not use CUTE_DSL_ARCH=sm_120a overrides because the cubin loader rejects them. The hand-written batched GEMM at L=256 ran in roughly 10 microseconds. torch.bmm on the same shape ran in roughly the same time. cuBLAS on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough already matches a hand-written CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingAbout: CuTe DSL experiments Example: tcgen05 gate matrix sample Example: TileLang TMA bulk-copy sample kernel at small BF16 shapes. The TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample advantage is not GEMM efficiency; it is that TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample fuses about ten GEMMs plus on the order of 150 elementwise ops plus rotary plus reductions into one CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: Torch 2.13 on GB10 stack kernel with 16 CTAs each running 16 chunks in on-chip state. cuTile Python structurally cannot do that because it has to split into at least two kernels with gmem temporaries. The roughly 4x gap is the kernel-structure tax, not the instruction tax.

Triton M2RNN autotune sweep

This was the most anticlimactic lane. The sweep moved search around the edges, but it never closed the structural gap to the fused TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample path on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough. That is why the practical verdict here stays boring: TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample remained the hot-path choice, and the other lanes stayed receipts, not replacements.

4. NaN, NaN, NaN: a bisect that wasn't

The hardest two days of this project had nothing to do with kernels and everything to do with grad norm: nan. The symptom was simple: the canonical multi-GPU H200 hybrid training was producing finite gradients on day one ("the golden run") and grad norm: nan on every iteration two days later.

The obvious suspect was a recent commit that rewrote the MTP and main-head Liger CE patches from reduction="none" to reduction="mean" plus broadcast, explicitly to fix a silent grad corruption from an upstream Liger issue. Clean hypothesis: the fix broke training. Empirical reality: not the fix. We ran five mutations at HEAD - drop MTP, prefer native Hopper CE, fall back to vanilla CE on logits, revert the Mamba3 regional-compile change, and drop selective recompute. All five produced lm loss ~12 (finite), grad norm: nan on iter 1. The bug was upstream of those candidates, which is exactly why Training on 8x H200 SXM: the operator playbook treats provenance and receipts as part of the runtime contract.

So we set up a proper bisect on a clean checkout, with PYTHONPATH precedence and an import-path pre-check against the intended runtime package. Then we tested the claimed-golden commit itself under the same environment. NaN. Iter 1: finite loss, NaN grad norm. Same on the other claimed "last known finite". Same on HEAD. Three commits, including the one that allegedly produced healthy gradients twenty-four hours earlier, all produced NaNs under the same environment. The conclusion was forced by the data: the regression was not in our source. The software environment had drifted between the golden measurement and our bisect.

The two highest-probability culprits were straightforward:

  1. The vendored megatron-lm checkout had no .git directory, so the Megatron-LM version was unverifiable.
  2. The state-spaces-mamba fork carried uncommitted patches, including an FP32 upcast of dd_dt + self.dt_bias before softplus and GQA-branch changes in the backward kernels.

5. The other NaN: a mutation we made ourselves

While that NaN was being investigated, GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough produced a different and equally instructive failure: cudaErrorMisalignedAddress at mamba_mimo_fwd_kernel, immediately on iter 0, during forward.

Root cause traced in an afternoon, once we stopped trusting the env gate. A "P1: enable TMA + warp specialization in Mamba3 MIMO kernels" commit had at some point been applied to the installed mamba_ssm site-packages files, not to a copy. The local Mamba patch helper had no restore path, which is exactly the maintenance failure How we keep a patch lane is meant to stop. The gate that was supposed to keep that patch disabled correctly skipped reapplying it on the next run - but the disk state was already mutated. Every subsequent Python import picked up the patched kernels regardless of the intended off-path.

The diff was tiny:

-        tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True,
-        tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True,
+        tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: False,
+        tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: False,
+        tilelang.PassConfigKey.TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE: True,

Plus @autotune(...) enabled. On the pinned TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample revision, the TMA-lower path produces bulk-copy descriptors that assume aligned multi-byte boundaries. Combined with the tile shapes in mamba_mimo_fwd_kernel, that produces unaligned addresses on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough — the exact alignment-bug class that the CUTLASS sm_120/sm_121 issue tracker catalogues.

The practical boundary turned out to be descriptor ownership as much as tile shape. On this box, the safe rule is to treat TMA descriptors as host-owned receipts, not something producer warps should improvise inside the kernel. The compact local bridge is TileLang TMA bulk-copy 3D shared-memory deep dive together with the checked-in TileLang TMA bulk-copy sample: once the descriptor is created on the host side and passed down as immutable state, the question returns to layout legality instead of racing descriptor construction against the copy itself.

That ownership cut also matches NVIDIA's tensor-memory API surface: tensor maps are host-created descriptor objects that the kernel receives as opaque state, not ad-hoc device-side bookkeeping. On GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough that distinction mattered because "fix the kernel" was too vague; the durable repair was to move descriptor creation back to the host and then retune the staging plan around the smaller consumer-Blackwell SMEM envelope.

The hardware-envelope side of the fix was just as important as the patch rollback. Once we treated GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough as a 99 KiB dynamic-SMEM box on a 273 GB/s memory bus, the viable kernel shapes got narrower fast: smaller chunks, fewer warps, and host-built tensor maps beat every attempt to preserve B200-sized staging on the consumer die. That is not glamorous tuning. It is the cost of admitting that the memory system is part of the contract.

The key tuning point from the validation packet is that those knobs were one envelope, not four independent tricks. Host-built tensor maps only became stable once the staging plan also shrank to GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough-sized chunks and narrower warp/stage settings; keeping a B200-sized live set while "just fixing descriptors" still pushed the lane back into SMEM pressure or alignment trouble. The most useful checked-in companions are the TileLang TMA bulk-copy sample, Mamba3 MIMO SMEM legality sample, and the deeper reader-facing walkthroughs TileLang TMA bulk-copy 3D shared-memory deep dive plus Mamba3 MIMO 3D-to-2D SMEM deep dive.

That is also why the small checked-in examples matter. TileLang TMA bulk-copy sample isolates the descriptor-ownership side of the fix, while Mamba3 MIMO shared-memory layout example isolates the 3D-to-2D layout repair. Together they explain the shape of the fix without pretending the tiny examples are full training-kernel receipts.

Two lessons. First: never patch installed site-packages in place. The linear-CE patch already does this correctly via monkey-patch at import time; the mamba3 P1 patch needs to be reworked to write to a mamba_ssm_p1/ shadow tree and patch at import. Second: env gates do not protect against irreversible disk mutations. If your "off" path leaves the system in the "on" state, your gate is a label, not a switch.

The research packet also sharpened one reader-facing distinction that is easy to miss in a training diary: single-node finite training on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough and multi-GPU FP8 stability are different proof surfaces. The GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough cut was valuable precisely because it removed the collective dimension and showed that the hybrid stack, the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough-sized kernel envelope, and the local FP8 lane could all stay finite together. It did not settle the later distributed interaction, which is why the precision-specific follow-on stays FP8 in the training stack: what shipped and what we rolled back rather than this bring-up story.

That distinction is the point: FP8 on a single GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough is a local precision-and-kernel smoke, not a replacement for the distributed FP8 stability evidence that the larger H200 lane still has to provide.

6. What we did finally validate on GB10

After the P1 disk state was reverted, we ran a 13-layer hybrid cut (1 MLA + 3 DSA + 4 MoE + 4 Mamba3/M2RNN + 1 MTP) end-to-end on a single GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough and got finite gradients across every config we could reasonably build:

  • BF16 with unfused attention, several iterations: finite, healthy loss decay.
  • FP8 tensorwise at small and medium MBS: finite across tens of iterations.
  • Plus TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample SparseMLA BF16, Liger MTP, Liger main-head, and DSA indexer fused: finite, byte-identical iter-1 grad to the no-SparseMLA run.
  • Plus the full runtime configuration with hashed n-gram features, structure features, the Mamba3 MIMO path, the eight-group setting, and recompute enabled: finite, grad norm decaying smoothly across ten iterations.
  • Plus the true per-layer dims (hidden=3584, ffn=18944, 28 heads) at moderate MBS: finite.
  • Plus the canonical microbatch setting with the index-cache path enabled: finite, peak memory in the high-80s of GB, validation PPL in the expected band.

The model-side meaning of that 13-layer mix is the same public contract shown in NAM56R pattern composition sample and NAM56R feature placement receipt: the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough lane matters because it exercised a real hybrid layout, not a generic dense placeholder.

That last one is as close to the canonical golden config as a single GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough can physically run. Every component that the full training runtime uses and that fits on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough produces finite gradients on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough. The NaN that haunted the multi-GPU system lives in the intersection of EP=8 collective backward, megatron-lm SHA drift, and TE FP8 tensorwise behaviour, none of which a single GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough can exercise.

What we kept and what we threw away

Kept: TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample for Mamba3 MIMO bwd_bwd on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough, the OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.GroundingAbout: NVFP4 inference on GB10 Example: driver signal versus runtime proof sample-based NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference path, RHT-disabled NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference recipe, the system ptxasQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 TMA multicast probe surface for Triton, the is_big_gpu patch and capture_scalar_outputs knob, GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary Example: GB10 repro walkthrough as the validation surface for kernel correctness and single-node smoke tests, env-pinned and SHA-pinned dependencies for any "golden" claim, monkey-patch-at-import for kernel mutations.

Threw away: cuTile for the Mamba MIMO forward-backward-backward kernel path on this box, FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample source builds (silicon-blocked), trtllm-gen FMHA on SM12x, in-place mutation of installed site-packages, "the same recipe wins on every Blackwell" as a working assumption, and any throughput claim from a system whose Megatron-LM checkout has no .git directory.

If we had a do-over: pin megatron-lm to a SHA in a real .git checkout, snapshot the mamba_ssm fork state at every "golden" measurement, and never let a patch helper write to installed site-packages. Those three rules would have saved most of the time behind this writeup.

The toolchain half of the same story is Torch 2.13 on GB10: the serving and training stack we actually chose, and the maintenance lesson generalizes into how we keep a patch lane.

FAQ

Frequently asked questions

Is GB10 a small B200?+
No. This article exists largely to show why that assumption wastes time. The ISA, memory system, and tensor-path story are different enough that reuse has to be proven, not assumed.
What do sm_100a, sm_121a, sm_120f, and the a suffix mean here?+
sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. is the datacenter Blackwell target used by B200-class kernels. sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. is the exact GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. target the driver reports on DGX Spark. sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a. is the consumer-Blackwell family compile target we often use for shipping kernels when we want family-common optimizations without pretending GB10 turned into an SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. target. The a suffix matters because NVIDIA treats it as an architecture-specific target, not a generic family label. Use The MegaCpp model glossary for quick definitions, GB10 Stack Parity for MegaCpp for the operational consequences, and GB10 repro walkthrough for the checked-in example lane.
What is tcgen05 in plain English, and why does it matter here?+
It is the datacenter-Blackwell tensor-core and TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. instruction family. It matters because many Blackwell writeups accidentally assume that every Blackwell-branded box has that path. GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. does not. In the checked-in GB10 lane, tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe. is the smallest allocation-side probe, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes. is the load-side follow-on, and tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof. is the matrix instruction itself. Keeping those names separate is the whole point: moving one gate deeper does not turn the later probes into a clean runtime receipt. Use The MegaCpp model glossary for quick definitions, What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the proof boundary, and GB10 gate matrix for the checked-in gate walk.
What is OMMA in plain English here?+
It is the warp-level matrix path that stayed viable on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. after the datacenter-only tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. / TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. lane dropped out. In this article, OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable. means "the FP4 lane that actually executed on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.," not "every Blackwell FP4 example carries over unchanged." Use The MegaCpp model glossary for the term family and What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the proof boundary.
What held up best on GB10 after the bring-up churn?+
TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments. for the Mamba hot path, the OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.-based FP4 route, and a tightly pinned toolchain. The broader stack story is in Torch 2.13 on GB10: the serving and training stack we actually chose.
What changed once we treated GB10 like a 99 KiB box instead of a small B200?+
Kernel tuning got more literal: smaller chunks, narrower warp counts, and host-built TMA descriptors stopped more failures than trying to preserve SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.-sized tiles. The main shift was from "port the B200 shape" to "rebuild the staging plan for the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. memory envelope." For the concrete legality and descriptor side, continue with TileLang TMA bulk-copy 3D shared-memory deep dive and Mamba3 MIMO 3D-to-2D SMEM deep dive.
What does "host-built tensor map" mean in practice?+
It means the tensor or TMA descriptor is created on the CPU and handed to the kernel as opaque state, instead of being assembled on-device by producer warps. On GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. that ownership split mattered because descriptor legality and kernel tuning were separate problems: TileLang TMA bulk-copy sample is the small descriptor-side receipt, and Mamba3 MIMO shared-memory layout example is the small layout-side companion.
Why mention TRITON_CACHE_DIR in a GB10 bring-up story?+
Because an sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. toolchain bridge is not just a compiler flag problem. If one run uses a wrapper or alias path for ptxasQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics. and the next run uses a native target, a shared Triton cache can blur the boundary between the two receipts. The safe practice is to isolate or clear TRITON_CACHE_DIR whenever that target-routing rule changes, for the same reason we keep patch lanes and golden runs tied to explicit state rather than ambient disk leftovers.
Does forcing is_big_gpu = True mean GB10 behaves like a datacenter GPU?+
No. It only bypasses PyTorch Inductor's SM-count heuristic so autotune can consider Triton GEMM templates on a 48-SM device. The actual winning GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. configs still had to respect the smaller SMEM and bandwidth envelope, which is why the keepers in this article are smaller tiles, narrower warp counts, and GB10-specific MFU denominators rather than "B200 settings on a small box."
Where should I read next for FA4, NVFP4, or memory-fit terms?+
Use The FA4 Catalog on Blackwell for FlashAttention-4 eligibility and execute-proof boundaries, NVFP4 Inference for the MegaCpp SLM Ensemble for the narrower FP4 serving lane, and CPU Offload and Startup Memory Calibration on H200 and GB10 when the question turns into startup headroom rather than tensor-path legality.
What is the main operational lesson beyond kernels?+
Never mutate installed packages in place, and keep every golden claim tied to a pinned environment and a restorable state. In practice that means a real version-controlled checkout, a recorded toolchain, and a shadow-copy patch path instead of in-place edits. That is the same maintenance rule behind how we keep a patch lane.
What did GB10 prove and what did it not prove?+
It proved a finite hybrid single-node training cut, a working OMMAQuick term guideOMMAThe older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.-based FP4 lane, and a software stack that can be made repeatable. It did not prove datacenter-style tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. parity, which is why the narrower claim boundary stays in What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths and Why Driver-Visible Paths Can Look Like Hardware Support on GB10, Even When Silicon Proof Is Missing.
Why does this article stop at a single-GB10 validation cut instead of settling the multi-GPU FP8 NaN story?+
Because those are different proof surfaces. The single-GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. cut tells us the hybrid stack, the GB10-sized kernel envelope, and the local FP8 lane can stay finite when the collective dimension is removed. It does not settle the later EP or collective-scaling interaction that only shows up once the run becomes truly distributed. For that boundary, continue with FP8 and FP8 rollout and Torch 2.13 on GB10: the serving and training stack we actually chose.
Why say TMA multicast is "effectively absent" instead of simply "unsupported"?+
Because the PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics. docs describe clustered multicast as a real instruction form, but the public GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. lane never turns that into a deployment-grade multi-CTA receipt. In the checked-in examples, the local multicast probe stays separate from tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof., which avoids telling a stronger story than the evidence supports. Use The MegaCpp model glossary for quick definitions, What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths for the narrower claim boundary, Reproducing the sm_100a to sm_121a cubin patch on GB10 for the neighboring checked-in probe order, and GB10 repro walkthrough before dropping to the checked-in probe source.
What is .nv.capmerc, and why does a training story mention it?+
Because the bring-up story only makes sense if you keep the lower proof surfaces in view. .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. is the observed section-name family for the later GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. metadata boundary where the public cubin-patch lane still stops, and .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary. is the companion section family that shows up alongside it in that same gate walk. They are not training knobs and not public API terms we verified in NVIDIA docs. They show up here only as reminders that some GB10 boundaries live below Python packaging and below model code. Use The MegaCpp model glossary for quick definitions, Why Driver-Visible Paths Can Look Like Hardware Support on GB10, Even When Silicon Proof Is Missing for the wording discipline, GB10 libcuda driver patch lane and why it still is not silicon proof for the history of the deeper stop, and GB10 gate matrix before the broader walkthrough.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

sm_121a

Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.

sm_100a

Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.

tcgen05

The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.

TMEM

Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.

sm_100

Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.

.nv.merc.rela.*

Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

sm_120f

Family-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.

tcgen05.alloc

Documented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.

tcgen05.ld

Documented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.

tcgen05.mma

The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.

TMA multicast

The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.

.nv.capmerc

Observed section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.

OMMA

The older operand-mma path that still matters on consumer Blackwell when tcgen05 and TMEM-coupled paths are unavailable.

libcuda

The user-space NVIDIA driver library that owns module load, metadata validation, and the helper-cubin patch lane in the GB10 experiments.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

PTX

NVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.

CuTe DSL

The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

.nv.info

Observed per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

WGMMA

Hopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.

NVFP4

NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

Topic hubs