What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths
Our GB10 tests show that some Blackwell datacenter-targeted SASS can be accepted and executed on consumer silicon, but they do not prove that the Blackwell Tensor Core Generation 5 matrix-instruction path (tcgen05.mma) physically executes on GB10. Older stronger claims overstate what the evidence supports.

The practical question behind these GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough experiments was simple: when a Blackwell-datacenter cubin fails on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough, is that because the silicon physically lacks the path, because the driver blocks it, or because we are mixing evidence from several different layers and telling ourselves a cleaner story than the data supports?
The answer is narrower than some early drafts made it sound.
Read this page in three layers. First, the narrow positive result is that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough can accept and execute at least some sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source-targeted SASS after a small architecture-field rewrite in the cubin. Second, tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity and TMA-adjacent probes hit later driver-side gates that the baseline arithmetic kernel never touches. Third, none of that is the same as proving that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough physically executes tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source the way B200 or GB100 does.
The target names matter before the deeper terms do. In NVIDIA's compiler vocabulary, sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingHistory: GB10 journey Example: GB10 cubin patch repro Reference: GB10 stack parity is the baseline Blackwell target name, sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source is the architecture-specific datacenter target used by B200-class cubins, sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.GroundingAbout: GB10 stack parity History: GB10 journey Example: GB10 cubin patch repro is the family-specific consumer-Blackwell target, and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity is the exact GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough device target the driver reports on DGX Spark. These labels share the Blackwell brand, but they do not promise the same kernel contract. For the term map, use MegaCpp model glossary. For the staged GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough explanation, continue with Reproducing the sm_100a to sm_121a cubin patch on GB10 and the checked-in GB10 repro walkthrough.
For first touch, tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity means the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingHistory: GB10 journey Example: GB10 cubin patch repro Reference: GB10 stack parity/datacenter Tensor Core Generation 5 family, and TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample is the tensor-memory space those instructions are written around. That is why a receipt saying only "Blackwell" is not enough: the real question is whether the receipt is about a consumer warp-level path or about the datacenter TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample-coupled path. The public ladder here is older warp-level mma.syncQuick term guidemma.syncThe older warp-level matrix-multiply path that predates Hopper WGMMA and the SM100 tcgen05 family.GroundingHistory: GB10 journey Reference: NVFP4 inference on GB10 Reference: FA4 catalog on Blackwell, then Hopper wgmmaQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingHistory: GB10 journey Reference: upstream PR: TileLang and Megatron Reference: TileLang TMA bulk-copy sample, then Blackwell SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingHistory: GB10 journey Example: GB10 cubin patch repro Reference: GB10 stack parity tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity.
One reader-first decode order helps. Start with the target label (sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingHistory: GB10 journey Example: GB10 cubin patch repro Reference: GB10 stack parity, sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source, sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.GroundingAbout: GB10 stack parity History: GB10 journey Example: GB10 cubin patch repro, sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity). Then ask whether the probe is about the datacenter TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample-coupled family (tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity, tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source) or about the adjacent clustered-copy question (TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingHistory: GB10 journey Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source). Only after that should you drop to metadata names like reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough, .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough per-kernel records, and .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro. That is also the order used by GB10 stack parity for MegaCpp, Training the MegaCpp SLM Ensemble on GB10, and the checked-in GB10 repro walkthrough.
For engineering purposes, that means the safe rule is still the conservative one: treat tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough as unavailable unless you have a real end-to-end execution result for that exact instruction family. Driver-visible hints are not enough.
The narrow positive result
The strongest positive result in the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough test lane is also the cleanest one. A trivial arithmetic cubin compiled for sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source loaded and ran on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough after only the low architecture bits in ELF e_flagsQuick term guidee_flagsThe ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.GroundingAbout: sm_100a cubin patch repro Example: baseline arch-patch proof sample Reference: ELF arch-field patcher were rewritten from sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source to sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity. The smallest checked-in restatement of that exact fact is baseline arch-patch proof sample.
The important part is not the patch itself. The important part is what happened after the patch:
[cuModuleLoadDataEx] CUDA_SUCCESS
[cuLaunchKernel] CUDA_SUCCESS
[cuCtxSynchronize] CUDA_SUCCESS
# out[0..7]: 00000001 00000003 00000005 00000007 00000009 0000000b 0000000d 0000000f
That is enough to establish three things.
First, the user-space driver really does contain software gating at the architecture-identification layer. Second, GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough's instruction path will accept at least some SASS originally emitted for the datacenter Blackwell line. Third, the absence of an immediate Xid or hard decoder fault means "consumer Blackwell" and "datacenter Blackwell" are not separated by one single binary hardware switch at the very first instruction boundary.
That is a meaningful result. It is just not the same as "GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough has working tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source." The same distinction is why NVFP4 inference, the FA4 catalog on Blackwell, and the driver signal versus runtime proof sample all keep driver-visible hints separate from execution-grade proof.
The four software gates we found
Once we moved from a baseline arithmetic kernel to tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity / TMA-oriented probes, the driver path turned out to be layered.
Our GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough smtest lane consistently exposed four gates before the cubin reached a usable execution state:
- ELF
e_flagsQuick term guidee_flagsThe ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.GroundingAbout: sm_100a cubin patch repro Example: baseline arch-patch proof sample Reference: ELF arch-field patcher architecture validation - loader-side reserved shared-memory symbol bookkeeping such as
.nv.reservedSmem.offset0Quick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough and.nv.reservedSmem.capQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough - mutable per-kernel capability metadata in
.nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough records - the observed deeper integrity-protected capability block surfaced as
.nv.capmerc.textQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro records plus.nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough
The important engineering lesson is not just that there are four gates. It is that each gate answers a different question.
Gate 1 is the simple identity check: does the cubin even claim the right target? Gate 2 is loader-side reserved shared-memory bookkeeping: the driver is still rejecting missing reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough symbols before anything like normal execution begins. Gate 3 is mutable per-kernel capability metadata in .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough records. Gate 4 is the observed deeper integrity-protected capability block surfaced as .nv.capmerc.textQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro records plus .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough. That is the point where naive cubin surgery stops being enough.
For the staged explanation, continue directly with Reproducing the sm_100a to sm_121a cubin patch on GB10, the checked-in GB10 repro walkthrough, and the compact GB10 gate matrix. The reserved shared-memory symbol patcher only proves that the weak undefined reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough layer is mutable enough to move the failure. The kernel capability-record patcher does the same for selected .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough per-kernel records. Neither script turns gate 4 into runtime proof.
That wording is intentionally conservative. Public NVIDIA docs explain tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity, TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample, and clustered TMA instructions, but this article grounds .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro and .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough only against the checked-in gate walk and the literal section names visible in the public bundle. We do not present .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro as an official NVIDIA API surface; the claim here is only that this observed signed capability boundary is where the public-safe repro still fails.
External public evidence reinforces that restraint rather than weakening it. We did not find a public, reproducible GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough receipt that carries a tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity / TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample probe through the later .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro-named boundary into a stable tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source execution result. Negative evidence is not proof of impossibility by itself, but it removes the usual shortcut where someone else's public repro would justify stronger language than our checked-in bundle supports.
This matters because public discussions about GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough support often collapse all of these layers into one sentence like "the driver blocks it" or "the hardware lacks it." That hides the real structure of the problem. Some things are byte-patchable. Some things are not. And the fact that you can move through the earlier gates does not tell you what would happen if the exact tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity kernel reached a fully valid submission state.
What the tcgen05 probes actually hit
The conservative GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough evidence stops at gate 4.
The checked-in probes keep the hard terms separate on purpose because they are not one bucket. tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough asks whether the TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample allocation side can even advance past the early metadata gates. tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity asks about the load side of that same family. tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source asks about the matrix instruction itself. The clustered TMA multicast probeQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingHistory: GB10 journey Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source is adjacent rather than identical: it is one local copy-path probe inside the broader TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingHistory: GB10 journey Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source family, not the name of the whole family and not a successful tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source receipt in disguise. None of those surfaces produced a clean public-safe GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough execution receipt. The isolated tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough lane, for example, can be moved through the earlier gates and still ends in CUDA_ERROR_INVALID_IMAGE once the observed .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro-named integrity layer becomes decisive.
For non-kernel readers, the split is simpler than the mnemonics make it sound. tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough is the smallest "can the TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample-side setup even begin?" question. tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity is the "can data be pulled through that setup?" follow-on. tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source is the actual matrix-operation question. TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingHistory: GB10 journey Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source is the neighboring clustered-copy question rather than the matrix path itself. The checked-in bundle separates them because moving one question forward does not automatically answer the others.
The important split is that minimal TMEM allocation probe source isolates tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough, while full GB10 tensor-path probe source keeps tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity, tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source, and the clustered TMA multicast probeQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingHistory: GB10 journey Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source separate in the broader bundle. For the staged command flow, continue with the checked-in GB10 repro walkthrough and GB10 gate matrix.
That is the key distinction that older drafts blurred.
There is a big difference between:
- proving the driver can be pushed farther than its default routing policy,
- proving a helper or wrapper path exists inside
libcuda, - proving a kernel can be submitted and hang,
- and proving that
tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source is a stable, physically present, usable execution path on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough silicon.
Only the last one would justify a public claim of B200-style tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity availability. We do not have that proof.
Why older stronger claims are stale
Some of our earlier exploratory notes leaned too hard on the most exciting interpretation of the data. That is normal in a live reverse-engineering session and unacceptable in a publication.
The strongest overreach looked like this:
- a helper or routing patch can be made to reach deeper driver paths,
- therefore the silicon probably carries the full capability,
- therefore GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough effectively "has" the datacenter path if we patch enough bytes.
That leap is too large.
A more honest reading is:
- the software stack clearly contains layered product gating,
- at least some datacenter-targeted SASS decodes and executes on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough,
- the driver ships enough capability-related machinery to make the path look tantalizingly close,
- but the publication-grade
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity proof is still missing.
That is why we are treating the earlier stronger wording as stale. It was useful as a research hypothesis. It is not the standard we want attached to a public article or a customer-facing example repo.
Consumer Blackwell vs datacenter Blackwell
What should an engineer conclude from this if they just want to ship kernels?
The practical conclusion is not mysterious:
- GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough is not a small B200.
- Driver-visible datacenter artifacts do not make it one.
- If a path depends on TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample,
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity, or other datacenter-only assumptions, you should treat GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough as a separate target with its own kernel contract.
That conclusion matches the rest of the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough bring-up story across this site. In inference, we already treat GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough's OMMA-based FP4 lane as real while keeping TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample-coupled paths off. In the FA4 catalog, we already gate GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough separately from the sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source line. The tensor-path experiments fit the same pattern: shared branding, partial decode overlap, different operational contract.
The split is wider than one instruction mnemonic. The same research lane that keeps tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity in the unsupported bucket also keeps clustered multi-CTA copy assumptions and CTA-pair / 2-SM tensor cooperation on the datacenter side of the fence. In reader terms: if a kernel design depends on clustered delivery or two-SM tensor cooperation as part of its normal contract, treat that as another sign you are outside the safe GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough subset rather than one patch away from a supported path.
That is also why public B200 or GB100 capability tables are only partial context here. They can be perfectly accurate descriptions of the datacenter contract and still tell you nothing about whether GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough inherits TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample, clustered TMA delivery, or 2-SM tensor cooperation. For GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough, the trustworthy rule is still receipt first: use the datacenter docs to decode the vocabulary, then look for a GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough execution-grade result before treating any of those paths as available.
Another way to keep the boundary straight is TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample versus the narrower consumer low-precision lane. On B200-class SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingHistory: GB10 journey Example: GB10 cubin patch repro Reference: GB10 stack parity, NVIDIA's public docs place tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity on the TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample-coupled datacenter path. On the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough side of this site, the low-precision receipts we do trust live in a different contract: the consumer sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.GroundingAbout: GB10 stack parity History: GB10 journey Example: GB10 cubin patch repro / sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity lane sometimes described in neighboring posts with local shorthand like OMMA, plus the NVFP4 and FA4 examples that stay inside that consumer staging model. Those receipts matter, but they answer a different question than tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity, which is why NVFP4 Inference for the MegaCpp SLM Ensemble, The FA4 Catalog on Blackwell, and GB10 stack parity for MegaCpp can all be true without upgrading this article into TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample parity.
The main correction here is about proof discipline. A capability table, helper cubin, or partially patched submission path is not a shipping contract.
The conservative public rule
The right public rule is stricter than the most optimistic early research note.
If you are writing documentation, examples, or runtime policy for GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough:
- you may say that some
sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source SASS executes on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough after an arch-field rewrite; - you may say that multiple driver-side gates exist before
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity probes can run; - you may say that driver-visible capability machinery can make unsupported paths look deceptively close;
- you should not say that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough has proven working
tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source parity with B200 or GB100.
That is also why the new public examples for this topic focus on the baseline arch-patch probe, the gate matrix, and the difference between a software-visible signal and runtime proof. Those are the parts we can defend cleanly.
What we are publishing instead
For this topic we are publishing three things and drawing one line.
The three things:
- a compact baseline probe showing what the positive
sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source result really means; - a compact example showing why driver-visible support is not runtime proof;
- a near-copy gate-matrix example showing exactly where the
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity path still stops.
The line:
We are not publishing "6-byte patch unlocks tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough" as a settled statement.
That line is worth keeping. It saves future engineers from inheriting an evidence problem disguised as a success story.
This narrower claim lines up with Training the MegaCpp SLM Ensemble on GB10, NVFP4 Inference for the MegaCpp SLM Ensemble, and The FA4 Catalog on Blackwell: consumer Blackwell is a real target, but only on the exact kernel and precision paths that have execution-grade evidence rather than suggestive driver metadata.
Frequently asked questions
What is the strongest positive claim we can make?+
sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. SASS executes on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. after an architecture-field rewrite. That still falls well short of proving working tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof..What do sm_100, sm_100a, and sm_121a mean in these receipts?+
sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. is the baseline Blackwell target name in NVIDIA's compiler vocabulary. sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. is the architecture-specific datacenter Blackwell/B200 target the original cubins were built for, and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. is the architecture-specific consumer Blackwell GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. target that the driver reports on DGX Spark. NVIDIA's compiler docs also use sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a. for the family-specific consumer-Blackwell target. The a suffix matters because NVIDIA treats it as architecture-specific, while f means family-specific. The arch-patch result is interesting because it crosses that target boundary at the cubin-identity layer, not because it erases the hardware differences. The shortest checked-in handoff in this repo is the MegaCpp GB10 example index, Reproducing the sm_100a to sm_121a cubin patch on GB10, and GB10 stack parity for MegaCpp; the deeper implementation matrix remains the GB10 repro bundle overview and GB10 repro walkthrough.Why do some GB10 notes mention sm_120f when this article talks about sm_121a?+
sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. is the device target the driver reports on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.. sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a. is the family target NVIDIA exposes for consumer-Blackwell kernels that want family-common features without binary-locking to one exact device target. That compile-target nuance matters for production GB10 kernels and performance tuning, but it does not turn the sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. cubin-patch result into tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. proof. The reader-facing handoff on this site is GB10 stack parity for MegaCpp, Training the MegaCpp SLM Ensemble on GB10, and NVFP4 Inference for the MegaCpp SLM Ensemble. The checked-in walkthrough that sits behind those posts is GB10 repro walkthrough, and the broader raw example map comes last in the MegaCpp GB10 example index.What is tcgen05 in plain English?+
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. means Tensor Core Generation 5 instructions on the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane./datacenter side of Blackwell. PTX is the lower-level parallel-thread execution vocabulary NVIDIA documents for these instruction names, not a proof that a consumer device can execute every named path. The checked-in public probes split that family on purpose: tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe. is the Tensor Memory allocation instruction, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes. is the Tensor Memory load instruction, and tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof. is the matrix instruction itself. If you want the cleanest explanation first, read Reproducing the sm_100a to sm_121a cubin patch on GB10 and Why driver-visible paths can look like hardware support on GB10. The staged checked-in bundle docs come next in GB10 repro walkthrough and GB10 gate matrix. The raw compact mirror is the compact gate-walk mirror.What is TMEM in plain English?+
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. paths. In this GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. lane it is mostly a warning label: when a probe depends on TMEM semantics, you should read it as an SM100-side contract first, then ask whether GB10 has a clean execution receipt for that exact contract. So far, the public-safe answer is no. The shortest contrast set is Reproducing the sm_100a to sm_121a cubin patch on GB10, GB10 stack parity for MegaCpp, and NVFP4 Inference for the MegaCpp SLM Ensemble. The deeper checked-in bundle overview is GB10 repro bundle overview.What is OMMA in plain English here?+
OMMA is local shorthand for the narrower warp-level consumer-Blackwell low-precision lane that still has real GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. execution receipts. That is a different claim from SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.-side tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. plus TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.. The reader-safe contrast is simple: OMMA belongs to the GB10 posts that explain the surviving consumer path, while tcgen05 belongs to the datacenter probe family this article keeps in the unsupported-until-proven bucket. The shortest follow-ons are NVFP4 Inference for the MegaCpp SLM Ensemble, Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story, and The FA4 Catalog on Blackwell: Variants, sm Guards, and Runtime Selection.If GB10 has a real NVFP4 or OMMA-style low-precision lane elsewhere on this site, why does that still not prove tcgen05 here?+
sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a. / sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. contract, with its own tiling, staging, and backend limits. The tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe., tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes., and tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof. probes in this article are explicitly testing the stronger SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. plus TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. contract instead. In other words, a real GB10 NVFP4 or local-OMMA receipt is evidence for the consumer lane, not inheritance of the datacenter lane. The shortest reader-first contrast is NVFP4 Inference for the MegaCpp SLM Ensemble, The FA4 Catalog on Blackwell, GB10 stack parity for MegaCpp, and the checked-in GB10 public claims guardrail.What is .nv.capmerc in plain English?+
.nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. is the literal section-name family we can see on the later capability boundary in the checked-in cubins. Reader-first meaning: it is the deeper metadata layer where the public-safe GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. lane still stops after the easier byte patches move the earlier failures. It is not presented here as a public NVIDIA programming interface, and it is not a synonym for successful tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. execution. Start with Why driver-visible paths can look like hardware support on GB10, then GB10 gate matrix, then the checked-in compact gate-walk mirror.What do the four gates mean in plain English?+
e_flagsQuick term guidee_flagsThe ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.. Gate 2 is loader-side reserved shared-memory bookkeeping through weak undefined reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane. symbols. Gate 3 is mutable .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary. per-kernel metadata records. Gate 4 is the observed deeper integrity-protected capability block surfaced as the .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. text metadata section plus .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.. If you want the staged explanation, read Reproducing the sm_100a to sm_121a cubin patch on GB10, then GB10 repro walkthrough, then GB10 gate matrix.What do the reserved shared-memory symbol patcher and kernel capability-record patcher actually prove?+
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.. Start with Reproducing the sm_100a to sm_121a cubin patch on GB10 and Why driver-visible paths can look like hardware support on GB10 for the reader-facing explanation, then GB10 gate matrix for the compact staged view. The lower-level patch helpers come last: the reserved shared-memory symbol patcher moves the failure past weak undefined reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane. names, and the kernel capability-record patcher trims selected .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary. per-kernel records so the image reaches the deeper capability block. Neither script converts the lane into a publication-grade execution receipt.Why does the public-safe lane still end at CUDA_ERROR_INVALID_IMAGE after the earlier patchers?+
tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe. lane starts at CUDA_ERROR_NOT_FOUND, moves to CUDA_ERROR_INVALID_IMAGE after the weak reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane. symbols are patched, and still ends at CUDA_ERROR_INVALID_IMAGE after selected .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary. records are trimmed. The useful result is that the failure surface became more precise, not that the kernel reached runtime proof. The shortest staged receipts are GB10 repro walkthrough and GB10 gate matrix.Why is gate 4 treated differently from the earlier gates?+
.nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. and related records behave like integrity-protected capability metadata, not just mutable loader-side bookkeeping. The reader-facing explanation is Why driver-visible paths can look like hardware support on GB10 together with Reproducing the sm_100a to sm_121a cubin patch on GB10. The compact bundle summary comes next in GB10 gate matrix. In this corpus we treat .nv.capmerc only as the observed section-name family visible in the checked-in cubins, not as a public NVIDIA programming surface; this pass did not verify any NVIDIA primary doc that promotes it beyond that.Does this article claim that .nv.capmerc is a documented cryptographic lock?+
.nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. and .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary. are the literal later section-name families we can observe in the checked-in bundle, and "integrity-protected" is the cautious reader-facing description of the boundary they appear to represent in practice. We are not claiming that NVIDIA publishes .nv.capmerc as a public DRM or signature API. The safer reading is narrower: public docs put TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. and tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. on the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane./datacenter side, while our GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. repro still stops at the later capability boundary. The reader-facing handoff is Why driver-visible paths can look like hardware support on GB10, Inside the GB10 driver patch lane, and the checked-in GB10 gate matrix.What is TMA multicast in plain English?+
cp.async.bulk.tensor...multicast::cluster: one TMA transfer tries to deliver a tile to multiple CTAs in a cluster instead of one CTA. In our checked-in probes, the clustered TMA multicast probeQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster. is one local copy-path example inside that broader TMA multicast family, not a synonym for the whole family and not shorthand for tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. success. The cleanest first explanation here is Reproducing the sm_100a to sm_121a cubin patch on GB10, which keeps multicast in the same cautionary bucket as the other advanced probes. The checked-in bundle docs come next in GB10 repro walkthrough and the GB10 repro bundle overview. The adjacent copy-path example is the TileLang TMA bulk-copy checked-in example. The public-safe result is still negative: this probe is not a published GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. success receipt.Does this article say anything about 2-SM MMA or CTA-pair tensor paths?+
tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.: they stay on the datacenter side of the contract until there is a clean GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. execution receipt for that exact path. The research packet groups those wider cooperative tensor paths with the same unsupported cluster of assumptions as TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.-coupled tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path., so they belong in the same "unsupported until proven" bucket rather than the "one more patch should unlock it" bucket. The safer adjacent follow-ons are Training the MegaCpp SLM Ensemble on GB10, NVFP4 Inference for the MegaCpp SLM Ensemble, and The FA4 Catalog on Blackwell.Do public B200 or GB100 capability tables prove that GB10 should have the same tensor path?+
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path., clustered TMA, or CTA-pair tensor cooperation, but we still require a GB10 execution-grade receipt before treating any of those paths as available on GB10. For the GB10 paths with cleaner evidence, read NVFP4 Inference for the MegaCpp SLM Ensemble, The FA4 Catalog on Blackwell, and Training the MegaCpp SLM Ensemble on GB10.What came before tcgen05, and what should I look at on GB10 instead?+
mma.syncQuick term guidemma.syncThe older warp-level matrix-multiply path that predates Hopper WGMMA and the SM100 tcgen05 family.; Hopper adds warpgroup wgmmaQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.; Blackwell SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. adds tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. on the TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.-coupled datacenter side. On GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof., the safer adjacent question is not "where is the tcgen05 success example?" but "which consumer-Blackwell path has an execution-grade receipt?" Start with NVFP4 Inference for the MegaCpp SLM Ensemble, Training the MegaCpp SLM Ensemble on GB10, and The FA4 Catalog on Blackwell. The supporting checked-in bundle note for compile-target nuance is GB10 repro walkthrough. The adjacent copy-path example is the TileLang TMA bulk-copy checked-in example.Where can I see tcgen05.alloc, tcgen05.ld, tcgen05.mma, and TMA multicast separately?+
tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe., while the full GB10 tensor-path probe source contains separate alloc, ld, mma, and clustered TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster. probes. The compact checked-in mirror is the compact gate-walk mirror.What should I open first if all of these names are new?+
Which related posts cover the GB10 paths we do trust?+
libcuda lane.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.
The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.
Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.
The ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.
Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.
Observed weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.
Observed per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.
Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.
Family-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.
Documented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.
Documented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.
The older warp-level matrix-multiply path that predates Hopper WGMMA and the SM100 tcgen05 family.
The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.
Observed section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.
Hopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.
Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.
NVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.