MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 20, 202611 min readDavid Gornshtein

GB10

Blackwell

CUDA

Tensor Core

tcgen05

Driver Research

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

Q: What is the strongest positive claim we can make?

That some sm_100a SASS executes on GB10 after an architecture-field rewrite. That still falls well short of proving working tcgen05.mma.

Q: What do sm_100, sm_100a, and sm_121a mean in these receipts?

sm_100 is the baseline Blackwell target name in NVIDIA's compiler vocabulary. sm_100a is the architecture-specific datacenter Blackwell/B200 target the original cubins were built for, and sm_121a is the architecture-specific consumer Blackwell GB10 target that the driver reports on DGX Spark. NVIDIA's compiler docs also use sm_120f for the family-specific consumer-Blackwell target. The a suffix matters because NVIDIA treats it as architecture-specific, while f means family-specific. The arch-patch result is interesting because it crosses that target boundary at the cubin-identity layer, not because it erases the hardware differences. The shortest checked-in handoff in this repo is the MegaCpp GB10 example index, Reproducing the sm_100a to sm_121a cubin patch on GB10, and GB10 stack parity for MegaCpp; the deeper implementation matrix remains the GB10 repro bundle overview and GB10 repro walkthrough.

Q: Why do some GB10 notes mention sm_120f when this article talks about sm_121a?

sm_121a is the device target the driver reports on GB10. sm_120f is the family target NVIDIA exposes for consumer-Blackwell kernels that want family-common features without binary-locking to one exact device target. That compile-target nuance matters for production GB10 kernels and performance tuning, but it does not turn the sm_100a cubin-patch result into tcgen05 proof. The reader-facing handoff on this site is GB10 stack parity for MegaCpp, Training the MegaCpp SLM Ensemble on GB10, and NVFP4 Inference for the MegaCpp SLM Ensemble. The checked-in walkthrough that sits behind those posts is GB10 repro walkthrough, and the broader raw example map comes last in the MegaCpp GB10 example index.

Q: What is tcgen05 in plain English?

In NVIDIA's PTX naming, tcgen05 means Tensor Core Generation 5 instructions on the SM100/datacenter side of Blackwell. PTX is the lower-level parallel-thread execution vocabulary NVIDIA documents for these instruction names, not a proof that a consumer device can execute every named path. The checked-in public probes split that family on purpose: tcgen05.alloc is the Tensor Memory allocation instruction, tcgen05.ld is the Tensor Memory load instruction, and tcgen05.mma is the matrix instruction itself. If you want the cleanest explanation first, read Reproducing the sm_100a to sm_121a cubin patch on GB10 and Why driver-visible paths can look like hardware support on GB10. The staged checked-in bundle docs come next in GB10 repro walkthrough and GB10 gate matrix. The raw compact mirror is the compact gate-walk mirror.

Q: What is TMEM in plain English?

TMEM is Tensor Memory, the dedicated memory space used by SM100 tcgen05 paths. In this GB10 lane it is mostly a warning label: when a probe depends on TMEM semantics, you should read it as an SM100-side contract first, then ask whether GB10 has a clean execution receipt for that exact contract. So far, the public-safe answer is no. The shortest contrast set is Reproducing the sm_100a to sm_121a cubin patch on GB10, GB10 stack parity for MegaCpp, and NVFP4 Inference for the MegaCpp SLM Ensemble. The deeper checked-in bundle overview is GB10 repro bundle overview.

Q: What is OMMA in plain English here?

In this article family, OMMA is local shorthand for the narrower warp-level consumer-Blackwell low-precision lane that still has real GB10 execution receipts. That is a different claim from SM100-side tcgen05 plus TMEM. The reader-safe contrast is simple: OMMA belongs to the GB10 posts that explain the surviving consumer path, while tcgen05 belongs to the datacenter probe family this article keeps in the unsupported-until-proven bucket. The shortest follow-ons are NVFP4 Inference for the MegaCpp SLM Ensemble, Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story, and The FA4 Catalog on Blackwell: Variants, sm Guards, and Runtime Selection.

Q: What is .nv.capmerc in plain English?

In this article set, .nv.capmerc is the literal section-name family we can see on the later capability boundary in the checked-in cubins. Reader-first meaning: it is the deeper metadata layer where the public-safe GB10 lane still stops after the easier byte patches move the earlier failures. It is not presented here as a public NVIDIA programming interface, and it is not a synonym for successful tcgen05 execution. Start with Why driver-visible paths can look like hardware support on GB10, then GB10 gate matrix, then the checked-in compact gate-walk mirror.

Q: What do the four gates mean in plain English?

Gate 1 is the cubin identity check in ELF e_flags. Gate 2 is loader-side reserved shared-memory bookkeeping through weak undefined reservedSmem symbols. Gate 3 is mutable .nv.info per-kernel metadata records. Gate 4 is the observed deeper integrity-protected capability block surfaced as the .nv.capmerc text metadata section plus .nv.merc.rela.*. If you want the staged explanation, read Reproducing the sm_100a to sm_121a cubin patch on GB10, then GB10 repro walkthrough, then GB10 gate matrix.

Our GB10 tests show that some Blackwell datacenter-targeted SASS can be accepted and executed on consumer silicon, but they do not prove that the Blackwell Tensor Core Generation 5 matrix-instruction path (tcgen05.mma) physically executes on GB10. Older stronger claims overstate what the evidence supports.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

Published April 20, 2026•11 min read•David Gornshtein

The practical question behind these GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough experiments was simple: when a Blackwell-datacenter cubin fails on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough, is that because the silicon physically lacks the path, because the driver blocks it, or because we are mixing evidence from several different layers and telling ourselves a cleaner story than the data supports?

The answer is narrower than some early drafts made it sound.

Read this page in three layers. First, the narrow positive result is that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough can accept and execute at least some sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source-targeted SASS after a small architecture-field rewrite in the cubin. Second, tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity and TMA-adjacent probes hit later driver-side gates that the baseline arithmetic kernel never touches. Third, none of that is the same as proving that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough physically executes tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source the way B200 or GB100 does.

The target names matter before the deeper terms do. In NVIDIA's compiler vocabulary, sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingHistory: GB10 journey Example: GB10 cubin patch repro Reference: GB10 stack parity is the baseline Blackwell target name, sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source is the architecture-specific datacenter target used by B200-class cubins, sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.GroundingAbout: GB10 stack parity History: GB10 journey Example: GB10 cubin patch repro is the family-specific consumer-Blackwell target, and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity is the exact GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough device target the driver reports on DGX Spark. These labels share the Blackwell brand, but they do not promise the same kernel contract. For the term map, use MegaCpp model glossary. For the staged GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough explanation, continue with Reproducing the sm_100a to sm_121a cubin patch on GB10 and the checked-in GB10 repro walkthrough.

For first touch, tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity means the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingHistory: GB10 journey Example: GB10 cubin patch repro Reference: GB10 stack parity/datacenter Tensor Core Generation 5 family, and TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample is the tensor-memory space those instructions are written around. That is why a receipt saying only "Blackwell" is not enough: the real question is whether the receipt is about a consumer warp-level path or about the datacenter TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample-coupled path. The public ladder here is older warp-level mma.syncQuick term guidemma.syncThe older warp-level matrix-multiply path that predates Hopper WGMMA and the SM100 tcgen05 family.GroundingHistory: GB10 journey Reference: NVFP4 inference on GB10 Reference: FA4 catalog on Blackwell, then Hopper wgmmaQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingHistory: GB10 journey Reference: upstream PR: TileLang and Megatron Reference: TileLang TMA bulk-copy sample, then Blackwell SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingHistory: GB10 journey Example: GB10 cubin patch repro Reference: GB10 stack parity tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity.

One reader-first decode order helps. Start with the target label (sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingHistory: GB10 journey Example: GB10 cubin patch repro Reference: GB10 stack parity, sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source, sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.GroundingAbout: GB10 stack parity History: GB10 journey Example: GB10 cubin patch repro, sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity). Then ask whether the probe is about the datacenter TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample-coupled family (tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity, tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source) or about the adjacent clustered-copy question (TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingHistory: GB10 journey Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source). Only after that should you drop to metadata names like reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough, .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough per-kernel records, and .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro. That is also the order used by GB10 stack parity for MegaCpp, Training the MegaCpp SLM Ensemble on GB10, and the checked-in GB10 repro walkthrough.

For engineering purposes, that means the safe rule is still the conservative one: treat tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough as unavailable unless you have a real end-to-end execution result for that exact instruction family. Driver-visible hints are not enough.

The narrow positive result

The strongest positive result in the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough test lane is also the cleanest one. A trivial arithmetic cubin compiled for sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source loaded and ran on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough after only the low architecture bits in ELF e_flagsQuick term guidee_flagsThe ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.GroundingAbout: sm_100a cubin patch repro Example: baseline arch-patch proof sample Reference: ELF arch-field patcher were rewritten from sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source to sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity. The smallest checked-in restatement of that exact fact is baseline arch-patch proof sample.

The important part is not the patch itself. The important part is what happened after the patch:

[cuModuleLoadDataEx]    CUDA_SUCCESS
[cuLaunchKernel]        CUDA_SUCCESS
[cuCtxSynchronize]      CUDA_SUCCESS
# out[0..7]: 00000001 00000003 00000005 00000007 00000009 0000000b 0000000d 0000000f

That is enough to establish three things.

First, the user-space driver really does contain software gating at the architecture-identification layer. Second, GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough's instruction path will accept at least some SASS originally emitted for the datacenter Blackwell line. Third, the absence of an immediate Xid or hard decoder fault means "consumer Blackwell" and "datacenter Blackwell" are not separated by one single binary hardware switch at the very first instruction boundary.

That is a meaningful result. It is just not the same as "GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough has working tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source." The same distinction is why NVFP4 inference, the FA4 catalog on Blackwell, and the driver signal versus runtime proof sample all keep driver-visible hints separate from execution-grade proof.

The four software gates we found

Once we moved from a baseline arithmetic kernel to tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity / TMA-oriented probes, the driver path turned out to be layered.

Our GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough smtest lane consistently exposed four gates before the cubin reached a usable execution state:

ELF e_flagsQuick term guidee_flagsThe ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.GroundingAbout: sm_100a cubin patch repro Example: baseline arch-patch proof sample Reference: ELF arch-field patcher architecture validation
loader-side reserved shared-memory symbol bookkeeping such as .nv.reservedSmem.offset0Quick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough and .nv.reservedSmem.capQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough
mutable per-kernel capability metadata in .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough records
the observed deeper integrity-protected capability block surfaced as .nv.capmerc.textQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro records plus .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough

The important engineering lesson is not just that there are four gates. It is that each gate answers a different question.

Gate 1 is the simple identity check: does the cubin even claim the right target? Gate 2 is loader-side reserved shared-memory bookkeeping: the driver is still rejecting missing reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough symbols before anything like normal execution begins. Gate 3 is mutable per-kernel capability metadata in .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough records. Gate 4 is the observed deeper integrity-protected capability block surfaced as .nv.capmerc.textQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro records plus .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough. That is the point where naive cubin surgery stops being enough.

For the staged explanation, continue directly with Reproducing the sm_100a to sm_121a cubin patch on GB10, the checked-in GB10 repro walkthrough, and the compact GB10 gate matrix. The reserved shared-memory symbol patcher only proves that the weak undefined reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough layer is mutable enough to move the failure. The kernel capability-record patcher does the same for selected .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate repro Example: GB10 gate walkthrough per-kernel records. Neither script turns gate 4 into runtime proof.

That wording is intentionally conservative. Public NVIDIA docs explain tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity, TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample, and clustered TMA instructions, but this article grounds .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro and .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough only against the checked-in gate walk and the literal section names visible in the public bundle. We do not present .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro as an official NVIDIA API surface; the claim here is only that this observed signed capability boundary is where the public-safe repro still fails.

External public evidence reinforces that restraint rather than weakening it. We did not find a public, reproducible GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough receipt that carries a tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity / TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample probe through the later .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro-named boundary into a stable tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source execution result. Negative evidence is not proof of impossibility by itself, but it removes the usual shortcut where someone else's public repro would justify stronger language than our checked-in bundle supports.

This matters because public discussions about GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough support often collapse all of these layers into one sentence like "the driver blocks it" or "the hardware lacks it." That hides the real structure of the problem. Some things are byte-patchable. Some things are not. And the fact that you can move through the earlier gates does not tell you what would happen if the exact tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity kernel reached a fully valid submission state.

What the `tcgen05` probes actually hit

The conservative GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough evidence stops at gate 4.

The checked-in probes keep the hard terms separate on purpose because they are not one bucket. tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough asks whether the TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample allocation side can even advance past the early metadata gates. tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity asks about the load side of that same family. tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source asks about the matrix instruction itself. The clustered TMA multicast probeQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingHistory: GB10 journey Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source is adjacent rather than identical: it is one local copy-path probe inside the broader TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingHistory: GB10 journey Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source family, not the name of the whole family and not a successful tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source receipt in disguise. None of those surfaces produced a clean public-safe GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough execution receipt. The isolated tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough lane, for example, can be moved through the earlier gates and still ends in CUDA_ERROR_INVALID_IMAGE once the observed .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate repro-named integrity layer becomes decisive.

For non-kernel readers, the split is simpler than the mnemonics make it sound. tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough is the smallest "can the TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample-side setup even begin?" question. tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity is the "can data be pulled through that setup?" follow-on. tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source is the actual matrix-operation question. TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingHistory: GB10 journey Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source is the neighboring clustered-copy question rather than the matrix path itself. The checked-in bundle separates them because moving one question forward does not automatically answer the others.

The important split is that minimal TMEM allocation probe source isolates tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough, while full GB10 tensor-path probe source keeps tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity, tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source, and the clustered TMA multicast probeQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingHistory: GB10 journey Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source separate in the broader bundle. For the staged command flow, continue with the checked-in GB10 repro walkthrough and GB10 gate matrix.

That is the key distinction that older drafts blurred.

There is a big difference between:

proving the driver can be pushed farther than its default routing policy,
proving a helper or wrapper path exists inside libcuda,
proving a kernel can be submitted and hang,
and proving that tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source is a stable, physically present, usable execution path on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough silicon.

Only the last one would justify a public claim of B200-style tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity availability. We do not have that proof.

Why older stronger claims are stale

Some of our earlier exploratory notes leaned too hard on the most exciting interpretation of the data. That is normal in a live reverse-engineering session and unacceptable in a publication.

The strongest overreach looked like this:

a helper or routing patch can be made to reach deeper driver paths,
therefore the silicon probably carries the full capability,
therefore GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough effectively "has" the datacenter path if we patch enough bytes.

That leap is too large.

A more honest reading is:

the software stack clearly contains layered product gating,
at least some datacenter-targeted SASS decodes and executes on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough,
the driver ships enough capability-related machinery to make the path look tantalizingly close,
but the publication-grade tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity proof is still missing.

That is why we are treating the earlier stronger wording as stale. It was useful as a research hypothesis. It is not the standard we want attached to a public article or a customer-facing example repo.

Consumer Blackwell vs datacenter Blackwell

What should an engineer conclude from this if they just want to ship kernels?

The practical conclusion is not mysterious:

GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough is not a small B200.
Driver-visible datacenter artifacts do not make it one.
If a path depends on TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample, tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity, or other datacenter-only assumptions, you should treat GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough as a separate target with its own kernel contract.

That conclusion matches the rest of the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough bring-up story across this site. In inference, we already treat GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough's OMMA-based FP4 lane as real while keeping TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample-coupled paths off. In the FA4 catalog, we already gate GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough separately from the sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source line. The tensor-path experiments fit the same pattern: shared branding, partial decode overlap, different operational contract.

The split is wider than one instruction mnemonic. The same research lane that keeps tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity in the unsupported bucket also keeps clustered multi-CTA copy assumptions and CTA-pair / 2-SM tensor cooperation on the datacenter side of the fence. In reader terms: if a kernel design depends on clustered delivery or two-SM tensor cooperation as part of its normal contract, treat that as another sign you are outside the safe GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough subset rather than one patch away from a supported path.

That is also why public B200 or GB100 capability tables are only partial context here. They can be perfectly accurate descriptions of the datacenter contract and still tell you nothing about whether GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough inherits TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample, clustered TMA delivery, or 2-SM tensor cooperation. For GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough, the trustworthy rule is still receipt first: use the datacenter docs to decode the vocabulary, then look for a GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough execution-grade result before treating any of those paths as available.

Another way to keep the boundary straight is TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample versus the narrower consumer low-precision lane. On B200-class SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingHistory: GB10 journey Example: GB10 cubin patch repro Reference: GB10 stack parity, NVIDIA's public docs place tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity on the TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample-coupled datacenter path. On the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough side of this site, the low-precision receipts we do trust live in a different contract: the consumer sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.GroundingAbout: GB10 stack parity History: GB10 journey Example: GB10 cubin patch repro / sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity lane sometimes described in neighboring posts with local shorthand like OMMA, plus the NVFP4 and FA4 examples that stay inside that consumer staging model. Those receipts matter, but they answer a different question than tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity, which is why NVFP4 Inference for the MegaCpp SLM Ensemble, The FA4 Catalog on Blackwell, and GB10 stack parity for MegaCpp can all be true without upgrading this article into TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingNVFP4 inference on GB10 GB10 stack parity TileLang TMA bulk-copy companion sample parity.

The main correction here is about proof discipline. A capability table, helper cubin, or partially patched submission path is not a shipping contract.

The conservative public rule

The right public rule is stricter than the most optimistic early research note.

If you are writing documentation, examples, or runtime policy for GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough:

you may say that some sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source SASS executes on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough after an arch-field rewrite;
you may say that multiple driver-side gates exist before tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity probes can run;
you may say that driver-visible capability machinery can make unsupported paths look deceptively close;
you should not say that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough has proven working tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingExample: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source parity with B200 or GB100.

That is also why the new public examples for this topic focus on the baseline arch-patch probe, the gate matrix, and the difference between a software-visible signal and runtime proof. Those are the parts we can defend cleanly.

What we are publishing instead

For this topic we are publishing three things and drawing one line.

The three things:

a compact baseline probe showing what the positive sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingExample: sm_100a cubin patch repro Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source result really means;
a compact example showing why driver-visible support is not runtime proof;
a near-copy gate-matrix example showing exactly where the tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity path still stops.

The line:

We are not publishing "6-byte patch unlocks tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingExample: GB10 cubin patch repro Example: GB10 repro walkthrough Reference: GB10 stack parity on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story Example: GB10 repro walkthrough" as a settled statement.

That line is worth keeping. It saves future engineers from inheriting an evidence problem disguised as a success story.

This narrower claim lines up with Training the MegaCpp SLM Ensemble on GB10, NVFP4 Inference for the MegaCpp SLM Ensemble, and The FA4 Catalog on Blackwell: consumer Blackwell is a real target, but only on the exact kernel and precision paths that have execution-grade evidence rather than suggestive driver metadata.

FAQ

Frequently asked questions

What is the strongest positive claim we can make?+

That some sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. SASS executes on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. after an architecture-field rewrite. That still falls well short of proving working tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof..

What do sm_100, sm_100a, and sm_121a mean in these receipts?+

sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. is the baseline Blackwell target name in NVIDIA's compiler vocabulary. sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. is the architecture-specific datacenter Blackwell/B200 target the original cubins were built for, and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. is the architecture-specific consumer Blackwell GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. target that the driver reports on DGX Spark. NVIDIA's compiler docs also use sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a. for the family-specific consumer-Blackwell target. The a suffix matters because NVIDIA treats it as architecture-specific, while f means family-specific. The arch-patch result is interesting because it crosses that target boundary at the cubin-identity layer, not because it erases the hardware differences. The shortest checked-in handoff in this repo is the MegaCpp GB10 example index, Reproducing the sm_100a to sm_121a cubin patch on GB10, and GB10 stack parity for MegaCpp; the deeper implementation matrix remains the GB10 repro bundle overview and GB10 repro walkthrough.

Why do some GB10 notes mention sm_120f when this article talks about sm_121a?+

sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. is the device target the driver reports on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.. sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a. is the family target NVIDIA exposes for consumer-Blackwell kernels that want family-common features without binary-locking to one exact device target. That compile-target nuance matters for production GB10 kernels and performance tuning, but it does not turn the sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. cubin-patch result into tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. proof. The reader-facing handoff on this site is GB10 stack parity for MegaCpp, Training the MegaCpp SLM Ensemble on GB10, and NVFP4 Inference for the MegaCpp SLM Ensemble. The checked-in walkthrough that sits behind those posts is GB10 repro walkthrough, and the broader raw example map comes last in the MegaCpp GB10 example index.

What is tcgen05 in plain English?+

In NVIDIA's PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics. naming, tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. means Tensor Core Generation 5 instructions on the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane./datacenter side of Blackwell. PTX is the lower-level parallel-thread execution vocabulary NVIDIA documents for these instruction names, not a proof that a consumer device can execute every named path. The checked-in public probes split that family on purpose: tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe. is the Tensor Memory allocation instruction, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes. is the Tensor Memory load instruction, and tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof. is the matrix instruction itself. If you want the cleanest explanation first, read Reproducing the sm_100a to sm_121a cubin patch on GB10 and Why driver-visible paths can look like hardware support on GB10. The staged checked-in bundle docs come next in GB10 repro walkthrough and GB10 gate matrix. The raw compact mirror is the compact gate-walk mirror.

What is TMEM in plain English?+

TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. is Tensor Memory, the dedicated memory space used by SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. paths. In this GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. lane it is mostly a warning label: when a probe depends on TMEM semantics, you should read it as an SM100-side contract first, then ask whether GB10 has a clean execution receipt for that exact contract. So far, the public-safe answer is no. The shortest contrast set is Reproducing the sm_100a to sm_121a cubin patch on GB10, GB10 stack parity for MegaCpp, and NVFP4 Inference for the MegaCpp SLM Ensemble. The deeper checked-in bundle overview is GB10 repro bundle overview.

What is OMMA in plain English here?+

In this article family, OMMA is local shorthand for the narrower warp-level consumer-Blackwell low-precision lane that still has real GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. execution receipts. That is a different claim from SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.-side tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. plus TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.. The reader-safe contrast is simple: OMMA belongs to the GB10 posts that explain the surviving consumer path, while tcgen05 belongs to the datacenter probe family this article keeps in the unsupported-until-proven bucket. The shortest follow-ons are NVFP4 Inference for the MegaCpp SLM Ensemble, Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story, and The FA4 Catalog on Blackwell: Variants, sm Guards, and Runtime Selection.

If GB10 has a real NVFP4 or OMMA-style low-precision lane elsewhere on this site, why does that still not prove tcgen05 here?+

Because those receipts answer a narrower consumer-Blackwell question. They show that a GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. low-precision serving lane works inside the consumer sm_120fQuick term guidesm_120fFamily-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a. / sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. contract, with its own tiling, staging, and backend limits. The tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe., tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes., and tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof. probes in this article are explicitly testing the stronger SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. plus TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. contract instead. In other words, a real GB10 NVFP4 or local-OMMA receipt is evidence for the consumer lane, not inheritance of the datacenter lane. The shortest reader-first contrast is NVFP4 Inference for the MegaCpp SLM Ensemble, The FA4 Catalog on Blackwell, GB10 stack parity for MegaCpp, and the checked-in GB10 public claims guardrail.

What is .nv.capmerc in plain English?+

In this article set, .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. is the literal section-name family we can see on the later capability boundary in the checked-in cubins. Reader-first meaning: it is the deeper metadata layer where the public-safe GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. lane still stops after the easier byte patches move the earlier failures. It is not presented here as a public NVIDIA programming interface, and it is not a synonym for successful tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. execution. Start with Why driver-visible paths can look like hardware support on GB10, then GB10 gate matrix, then the checked-in compact gate-walk mirror.

What do the four gates mean in plain English?+

Gate 1 is the cubin identity check in ELF e_flagsQuick term guidee_flagsThe ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.. Gate 2 is loader-side reserved shared-memory bookkeeping through weak undefined reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane. symbols. Gate 3 is mutable .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary. per-kernel metadata records. Gate 4 is the observed deeper integrity-protected capability block surfaced as the .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. text metadata section plus .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.. If you want the staged explanation, read Reproducing the sm_100a to sm_121a cubin patch on GB10, then GB10 repro walkthrough, then GB10 gate matrix.

What do the reserved shared-memory symbol patcher and kernel capability-record patcher actually prove?+

They prove that gates 2 and 3 are real software-visible stages, not that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. has working tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.. Start with Reproducing the sm_100a to sm_121a cubin patch on GB10 and Why driver-visible paths can look like hardware support on GB10 for the reader-facing explanation, then GB10 gate matrix for the compact staged view. The lower-level patch helpers come last: the reserved shared-memory symbol patcher moves the failure past weak undefined reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane. names, and the kernel capability-record patcher trims selected .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary. per-kernel records so the image reaches the deeper capability block. Neither script converts the lane into a publication-grade execution receipt.

Why does the public-safe lane still end at CUDA_ERROR_INVALID_IMAGE after the earlier patchers?+

Because the earlier patchers only move the stop from one software-visible layer to the next. In the checked-in walkthrough, the tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe. lane starts at CUDA_ERROR_NOT_FOUND, moves to CUDA_ERROR_INVALID_IMAGE after the weak reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane. symbols are patched, and still ends at CUDA_ERROR_INVALID_IMAGE after selected .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary. records are trimmed. The useful result is that the failure surface became more precise, not that the kernel reached runtime proof. The shortest staged receipts are GB10 repro walkthrough and GB10 gate matrix.

Why is gate 4 treated differently from the earlier gates?+

Because the observed .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. and related records behave like integrity-protected capability metadata, not just mutable loader-side bookkeeping. The reader-facing explanation is Why driver-visible paths can look like hardware support on GB10 together with Reproducing the sm_100a to sm_121a cubin patch on GB10. The compact bundle summary comes next in GB10 gate matrix. In this corpus we treat .nv.capmerc only as the observed section-name family visible in the checked-in cubins, not as a public NVIDIA programming surface; this pass did not verify any NVIDIA primary doc that promotes it beyond that.

Does this article claim that .nv.capmerc is a documented cryptographic lock?+

No. In this article set, .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. and .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary. are the literal later section-name families we can observe in the checked-in bundle, and "integrity-protected" is the cautious reader-facing description of the boundary they appear to represent in practice. We are not claiming that NVIDIA publishes .nv.capmerc as a public DRM or signature API. The safer reading is narrower: public docs put TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. and tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. on the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane./datacenter side, while our GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. repro still stops at the later capability boundary. The reader-facing handoff is Why driver-visible paths can look like hardware support on GB10, Inside the GB10 driver patch lane, and the checked-in GB10 gate matrix.

What is TMA multicast in plain English?+

It is the clustered copy form of cp.async.bulk.tensor...multicast::cluster: one TMA transfer tries to deliver a tile to multiple CTAs in a cluster instead of one CTA. In our checked-in probes, the clustered TMA multicast probeQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster. is one local copy-path example inside that broader TMA multicast family, not a synonym for the whole family and not shorthand for tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. success. The cleanest first explanation here is Reproducing the sm_100a to sm_121a cubin patch on GB10, which keeps multicast in the same cautionary bucket as the other advanced probes. The checked-in bundle docs come next in GB10 repro walkthrough and the GB10 repro bundle overview. The adjacent copy-path example is the TileLang TMA bulk-copy checked-in example. The public-safe result is still negative: this probe is not a published GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. success receipt.

Does this article say anything about 2-SM MMA or CTA-pair tensor paths?+

Only in the same conservative way it talks about tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.: they stay on the datacenter side of the contract until there is a clean GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. execution receipt for that exact path. The research packet groups those wider cooperative tensor paths with the same unsupported cluster of assumptions as TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.-coupled tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path., so they belong in the same "unsupported until proven" bucket rather than the "one more patch should unlock it" bucket. The safer adjacent follow-ons are Training the MegaCpp SLM Ensemble on GB10, NVFP4 Inference for the MegaCpp SLM Ensemble, and The FA4 Catalog on Blackwell.

Do public B200 or GB100 capability tables prove that GB10 should have the same tensor path?+

No. Those tables can be accurate descriptions of the datacenter contract without saying anything about consumer GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. parity. In this article family we use them to decode names like TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable., tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path., clustered TMA, or CTA-pair tensor cooperation, but we still require a GB10 execution-grade receipt before treating any of those paths as available on GB10. For the GB10 paths with cleaner evidence, read NVFP4 Inference for the MegaCpp SLM Ensemble, The FA4 Catalog on Blackwell, and Training the MegaCpp SLM Ensemble on GB10.

What came before tcgen05, and what should I look at on GB10 instead?+

The public ladder is narrower than "Blackwell tensor cores" as one blob. Older tensor-core explanations usually mean warp-level mma.syncQuick term guidemma.syncThe older warp-level matrix-multiply path that predates Hopper WGMMA and the SM100 tcgen05 family.; Hopper adds warpgroup wgmmaQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.; Blackwell SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. adds tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. on the TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.-coupled datacenter side. On GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof., the safer adjacent question is not "where is the tcgen05 success example?" but "which consumer-Blackwell path has an execution-grade receipt?" Start with NVFP4 Inference for the MegaCpp SLM Ensemble, Training the MegaCpp SLM Ensemble on GB10, and The FA4 Catalog on Blackwell. The supporting checked-in bundle note for compile-target nuance is GB10 repro walkthrough. The adjacent copy-path example is the TileLang TMA bulk-copy checked-in example.

Where can I see tcgen05.alloc, tcgen05.ld, tcgen05.mma, and TMA multicast separately?+

The reader-facing path is Reproducing the sm_100a to sm_121a cubin patch on GB10, then GB10 repro walkthrough, then GB10 gate matrix. The checked-in probe sources come after that: the minimal TMEM allocation probe source isolates tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe., while the full GB10 tensor-path probe source contains separate alloc, ld, mma, and clustered TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster. probes. The compact checked-in mirror is the compact gate-walk mirror.

What should I open first if all of these names are new?+

Use MegaCpp model glossary for the name map, this article for the claim boundary, Reproducing the sm_100a to sm_121a cubin patch on GB10 for the staged walkthrough, and Why driver-visible paths can look like hardware support on GB10 for the gate-by-gate warning story. After that, the checked-in GB10 repro walkthrough and MegaCpp GB10 example index are the safest entry points into the checked-in examples.

Which related posts cover the GB10 paths we do trust?+

NVFP4 Inference for the MegaCpp SLM Ensemble and The FA4 Catalog on Blackwell are the cleaner follow-ons for the precision and kernel families with execution-grade evidence. For the unsupported-path side of the same vocabulary, the next two stops are Why driver-visible paths can look like hardware support on GB10 for the four-gate warning story and Inside the GB10 driver patch lane for the copied-libcuda lane.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

tcgen05.mma

The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.

Grounding

tcgen05

The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.

Grounding

sm_100a

Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.

Grounding

e_flags

The ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.

Grounding

sm_100

Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.

Grounding

reservedSmem

Observed weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.

Grounding

.nv.info

Observed per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.

Grounding

.nv.merc.rela.*

Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.

Grounding

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

Grounding

sm_121a

Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.

Grounding

sm_120f

Family-specific consumer Blackwell compile target used when kernels should target family-common features without pinning to one exact device label such as sm_121a.

Grounding

tcgen05.alloc

Documented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.

Grounding

tcgen05.ld

Documented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.

Grounding

mma.sync

The older warp-level matrix-multiply path that predates Hopper WGMMA and the SM100 tcgen05 family.

Grounding

TMA multicast

The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.

Grounding

.nv.capmerc

Observed section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.

Grounding

WGMMA

Hopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.

Grounding

TMEM

Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.

Grounding

PTX

NVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Topic hubs

Entity Hub

GB10 and Blackwell Bring-Up

A curated GB10 and Blackwell reading path: consumer-versus-datacenter tensor paths, driver-visible false positives, arch-patch repros, and the serving or precision choices that survived contact with the hardware.

David Gornshtein • MegaCppMore posts →

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

The narrow positive result

The four software gates we found

What the tcgen05 probes actually hit

Why older stronger claims are stale

Consumer Blackwell vs datacenter Blackwell

The conservative public rule

What we are publishing instead

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

GB10 and Blackwell Bring-Up

What the `tcgen05` probes actually hit