Reproducing the sm_100a -> sm_121a Cubin Patch on GB10: CUDA/C++ Code, ELF Edits, and the Exact Point Where tcgen05 Stops
A practical GB10 reproduction guide for the narrow result we can defend publicly: a patched sm_100a baseline cubin executes on GB10, while tcgen05-oriented probes stop at later driver-side gates rather than producing a publication-grade tcgen05 proof.

This article is the step-by-step version of the conservative GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story claim.
What we can defend publicly is narrow:
- a simple
sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source cubin can be patched so it loads and executes on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story; - the same workflow becomes a layered gate walk once the kernel uses
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity-oriented instructions; - the clean public evidence still stops short of proving working
tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source parity on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story.
A quick vocabulary guardrail helps. NVIDIA's compiler docs treat sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source and compute_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source as architecture-specific targets, not generic Blackwell-family payloads. In this article, sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source is the source cubin target, sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity is the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story target we patch into ELF metadata, tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity is the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: GB10 stack parity tensor-core instruction family, and TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity is the tensor-memory space that family uses. The patch changes the cubin's declared identity; it does not prove GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story exposes the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: GB10 stack parity tensor path.
That is enough to be useful. It tells you what to reproduce, which patches matter, which commands to run, and exactly where the public-safe story stops.
What you need
The public repro bundle for this article lives here:
- Repro bundle overview
- GB10 bundle build recipe
- GB10 guided run script
- CUDA Driver API loader example
- host attribute reader
- baseline arithmetic probe source
- minimal TMEM allocation probe source
- full GB10 tensor-path probe source
- ELF arch-field patcher
- reserved shared-memory symbol patcher
- kernel capability-record patcher
- GB10 gate matrix
- GB10 repro walkthrough
- GB10 claim-scope guardrails
Two checked-in mirrors are useful before you touch the C++ path: the arch-patch probe mirror shows the narrow arch-patch proof shape, and the driver signal and runtime proof mirror keeps driver-visible signals separate from runtime proof.
The receipts behind the article were collected on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story with CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary 13.2 and driver 595.58.03. If your exact environment differs, treat the command flow as the stable part and the exact offsets or return paths as environment-specific.
The baseline result in one screen
The narrow positive result is a trivial arithmetic kernel compiled for sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source, then patched so the ELF arch field says sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity.
The kernel is intentionally boring:
extern "C" __global__ void k_baseline(int* out) {
out[threadIdx.x] = threadIdx.x * 2 + 1;
}
That source is published in the baseline kernel source.
The public-safe receipt is the one that matters:
# device: NVIDIA GB10 sm_121
[cuModuleLoadDataEx] CUDA_SUCCESS
[cuLaunchKernel] CUDA_SUCCESS
[cuCtxSynchronize] CUDA_SUCCESS
# out[0..7]: 00000001 00000003 00000005 00000007 00000009 0000000b 0000000d 0000000f
That proves three specific things.
- The user-space driver performs a software-visible architecture check at the cubin level.
- GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story will accept and execute at least some SASS originally emitted for the datacenter Blackwell line.
- This is not the same thing as proving GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story exposes working
tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source.
The third point is the one worth repeating, because it is where reverse-engineering writeups usually start to overclaim.
Step 1: build the baseline cubin
From the bundle directory:
make clean
make build-baseline
Under the hood, the baseline build is equivalent to:
nvcc -arch=sm_100a --cubin -std=c++17 -lineinfo \
-Xptxas -gno-tmem-access-check \
kernel_baseline.cu -o kernel_baseline_100a.cubin
The -gno-tmem-access-check flag is harmless for the baseline kernel and keeps the command line aligned with the deeper probes that follow.
Step 2: patch only the ELF arch field
The smallest patch in the whole workflow is the ELF arch-field patcher. It rewrites only the low architecture bits in ELF e_flagsQuick term guidee_flagsThe ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.GroundingExample: baseline arch-patch proof sample Reference: ELF arch-field patcher Reference: GB10 gate walkthrough and preserves the upper bits.
Run it directly:
./patch_elf.py kernel_baseline_100a.cubin kernel_baseline_patched.cubin sm_100a sm_121a
The source notes summarize the key field values like this:
sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source -> low 16 bits0x6402sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity -> low 16 bits0x7902
So the patch is changing the architecture identity at the cubin metadata layer, not rewriting the kernel body. That distinction matters because the a suffix marks an architecture-specific target in NVIDIA's feature-target model.
If you want to inspect the result yourself:
readelf -h kernel_baseline_100a.cubin | grep -E "Flags|Machine"
readelf -h kernel_baseline_patched.cubin | grep -E "Flags|Machine"
Step 3: load and run the patched cubin on GB10
The published CUDA Driver API loader example uses the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Driver API directly. It loads the cubin, resolves the kernel symbol, launches it, synchronizes, and prints the output buffer.
Build and run:
make run-baseline
Or run the loader explicitly:
g++ -O2 -std=c++17 -I/usr/local/cuda/include loader.cpp -o loader -L/usr/local/cuda/lib64 -lcuda
./loader kernel_baseline_patched.cubin k_baseline 32
This is the exact point where the public-safe positive claim ends. The patched baseline cubin executes. That is real. It is useful. It is still narrower than “GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story has datacenter Blackwell tensor-path parity.”
Step 4: move from baseline arithmetic to a tcgen05 probe
The next published probe is the minimal TMEM allocation probe source. It isolates tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingExample: GB10 repro walkthrough Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary without mixing in the larger mma or TMA path.
The interesting part of that kernel is this block:
if (threadIdx.x == 0) {
uint32_t smem_ptr = __cvta_generic_to_shared(&tmem_addr);
asm volatile(
"tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], 32;\n"
:: "r"(smem_ptr));
asm volatile("tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;\n");
}
That is where the story stops being a simple arch-byte patch and becomes a gate walk. For the wider probe map, the full GB10 tensor-path probe source keeps tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity, tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source, and a clustered TMA copy exampleQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough separate so the alloc-side failure is not confused with a matrix-op success.
The reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: reserved shared-memory symbol patcher gate is loader bookkeeping, and PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 TMA multicast probe surface or ptxasQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 TMA multicast probe surface terminology in the build log should not be read as proof that the final tensor path executed.
Build the alloc-only cubin:
make build-alloc
Then try the first launch attempt:
./loader alloc_patched.cubin k_tcgen05_alloc 128
On the public-safe path, that does not complete as a working GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story kernel.
The four gates we found
The public repro bundle keeps the gate structure explicit because collapsing it into one sentence hides the engineering reality.
The four gates are:
- ELF
e_flagsQuick term guidee_flagsThe ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.GroundingExample: baseline arch-patch proof sample Reference: ELF arch-field patcher Reference: GB10 gate walkthrough architecture validation - weak undefined reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: reserved shared-memory symbol patcher symbols such as
.nv.reservedSmem.offset0Quick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: reserved shared-memory symbol patcher and.nv.reservedSmem.capQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: reserved shared-memory symbol patcher .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: kernel capability-record patcher per-kernel capability records.nv.capmerc.textQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate walkthrough per-kernel metadata plus.nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary integrity metadata
You can also see this summarized in the GB10 gate matrix and the compact compact gate-walk mirror.
The important distinction is that the gates are not interchangeable.
- Gate 1 is simple identity metadata.
- Gate 2 is loader-side symbol plumbing.
- Gate 3 is mutable per-kernel metadata that the loader still consumes as bounds data, not decorative labels.
- Gate 4 is integrity-protected capability metadata coupled to the encoded instruction stream.
That last transition is where naive cubin surgery stops being enough.
Step 5: patch the reservedSmem symbols
The first tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingExample: GB10 repro walkthrough Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary failure in the log moves from the arch gate to symbol resolution. The source notes report weak undefined symbols like these:
.nv.reservedSmem.offset0
.nv.reservedSmem.cap
The published patcher for that stage is the reserved shared-memory symbol patcher. It is a diagnostic gate mover, not a support claim, and it does not create a working tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity result.
Run it in place on the alloc probe:
./patch_symbols.py alloc_patched.cubin alloc_patched.cubin \
.nv.reservedSmem.offset0 .nv.reservedSmem.cap
Then retry the load:
./loader alloc_patched.cubin k_tcgen05_alloc 128
At this point the error changes, which tells you the earlier gate was real. It does not tell you that the final tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity path is now proven available.
Step 6: strip selected .nv.info records
The next patcher in the bundle is the kernel capability-record patcher. It removes selected tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity-specific records from the .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: kernel capability-record patcher records attached to the named kernel.
Run it like this:
./patch_nvinfo.py alloc_patched.cubin alloc_patched_info.cubin k_tcgen05_alloc
Then load the rewritten cubin:
./loader alloc_patched_info.cubin k_tcgen05_alloc 128
This is the exact step where the public-safe story still stops at CUDA_ERROR_INVALID_IMAGE. The useful lesson is not that those records are optional; it is that they expose a later loader-consumed bounds check. The deeper gate is still in the way.
The exact point where the public-safe path stops
The clean public lane stops at gate 4: .nv.capmerc.textQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate walkthrough per-kernel metadata together with .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary.
That is why the conservative public wording is:
- the baseline arithmetic cubin executes after an arch-field patch;
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity-oriented probes hit additional driver-side gates;- the clean public evidence does not prove working
tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source parity on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story.
This is also why the bundle separates the parent repro lane from the deeper driver research lane. The public-safe lane is about what we can show cleanly with source files, patch scripts, and reproducible receipts. It is not a place to smuggle in a stronger silicon claim than the receipts support.
One-command walkthrough
If you want the compact path instead of running each step manually, the bundle already includes a one-command walkthrough script:
./run.sh
That script runs:
make all./query_attrsmake run-baselinemake probe-alloc-gates
The result is exactly the sequence this article describes: a working baseline, then a staged tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingExample: GB10 repro walkthrough Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary gate walk that still stops before publication-grade proof.
What works, and what does not
What works in the public-safe bundle:
- building an
sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source baseline cubin; - rewriting the ELF arch field to
sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity; - loading and executing the patched baseline on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story;
- reproducing the staged failure movement for the alloc-only
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity probe; - seeing that the driver path is layered rather than a single yes/no hardware switch.
What does not work in the public-safe bundle:
- a clean end-to-end
tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingExample: GB10 repro walkthrough Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary execute receipt on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story; - a clean end-to-end
tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source execute receipt on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story; - a public proof that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story physically exposes datacenter Blackwell
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity parity; - a claim that deeper helper paths or routing knowledge inside
libcudaare the same thing as runtime proof.
That distinction is the whole point of publishing the bundle this way.
Why this article exists next to the other GB10 posts
The other two GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story posts explain the meaning of the result and the difference between driver-visible hints and runtime proof:
- What our GB10 experiments actually prove about Blackwell tensor paths
- Why driver-visible paths can look like hardware support on GB10, even when silicon proof is missing
This article is the practical companion. It is here so another engineer can repeat the exact cubin patch, load path, and gate walk without guessing where the public evidence starts and where it stops.
Frequently asked questions
Does the baseline patch prove tcgen05.mma works on GB10?+
sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. cubin can be patched to load and execute as sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.. The tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. probes still hit later gates.Why keep sm_100a and sm_121a so explicit?+
a suffix is the architecture-specific contract. Treating those targets as a generic Blackwell family hides the exact thing the repro is testing.Where do sm_100, sm_100a, and sm_121a fit in this repro?+
sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. is the broader SM100 / datacenter Blackwell target vocabulary you will see in NVIDIA's PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics. and CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop. material. sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. is the architecture-specific datacenter target used to build the original cubin, and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. is the architecture-specific GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. target written into the ELF metadata for the baseline patch. That rewrite is a loader/runtime fact, not proof that GB10 inherits every SM100 tensor path. For the wider target map, pair this article with What our GB10 experiments actually prove and The FA4 Catalog on Blackwell.What do tcgen05.alloc, tcgen05.ld, tcgen05.mma, and TMA multicast mean here?+
tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe. is the Tensor Memory allocation-side probe, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes. is the Tensor Memory load-side probe, and tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof. is the matrix-operation question. TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster. is separate: it is a clustered copy shape around cp.async.bulk.tensor...multicast::cluster, not a synonym for the matrix path. The minimal TMEM allocation probe source isolates the first question, while the full GB10 tensor-path probe source keeps the load, matrix, and clustered-copy examples apart.What do reservedSmem, .nv.info, .nv.capmerc, and .nv.merc.rela.* prove?+
reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane. names are weak reserved shared-memory symbols consumed by the loader, .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary. records are mutable per-kernel metadata, and .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. plus .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary. are the later observed metadata section families where the public-safe lane still stops. The section names are reported as observed cubin evidence in this bundle, not as a public NVIDIA programming API. The shortest local cross-checks are the GB10 gate matrix, GB10 claim-scope guardrails, and Why driver-visible paths can look like hardware support on GB10.Where do PTX, CuTe, and CUTLASS fit?+
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path., TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable., TMA, and SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.. For the implementation-facing handoff, read The FA4 Catalog on Blackwell and Our honest experience with CuTe DSL after the GB10 gate walk.Where is the shortest checked-in proof map?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.
Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.
The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.
Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.
Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.
Observed weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.
Observed per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.
Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
Documented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.
Documented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.
The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.
The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.
Observed section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.
FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.
NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.
The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.
CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.
NVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.
The ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.