MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 20, 20268 min readDavid Gornshtein

GB10

Blackwell

CUDA

C++

Cubin

tcgen05

Driver Research

Reproducing the sm_100a -> sm_121a Cubin Patch on GB10: CUDA/C++ Code, ELF Edits, and the Exact Point Where tcgen05 Stops

Q: Does the baseline patch prove tcgen05.mma works on GB10?

No. It proves that a trivial sm_100a cubin can be patched to load and execute as sm_121a. The tcgen05 probes still hit later gates.

Q: Why keep sm_100a and sm_121a so explicit?

Because the a suffix is the architecture-specific contract. Treating those targets as a generic Blackwell family hides the exact thing the repro is testing.

Q: Where do sm_100, sm_100a, and sm_121a fit in this repro?

sm_100 is the broader SM100 / datacenter Blackwell target vocabulary you will see in NVIDIA's PTX and CUTLASS material. sm_100a is the architecture-specific datacenter target used to build the original cubin, and sm_121a is the architecture-specific GB10 target written into the ELF metadata for the baseline patch. That rewrite is a loader/runtime fact, not proof that GB10 inherits every SM100 tensor path. For the wider target map, pair this article with What our GB10 experiments actually prove and The FA4 Catalog on Blackwell.

Q: What do tcgen05.alloc, tcgen05.ld, tcgen05.mma, and TMA multicast mean here?

Use PTX as the vocabulary source and the checked-in bundle as the proof boundary. tcgen05.alloc is the Tensor Memory allocation-side probe, tcgen05.ld is the Tensor Memory load-side probe, and tcgen05.mma is the matrix-operation question. TMA multicast is separate: it is a clustered copy shape around cp.async.bulk.tensor...multicast::cluster, not a synonym for the matrix path. The minimal TMEM allocation probe source isolates the first question, while the full GB10 tensor-path probe source keeps the load, matrix, and clustered-copy examples apart.

Q: Where do PTX, CuTe, and CUTLASS fit?

PTX and CUTLASS are material for the instruction names and Blackwell kernel families; they do not replace a GB10 execution receipt. CuTe and CUTLASS matter most in the adjacent FA4 and GEMM posts, while this repro uses them only to decode terms like tcgen05, TMEM, TMA, and SM100. For the implementation-facing handoff, read The FA4 Catalog on Blackwell and Our honest experience with CuTe DSL after the GB10 gate walk.

A practical GB10 reproduction guide for the narrow result we can defend publicly: a patched sm_100a baseline cubin executes on GB10, while tcgen05-oriented probes stop at later driver-side gates rather than producing a publication-grade tcgen05 proof.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Reproducing the sm_100a -> sm_121a Cubin Patch on GB10: CUDA/C++ Code, ELF Edits, and the Exact Point Where tcgen05 Stops

Published April 20, 2026•8 min read•David Gornshtein

This article is the step-by-step version of the conservative GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story claim.

What we can defend publicly is narrow:

a simple sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source cubin can be patched so it loads and executes on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story;
the same workflow becomes a layered gate walk once the kernel uses tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity-oriented instructions;
the clean public evidence still stops short of proving working tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source parity on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story.

A quick vocabulary guardrail helps. NVIDIA's compiler docs treat sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source and compute_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source as architecture-specific targets, not generic Blackwell-family payloads. In this article, sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source is the source cubin target, sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity is the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story target we patch into ELF metadata, tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity is the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: GB10 stack parity tensor-core instruction family, and TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity is the tensor-memory space that family uses. The patch changes the cubin's declared identity; it does not prove GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story exposes the SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: GB10 stack parity tensor path.

That is enough to be useful. It tells you what to reproduce, which patches matter, which commands to run, and exactly where the public-safe story stops.

What you need

The public repro bundle for this article lives here:

Two checked-in mirrors are useful before you touch the C++ path: the arch-patch probe mirror shows the narrow arch-patch proof shape, and the driver signal and runtime proof mirror keeps driver-visible signals separate from runtime proof.

The receipts behind the article were collected on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story with CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary 13.2 and driver 595.58.03. If your exact environment differs, treat the command flow as the stable part and the exact offsets or return paths as environment-specific.

The baseline result in one screen

The narrow positive result is a trivial arithmetic kernel compiled for sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source, then patched so the ELF arch field says sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity.

The kernel is intentionally boring:

extern "C" __global__ void k_baseline(int* out) {
    out[threadIdx.x] = threadIdx.x * 2 + 1;
}

That source is published in the baseline kernel source.

The public-safe receipt is the one that matters:

# device: NVIDIA GB10  sm_121
[cuModuleLoadDataEx]                CUDA_SUCCESS
[cuLaunchKernel]                    CUDA_SUCCESS
[cuCtxSynchronize]                  CUDA_SUCCESS
# out[0..7]: 00000001 00000003 00000005 00000007 00000009 0000000b 0000000d 0000000f

That proves three specific things.

The user-space driver performs a software-visible architecture check at the cubin level.
GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story will accept and execute at least some SASS originally emitted for the datacenter Blackwell line.
This is not the same thing as proving GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story exposes working tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source.

The third point is the one worth repeating, because it is where reverse-engineering writeups usually start to overclaim.

Step 1: build the baseline cubin

From the bundle directory:

make clean
make build-baseline

Under the hood, the baseline build is equivalent to:

nvcc -arch=sm_100a --cubin -std=c++17 -lineinfo \
  -Xptxas -gno-tmem-access-check \
  kernel_baseline.cu -o kernel_baseline_100a.cubin

The -gno-tmem-access-check flag is harmless for the baseline kernel and keeps the command line aligned with the deeper probes that follow.

Step 2: patch only the ELF arch field

The smallest patch in the whole workflow is the ELF arch-field patcher. It rewrites only the low architecture bits in ELF e_flagsQuick term guidee_flagsThe ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.GroundingExample: baseline arch-patch proof sample Reference: ELF arch-field patcher Reference: GB10 gate walkthrough and preserves the upper bits.

Run it directly:

./patch_elf.py kernel_baseline_100a.cubin kernel_baseline_patched.cubin sm_100a sm_121a

The source notes summarize the key field values like this:

sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source -> low 16 bits 0x6402
sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity -> low 16 bits 0x7902

So the patch is changing the architecture identity at the cubin metadata layer, not rewriting the kernel body. That distinction matters because the a suffix marks an architecture-specific target in NVIDIA's feature-target model.

If you want to inspect the result yourself:

readelf -h kernel_baseline_100a.cubin | grep -E "Flags|Machine"
readelf -h kernel_baseline_patched.cubin | grep -E "Flags|Machine"

Step 3: load and run the patched cubin on GB10

The published CUDA Driver API loader example uses the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: GB10 journey About: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Driver API directly. It loads the cubin, resolves the kernel symbol, launches it, synchronizes, and prints the output buffer.

Build and run:

make run-baseline

Or run the loader explicitly:

g++ -O2 -std=c++17 -I/usr/local/cuda/include loader.cpp -o loader -L/usr/local/cuda/lib64 -lcuda
./loader kernel_baseline_patched.cubin k_baseline 32

This is the exact point where the public-safe positive claim ends. The patched baseline cubin executes. That is real. It is useful. It is still narrower than “GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story has datacenter Blackwell tensor-path parity.”

Step 4: move from baseline arithmetic to a tcgen05 probe

The next published probe is the minimal TMEM allocation probe source. It isolates tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingExample: GB10 repro walkthrough Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary without mixing in the larger mma or TMA path.

The interesting part of that kernel is this block:

if (threadIdx.x == 0) {
    uint32_t smem_ptr = __cvta_generic_to_shared(&tmem_addr);
    asm volatile(
        "tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], 32;\n"
        :: "r"(smem_ptr));
    asm volatile("tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;\n");
}

That is where the story stops being a simple arch-byte patch and becomes a gate walk. For the wider probe map, the full GB10 tensor-path probe source keeps tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity, tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source, and a clustered TMA copy exampleQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough separate so the alloc-side failure is not confused with a matrix-op success.

The reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: reserved shared-memory symbol patcher gate is loader bookkeeping, and PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 TMA multicast probe surface or ptxasQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 TMA multicast probe surface terminology in the build log should not be read as proof that the final tensor path executed.

Build the alloc-only cubin:

make build-alloc

Then try the first launch attempt:

./loader alloc_patched.cubin k_tcgen05_alloc 128

On the public-safe path, that does not complete as a working GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story kernel.

The four gates we found

The public repro bundle keeps the gate structure explicit because collapsing it into one sentence hides the engineering reality.

The four gates are:

ELF e_flagsQuick term guidee_flagsThe ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.GroundingExample: baseline arch-patch proof sample Reference: ELF arch-field patcher Reference: GB10 gate walkthrough architecture validation
weak undefined reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: reserved shared-memory symbol patcher symbols such as .nv.reservedSmem.offset0Quick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: reserved shared-memory symbol patcher and .nv.reservedSmem.capQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: reserved shared-memory symbol patcher
.nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: kernel capability-record patcher per-kernel capability records
.nv.capmerc.textQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate walkthrough per-kernel metadata plus .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary integrity metadata

You can also see this summarized in the GB10 gate matrix and the compact compact gate-walk mirror.

The important distinction is that the gates are not interchangeable.

Gate 1 is simple identity metadata.
Gate 2 is loader-side symbol plumbing.
Gate 3 is mutable per-kernel metadata that the loader still consumes as bounds data, not decorative labels.
Gate 4 is integrity-protected capability metadata coupled to the encoded instruction stream.

That last transition is where naive cubin surgery stops being enough.

Step 5: patch the reservedSmem symbols

The first tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingExample: GB10 repro walkthrough Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary failure in the log moves from the arch gate to symbol resolution. The source notes report weak undefined symbols like these:

.nv.reservedSmem.offset0
.nv.reservedSmem.cap

The published patcher for that stage is the reserved shared-memory symbol patcher. It is a diagnostic gate mover, not a support claim, and it does not create a working tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity result.

Run it in place on the alloc probe:

./patch_symbols.py alloc_patched.cubin alloc_patched.cubin \
  .nv.reservedSmem.offset0 .nv.reservedSmem.cap

Then retry the load:

./loader alloc_patched.cubin k_tcgen05_alloc 128

At this point the error changes, which tells you the earlier gate was real. It does not tell you that the final tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity path is now proven available.

Step 6: strip selected `.nv.info` records

The next patcher in the bundle is the kernel capability-record patcher. It removes selected tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity-specific records from the .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: kernel capability-record patcher records attached to the named kernel.

Run it like this:

./patch_nvinfo.py alloc_patched.cubin alloc_patched_info.cubin k_tcgen05_alloc

Then load the rewritten cubin:

./loader alloc_patched_info.cubin k_tcgen05_alloc 128

This is the exact step where the public-safe story still stops at CUDA_ERROR_INVALID_IMAGE. The useful lesson is not that those records are optional; it is that they expose a later loader-consumed bounds check. The deeper gate is still in the way.

The exact point where the public-safe path stops

The clean public lane stops at gate 4: .nv.capmerc.textQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.GroundingAbout: GB10 driver gates warning History: libcuda patch lane Example: GB10 gate walkthrough per-kernel metadata together with .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.GroundingAbout: GB10 driver gates warning Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary.

That is why the conservative public wording is:

the baseline arithmetic cubin executes after an arch-field patch;
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity-oriented probes hit additional driver-side gates;
the clean public evidence does not prove working tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source parity on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story.

This is also why the bundle separates the parent repro lane from the deeper driver research lane. The public-safe lane is about what we can show cleanly with source files, patch scripts, and reproducible receipts. It is not a place to smuggle in a stronger silicon claim than the receipts support.

One-command walkthrough

If you want the compact path instead of running each step manually, the bundle already includes a one-command walkthrough script:

./run.sh

That script runs:

make all
./query_attrs
make run-baseline
make probe-alloc-gates

The result is exactly the sequence this article describes: a working baseline, then a staged tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingExample: GB10 repro walkthrough Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary gate walk that still stops before publication-grade proof.

What works, and what does not

What works in the public-safe bundle:

building an sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: full GB10 tensor-path probe source baseline cubin;
rewriting the ELF arch field to sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity;
loading and executing the patched baseline on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story;
reproducing the staged failure movement for the alloc-only tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity probe;
seeing that the driver path is layered rather than a single yes/no hardware switch.

What does not work in the public-safe bundle:

a clean end-to-end tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingExample: GB10 repro walkthrough Example: GB10 gate walkthrough Reference: GB10 tensor-path proof summary execute receipt on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story;
a clean end-to-end tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 claim-scope guardrails Reference: full GB10 tensor-path probe source execute receipt on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story;
a public proof that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story physically exposes datacenter Blackwell tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 repro walkthrough Reference: GB10 stack parity parity;
a claim that deeper helper paths or routing knowledge inside libcuda are the same thing as runtime proof.

That distinction is the whole point of publishing the bundle this way.

Why this article exists next to the other GB10 posts

The other two GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story posts explain the meaning of the result and the difference between driver-visible hints and runtime proof:

This article is the practical companion. It is here so another engineer can repeat the exact cubin patch, load path, and gate walk without guessing where the public evidence starts and where it stops.

FAQ

Frequently asked questions

Does the baseline patch prove tcgen05.mma works on GB10?+

No. It proves that a trivial sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. cubin can be patched to load and execute as sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.. The tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. probes still hit later gates.

Why keep sm_100a and sm_121a so explicit?+

Because the a suffix is the architecture-specific contract. Treating those targets as a generic Blackwell family hides the exact thing the repro is testing.

Where do sm_100, sm_100a, and sm_121a fit in this repro?+

sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. is the broader SM100 / datacenter Blackwell target vocabulary you will see in NVIDIA's PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics. and CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop. material. sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. is the architecture-specific datacenter target used to build the original cubin, and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. is the architecture-specific GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. target written into the ELF metadata for the baseline patch. That rewrite is a loader/runtime fact, not proof that GB10 inherits every SM100 tensor path. For the wider target map, pair this article with What our GB10 experiments actually prove and The FA4 Catalog on Blackwell.

What do tcgen05.alloc, tcgen05.ld, tcgen05.mma, and TMA multicast mean here?+

Use PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics. as the vocabulary source and the checked-in bundle as the proof boundary. tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe. is the Tensor Memory allocation-side probe, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes. is the Tensor Memory load-side probe, and tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof. is the matrix-operation question. TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster. is separate: it is a clustered copy shape around cp.async.bulk.tensor...multicast::cluster, not a synonym for the matrix path. The minimal TMEM allocation probe source isolates the first question, while the full GB10 tensor-path probe source keeps the load, matrix, and clustered-copy examples apart.

What do reservedSmem, .nv.info, .nv.capmerc, and .nv.merc.rela.* prove?+

They prove that the failure path has layers, not that the final tensor path works. reservedSmemQuick term guidereservedSmemObserved weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane. names are weak reserved shared-memory symbols consumed by the loader, .nv.infoQuick term guide.nv.infoObserved per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary. records are mutable per-kernel metadata, and .nv.capmercQuick term guide.nv.capmercObserved section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops. plus .nv.merc.rela.*Quick term guide.nv.merc.rela.*Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary. are the later observed metadata section families where the public-safe lane still stops. The section names are reported as observed cubin evidence in this bundle, not as a public NVIDIA programming API. The shortest local cross-checks are the GB10 gate matrix, GB10 claim-scope guardrails, and Why driver-visible paths can look like hardware support on GB10.

Where do PTX, CuTe, and CUTLASS fit?+

PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics. and CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop. are grounding material for the instruction names and Blackwell kernel families; they do not replace a GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. execution receipt. CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface. and CUTLASS matter most in the adjacent FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. and GEMM posts, while this repro uses them only to decode terms like tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path., TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable., TMA, and SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.. For the implementation-facing handoff, read The FA4 Catalog on Blackwell and Our honest experience with CuTe DSL after the GB10 gate walk.

Where is the shortest checked-in proof map?+

Start with the GB10 repro walkthrough, then read the GB10 gate matrix and GB10 claim-scope guardrails.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

sm_100a

Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.

Grounding

sm_121a

Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.

Grounding

tcgen05

The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.

Grounding

sm_100

Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.

Grounding

TMEM

Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.

Grounding

reservedSmem

Observed weak undefined reserved shared-memory symbols that mark the earlier loader-side bookkeeping gate in the GB10 lane.

Grounding

.nv.info

Observed per-kernel metadata records edited in the GB10 repro lane before the later integrity-protected boundary.

Grounding

.nv.merc.rela.*

Observed companion metadata section family that appears with .nv.capmerc in the deeper GB10 gate boundary.

Grounding

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

Grounding

tcgen05.alloc

Documented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.

Grounding

tcgen05.ld

Documented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.

Grounding

tcgen05.mma

The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.

Grounding

TMA multicast

The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.

Grounding

.nv.capmerc

Observed section-name family for the deeper integrity-protected metadata boundary where the public-safe GB10 gate walk still stops.

Grounding

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Grounding

CUTLASS

NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.

Grounding

CuTe DSL

The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.

Grounding

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

Grounding

PTX

NVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.

Grounding

e_flags

The ELF header field whose low architecture bits are rewritten from sm_100a to sm_121a in the public GB10 arch-patch lane.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Topic hubs

Entity Hub

GB10 and Blackwell Bring-Up

A curated GB10 and Blackwell reading path: consumer-versus-datacenter tensor paths, driver-visible false positives, arch-patch repros, and the serving or precision choices that survived contact with the hardware.

David Gornshtein • MegaCppMore posts →

Reproducing the sm_100a -> sm_121a Cubin Patch on GB10: CUDA/C++ Code, ELF Edits, and the Exact Point Where tcgen05 Stops

What you need

The baseline result in one screen

Step 1: build the baseline cubin

Step 2: patch only the ELF arch field

Step 3: load and run the patched cubin on GB10

Step 4: move from baseline arithmetic to a tcgen05 probe

The four gates we found

Step 5: patch the reservedSmem symbols

Step 6: strip selected .nv.info records

The exact point where the public-safe path stops

One-command walkthrough

What works, and what does not

Why this article exists next to the other GB10 posts

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

GB10 and Blackwell Bring-Up

Step 6: strip selected `.nv.info` records