# GB10 Walkthrough

This walkthrough follows the same staged GB10 probe flow used throughout the
bundle, while keeping the public interpretation conservative.

If you only want the shortest path, run [the one-command runner](./run.sh). If you want the
actual source and scripts, keep these files open while you work:

- [The Driver API loader](./loader.cpp)
- [The host attribute reader](./query_attrs.cpp)
- [The baseline arithmetic kernel](./kernel_baseline.cu)
- [The minimal alloc probe](./kernel_alloc_only.cu)
- [The exploratory tensor-path probe](./kernel_sm100a.cu)
- [The ELF arch-field patch helper](./patch_elf.py)
- [The reserved-SMEM symbol patch helper](./patch_symbols.py)
- [The capability-record strip helper](./patch_nvinfo.py)

## 1. Build the source artifacts

```bash
make all
```

That produces:

- `kernel_baseline_100a.cubin`
- `kernel_baseline_patched.cubin`
- `alloc_100a.cubin`
- `alloc_patched.cubin`
- `kernel_sm100a.cubin`
- `kernel_sm100a_patched.cubin`
- `loader`
- `query_attrs`

## 2. Inspect what the driver reports

```bash
./query_attrs
```

This is the clean baseline for the host before any deeper patch story.

The readback source is [the host attribute reader](./query_attrs.cpp).

## 3. Run the narrow positive proof

```bash
make run-baseline
```

This should show:

- `cuModuleLoadDataEx == CUDA_SUCCESS`
- `cuLaunchKernel == CUDA_SUCCESS`
- `cuCtxSynchronize == CUDA_SUCCESS`
- output values `1, 3, 5, ... 15`

Interpretation:

- GB10 accepted and executed a baseline cubin that was originally compiled for
  `sm_100a`;
- this proves a loader/runtime fact, not full datacenter-path parity.

The three files behind this step are:

- [The baseline arithmetic kernel](./kernel_baseline.cu)
- [The ELF arch-field patch helper](./patch_elf.py)
- [The Driver API loader](./loader.cpp)

## 4. Walk the minimal `tcgen05.alloc` lane

```bash
make build-alloc
./loader alloc_patched.cubin k_tcgen05_alloc 128 || true
./patch_symbols.py alloc_patched.cubin alloc_patched.cubin \
  .nv.reservedSmem.offset0 .nv.reservedSmem.cap
./loader alloc_patched.cubin k_tcgen05_alloc 128 || true
./patch_nvinfo.py alloc_patched.cubin alloc_patched_info.cubin k_tcgen05_alloc
./loader alloc_patched_info.cubin k_tcgen05_alloc 128 || true
```

Expected shape of the walk:

1. before symbol patching: `CUDA_ERROR_NOT_FOUND`
2. after symbol patching: `CUDA_ERROR_INVALID_IMAGE`
3. after `.nv.info` stripping: still `CUDA_ERROR_INVALID_IMAGE`

Interpretation:

- the early gates are software and metadata driven;
- the public lane still stops before a clean `tcgen05` completion receipt.

The exact artifacts for this step are:

- [The minimal alloc probe](./kernel_alloc_only.cu)
- [The reserved-SMEM symbol patch helper](./patch_symbols.py)
- [The capability-record strip helper](./patch_nvinfo.py)
- [The Driver API loader](./loader.cpp)

## 5. Optional fuller probes

```bash
make run-full-alloc || true
make run-full-mma || true
make run-full-tma || true
```

These targets are included because the articles discuss them, not because they
already establish public proof.

The fuller exploratory source is [the tensor-path probe](./kernel_sm100a.cu).

## 6. If you want the deeper driver patch lane

Read [the copied-driver lane overview](./driver_patch_lane/README.md) and then
[the copied-driver patch helper](./driver_patch_lane/patch_libcuda.py).

That lane is user-space `libcuda` research and should be treated separately
from the public-safe baseline and gate-walk claims.

## 7. Recommended writeup discipline

Before publishing or summarizing the results, re-read
[the public wording guardrail](./public_claims.md). The point of this bundle is not only
to make the steps repeatable, but also to keep the wording tied to the exact
stage of proof you actually reached.