NVFP4 Inference for the MegaCpp SLM Ensemble
Why we train in FP16/BF16 and ship in NVFP4, what Blackwell and GB10 actually give us, and which kernels survive the trip from B200 to DGX Spark.

We train the SLM ensemble in FP16/BF16 on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and B200 and serve it in NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.Groundingprecision recipe: FP16, BF16, FP8, NVFP4
on Blackwell-class hardware. This post covers the serving-side contract: how a
BF16 master checkpoint becomes an NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.Groundingprecision recipe: FP16, BF16, FP8, NVFP4 deployment artifact, which Blackwell
features the kernels actually rely on, and where GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story (DGX Spark, sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough)
forces us to diverge from B200 (sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough). If your real question is KV
residency, paged attentionQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack, or schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention design rather than weight format, the
cleaner follow-ups are KV cache and paged attention
and Inference serving stack.
For the training-side dtype story and the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story proof boundary, pair this with Precision recipe: FP16, BF16, FP8, NVFP4 and GB10 Blackwell tensor paths: what we actually proved.
One naming seam matters up front. sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough is the datacenter Blackwell target
used by B200-class tensor-path work. sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story consumer target the
driver reports on DGX Spark. The shortest checked-in decoders in this public
tree are GB10 arch patch probe sample,
GB10 driver signal versus runtime proof sample,
and GB10 public claims guardrail.
Read together, they answer three different first-touch questions: what the
device reports, what an arch rewrite can and cannot prove, and why
driver-visible hints still do not count as runtime proof.
Why this matters
Inference is usually bandwidth-bound, not compute-bound, on the model shapes
this post is about. That is why NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.Groundingprecision recipe: FP16, BF16, FP8, NVFP4 is useful when large weight reads
dominate. Block-scaled just means the packed FP4 values travel with
higher-precision scales so the kernel can reconstruct a usable range block by
block rather than tensor-wide.
The narrow claim here is not "FP4 beats FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper everywhere." It is simpler: shrinking large weight tensors can help in serving even when the same low-precision move would not pay back in training. We also do not quantize everything. DSA indexer fused-memory sample and M2RNN mixer spec sample are the checked-in counterexamples where fusion, layout, or scan behavior matters more than weight dtype.
The quantization contract
From a BF16 master checkpoint we produce an NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.Groundingprecision recipe: FP16, BF16, FP8, NVFP4 artifact with three pieces per quantized tensor:
- NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.Groundingprecision recipe: FP16, BF16, FP8, NVFP4 weights in E2M1 format, packed two values per byte.
- A block-scale tensor, typically E4M3, aligned with the block layout the serving kernel expects.
- A higher-precision tensor-level scale or amax term used when reconstructing the runtime scale.
Staying canonical matters. NVIDIA's public NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.Groundingprecision recipe: FP16, BF16, FP8, NVFP4 and CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample docs use the same block-scaled framing, and keeping that layout lets us use the vendor kernels instead of maintaining a translation layer.
The layers we deliberately keep out of this lane are the SSM blocks, norms, the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack router, the LM head, and small irregular DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample-style linears whose surrounding traffic dominates the actual math.
B200 and GB10 are different lanes
B200 (sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough) and GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story (sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough) share the Blackwell name but not one
execution model. The serving-relevant subset is:
| Feature | B200 (sm_100a) |
GB10 (sm_121a) |
|---|---|---|
| Block-scaled FP4 path | yes | narrower warp-level block-scaled MMA lane |
tcgen05.* and TMEM |
yes | not part of the public-safe proof set |
| Typical tile assumptions | larger datacenter tiles | retiled smaller for GB10 limits |
| Memory system | HBM3e class | 273 GB/s LPDDR5x |
For glossary purposes, sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro is the baseline Blackwell compiler-family name, while the staged datacenter probes split tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough, and tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails into separate gates. TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: GB10 stack parity Reference: TileLang TMA bulk-copy companion sample and TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough sit beside that datacenter path, not inside the narrower GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story serving proof.
The clean GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story claim on this site is deliberately narrow: some
consumer-Blackwell FP4-shaped work is real, but the public evidence still stops
short of proving B200 or GB100 tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough parity on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story. That is why the
checked-in GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story bundle keeps separate what the device reports, what the staged
gate walk shows, and what remains unproven.
For attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns and backend selection, the same split still matters. On B200 we can use the datacenter Blackwell line. On GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story we keep the serving lane on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story-eligible fallbacks such as SDPA rather than pretending datacenter FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample and consumer Blackwell are interchangeable. Flash Attention 4 in practice and The FA4 Catalog on Blackwell cover that adjacent boundary.
Where NVFP4 pays back most
Grouped MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack expert GEMMs are the clearest case because expert weights dominate parameter volume. Dense FFN and projection paths are the next obvious fit. The common thread is simple: weight-heavy serving kernels benefit most from a smaller artifact.
The same rule explains the exclusions. If a surface is small, irregular, scan-dominated, or accuracy-sensitive, the better move is usually to keep it in BF16 or FP32 and fix the surrounding layout or kernel contract instead.
Frequently asked questions
Should I quantize the whole model to NVFP4 on GB10?+
What is NVFP4 in plain language?+
What does block-scaled mean in the vendor docs?+
What do sm_100a and sm_121a mean here?+
sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro. is the datacenter Blackwell target used by B200-class tensor-path work. sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. is the exact GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. consumer target the driver reports. We keep those lanes separate because one name does not imply one shared kernel contract.What do tcgen05.alloc, tcgen05.ld, and tcgen05.mma mean in the checked-in bundle?+
tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe. is the earliest tensor-memory allocation probe, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes. adds the follow-on load path, and tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof. is the deeper matrix-multiply probe further down the gate walk. The shortest public-safe map is compact gate-walk mirror, GB10 gate matrix, and GB10 repro walkthrough.Where does TMA multicast fit in the GB10 serving story?+
If GB10 ships a real FP4 lane here, why do the checked-in tcgen05 probes still fail?+
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. probes test the stronger datacenter SM100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane. plus TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. contract instead, so a real GB10 FP4 deployment lane does not imply proven tcgen05 or TMEM parity with B200 or GB100.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.
Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.
Documented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.
Documented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.
The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.
The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.
The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.
Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.
NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.
NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.
FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.
The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.
A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.