nvfp4
blackwell
gb10
inference
quantization
cutlass

NVFP4 Inference for the MegaCpp SLM Ensemble

Why we train in FP16/BF16 and ship in NVFP4, what Blackwell and GB10 actually give us, and which kernels survive the trip from B200 to DGX Spark.

11 min readDavid Gornshtein
NVFP4 Inference for the MegaCpp SLM Ensemble

NVFP4 Inference for the MegaCpp SLM Ensemble

The MegaCpp SLM ensemble is trained in FP16/BF16 on H200 and B200, and served in NVFP4 on Blackwell-class hardware. This post describes the inference path: how we move from a BF16 master checkpoint to an NVFP4 deployment artifact, which Blackwell features the kernels actually use, and where the GB10 (DGX Spark, sm_121a) silicon forces the design to diverge from B200 (sm_100a). It is written for engineers who already know what mma.sync is and want to know what to build, not what to read.

The short version: train in FP16/BF16, quantize weights to NVFP4 with a per-block (E4M3) scale and a per-tensor FP32 amax, run GEMMs through mma.sync.kind::mxf4nvf4.block_scale on B200 and through the warp-level OMMA path on GB10, keep the SSM and elementwise tail in BF16, and stop trying to put FP4 anywhere it does not pay back the quantize overhead. Everything below is the work behind that sentence.

Why NVFP4 and not FP8 for inference

FP8 looked attractive for training and we spent a session proving that it is not, at least not for the NAM56R hybrid (27 Mamba3 + 13 attention + 12 MoE). The numbers were unambiguous: BF16 baseline at MBS=6, GBS=48 ran at 158 TFLOP/s and 16 percent MFU on H200x8; the same configuration with TE FP8 hybrid recipe ran at 158 TFLOP/s. Smaller batches were marginally worse. The reason is structural — GEMMs are only 23.5 percent of compute on this model, with SSM at 27.5 percent and elementwise at 14.7 percent, and TE pays an amax quantize/dequantize on every GEMM plus carries a BF16 master copy alongside FP8 weights. --fp8-param-gather reclaims about 5 GiB but does not move the clock. The full data is in [fp8_optimization_session_2026_04_13.md].

Inference is a different shape. The forward pass has no optimizer state, no master weights, no amax history to maintain across steps; the per-tensor scale is computed once at quantize time and baked in. With NVFP4 the weight footprint drops by roughly 4x from BF16, which on bandwidth-bound silicon (and Blackwell inference is bandwidth-bound for anything above batch 1) translates almost directly into tokens per second. The MegaCpp SLMs are small enough that activation memory is not the constraint; weight bandwidth is. NVFP4 wins for the same reason FP8 lost during training: the cost model inverts when there is no backward pass and no optimizer.

We also looked at NVFP4 inside the DSA indexer as a research probe and concluded it is not worth porting. The indexer linears are 64 to 512, 3584 to 64, and 3584 to 8 — three of four are below the FP8 cuBLAS crossover, let alone the FP4 crossover, and the inner _compute_index_scores is bandwidth-bound on a 2.15 GB FP32 accumulator that is unaffected by input dtype. FP8 already cost +11.4 percent on the indexer linear forward in our Modal B200 bench. The fix there is to fuse the einsum, ReLU, weighted-sum, and topk into a single kernel (DeepSeek's fp8_index is the reference), not to change the dtype. The full investigation is in [fp8_research_session_2026_04_14.md]. The same logic applies to NVFP4: small GEMMs do not amortise the quantize tax. We use NVFP4 where it earns its keep, which is the bulk MoE expert GEMMs, the attention projections, and the dense FFN linears.

The quantization recipe

The training-to-inference handoff is mechanical and we keep it that way to make checkpoint exchange between H200 training runs and GB10 or B200 serving boxes a one-step transformation. From a BF16 master checkpoint we produce an NVFP4 artifact with three pieces per quantized tensor:

  1. NVFP4 weight blocks (4-bit elements, E2M1, packed two per byte) with a 16-element block size along the K dimension.
  2. A per-block E4M3 scale (one FP8 byte per 16 NVFP4 elements), which is what the kind::mxf4nvf4.block_scale MMA consumes directly.
  3. A per-tensor FP32 amax used to recompose the global scale at load time.

This is the NVIDIA-canonical NVFP4 layout, which matters because it is what the CUTLASS BlockScaledMmaOp and the matching mma.sync family expect with no additional shuffling. We do not invent a custom layout; the cost of being non-canonical is that you lose the vendor kernels and have to maintain your own, which on a moving target like Blackwell is not a fight worth picking. Activations are quantized on the fly to NVFP4 at the GEMM input boundary using a per-tile scale derived from a running calibration; the SSM state and the residual stream stay in BF16.

Layers we do not quantize: the SSM scan kernels (Mamba3 SSD and the M2RNN combined kernels), all LayerNorms and RMSNorms, the router GEMM in MoE (too small and too sensitive), and the LM head. The SSM kernels are not GEMMs in the cuBLAS sense and the available FP8/FP4 SSD variants regress accuracy without a throughput win on the sizes we ship. RMSNorm in BF16 with FP32 accumulator is well below the tensor-core ceiling and there is nothing to gain by moving it.

Blackwell features we actually use

B200 (sm_100a) and GB10 (sm_121a) are marketed under the same Blackwell umbrella but are two different ISAs in practice. The capability matrix in [gb10_sm121_hardware.md] is the source of truth; the inference-relevant subset is:

Both silicon families expose extended mma.sync with kind::f8f6f4, kind::mxf8f6f4.block_scale, and kind::mxf4nvf4.block_scale, provided the compile target carries the a suffix (sm_100a, sm_120a, sm_121a). Both have TMA via cp.async.bulk.tensor in single-CTA form, swizzled ldmatrix/stmatrix, setmaxnreg.inc/.dec warp specialization, and DSMEM. So the "fast NVFP4 GEMM with TMA-loaded operands and block-scaled tensor cores" recipe compiles and runs on both.

What only B200 has: the tcgen05.* family (mma, ld, st, alloc, cp), Tensor Memory (TMEM, 256 KiB/SM), 2-SM UMMA, and TMA multicast. The GB10 die simply does not have the silicon — RT cores and DLSS hardware took the budget that would have gone to TMEM and tcgen05 on the datacenter chip. NVIDIA's own framing is that GB10 tensor cores are "closer to the GeForce Ampere-style MMA model" with FP4 and FP8 bolted on. Practically, this means every CUTLASS or FlashInfer kernel that hard-codes the tcgen05 path will not target GB10 and never will, and we have to maintain a separate kernel selection for the two targets. The dead-end list (FA4, CuTe DSL tcgen05/UMMA, trtllm-gen FMHA, B200 tile configs reused on GB10) is documented in [blackwell_feature_sweep_2026_04_12.md].

The other GB10-specific constraint that drives kernel choice is shared memory. B200 has 228 KiB SMEM per SM and a 232 KiB CUTLASS dynamic budget; GB10 has roughly 128 KiB physical and a 99 KiB CUTLASS dynamic budget (sm120_smem_capacity_bytes = 101376). The B200 default 128x256x256 mainloop tiling overflows GB10 SMEM at compile time, so every kernel we ship for GB10 is re-tiled — typically 128x128x128 or 64x128x128 with a smaller pipeline depth — and the tile selection is part of the deployment artifact, not a runtime decision.

Kernel choices, by layer

The inference kernel mix is selected per layer class. The selection is intentionally narrow because the cost of carrying alternative paths is real and the test matrix grows multiplicatively.

Dense FFN and attention projections (NVFP4 weights, BF16 activations or NVFP4 activations). On B200 we use the CUTLASS BlockScaledMmaOp path with kind::mxf4nvf4.block_scale, persistent scheduler, TMA bulk tensor loads, and tcgen05-coupled accumulation in TMEM. On GB10 we use the same block-scaled MMA but at warp level (the OMMA family — warp-level FP4 MMA is present on sm_121a, only the TMEM-coupled UTCOMMA variant is absent), with TMA loads still in single-CTA form, swizzled SMEM, and the re-tiled mainloop. The two paths share most of the algorithmic structure and diverge at the inner loop class.

MoE expert grouped GEMMs. This is where NVFP4 pays the largest dividend, because the expert weight matrices are the bulk of the model parameters. On B200 the CUTLASS NVFP4 grouped GEMM is the right call. On GB10 we use the same kernel pattern but route through the cuBLAS 13.2 path where it is officially tuned for Spark, falling back to the in-tree CUTLASS kernel at 4.4.2 or later only when grouping shapes fall outside the cuBLAS heuristic. We do not use the TRT-LLM nvfp4_gemm_cutlass MoE path on anything older than TRT-LLM 1.3.0rc2 with CUTLASS 4.4.2 because the combination produced silent numerical corruption — that is on the dead-paths list and we treat it as such.

Attention. On B200, FlashAttention-class kernels with TMEM and tcgen05 are available and we use them. On GB10, FA4 is silicon-blocked and trtllm-gen FMHA has no SM12x cubins; the working path is CUTLASS PR #3030 (BF16 CpAsync + TMA + FP8 inline-PTX, benchmarked on real GB10) or PyTorch SDPA via the efficient-attention backend. SDPA matches FA2/FA3 throughput on GB10, and the source-build cost of FA2/FA3 buys nothing, so we ship SDPA on GB10 and FlashAttention-on-Blackwell on B200. The MLA path is the same shape — FlashInfer FA2 prefill/decode/MLA all work on GB10; the trtllm-gen MLA path does not.

SSM (Mamba3 SSD, M2RNN, MIMO). BF16 throughout, with fp32 bias/D/dt tensors preserved as fp32. We removed the precautionary non-fp8, non-fp4 guards in author_mamba3_spec.py and m2rnn_spec.py during the FP8 training experiments and confirmed they were never load-bearing — TE's wrap of TELayerNormColumnParallelLinear and TERowParallelLinear handles the fp32 tensors cleanly. The status table in [fp8_path_status.md] records the per-path PASS/FAIL outcomes that informed this choice. For inference we keep the SSM kernels in BF16 because the throughput is set by SSM scan latency, not by the surrounding linears, and quantizing the linears below BF16 inside the SSM block produces no measurable end-to-end win on our serving sizes.

What this means in numbers

Peak ceilings, from the NVIDIA Blackwell whitepaper and the dev-forum confirmations collected in [gb10_sm121_hardware.md]:

B200 (sm_100a): 2250 TFLOPS BF16 (f32 acc), 4500 TFLOPS FP8, 9000 TFLOPS FP4, 8 TB/s HBM3e. GB10 DGX Spark (sm_121a): ~100 TFLOPS BF16, ~200 TFLOPS FP8, ~400 TFLOPS FP4 (1 PFLOP sparse spec), 273 GB/s LPDDR5X. RTX 5090 (sm_120a, same family as GB10): 209 TFLOPS BF16, 838 TFLOPS FP8, 1676 TFLOPS FP4, 1792 GB/s GDDR7.

The B200 to GB10 BF16 gap is roughly 22x at the tensor-core ceiling, and the memory-bandwidth gap is about 30x. For inference workloads on small to medium SLMs the bandwidth gap dominates: NVFP4 buys back roughly 4x weight-bandwidth pressure relative to BF16 on the same silicon, which is why a serving box built around GB10 or RTX 5090 is viable for the MegaCpp ensemble at all. The NVIDIA dev-forum post titled "SM121 CUTLASS Kernel Optimization Results" reports 356 TFLOPS sustained NVFP4 on a single GB10 with a tuned MoE grouped GEMM; that is the order-of-magnitude target for our serving kernels and broadly consistent with what we see end-to-end on the ensemble. GB10 is explicitly not a training target — it is bandwidth-bound at 273 GB/s LPDDR5X and our smoke-test runs converge cleanly but slowly. Production training stays on H200 and B200; production inference runs on whichever Blackwell tier matches the SLA.

Toolchain pins that matter

The Blackwell ecosystem moved fast enough through 2026 Q1 that pins matter more than usual. The current pinned set, with rationale:

CUTLASS 4.4.2 or later — earlier versions have an SM120 SMEM carve-out bug that surfaces as silent miscompilation on GB10. NVIDIA's own quote, from cutlass#3144: "SM120 (RTX 6000, 5090, etc.) and SM121 (Spark) only support 99 KiB smem. There is a fix for some carve-out calculations in SM120 kernels included in 4.4.2 and CUTLASS ToT which TRT-LLM has not updated to on ToT." TRT-LLM bumped to 4.4.2 on 2026-04-09; we follow that pin.

TensorRT-LLM 1.3.0rc2 or later — combined with CUTLASS 4.4.2, this is the first version of the nvfp4_gemm_cutlass MoE path that produces correct results on GB10. Older combinations are on our dead-paths list.

CUDA 13.2 with PTX ISA 8.8 minimum — required for .target sm_121a. PTX 8.7 errors out with "version 8.7 does not support .target sm_121a". Driver 595.45.04 or later.

cuDNN 9.13 or 9.17 — both work for the BF16 paths. NVFP4 mm_fp4 raises cudnnGraphNotSupportedError on SM120, so we do not route NVFP4 through cuDNN; SDPA via the efficient-attention backend remains fast.

CUTLASS Python DSL — the basic BF16/FP16 warp-MMA + TMA + persistent-scheduler pattern from blackwell_geforce/dense_gemm.py works on sm_121a; the BlockScaledMmaOp, MmaF16BF16Op, _S2TCopyBase, and Ld32x32bOp variants reject sm_121a at the admissible_archs allowlist level. We work around this with the MmaSM120BlockScaledOp.__post_init__ patch documented in [gb10_software_stack.md], which lets us reuse the BF16 hot path; for NVFP4 we drop to CUTLASS C++ on sm_121a rather than fight the DSL admissible-archs check.

What we are not doing

Three things deserve explicit non-goals, because they keep coming up.

We are not pursuing FP4 inside the DSA indexer. The Modal B200 bench in [fp8_research_session_2026_04_14.md] showed FP8 already costs +11.4 percent on the indexer linear forward at our shapes; FP4 cannot beat that on 64 to 512 GEMMs. The real optimization is the fused einsum kernel, which is dtype-orthogonal.

We are not building FA2 or FA3 from source on GB10. Empirical perf matches SDPA, the build is fragile (CUTLASS 4.3 plus a patched setup.py plus a flash_api.cpp arch-check edit), and the maintenance burden is real. SDPA is the answer on GB10.

We are not maintaining a B200-only NVFP4 path that uses tcgen05 or TMEM exclusively. The kernels we ship are written so the same algorithmic structure runs on both targets, with the inner-loop class (UTCOMMA on sm_100a, OMMA on sm_121a) and the tile sizes selected at build time per target. The deployment artifact carries both kernel sets and the loader picks the right one. The cost is two compile passes; the benefit is that we do not maintain two implementations of the same model.

Where this is going

CUTLASS PR #3030 (SM120 FlashAttention) is unmerged but usable; we will fold it into the GB10 attention path when it lands. The CuTe DSL tcgen05 admissible-archs unblocking has no NVIDIA timeline and we are not waiting. TE has no sm_121 path and the inference stack does not need one. NVFP4 numerical fidelity against the BF16 reference is gated by a separate per-checkpoint eval harness; the harness output is what we trust, not the dtype headline.

References

  • [fp8_optimization_session_2026_04_13.md]
  • [fp8_path_status.md]
  • [fp8_research_session_2026_04_14.md]
  • [blackwell_feature_sweep_2026_04_12.md]
  • [gb10_sm121_hardware.md]
  • [gb10_software_stack.md]
David Gornshtein • Datasunrise OÜMore posts →