The MegaCpp precision recipe: FP16, BF16, FP8 and NVFP4 in one stack
How MegaCpp picks a numerical format per op, per device, and per phase: FP16 only as a floor, BF16 as the steady state, FP8 in selected GEMMs, and NVFP4 for Blackwell inference.

We use four numerical tiers in the MegaCpp stack and we use them deliberately. BF16 is the steady state for trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 compute. FP16 exists mainly as a fallback on older or development hardware. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingHistory: FP8 rollout notes Reference: Megatron FLCE on Hopper is opt-in on a curated set of GEMMs. NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference is the inference target on Blackwell. The recipe is not a global flag. It is a per-op, per-device, per-phase contract.
Why the generation boundary matters
The public architectural line is simple. Hopper-class H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 systems belong to the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingHistory: FP8 rollout notes Reference: Megatron FLCE on Hopper, BF16, and FP16 trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 story. Blackwell adds NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference. If an article collapses those into one universal statement, it becomes misleading fast.
That is why the recipe is asymmetric. TrainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 formats are chosen around numerical stability and kernel support on the current trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 device. Inference formats are chosen around the deployment device.
The four tiers
| Tier | Main role | Safe public framing |
|---|---|---|
| FP16 | fallback | useful on older or development hardware |
| BF16 | training default | the steady-state training format on modern hardware |
| FP8 | selective acceleration | opt-in on large GEMM-heavy surfaces when the hardware and kernels support it |
| NVFP4 | low-precision serving | a Blackwell-era inference format, not a Hopper training format |
BF16 is the default training floor
BF16 is the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 default because it keeps the main trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 path simple and stable. That includes the optimizer master-precision story: optimizer-state precision should be discussed separately from the model's compute precision.
FP8 belongs on selected surfaces, not everywhere
Selective FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingHistory: FP8 rollout notes Reference: Megatron FLCE on Hopper rollout belongs on layer families where Hopper's FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingHistory: FP8 rollout notes Reference: Megatron FLCE on Hopper path is publicly supported and where cast overhead does not dominate the kernel. That usually means large projection-heavy GEMMs, not every small or irregular layer in the model.
Checkpoint and recompute precision is a separate surface again. It should be treated independently from both forward GEMM precision and optimizer-state precision.
NVFP4 belongs to the Blackwell inference story
NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference is the inference target on Blackwell. The trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 master remains BF16; conversion to NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference happens at quantization time rather than during Hopper-side trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200. That separation matters because public NVIDIA documentation places NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference on Blackwell, not Hopper.
The practical takeaway is also narrower than a headline speedup claim. NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference is useful because of footprint and Blackwell-era low-precision inference support, not because one universal speedup number applies across every device shape.
Practical takeaway
A clean public precision story should say this:
- H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 is a Hopper platform, so the public trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 story is BF16 plus optional FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingHistory: FP8 rollout notes Reference: Megatron FLCE on Hopper.
- Blackwell introduces NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference, so the public inference story can add NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference there.
- Precision policy should stay per-surface rather than collapsing into one global "low precision" label.
Frequently asked questions
Is NVFP4 a training checkpoint dtype in this recipe?+
NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8. is the later export artifact for Blackwell inference. That is why this article keeps the BF16 plus selected FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. training story separate from the serving-side NVFP4 story.If selected GEMMs run in FP8, should checkpoint or recompute buffers also move to FP8?+
FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. is an execution dtype on selected math surfaces, not a blanket storage policy for long-lived trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… buffers. Checkpoint and recompute storage stay on the BF16-side training surface, and the lower-precision cast happens only when a whitelisted GEMM is dispatched.What makes a GEMM eligible for FP8 here?+
What does "shape-aligned" mean for FP8?+
Should optimizer state follow the selected execution dtype?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.
Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.
NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.