MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 2 min readDavid Gornshtein
BF16
FP16
FP8
NVFP4
Mixed Precision
Training
Inference

The MegaCpp precision recipe: FP16, BF16, FP8 and NVFP4 in one stack

How MegaCpp picks a numerical format per op, per device, and per phase: FP16 only as a floor, BF16 as the steady state, FP8 in selected GEMMs, and NVFP4 for Blackwell inference.

MegaCpp
Focused on applied C++ model engineering
Article Preview
The MegaCpp precision recipe: FP16, BF16, FP8 and NVFP4 in one stack
Published 2 min readDavid Gornshtein

We use four numerical tiers in the MegaCpp stack and we use them deliberately. BF16 is the steady state for trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 compute. FP16 exists mainly as a fallback on older or development hardware. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingHistory: FP8 rollout notes Reference: Megatron FLCE on Hopper is opt-in on a curated set of GEMMs. NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference is the inference target on Blackwell. The recipe is not a global flag. It is a per-op, per-device, per-phase contract.

Why the generation boundary matters

The public architectural line is simple. Hopper-class H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 systems belong to the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingHistory: FP8 rollout notes Reference: Megatron FLCE on Hopper, BF16, and FP16 trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 story. Blackwell adds NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference. If an article collapses those into one universal statement, it becomes misleading fast.

That is why the recipe is asymmetric. TrainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 formats are chosen around numerical stability and kernel support on the current trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 device. Inference formats are chosen around the deployment device.

The four tiers

Tier Main role Safe public framing
FP16 fallback useful on older or development hardware
BF16 training default the steady-state training format on modern hardware
FP8 selective acceleration opt-in on large GEMM-heavy surfaces when the hardware and kernels support it
NVFP4 low-precision serving a Blackwell-era inference format, not a Hopper training format

BF16 is the default training floor

BF16 is the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 default because it keeps the main trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 path simple and stable. That includes the optimizer master-precision story: optimizer-state precision should be discussed separately from the model's compute precision.

FP8 belongs on selected surfaces, not everywhere

Selective FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingHistory: FP8 rollout notes Reference: Megatron FLCE on Hopper rollout belongs on layer families where Hopper's FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingHistory: FP8 rollout notes Reference: Megatron FLCE on Hopper path is publicly supported and where cast overhead does not dominate the kernel. That usually means large projection-heavy GEMMs, not every small or irregular layer in the model.

Checkpoint and recompute precision is a separate surface again. It should be treated independently from both forward GEMM precision and optimizer-state precision.

NVFP4 belongs to the Blackwell inference story

NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference is the inference target on Blackwell. The trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 master remains BF16; conversion to NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference happens at quantization time rather than during Hopper-side trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200. That separation matters because public NVIDIA documentation places NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference on Blackwell, not Hopper.

The practical takeaway is also narrower than a headline speedup claim. NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference is useful because of footprint and Blackwell-era low-precision inference support, not because one universal speedup number applies across every device shape.

Practical takeaway

A clean public precision story should say this:

  1. H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 is a Hopper platform, so the public trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 story is BF16 plus optional FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingHistory: FP8 rollout notes Reference: Megatron FLCE on Hopper.
  2. Blackwell introduces NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference, so the public inference story can add NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingNVFP4 inference there.
  3. Precision policy should stay per-surface rather than collapsing into one global "low precision" label.
FAQ

Frequently asked questions

Is NVFP4 a training checkpoint dtype in this recipe?+
No. The trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… checkpoint stays on the training-side policy surface, while NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8. is the later export artifact for Blackwell inference. That is why this article keeps the BF16 plus selected FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. training story separate from the serving-side NVFP4 story.
If selected GEMMs run in FP8, should checkpoint or recompute buffers also move to FP8?+
No. In this recipe, FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. is an execution dtype on selected math surfaces, not a blanket storage policy for long-lived trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… buffers. Checkpoint and recompute storage stay on the BF16-side training surface, and the lower-precision cast happens only when a whitelisted GEMM is dispatched.
What makes a GEMM eligible for FP8 here?+
Eligibility is a whitelist decision, not just "matrix multiply." The useful FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. targets are large, shape-aligned projection-style GEMMs where Tensor Core throughput can pay back casting and scale-management overhead. Normalization, residual, RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table., checkpoint, and recompute surfaces stay out of that fast path because they are either numerically sensitive or storage-oriented rather than clean GEMM execution surfaces. The rollout details live in FP8 in the training stack.
What does "shape-aligned" mean for FP8?+
For the public NVIDIA Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts. path, FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. linear support is documented for tensors whose dimensions are divisible by 16. In this recipe, awkward or small shapes stay on the BF16 path unless the cast, scale update, and Tensor Core work are all likely to pay back the extra routing cost.
Should optimizer state follow the selected execution dtype?+
No. Optimizer moments and master-update math are a stability surface, not a kernel-selection surface. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. can be an execution dtype for a whitelisted GEMM, and NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8. can be an inference export format, but neither one should turn optimizer state, checkpoint storage, or recompute buffers into low-precision storage by default.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NVFP4

NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

RoPE

Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.