Megatron FLCE on Hopper
Why Hopper-ready fused linear cross entropy is an output-layer contract as much as a kernel choice, and why shape-compatible alternatives are not enough.

It is tempting to describe fused linear cross entropy as a kernel detail. That is too shallow.
On the Hopper path, FLCE is also an output-layer contract. The model path has to present the fused loss surface the runtime expects. A plain column-parallel layer may be shape-compatible and still fail to preserve the intended fused loss path.
That is why this public sample keeps the comparison narrow: one lane exposes a plain output layer, the other exposes a fused linear-plus-cross-entropy path that is actually aligned with the Hopper runtime contract.
The checked-in proof surface is intentionally small: Megatron Hopper FLCE near-copy compares the plain path against the fused path without smuggling in the rest of the MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split runtime.
Why this matters beyond one kernel
The real engineering lesson is that parity checks have to happen at the loss boundary, not just the tensor-shape boundary. Once a stack uses fused output and loss handling, a "close enough" output module is not close enough.
One distributed constraint is worth naming plainly. MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split's
tensor-parallel cross-entropy path masks targets outside a rank's local
vocabulary interval and then reconciles per-rank results with cross-rank MAX
and SUM reductions. That means local tensor-shape parity still is not enough
if the output boundary no longer preserves the distributed loss contract.
The memory receipt has the same boundary problem. A fused lane should keep the
vocabulary sweep inside the loss boundary, carrying only chunk-local logits plus
the running maximum and denominator state needed for softmax normalization. If a
supposed Hopper-ready path first materializes the full [B, S, V] logit slab,
it may still match the scalar loss while proving a different memory class.
That is the same design pressure visible in the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode CE parity work. The bug surface is small, but the consequence is broad because the output path sits on every trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 step.
Frequently asked questions
Why is tensor-shape parity still not enough on tensor-parallel FLCE?+
What should a useful FLCE parity receipt name?+
When should the receipt prove that no full-logit fallback was used?+
Do padded vocabulary rows change what the receipt should prove?+
What should the Hopper-ready claim avoid overstating?+
What Hopper-specific detail belongs in the receipt?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.
Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…
The compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.
A deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…