MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 2 min readDavid Gornshtein
Megatron
Flce
Hopper
Cross Entropy

Megatron FLCE on Hopper

Why Hopper-ready fused linear cross entropy is an output-layer contract as much as a kernel choice, and why shape-compatible alternatives are not enough.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Megatron FLCE on Hopper
Published 2 min readDavid Gornshtein

It is tempting to describe fused linear cross entropy as a kernel detail. That is too shallow.

On the Hopper path, FLCE is also an output-layer contract. The model path has to present the fused loss surface the runtime expects. A plain column-parallel layer may be shape-compatible and still fail to preserve the intended fused loss path.

That is why this public sample keeps the comparison narrow: one lane exposes a plain output layer, the other exposes a fused linear-plus-cross-entropy path that is actually aligned with the Hopper runtime contract.

The checked-in proof surface is intentionally small: Megatron Hopper FLCE near-copy compares the plain path against the fused path without smuggling in the rest of the MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split runtime.

Why this matters beyond one kernel

The real engineering lesson is that parity checks have to happen at the loss boundary, not just the tensor-shape boundary. Once a stack uses fused output and loss handling, a "close enough" output module is not close enough.

One distributed constraint is worth naming plainly. MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split's tensor-parallel cross-entropy path masks targets outside a rank's local vocabulary interval and then reconciles per-rank results with cross-rank MAX and SUM reductions. That means local tensor-shape parity still is not enough if the output boundary no longer preserves the distributed loss contract.

The memory receipt has the same boundary problem. A fused lane should keep the vocabulary sweep inside the loss boundary, carrying only chunk-local logits plus the running maximum and denominator state needed for softmax normalization. If a supposed Hopper-ready path first materializes the full [B, S, V] logit slab, it may still match the scalar loss while proving a different memory class.

That is the same design pressure visible in the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode CE parity work. The bug surface is small, but the consequence is broad because the output path sits on every trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 step.

FAQ

Frequently asked questions

Why is tensor-shape parity still not enough on tensor-parallel FLCE?+
Because the distributed loss path owns more than the local fused kernel. A lane can preserve local math and still break the real contract if it drops the partition-local target masking or the cross-rank reductions that make the loss global again.
What should a useful FLCE parity receipt name?+
Name the output layer class, the local vocabulary shard shape, where the max and sum reductions happen, and which token count owns the final normalization. That keeps the receipt focused on the same output-and-loss seam as the checked-in near-copy instead of proving a different full-logit materialization path. The same reduction-boundary discipline is the reason Liger FLCE reduction none is a useful adjacent read. The same receipt should also name whether logits were kept chunk-local through the fused boundary or materialized as a full vocabulary tensor before loss. That is the difference between checking FLCE behavior and only checking scalar-loss equivalence.
When should the receipt prove that no full-logit fallback was used?+
When another consumer still asks the model for raw logits. Policy-logprob capture, top-k validation, or entropy tracking can turn a fused output-and-loss path back into a materialized logits path even when the scalar loss still matches. The receipt should name whether those consumers were disabled, moved behind the fused boundary, or tested in a separate unfused lane. That keeps this trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…-step proof separate from distillation and RL scoring and evaluation verifier receipts, where candidate scoring is intentional rather than a fused-path leak.
Do padded vocabulary rows change what the receipt should prove?+
Yes, but only if the receipt records them explicitly. A Hopper or tensor-parallel lane may carry a padded vocabulary shape while the model still owns a smaller logical vocabulary. The useful receipt names both counts, the local shard shape, and the masking rule that keeps padding or out-of-range targets out of the global loss denominator. That is the same static-shape discipline discussed in XLA SPMD tokenizer and vocab, but here it is checked at the fused output-and-loss boundary.
What should the Hopper-ready claim avoid overstating?+
It should not collapse native-kernel support and fallback behavior into one claim. The public Megatron-LM Hopper FLCE PR frames the Hopper work as extending the Blackwell fused linear cross-entropy path while keeping the same linear-plus-cross-entropy API and vocabulary chunking idea. A useful receipt still names whether the run used that native Hopper path, a soft fallback, or an intentionally unfused comparison lane.
What Hopper-specific detail belongs in the receipt?+
Name where the logit tile lives between matrix multiply and cross-entropy. The public PR describes the Hopper path as keeping MMA results in registers and running cross-entropy immediately in the same loop, instead of handing the epilogue a separate tensor-memory staging path. That makes the receipt stronger than a generic "runs on H100" claim: it ties the loss check to the Hopper execution shape that TMA-fed shared-memory tiles and warpgroup MMA are meant to support.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

XLA SPMD

The explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…

XLA

The compiler/runtime layer that lowers frontend tensor programs into executable TPU or accelerator graphs, with shape stability and ownership boundaries as the main operational concerns here.

Tokenizer

A deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Mamba

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…