MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 2 min readDavid Gornshtein
Liger
Flce
Cross Entropy
Hopper

Liger FLCE reduction=none

Why Liger fused linear cross entropy can go wrong on the reduction='none' backward path, why mean stays correct, and how the scaled-mean workaround restores the intended sum contract.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Liger FLCE reduction=none
Published 2 min readDavid Gornshtein

This bug is useful publicly because it is not a vague optimization problem. It is a contract bug with a narrow surface.

The broken lane is reduction='none' on the fused linear-cross-entropy backward path. The known-good lane is reduction='mean'. The practical workaround is to keep the kernel on the mean path and scale by the number of valid targets to recover per-token sum semantics.

That matters because the failure mode is not only a small numerical drift. In the real trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 path it can surface as corrupted gradients or NaN grad norms.

Why the workaround is worth documenting

The workaround is not pretending the bug is gone. It is documenting a safe lane that preserves the intended trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 semantics while the broken reduction path is still unstable.

The upstream issue trail adds one useful refinement. The first reduction='none' reports were mostly about API acceptance and forward shape: issue #488 reported that the fused loss still returned a scalar, issue #872 narrowed that to the loss wrapper assertion, and PR #879 merged the small acceptance fix. Issue #968 showed the real blocker on the backward path once ignore_index-style masking entered the contract. Draft PR #1126 therefore matters not because it makes the path correct, but because it shows the safer failure mode: turning a silent bad-update lane into an explicit failure instead. Open PR #1182 is useful in a different way: its current diff handles the reduction kwarg explicitly at the wrapper boundary instead of always inferring it from num_items_in_batch, but it still does not make unreduced backward mathematically safe.

The checked-in Liger FLCE reduction-none near-copy keeps the workaround honest: stay on reduction='mean', rescale by the valid target count, and compare that scalar against explicit masked-sum semantics. That contract is exact when the effective mask is just (labels != ignore_index). If the caller needs arbitrary per-token weighting or routing, the workaround becomes an approximation and the unfused loss path is the honest fallback until the broken backward lane is actually repaired.

FAQ

Frequently asked questions

Why use mean and rescale instead of insisting on reduction='none'?+
Because the mean path keeps the fused kernel on the known-good scalar contract. Rescaling by the number of valid targets, not by a blanket batch * sequence factor, restores the intended masked-sum trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… semantics without stepping back into the unstable reduction='none' backward path.
Why call this a scalar-backward contract?+
mean and sum hand the fused backward path one upstream scalar. none hands it a token-length vector, so a chunked FLCE backward has to route the matching token slice into each local chunk before applying the gradient. The checked-in Liger FLCE contract example therefore stays on the scalar lane; the checked-in chunked loss sample is the safer local pattern when the caller owns token-local weighting or routing.
What would a native none fix have to own?+
It has to own more than wrapper acceptance. A correct native path needs to align each token's upstream gradient with the chunk that produced that token, including ignored-label rows, before applying the local linear-gradient update. The local upstream PR review is useful here because it separates fail-closed backward behavior from wrapper-level reduction plumbing; the checked-in reduction-none near-copy therefore keeps the public workaround on the scalar lane until that vector ownership is actually repaired.
What if the caller needs token-level weights?+
Do not hide that behind the scaled-mean shortcut. The shortcut is exact only for the valid-token mask case above; arbitrary per-token weights should stay in a caller-owned unreduced path such as the checked-in chunked fused linear cross-entropy sample, then apply weighting outside the fused Liger backward lane until native FLCE supports that gradient contract.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…