Liger FLCE reduction=none
Why Liger fused linear cross entropy can go wrong on the reduction='none' backward path, why mean stays correct, and how the scaled-mean workaround restores the intended sum contract.

This bug is useful publicly because it is not a vague optimization problem. It is a contract bug with a narrow surface.
The broken lane is reduction='none' on the fused linear-cross-entropy
backward path. The known-good lane is reduction='mean'. The practical
workaround is to keep the kernel on the mean path and scale by the number of
valid targets to recover per-token sum semantics.
That matters because the failure mode is not only a small numerical drift. In the real trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 path it can surface as corrupted gradients or NaN grad norms.
Why the workaround is worth documenting
The workaround is not pretending the bug is gone. It is documenting a safe lane that preserves the intended trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 semantics while the broken reduction path is still unstable.
The upstream issue trail adds one useful refinement. The first
reduction='none' reports were mostly about API acceptance and forward shape:
issue #488 reported that the fused loss still returned a scalar, issue
#872 narrowed that to the loss wrapper assertion, and PR #879 merged the
small acceptance fix. Issue #968 showed the real blocker on the backward path
once ignore_index-style masking entered the contract. Draft PR #1126
therefore matters not because it makes the path correct, but because it shows
the safer failure mode: turning a silent bad-update lane into an explicit
failure instead. Open PR #1182 is useful in a different way: its current diff
handles the reduction kwarg explicitly at the wrapper boundary instead of
always inferring it from num_items_in_batch, but it still does not make
unreduced backward mathematically safe.
The checked-in Liger FLCE reduction-none near-copy
keeps the workaround honest: stay on reduction='mean', rescale by the valid
target count, and compare that scalar against explicit masked-sum semantics.
That contract is exact when the effective mask is just (labels != ignore_index). If the caller needs arbitrary per-token weighting or routing,
the workaround becomes an approximation and the unfused loss path is the honest
fallback until the broken backward lane is actually repaired.
Frequently asked questions
Why use mean and rescale instead of insisting on reduction='none'?+
batch * sequence factor, restores the intended masked-sum trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… semantics without stepping back into the unstable reduction='none' backward path.Why call this a scalar-backward contract?+
mean and sum hand the fused backward path one upstream scalar. none hands it a token-length vector, so a chunked FLCE backward has to route the matching token slice into each local chunk before applying the gradient. The checked-in Liger FLCE contract example therefore stays on the scalar lane; the checked-in chunked loss sample is the safer local pattern when the caller owns token-local weighting or routing.What would a native none fix have to own?+
reduction plumbing; the checked-in reduction-none near-copy therefore keeps the public workaround on the scalar lane until that vector ownership is actually repaired.What if the caller needs token-level weights?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…