MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 11 min readDavid Gornshtein
Fire
Dash
Redo
Plasticity
Nam52
NAM56R
Training

FIRE, DASH, and ReDo as one plasticity toolkit

How three separate plasticity ideas fit into one toolkit, what the public samples actually show, and which design choices are worth preserving as the stack evolves.

MegaCpp
Focused on applied C++ model engineering
Article Preview
FIRE, DASH, and ReDo as one plasticity toolkit
Published 11 min readDavid Gornshtein

The plasticity toolkit is interesting because it is not just "we added FIRE." It combines three distinct interventions with different time scales: FIRE for phase boundaries, DASH for periodic directional shrinkage, and ReDo for recycling dormant neurons. Like activation checkpointing policy and Muon on Hopper and Blackwell, it is a trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200-control surface where timing matters as much as the knob itself. The public sample is small enough to inspect directly, and the public-facing writeups are enough to show the division of labor without overclaiming a one-to-one paper implementation. If you want the shorter operator-facing continuation after this overview, FIRE, DASH, and ReDo in practice is the direct companion.

Many plasticity discussions collapse everything into one magic lever. This toolkit does the opposite. It treats plasticity as a maintenance stack. One tool repairs geometry at a boundary, one tool nudges weights during trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200, and one tool revives neurons that have gone quiet. That decomposition is the reason the implementation is worth studying. It is also why later integrations should preserve structure and timing, not just names.

The source map is unusually clear once it is presented through public materials. The toolkit sample shows the combined control surface, FIRE, DASH, and ReDo in practice shows how the same pieces behave once they meet a real lane, and the public references for FIRE, DASH, and ReDo are useful as background context. That is enough to ground the engineering story without leaning on unpublished context or claiming exact external parity where the public sample already says enough.

The toolkit works because the methods are scheduled differently

The first mistake people make with plasticity work is trying to apply every intervention at the same cadence. This toolkit does not do that.

FIRE is a boundary operation. It projects 2D weight matrices toward orthogonality with Newton-Schulz iterations and is meant for moments when the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 regime changes, such as context extension or other curriculum transitions. That is a one-shot structural reset.

DASH is much lighter. It looks at row-wise cosine similarity between weights and gradients and shrinks rows whose updates are tracking their own current direction too closely. That makes it a periodic maintenance action during trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 rather than a rare phase change tool.

ReDo is different again. It needs activity diagnostics over time. ReDoDiagnostics attaches hooks, tracks EMA-style activity, and lets recycle_dormant_neurons() reinitialize rows that have effectively dropped out. That means it is not about geometry in the same sense as FIRE, and it is not about directional shrinkage like DASH. It is about waking neurons back up.

Method Public code surface Time scale Main failure mode it targets
FIRE orthogonalization and target-selection helpers Phase boundary Loss of isometry and stale geometry between regimes
DASH directional-shrinkage step Periodic in-training Rows over-aligning with their own gradients
ReDo dormant-unit diagnostics and recycle helpers Periodic with accumulated diagnostics Dormant MLP neurons

This separation is the key design win. The toolkit is not three variants of the same knob. It is three maintenance layers aimed at different times and different failure modes.

What the FIRE implementation really adds in practice

The public sample makes a strong point: the theory is nice, but a working trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 system still has to solve multiple engineering problems that short method summaries do not really discuss. The most important one is optimizer state staleness.

After FIRE rewrites a weight matrix, the optimizer's stored state still describes the old basis. If you keep Adam-style exp_avg and exp_avg_sq, or Muon momentum buffers, the next updates can partially undo the re-orthogonalization. The public sample addresses that with selective optimizer-state reset, not a global wipe. reset_optimizer_states_for_fired_params() clears state only for parameters that were actually touched.

That is not cosmetic. It is the difference between a clean intervention and a self-canceling one.

The second half of that contract is lazy re-initialization. Clearing the matching slots is useful precisely because the next optimizer step rebuilds fresh statistics for the rewritten local shard instead of trying to reinterpret old momentum in a new basis. That is why the reset helper belongs to the FIRE surface itself rather than to some later cleanup pass.

The second important addition is DTensorQuick term guideDTensorPyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.GroundingAbout: EP / PP / TP / CP / SP / DP overview Example: 3D parallelism sample Reference: FSDP2 on XLA TPU safety. The public sample explicitly makes DASH and FIRE safe under FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample-style sharding. Helpers such as _local_tensor_if_dtensor() and _match_grad_to_local_shard() exist because the real parameter seen by the optimizer may be a shard, not a monolithic tensor. A notebook implementation can ignore that. A trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 system cannot.

The operational bar is stricter than "supports DTensorQuick term guideDTensorPyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.GroundingAbout: EP / PP / TP / CP / SP / DP overview Example: 3D parallelism sample Reference: FSDP2 on XLA TPU". Interventions need to write back into the local shard in place and make reset or recycle decisions from shard-symmetric statistics, not from whichever rank happened to observe a spike first. Otherwise one rank mutates a parameter family while another still thinks the old state is live, which is exactly the distributed skew this toolkit is supposed to prevent.

The same shard discipline matters again on the diagnostic side. ReDo thresholds only mean the same thing on every worker if the activity statistic is reduced to a layer-wide view before any rank decides a neuron is dormant. A local-only threshold can make one shard recycle capacity that another still considers healthy, which is how a "plasticity fix" turns into distributed disagreement.

The third addition is parameter targeting. The default path in the toolkit is careful about what it touches. Two-dimensional projection weights are in scope. Embeddings, head weights, scalar state parameters, and various bias-like tensors are generally not. That is especially important in hybrid architectures where not every learnable parameter represents the same kind of geometry.

touched = apply_fire(model, targets=get_fire_targets(model, mode="context_extension"))
reset_optimizer_states_for_fired_params(optimizer, touched)

That short sequence encodes a lot of engineering judgment: select a topology-aware target set, rewrite only the intended matrices, and invalidate only the optimizer state that became stale.

Why DASH and ReDo belong in the same module

At first glance, DASH and ReDo seem unrelated. One is about cosine alignment of rows and gradients. The other is about dormant neuron detection. The reason they belong together is that they are both trying to prevent late-trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 rigidity, just on different observables.

dash_step() is the lighter-weight tool. In the checked-in tensor sketch, a row whose gradient keeps pointing in the same direction as the row itself is treated as a candidate for bounded shrinkage. That makes the sample a maintenance heuristic, not a claim that every paper-level DASH schedule uses the same trigger.

The checked-in DASH sample is intentionally a bounded tensor-mode rule, not a claim of exact paper-parity scheduling. It computes row-wise cosine similarity, applies a thresholded penalty, and clamps the shrink factor so the intervention stays maintenance-shaped instead of turning into a hidden reinitialization path. That public-safe shape is still the useful engineering point: DASH should act as a graded pressure-release move that can coexist with the optimizer, not as an all-or-nothing reset.

ReDo is much more targeted. It looks for neurons that have effectively stopped firing, based on normalized EMA activity. The reinitialization path then restores incoming weights at a normal scale and damps outgoing weights, which is a sensible compromise between waking the neuron up and avoiding a destabilizing spike.

That normalization step matters more than the headline "recycle dormant neurons." The public sample computes dormancy relative to the layer mean rather than from a raw activation magnitude, which is what keeps the threshold about under-used capacity instead of about whichever layer simply runs at a smaller absolute scale. In distributed lanes that denominator is exactly the piece that has to agree across shards before any rank recycles rows.

The subtle but important local insight is that ReDo is activation-family dependent. The checked-in sample discussion explicitly connects dormant-neuron pressure to relu2, while also discussing SwiGLU as a way to reduce the need for ReDo-style maintenance. That matters because it prevents the toolkit from turning into dogma. If the activation choice changes the dormant-neuron problem, the right amount of ReDo also changes. The operational version of that caution shows up again in FIRE, DASH, and ReDo in practice, where the same toolkit is discussed as lane maintenance rather than as a universal recipe.

Two schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention details are worth keeping explicit. In the paper-level DASH story, strong alignment is evidence that a direction is still carrying useful features, so the heavier forgetting pressure belongs on weakly aligned or misaligned directions instead. The checked-in tensor sample is narrower than that schedule and should be read as a bounded maintenance sketch rather than as an exact coefficient-for-coefficient reproduction. ReDo has the opposite distributed constraint: the dormancy threshold is only fair if layer-wide activity statistics are reduced across shards before any rank decides a neuron is dead, otherwise one shard can recycle capacity that another still considers healthy.

Hybrid blocks are why targeting matters

NAM52 and NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample-style stacks make all of this harder because the architecture is heterogeneous on purpose. A, M, E, and R do not all want the same intervention.

AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns projections are natural FIRE targets because they are 2D linear maps with a clear geometric story. MLP projections are similar. Some Mamba-style projections can also make sense. But one-dimensional state parameters, convolution kernels, and topology-specific auxiliary structures do not all benefit from the same orthogonalization logic.

The implementation discussion is especially useful here because it spells out which parameter classes should be touched and which should be excluded. That is the sort of evidence a production port actually needs. A vague instruction like "apply FIRE to the model" is too blunt for a hybrid system.

The same targeting logic shows up in the context-extension mode. Rather than treating every 2D parameter equally, get_fire_targets() can narrow the intervention to the Q/K surfaces that matter most for extending context. That is a better operational story than uniform global treatment because it respects the block topology.

Block family Likely toolkit role Why
ablock / attention projections Strong FIRE candidates Geometry matters directly for Q/K/V and output projections
mblock linear projections Conditional FIRE candidates Some 2D maps benefit; state scalars do not
eblock feed-forward projections More DASH/ReDo/FIRE depending on activation path Large 2D MLP surfaces and dormant-neuron risk
rblock / recurrent-specific state Usually narrower targeting Many parameters are not natural FIRE surfaces

That table is the real operational takeaway. Plasticity is not a global property. It is a block-local maintenance problem.

What the tests prove, and what they do not

The test coverage around the toolkit is useful because it shows the sample is not relying on hand-wavy claims. The checks cover FIRE's effect on a proxy geometry metric, verify that the model still runs after intervention, and exercise the broader plasticity wiring. That means the toolkit has crossed the line from concept to maintained code.

But the tests also reveal the right humility. A passing unit test does not prove that a late-phase FIRE pass improves convergence in every NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample lane. A passing ReDo test does not prove that every dormant-neuron issue is solved. The toolkit should therefore be read as a set of grounded mechanisms with operational constraints, not as a guaranteed universal win.

That is exactly why the public sample is so valuable. It preserves the mismatch between an elegant method-level story and messy trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 reality: optimizer state has to be reset, sharded tensors have to be handled locally, and activation choice changes whether dormant-neuron recycling is even the right tool.

What later integrations should preserve from this work

The port should keep the decomposition, not just the names. That means at least four things.

  1. Preserve phase-boundary FIRE as a topology-aware targeted intervention.
  2. Preserve selective optimizer-state reset for touched parameters.
  3. Preserve the distinction between periodic DASH and diagnostic-driven ReDo.
  4. Preserve the idea that block family and activation family determine which tool is appropriate.

The main thing worth preserving is the separation of responsibilities. FIRE fits best as a curriculum-boundary utility, DASH as a lightweight periodic maintenance option, and ReDo only where the activation path and hook surfaces make the signal trustworthy. Flattening all three into a single switch would discard most of the design value the public sample makes visible. That is also why this post pairs well with training speed by feature: the right question is not "which plasticity flag exists?" but "which intervention is worth its maintenance and step-time cost on this lane?"

The main thing to avoid is flattening the toolkit into a single feature flag. Once that happens, all the useful timing and targeting discipline disappears. The same block-aware caution shows up in GateSkip and FlexiDepth after the router, and the strongest contribution here is showing that plasticity support can be modular, code-grounded, and still operational.

Why this matters beyond one implementation snapshot

The deepest value of the toolkit is not that it proves one external reference right. It is that it turns plasticity from an abstract trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 slogan into a set of maintainable engineering surfaces. That is a big difference.

FIRE gives a principled way to repair geometry at transitions. DASH gives a cheap maintenance move for rows that are becoming too self-aligned. ReDo gives a direct response to dead neurons when the activation family makes that a real problem. Put together, they form a credible answer to a common late-trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 complaint: the model is still updating, but it is learning less than it should.

That is why this work should survive beyond one sample snapshot. Not because every run needs all three methods, but because the public sample already shows the harder part: how to separate cadences, target the right parameter families, and keep the interventions compatible with sharding and optimizer state. That is the part worth keeping.

FAQ

Frequently asked questions

Should all three methods be enabled on every run?+
No. FIRE is a phase-boundary tool, DASH is periodic maintenance, and ReDo only makes sense where activity diagnostics are trustworthy. The public sample is valuable because it keeps those cadences separate instead of flattening them into one trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… toggle.
Why does FIRE reset optimizer state for touched parameters?+
Because the weight basis changed. Keeping stale Muon or Adam-style state would partially undo the intervention, which is exactly why the sample couples apply_fire(...) with reset_optimizer_states_for_fired_params(...); Muon on Hopper and Blackwell is the adjacent optimizer-side view.
Why is the reset helper enough instead of a custom optimizer restart path?+
Because the sample relies on the optimizer's normal lazy state creation on the next step. The important part is deleting only the slots attached to rewritten weights so fresh moments are rebuilt for the new shard-local basis without wiping unrelated state elsewhere in the model.
When does ReDo become less compelling?+
When the activation path already reduces dormant-neuron pressure. The sample's own discussion is careful here: relu2-style lanes have a stronger ReDo story than smoother SwiGLU-heavy lanes, so the diagnostics need to stay activation- aware instead of turning into blanket policy.
Where does targeting matter most?+
In hybrid stacks where attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand., recurrent, and expert blocks do not share the same geometry. That is the same "choose the right block, not the biggest knob" lesson as activation checkpointing policy.
Why does FIRE stay off 1D parameters?+
Because FIRE is a matrix-shape intervention, not a generic "touch every weight" pass. Biases and norm scales do not have the 2D geometry that the orthogonalization step is trying to repair, and rewriting them would change feature scaling without buying the isometry benefit that makes FIRE useful in the first place.
Why does ReDo need cross-shard sync before recycling?+
Because dormancy is a layer-relative judgment, not a per-rank intuition. If the layer mean is computed locally, two workers can disagree about whether the same unit is dead, and then the recycle step stops being a controlled intervention and starts being state skew.
Why is the ReDo sync scheduled instead of running on every step?+
Because the collective is only needed when the toolkit is about to make a dormancy decision. Local EMA buffers can accumulate cheaply between checks, and then one scheduled reduction produces the shared layer-wide denominator every rank needs before recycling. That keeps the communication cost attached to the maintenance cadence instead of charging every trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… step for a statistic that only matters when ReDo is actually going to act. The cadence-side continuation is FIRE, DASH, ReDo in practice, and the sharding-side background is FSDP2 pain and payoff.
Why use normalized EMA activity instead of one raw activation cutoff?+
Because raw magnitudes are layer-scale dependent. ReDo becomes portable across different widths and activation families only when a neuron's activity is judged relative to the layer mean, which is also why the denominator has to be made consistent across shards before any recycle step runs.
Why not drive FIRE, DASH, and ReDo from one shared interval?+
Because the checked-in schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state. keeps them as different runtime decisions: FIRE at a phase boundary, DASH on a maintenance cadence, and ReDo on its own diagnostic cadence. Collapsing all three into one interval would blur the line between a structural reset, a lightweight shrink step, and a dormancy repair pass, which is exactly the discipline this toolkit is trying to preserve.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

A / M / E / R

MegaCpp shorthand for the four main block families: attention, Mamba/state-space, expert/MoE, and recurrent tail layers.

scheduler

The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.

ablock

The attention-heavy block family in MegaCpp's A/M/E/R notation.

mblock

The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.

eblock

The expert / MoE block family in MegaCpp's A/M/E/R notation.

DTensor

PyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.

rblock

The recurrent tail block family in MegaCpp's A/M/E/R notation.

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.