MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 8 min readDavid Gornshtein
Ablation
Training
Stp
Fire
MTP
Nam52
NAM56R

What changed after the 10K-step gate: the ablations that stayed honest

A grounded reading of training changes after the configured 10K-step gate: STP activation, auxiliary-head timing, plasticity scheduling, and why later ablations are more trustworthy than warmup-era receipts.

MegaCpp
Focused on applied C++ model engineering
Article Preview
What changed after the 10K-step gate: the ablations that stayed honest
Published 8 min readDavid Gornshtein

Overview: The most useful ablations usually begin after the configured 10K-step gate because that is when this trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 policy stops over-weighting warmup behavior. A delayed STP activation gate postpones the trajectory-straightness regularizer by design, multi-token prediction is treated as substrate-sensitive rather than a default win, and the plasticity toolkit is scheduled around real phase changes rather than tiny startup windows. If you compare feature sets only at step 20, step 100, or even step 1000, you mostly learn how initialization behaves. If you compare after the configured gate, you start learning how the model family trains under its intended schedule.

The trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 code already tells you this if you read it literally. The STP loss sample defines STP as a curvature penalty on hidden-state trajectories; it is not supposed to matter before it is enabled. Public trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 configuration and runtime notes describe STP as an optional auxiliary term with a delayed start and separate weight. The public SOTA ablation writeup also treats MTP as a lane-specific decision: it regressed on the fixed 4K substrate and stays parked until a substrate-matched rerun says otherwise. The lesson is not that auxiliary features are bad. The lesson is that honest ablations must be aligned to the activation schedule and substrate of the feature being studied.

That becomes even more important in hybrid families such as NAM52 and NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample. In the local notation, A is an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns block, M is a Mamba block, E is an expert or MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack block, and R is a recurrent block. The same taxonomy also appears in block-level naming such as ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, mblockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, rblockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, and cblock across the public samples. A feature may interact with one part of that stack much earlier than another. So the headline question is not "does the feature help?" but "when is the feature actually live enough to measure?"

Warmup receipts mostly measure the wrong thing

Very short receipts look rigorous because they are easy to compare, but they often collapse three different effects into one number: startup transients, immature optimizer state, and the actual feature under test. The public samples contain multiple examples of this failure mode.

The cleanest one is MTP. The public ablation writeup does not promote or reject it from a generic prior; it parks MTP after a substrate-matched regression on the fixed 4K lane. That is a stronger operational rule than a paper-level expectation because it ties the decision to the lane where the feature was actually measured. For gated work, the useful question is not whether MTP is good in the abstract, but whether it still earns its cost once the target lane is stable and comparable.

The same logic applies to STP. The STP loss sample makes the feature conceptually cheap, but the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 stack still delays it so that the base model can establish representations first. The checked-in STP activation note makes the contract even tighter: hidden states can be collected early enough to prove the path is wired, while the auxiliary term is only charged after the chosen start step. Once you know that a delayed STP start exists, a pre-start ablation becomes almost meaningless. You are comparing a dormant option against another dormant option and attributing the result to a feature that is not yet contributing gradient signal.

Plasticity tools follow the same pattern from another angle. The public toolkit notes are explicit that FIRE is a phase-boundary tool, DASH is periodic, and ReDo is useful only when the activation family can produce dormant neurons in the first place. Those mechanisms are not intended to show their value in the first few hundred steps of a fresh run. A short receipt can capture overhead, but it cannot tell you whether the intervention improves long-run plasticity.

Surface What a short receipt sees What a receipt after the configured gate sees
STP Mostly disabled or weakly coupled Real trajectory-straightness regularization on live hidden trajectories
MTP Extra path that can look like a generic regression Substrate-matched rerun in the lane where the head would operate
FIRE Usually irrelevant unless a curriculum shift happens True phase-transition reset behavior
DASH / ReDo Local perturbation without stable baseline Whether plasticity maintenance helps late training

The practical rule is simple: if the code delays a feature, the ablation must delay its conclusion.

What STP actually changes after the cutoff

The STP implementation is unusually transparent. The loss samples ordered triples (s, r, t) from a hidden-state trajectory and penalizes curvature with 1 - cos(h[t] - h[r], h[r] - h[s]). That matters because it tells you what STP is and what it is not. It is not a second language-model head. It is not another token-level classification target. It is a geometric prior over hidden-state evolution.

That geometry-based design is exactly why early measurements are easy to misread. At the start of trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200, hidden-state trajectories are still being organized by the main objective. A curvature regularizer can either appear inert or look deceptively expensive, depending on how noisy those first trajectories are. After the configured gate, the same regularizer is applied to a representation space that has more settled structure. That is the first point where an STP-on versus STP-off comparison starts to answer a real question.

The base trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 code also preserves STP as a separable knob. The argument surface keeps the STP weight distinct from the primary loss and logs it as its own auxiliary component. That separation is important for receipts. If an ablation changes the total loss after the configured gate, you want to know whether the difference came from the base objective, the auxiliary term itself, or a throughput tradeoff that changed effective optimization rate.

A minimal honest receipt therefore needs at least three channels: base loss, STP loss, and throughput. The repo does not require a giant dashboard to make this point; the contract can stay small.

ablation_window:
  compare_from_step: 10000
  report:
    - train/loss
    - train/stp_loss
    - tok_per_sec
    - active_pattern

The active_pattern field matters in hybrids. A run with a mostly A-heavy schedule can expose STP differently from a schedule with more M, E, or R pressure, even if the top-line model name is still NAM52 or NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample. The checked-in NAM56R Megatron plan sample keeps that contract explicit with a fixed pattern string, an expanded role list, mtp_depths: 0, and a fail-closed default instead of hiding the lane shape behind a vague preset name.

Hybrid patterns make timing more important, not less

One reason the local notation is useful is that it forces you to think in blocks instead of slogans. NAM52 and NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample are not generic dense transformers. They are patterned hybrids, and the pattern notation explains why two runs with the same parameter count can react differently to the same ablation.

In the public samples, A, M, E, and R are not decorative. They are the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 topology. A hybrid pattern string encodes where attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, state-space, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, and recurrent pressure are actually placed. The cost of an auxiliary head or a plasticity intervention may concentrate in only one of those categories. That means an honest gated ablation should preserve the pattern string, not reduce everything to "feature on" and "feature off."

This is also where the public sample becomes relevant. The ported Megatron-side code in the hybrid pattern sample keeps the same idea alive: hybrid structure and MTP configuration are first-class runtime contracts. The port is not merely copying names. It is preserving the fact that a block mix and an auxiliary path interact structurally.

For gated analysis, that yields a better comparison matrix.

Family Pattern lens Ablation question that survives warmup
NAM52 Mostly hybrid A/M with targeted extras Does STP or MTP improve the settled optimization path?
NAM56R Larger mixed A/M/E/R family Which auxiliary terms are still worth paying for after the stack is fully live?
Public port lanes Megatron-native hybrid specs Which earlier ablations transfer as real runtime knobs rather than one-off experiments?

Without the pattern lens, a short receipt invites the wrong conclusion: "feature X is slower." With the pattern lens, the better conclusion is: "feature X is slow or useful under this block topology and after this activation schedule."

Why gated ablations are the first ones worth operationalizing

The repo contains several examples where a feature's apparent cost changes once neighboring issues are fixed. That is why the 10K cutoff is methodological, not mystical. It gives the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 system enough time to move from setup behavior to operating behavior.

The MTP evidence is the clearest operational example. The public ablation writeup labels the substrate and parks MTP after a matched regression instead of carrying an earlier expectation forward. That means the operational baseline for a new backend, new launcher, or new substrate should usually keep MTP off first, then add a later MTP ablation once the lane is known-good.

STP follows the same principle via delayed activation. Plasticity tools follow it via event-based scheduling. The checked-in toolkit note is useful here because it keeps the interventions separate: FIRE is for phase boundaries, DASH is a periodic shrinkage rule, and ReDo is a dormant-neuron recycling pass. Once you line those up, the correct ablation order becomes obvious:

  1. Establish a stable base lane.
  2. Let delayed auxiliaries actually turn on.
  3. Compare only in the interval where the feature is live.
  4. Keep the hybrid pattern fixed while comparing.

What to carry forward into future receipts

The best part of this setup is not any single feature. It is the discipline of separating dormant, warming, and active regimes. That discipline should survive future ports and future trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 reports.

For practical reporting, the carry-forward checklist is short:

Requirement Why it matters
Preserve model family and pattern string Avoid mixing topology changes with feature changes
State the feature's activation schedule Prevent dormant-window receipts from being over-interpreted
Report throughput with the loss Distinguish algorithmic gain from rate distortion
Use a post-gate comparison window when possible Measure active behavior rather than warmup quirks
Keep references file-level and code-grounded Make the claim reproducible by rereading the repo

The core lesson is narrow but durable. A good ablation is not just a pair of numbers. It is a timing claim. The repo already encodes the timing: STP starts late, MTP is substrate-sensitive, FIRE is for boundaries, and hybrid patterns shape every comparison. Once you accept that, the post-gate window stops looking arbitrary. It becomes the first interval where the experiment is measuring active behavior rather than startup noise.

FAQ

Frequently asked questions

Why can the STP path be wired before the STP loss is active?+
Because the checked-in activation note separates two questions: whether hidden states are being collected correctly and whether the auxiliary term is allowed to affect optimization yet. That makes it easier to catch wiring mistakes without pretending the feature was already live in the warmup window.
Why does the hybrid pattern string belong in a post-10K ablation receipt?+
Because the same auxiliary can look cheaper or more useful simply because the run moved between different A/M/E/R mixes. A gated comparison is only honest when the model family and active pattern stay fixed while the feature window changes. The execution-side view is easier to keep straight if you read Hybrid Layer Interleaving next to NAM56R launch policy.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

ablock

The attention-heavy block family in MegaCpp's A/M/E/R notation.

mblock

The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.

eblock

The expert / MoE block family in MegaCpp's A/M/E/R notation.

rblock

The recurrent tail block family in MegaCpp's A/M/E/R notation.

A / M / E / R

MegaCpp shorthand for the four main block families: attention, Mamba/state-space, expert/MoE, and recurrent tail layers.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Topic hubs