MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 2 min readDavid Gornshtein
Megatron
Restoration
Migration
Ops

Restoration without git history

How MegaCpp reconstructs a Megatron training tree when the code survives but the original commit graph does not.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Restoration without git history
Published 2 min readDavid Gornshtein

The hardest restore is not a clean clone. It is a machine image or archive where the code still exists, some receipts still exist, but the Git history that used to explain the tree is gone.

In that situation the only honest workflow is reconstructive: choose the most plausible upstream base, replay the narrow local patch surface, and then prove the rebuilt tree still behaves like the lane you meant to recover. That is why this short note sits next to Restoring a Megatron training tree without git history, Checkpoint format and resume, and How we keep a patch lane.

For first touch:

  • The upstream base is the closest verifiable upstream snapshot that explains the surviving tree with the smallest local delta.
  • The patch surface is the bounded set of local seams still required on top of that base.
  • The canary is the smallest launch or resume proof that shows the rebuilt tree still behaves like the intended lane.

The quickest public-safe starter kit is:

Together they separate base plan, local seam inventory, and proof launch.

What the workflow preserves

The goal is not to reconstruct every historical commit. The goal is to recover a trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 tree whose runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot is explainable again.

That means preserving three things:

  • a plausible upstream base
  • a narrow replayable patch layer
  • a verification path that proves the rebuilt tree still behaves correctly

That target is narrower than full archaeology and more useful for recovery. It matches the evidence-first posture in Checkpoint format and resume: restore the executable contract first, then widen the tree only if the verified lane still needs more context.

The public-safe sequence

  1. Reconstruct the likely upstream-shaped execution plan.
  2. Mark the local seams that still sit above that base.
  3. Rebuild the launch contract that made the original lane comparable.
  4. Run the smallest canary that still exercises the restored path.

The checked-in examples above are small on purpose because they keep each part of that sequence inspectable instead of hiding it inside one opaque launcher or archive.

The practical way to keep that scope honest is to inventory three separate contracts instead of diffing everything at once. The plan receipt says what upstream-shaped model you think you have, the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot patch-surface receipt lists only the live seams that still sit above that base, and the launch contract records the fixed policy that has to surround the generated args. When those three surfaces agree, you have a restoration story. When they do not, you usually have an archive that still runs but no longer explains itself.

FAQ

Frequently asked questions

What is the main goal of restoration without git history?+
Not full history recovery. The goal is to recover a known-good tree with enough evidence that the rebuilt runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… is understandable and repeatable.
Why start from an upstream base instead of editing the surviving tree in place?+
Because a plausible upstream base plus a narrow patch layer is something you can reason about later. An opaque surviving tree gives you no clean place to compare, replay, or retire local changes.
What makes a restoration trustworthy?+
A documented base guess, a bounded patch surface, and a canary that proves the rebuilt tree still behaves like the lane you intended to recover.
What should the canary prove before the rebuilt tree is trusted?+
It should compare the restored lane against the smallest runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… contract that made the old tree meaningful: the reconstructed plan, the bounded patch surface, the launch policy, and one resume or trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…-step receipt. A launch that merely starts is weaker than a launch that preserves the same checkpoint boundary, parallelism shape, and deterministic proof surface; that is why this note links the restore canary back to Checkpoint format and resume and Determinism and bit-exact runs.
How do I compare surviving local seams when the old commit IDs are gone?+
Compare the delta, not the missing hash. Git's patch-id --stable is designed to compute a reasonably stable identifier for patch text with line numbers ignored, which makes it a useful public-safe check for whether two local seams are materially the same change even if the original commit graph is gone.
What belongs in the minimal patch surface?+
Only the runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… seams that still change behavior after the upstream-shaped base is reconstructed. In the checked-in sample that means keeping recipe-native facts separate from fused-loss glue, hybrid schedule glue, or shared layer-spec helpers instead of replaying every surviving local helper as if it were equally load-bearing.
Which checked-in files should I read first?+
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Runtime boundaries

Why MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…