Restoring a Megatron training tree without git history
How MegaCpp treats restoration as a base-plus-patch-plus-canary workflow when the working tree survived but the original .git metadata did not.

The interesting restoration problem is not a fresh clone. It is the machine or archive that still has code, caches, and maybe checkpoints, but no usable Git history. In that situation the right question is not "how do I recover the old commit graph?" The right question is "how do I rebuild an explainable trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 tree with the smallest defensible local delta?"
That is why restoration in this article family looks closer to Migration policy: native Megatron vs narrow custom seams, Checkpoint format and resume, and How we keep a patch lane than to normal Git recovery.
For first touch:
- The upstream base is the nearest verifiable upstream snapshot that explains the recovered tree cleanly.
- The patch surface is the bounded local runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot layer still needed on top of that base.
- The launch contract is the explicit policy that makes the restored run comparable to the original one.
- The canary is the smallest post-restore launch or resume proof that still exercises the intended lane.
The quickest checked-in proof surfaces are NAM56R Megatron plan sample, NAM56R runtime patch surface sample, and NAM56R launch contract sample.
What to restore first
Start with the surfaces that preserve execution truth rather than directory shape.
In practice the safest order is:
- recover the most plausible upstream base
- map the local patch surface that still sits above it
- restore the launch and checkpoint assumptions that make the lane comparable
- replay the smallest canary that still exercises the rebuilt stack
That evidence-first order is the same habit behind Determinism and bit-exact runs and Checkpoint format and resume: recover the contracts that change runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot meaning before worrying about cosmetic tree completeness.
When several upstream snapshots look plausible, compare only the surviving behavioral delta instead of treating the missing commit hash as the unit of truth. Git's stable patch ID is a useful public sanity check here because it tracks patch content rather than commit identity, and the stable mode keeps the result insensitive to file-diff reordering.
Why the base-plus-patch split matters
The checked-in starter kit is small on purpose.
NAM56R Megatron plan sample captures the upstream-shaped execution plan you are trying to recover. NAM56R runtime patch surface sample marks the behaviors that still depend on a local seam after the base is back. NAM56R launch contract sample keeps cluster policy and launch proof separate from the recovered recipe itself.
That split is valuable because it makes restoration reviewable. Later readers can ask three narrow questions instead of one vague one:
- Is the guessed base plausible?
- Is the local seam still bounded?
- Did the canary actually prove the right runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot contract?
What not to trust
A missing .git directory does not make the surviving tree authoritative.
Generated outputs, cached wheels, copied virtual environments, and stale build products can all make a broken tree look healthy for one launch. Restoration should therefore trust checked-in proof surfaces and canary receipts more than directory folklore.
If the rebuilt tree cannot pass the smallest meaningful canary, it is not restored yet. A one-off launch that "seems to work" is weaker evidence than a small canary whose scope is explicit.
How checkpointing and restoration meet
Checkpoint recovery is part of the restoration story, not a separate concern.
If the restored tree cannot correctly interpret the checkpoint kind, the layout, or the intended resume consumer, then the tree is still missing part of its runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot contract. That is why Checkpoint format and resume belongs in the same reading cluster as this article: a restored code tree that cannot honestly resume is only half restored.
Frequently asked questions
Can I treat the surviving working tree as the source of truth if it still launches?+
What actually proves restoration succeeded?+
What exactly is the launch contract here?+
How do I choose the upstream base when the old commit graph is gone?+
Which checked-in files should I read first when rebuilding the lane?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.
Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…
A grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…
Why MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…