MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 3 min readDavid Gornshtein
Megatron
Migration
Restoration
Ops

Restoring a Megatron training tree without git history

How MegaCpp treats restoration as a base-plus-patch-plus-canary workflow when the working tree survived but the original .git metadata did not.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Restoring a Megatron training tree without git history
Published 3 min readDavid Gornshtein

The interesting restoration problem is not a fresh clone. It is the machine or archive that still has code, caches, and maybe checkpoints, but no usable Git history. In that situation the right question is not "how do I recover the old commit graph?" The right question is "how do I rebuild an explainable trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 tree with the smallest defensible local delta?"

That is why restoration in this article family looks closer to Migration policy: native Megatron vs narrow custom seams, Checkpoint format and resume, and How we keep a patch lane than to normal Git recovery.

For first touch:

  • The upstream base is the nearest verifiable upstream snapshot that explains the recovered tree cleanly.
  • The patch surface is the bounded local runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot layer still needed on top of that base.
  • The launch contract is the explicit policy that makes the restored run comparable to the original one.
  • The canary is the smallest post-restore launch or resume proof that still exercises the intended lane.

The quickest checked-in proof surfaces are NAM56R Megatron plan sample, NAM56R runtime patch surface sample, and NAM56R launch contract sample.

What to restore first

Start with the surfaces that preserve execution truth rather than directory shape.

In practice the safest order is:

  1. recover the most plausible upstream base
  2. map the local patch surface that still sits above it
  3. restore the launch and checkpoint assumptions that make the lane comparable
  4. replay the smallest canary that still exercises the rebuilt stack

That evidence-first order is the same habit behind Determinism and bit-exact runs and Checkpoint format and resume: recover the contracts that change runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot meaning before worrying about cosmetic tree completeness.

When several upstream snapshots look plausible, compare only the surviving behavioral delta instead of treating the missing commit hash as the unit of truth. Git's stable patch ID is a useful public sanity check here because it tracks patch content rather than commit identity, and the stable mode keeps the result insensitive to file-diff reordering.

Why the base-plus-patch split matters

The checked-in starter kit is small on purpose.

NAM56R Megatron plan sample captures the upstream-shaped execution plan you are trying to recover. NAM56R runtime patch surface sample marks the behaviors that still depend on a local seam after the base is back. NAM56R launch contract sample keeps cluster policy and launch proof separate from the recovered recipe itself.

That split is valuable because it makes restoration reviewable. Later readers can ask three narrow questions instead of one vague one:

What not to trust

A missing .git directory does not make the surviving tree authoritative.

Generated outputs, cached wheels, copied virtual environments, and stale build products can all make a broken tree look healthy for one launch. Restoration should therefore trust checked-in proof surfaces and canary receipts more than directory folklore.

If the rebuilt tree cannot pass the smallest meaningful canary, it is not restored yet. A one-off launch that "seems to work" is weaker evidence than a small canary whose scope is explicit.

How checkpointing and restoration meet

Checkpoint recovery is part of the restoration story, not a separate concern.

If the restored tree cannot correctly interpret the checkpoint kind, the layout, or the intended resume consumer, then the tree is still missing part of its runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…GroundingThe Compile-Time Tax We Accept for Runtime Speed Regional compile without losing the plot contract. That is why Checkpoint format and resume belongs in the same reading cluster as this article: a restored code tree that cannot honestly resume is only half restored.

FAQ

Frequently asked questions

Can I treat the surviving working tree as the source of truth if it still launches?+
No. A tree without Git history can still contain stale generated files, half-applied local edits, or environment artifacts that hide real drift.
What actually proves restoration succeeded?+
The smallest launch or resume canary that exercises the restored trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… path, plus evidence that the canary matches the documented base, patch surface, and launch contract.
What exactly is the launch contract here?+
It is the explicit launcher/runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… policy that keeps the restored lane comparable to the original one: process layout, key runtime toggles, and the small set of policy decisions the canary is supposed to preserve.
How do I choose the upstream base when the old commit graph is gone?+
Start from the smallest stable surfaces that still explain runtimeQuick term guideRuntime boundariesWhy MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,… behavior: model and launcher arguments, dependency pins, checkpoint expectations, and the bounded patch surface you know you still need. Hashing or diffing those surfaces against candidate upstream snapshots is more trustworthy than trying to recreate the full history from memory, because the right base is the one that minimizes unexplained local delta while still matching the launch contract.
Which checked-in files should I read first when rebuilding the lane?+
Start with NAM56R Megatron plan sample, NAM56R runtime patch surface sample, and NAM56R launch contract sample. Then keep Distributed debugging notes nearby so the canary stays narrow and mechanism-focused.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Debugging

A grounded guide for debugging Modal failures in MegaCpp: cold starts, multi-GPU hangs, image drift, detached collector issues, and volume or…

Runtime boundaries

Why MegaCpp pays first-compile and recompile costs in exchange for steady-state throughput, and the operational rules that keep torch.compile,…