Migration policy: native Megatron vs narrow custom seams
Why MegaCpp ports only what Megatron or Nemotron do not already provide, and why ambiguous mappings should fail closed instead of being reinterpreted silently.

The easiest way to make a migration story unreadable is to port everything. A clean migration policy does the opposite. Reuse native MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split or Nemotron surfaces where they are real, and keep only the irreducible local seams custom.
MegaCpp's migration policy is useful because it states that boundary directly. It prefers translation layers, fail-closed mappings, and narrow local seams instead of one large downstream fork.
What the policy is actually buying
This is not only a code-organization preference. It makes the stack easier to verify and easier to explain publicly.
- native surfaces stay close to upstream docs and runtime behavior
- custom seams remain enumerated and auditable
- ambiguous mappings stop early instead of silently drifting
That is why the translator, recipe, MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: public-safe MLA integration patterns Reference: fused MLA on NVIDIA adapter, and recurrent mixer examples belong together. They are all examples of the same rule: keep the custom seam as small as possible and make it obvious where it begins.
The maintenance payoff is that every custom surface has to name its owner. A pattern translator can refuse an unknown block, an MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: public-safe MLA integration patterns Reference: fused MLA on NVIDIA adapter can stay scoped to the handoff MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split already understands, and a recurrent mixer can remain a separate spec instead of pretending to be a native attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries block. That is a smaller promise than a full fork, but it is also easier to audit when upstream adds a real surface later.
The review order should stay mechanical: first keep supported tokens on the native path, then place experimental features in named receipts, and only then justify a custom seam. The fail-closed translation sample, feature placement sample, and shared MLA adapter sample show that split without turning the migration policy into a broad compatibility layer.
Frequently asked questions
What happens when a block family has no native handoff?+
When is a native handoff not a custom seam?+
--multi-latent-attention path, so the MegaCpp side should treat that as a native handoff unless a real compatibility gap appears.How should experimental memory features graduate?+
Why not keep a broad downstream fork until upstream catches up?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.
The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.
Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.
Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.