MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 2 min readDavid Gornshtein
Migration
Megatron
Nemotron
Porting Policy

Migration policy: native Megatron vs narrow custom seams

Why MegaCpp ports only what Megatron or Nemotron do not already provide, and why ambiguous mappings should fail closed instead of being reinterpreted silently.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Migration policy: native Megatron vs narrow custom seams
Published 2 min readDavid Gornshtein

The easiest way to make a migration story unreadable is to port everything. A clean migration policy does the opposite. Reuse native MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split or Nemotron surfaces where they are real, and keep only the irreducible local seams custom.

MegaCpp's migration policy is useful because it states that boundary directly. It prefers translation layers, fail-closed mappings, and narrow local seams instead of one large downstream fork.

What the policy is actually buying

This is not only a code-organization preference. It makes the stack easier to verify and easier to explain publicly.

  • native surfaces stay close to upstream docs and runtime behavior
  • custom seams remain enumerated and auditable
  • ambiguous mappings stop early instead of silently drifting

That is why the translator, recipe, MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: public-safe MLA integration patterns Reference: fused MLA on NVIDIA adapter, and recurrent mixer examples belong together. They are all examples of the same rule: keep the custom seam as small as possible and make it obvious where it begins.

The maintenance payoff is that every custom surface has to name its owner. A pattern translator can refuse an unknown block, an MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: public-safe MLA integration patterns Reference: fused MLA on NVIDIA adapter can stay scoped to the handoff MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split already understands, and a recurrent mixer can remain a separate spec instead of pretending to be a native attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries block. That is a smaller promise than a full fork, but it is also easier to audit when upstream adds a real surface later.

The review order should stay mechanical: first keep supported tokens on the native path, then place experimental features in named receipts, and only then justify a custom seam. The fail-closed translation sample, feature placement sample, and shared MLA adapter sample show that split without turning the migration policy into a broad compatibility layer.

FAQ

Frequently asked questions

What happens when a block family has no native handoff?+
The migration should keep the supported symbols on the native handoff, preserve the unsupported family as an explicit seam, and fail closed instead of substituting a nearby attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand., dense, or expert block. The fail-closed pattern translation sample shows the refusal behavior, while the NAM56R block taxonomy sample shows the recurrent tail named as custom rather than hidden inside the recipe translator.
When is a native handoff not a custom seam?+
When upstream already documents the feature as a runtime surface, the migration should not wrap it just to keep the story symmetrical. MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. is the useful example: Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges. exposes a --multi-latent-attention path, so the MegaCpp side should treat that as a native handoff unless a real compatibility gap appears.
How should experimental memory features graduate?+
Keep them behind named seams until the handoff is real. The M2RNN memory article treats the matrix-state mixer as a narrow MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and… spec boundary, while the feature placement sample keeps n-gram enrichment and recurrent blocks in separate slots. That split makes it clear which pieces are native reuse, which pieces are explicit custom seams, and which pieces are still feature-side experiments.
Why not keep a broad downstream fork until upstream catches up?+
Because a fork makes every future upstream feature look like a merge conflict. The narrower policy keeps adapters small enough to delete: fail-closed pattern translation, MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. handoff, and recurrent-mixer specs can each graduate or stay custom independently. If one seam becomes native, the rest of the recipe does not need to move with it.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Topic hubs