MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202610 min readMegaCpp Engineering

Hybrid Models

Scheduling

Mamba

MoE

Pipeline Parallel

Hybrid Layer Interleaving: Why A/M/E/R Schedules Need Real Execution Plans

Q: Why can't pipeline stages be balanced by layer count alone?

Because A, M, E, and R do not cost the same thing. Two stages with the same number of layers can still have very different expert traffic, recompute cost, or recurrent-state bookkeeping, which is why interleaved PP needs family-aware chunks instead of blind equal counts.

Q: How should a scheduler treat identity attention in MoE-only layers?

As a pass-through dependency, not as a hidden attention workload. The useful schedule fact is that the E position still belongs in the global layer order, while the empty attention surface should not receive dense-attention recompute, cache, or overlap assumptions. That keeps backward bookkeeping aligned with the typed plan: gradients flow through the real MoE and neighboring family nodes, but the scheduler does not invent an attention task just because the wrapper kept an attention-shaped slot.

A code-grounded explanation of how interleaved schedules work for NAM52 and NAM56R-style hybrid models, based on hybrid pattern notes, scheduling examples, and authoritative parallelism references.

By MegaCpp Engineering

MegaCpp

Focused on applied C++ model engineering

Article Preview

Published April 18, 2026•10 min read•MegaCpp Engineering

Hybrid Layer Interleaving: Why A/M/E/R Schedules Need Real Execution Plans

Hybrid layer interleaving only works when the runtime knows exactly what each layer family promises. The examples and MegaCpp model glossary make that contract explicit: ABlockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, EBlockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, and RBlockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample are separate roles, pattern strings such as AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample are machine-usable schedules, and the interleaved planner wraps non-MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layers differently from MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack transformer layers so execution can stay uniform without pretending all layers behave the same.

Interleaving sounds simple in abstract form. You alternate attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, expert, MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode, and recurrent layers so the model mixes inductive biases instead of committing to one family. But an interleaved architecture is only operationally useful if the training stack can answer a more demanding question: how should those layers be represented in the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention?

That question is where many architecture writeups get thin. They stop at pattern notation. The useful next step is execution planning. A public hybrid schedule note should classify which layers are real MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack transformer layers, which layers should be treated as opaque scheduling nodes, and how to keep an interleaved schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention from branching itself into a maintenance nightmare.

Pattern notation is only the first layer of meaning

The architecture notation used around NAM52 and NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample is compact but intentional: A for attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, M for MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style sequence mixing, E for expert or MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, R for recurrent. In the public glossary and layout notes for this repo, these are explained as ABlockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, EBlockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, and RBlockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample: attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy, sequence-mixer, conditional-capacity, and recurrent-style blocks. The point is that each layer family does one thing instead of hiding multiple unrelated roles inside a generic transformer block.

That separation is the foundation of interleaving. If A and E were both overloaded containers with attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns and FFN inside, the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention would have much less leverage. Instead, the model encodes the architecture as a sequence of narrow roles. The public hybrid pattern sample preserves the same idea by turning AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample into a layer-by-layer expansion where E remains visible as an expert-bearing position.

This is why pattern strings are worth keeping in engineering discussions. They are not branding. They are shorthand for execution-relevant layer families.

Symbol	Layer family	Primary job	Why scheduler cares
`A`	`ABlock`	attention sequence mixing	may need attention-specific overlap, cache, or norm behavior
`M`	`MBlock`	Mamba sequence mixing	has different state/update semantics than attention
`E`	`EBlock`	expert or dense FFN family	may carry MoE aux losses, routing, and expert ownership
`R`	`RBlock`	recurrent or M2RNN-style mixing	runtime state and recurrence differ again

Once you internalize that table, interleaving stops looking like a cosmetic depth pattern and starts looking like a typed program.

What the layer split buys you

The key practical property is that the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention can treat ABlockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, EBlockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, and RBlockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample as typed roles rather than reverse-engineering a monolithic transformer block. ABlockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample means attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy mixing, MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample means state-space or scan-heavy mixing, EBlockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample means routed or dense FFN capacity, and RBlockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample means recurrent-style consolidation. That lets the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention reason about what a layer contributes without first deconstructing its implementation details.

It also means the runtime can attach family-specific optimizations and caveats. EBlockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample owns MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing, shared-expert behavior, and aux-loss behavior. MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample has its own state-update and sharding constraints. RBlockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample owns recurrent integration paths, including M2RNN-style memory. ABlockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample carries the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-specific surfaces, including cache, projection, and overlap behavior.

That kind of separation pays off most in two places.

First, it makes architectural experimentation compositional. You can ask whether a depth schedule needs more M or more E positions without rewriting the meaning of every layer. Second, it makes parallel runtime planning tractable. The schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention does not need to reverse-engineer hidden subgraphs to know whether a node is a candidate for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-aware planning.

The planner's real contribution

The hybrid schedule plan is the missing half of the story. It documents that mixed models contain both MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack transformer layers and non-MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack layers such as MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style mixers or recurrent surfaces. The planner therefore introduces two conceptual node types: one for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-aware transformer-layer scheduling and one opaque wrapper for everything else.

That decision is subtle and correct. An interleaved schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention wants a common interface so it can step through a depth schedule without open-coded family branches everywhere. But pretending every layer is a MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack transformer layer would be wrong. The opaque wrapper gives non-MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack families a scheduling slot with a consistent surface while preserving the fact that they are operationally different.

The file even calls out NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample-specific behavior: MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-only layers can have identity attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, so backward handling must not assume dense-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns machinery is present. This is exactly the kind of detail that gets lost in high-level “hybrid model” summaries. The planner has to know it, or the runtime will synthesize work that the layer does not own.

A good way to phrase the contribution is this: the planner converts architecture notation into an executable schedule graph without destroying type information.

Interleaving is also a pipeline question

Hybrid interleaving is not just about local forward order. It also affects pipeline partitioning. The config surface in the main model runtime module includes explicit PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample stage boundaries extracted from pipe-delimited patterns. That tells you the architecture string can serve double duty: it expresses layer-family order and stage partitioning.

This matters because not all layer families stress the same part of the system. An E-heavy stage can lean on expert exchange and loss accounting. An A-heavy stage may be more sensitive to attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns kernel or cache behavior. An R-heavy stage may concentrate recurrent-state semantics. If stage boundaries are chosen blindly, you can end up with lopsided pipeline segments even when total layer count looks balanced.

The schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention therefore has to care about two kinds of regularity at once:

The local depth order of A, M, E, and R.
The stage-level grouping of those families when PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample is enabled.

This is another place where pattern notation earns its keep. A schedule like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample tells you more than “52 layers.” It tells you what kind of stage mixtures are even possible.

Public parallelism docs make the same distinction explicit. Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample exposes interleaved PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample as multiple virtual chunks per physical pipeline rank, via virtual_pipeline_model_parallel_size and --num-layers-per-virtual-pipeline-stage, specifically to reduce pipeline bubbles. That is the public analogue of the local hybrid notes: once A, M, E, and R stop having comparable latency, a single contiguous slice per rank is a poor balancing primitive.

The A -> E seam is similarly a runtime issue, not just a modeling one. Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample's MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack docs now expose batch-level EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding overlap for All-to-All traffic, and its token-dispatch API explicitly warns against extra communication in preprocessing because exposed work on the same stream defeats compute and communication overlap. That matches the local research point that E positions next to attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns should be planned as overlap-sensitive nodes, not generic FFN slots.

Why opaque scheduling is a feature, not a compromise

Engineers sometimes treat “opaque” wrappers as a failure of abstraction, but in this case the opposite is true. The opaque path in the planner is what keeps the interleaved schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention honest. It says: this layer participates in global ordering, but it does not pretend to expose the full MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack transformer surface.

That is useful for MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode layers, DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample-flavored attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns layers, or any other family where the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention should manage placement and sequencing but should not infer MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack semantics. It is also good software engineering. Instead of proliferating one-off schedule planners for every mixed architecture, the system preserves a common outer interface and keeps family-specific behavior inside typed nodes.

The result is a cleaner boundary:

architecture pattern -> typed layer family -> schedule node -> runtime behavior

Breaking that chain anywhere creates drift. Keep it intact, and architecture notation remains operational.

What this means for NAM52 and NAM56R discussions

NAM52 and NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample are helpful examples because they are not purely dense transformer stacks. Once you mix A, M, E, and R, you can no longer talk about “a layer” as if every depth position had the same memory, communication, and loss behavior. A hybrid schedule is inherently heterogeneous.

That heterogeneity has at least four consequences.

First, recompute policy should be family-aware. The comments in the main model runtime module already show separate recompute surfaces for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack experts, M2RNN recurrence, and MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode convolutional pieces. Second, schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention overlap logic has to respect which family is active. Third, performance measurements need pattern awareness, or a faster run may simply be one with fewer expensive E or R positions on the hot path. Fourth, correctness bugs can be architecture-family-specific rather than global.

That policy split is practical, not decorative. E blocks usually create the largest routed-token and expert-activation surfaces, so expert-side recompute can buy real memory relief. M and R blocks tend to have the opposite shape: their stateful paths are more sequential, so replaying them on backward can cost more than keeping their state resident in the first place.

The historical reports around compile warmup and TPU bugs reinforce that last point. They do not talk about the model as a homogeneous slab. They call out MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-specific warmup behavior and TPU-specific reduction issues. That is exactly the level of granularity an interleaving-aware stack should preserve.

A representative interleaved configuration

A compact configuration block makes the idea more concrete. The exact launch style varies, but the important parts are the same: explicit hybrid mode, explicit pattern, and layer-family options that remain narrow rather than globally ambiguous.

nemotron_style: true
nem_pattern: AEMEAEMEAEMR
moe_enabled: true
moe_n_routed_experts: 16
moe_top_k: 4
use_mla: true
mamba_num_heads: 56
recompute_moe_experts: true
recompute_m2rnn: true

This is not “just a config.” It is a statement that the architecture is typed, that the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention can depend on the types, and that family-specific runtime choices are intentional.

The practical lesson

The practical lesson is that hybrid layer interleaving should be treated as a planning problem, not just an architecture problem. If all you preserve is the pattern string, you lose the layer contracts. If all you preserve is a generic schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention, you lose the meaning of the pattern. The useful system keeps both.

That is why ABlockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, EBlockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, and RBlockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample matter so much. They make the model legible to the runtime. And that is why the hybrid schedule plan matters: it turns that legibility into execution.

Once you see interleaving this way, several design choices stop looking incidental. Opaque node wrappers are not awkward. They are what let the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention stay generic without becoming dishonest. Pattern notation is not cosmetic. It is the compact source of typed schedule structure. And NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample-style schedules are not merely architectural flavor. They are distributed execution plans waiting to be compiled into real work.

Why interleaving changes debugging, not just training

Another underrated benefit of the typed hybrid schedule is that it makes debugging more local. When a regression appears in a mixed model, the first useful question is often “which family broke?” not “which depth index broke?” If the architecture is expressed only as a flat list of anonymous layers, that question is harder to answer. In this stack, the pattern and class split make it straightforward.

That is visible in the public example surface. Small AE-style hybrid schedules isolate attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns and expert behavior cleanly. The tests are not written as if every layer were interchangeable. They intentionally create small hybrid schedules because the runtime behavior depends on the family split. That same discipline helps when scaling to NAM52 or NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample: if a bug only reproduces when an E surface follows an A surface, the typed schedule gives you a direct way to describe and reproduce it.

This is especially important for interleaving because many integration bugs happen at the family boundary rather than inside one family. Norm placement, checkpoint boundaries, aux-loss propagation, and stage-local bookkeeping can all drift when the runtime crosses from attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-only to expert-only or from MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode to recurrent mixing. A schedule that preserves family identity makes those seams visible instead of burying them in a generic “layer N” label.

Interleaving is also about resource alternation

There is a systems reason hybrid schedules remain attractive even when they complicate planning: they alternate resource profiles. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy regions, expert-heavy regions, and stateful sequence-mixing regions do not stress the machine in exactly the same way. In theory, a well-designed interleaving can distribute those pressures more smoothly across depth.

But that only works if the runtime acknowledges the asymmetry instead of smoothing it away. The planner’s distinction between transformer-layer schedules and opaque schedules is therefore not mere code organization. It is the mechanism that keeps the resource model faithful. A MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-bearing E node can participate in expert-aware scheduling, while an opaque M or R node can still be ordered and checkpointed without pretending it needs the same machinery.

That matters for pipeline work too. A stage with several E positions in close proximity may want different overlap or loss-handling assumptions than a stage dominated by M and R. Interleaving can improve the global model, but only if the stage planner respects what kind of work is being interleaved.

FAQ

Frequently asked questions

Why can't pipeline stages be balanced by layer count alone?+

Because A, M, E, and R do not cost the same thing. Two stages with the same number of layers can still have very different expert traffic, recompute cost, or recurrent-state bookkeeping, which is why interleaved PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors. needs family-aware chunks instead of blind equal counts.

How should a scheduler treat identity attention in MoE-only layers?+

As a pass-through dependency, not as a hidden attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. workload. The useful schedule fact is that the E position still belongs in the global layer order, while the empty attention surface should not receive dense-attention recompute, cache, or overlap assumptions. That keeps backward bookkeeping aligned with the typed plan: gradients flow through the real MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble. and neighboring family nodes, but the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state. does not invent an attention task just because the wrapper kept an attention-shaped slot.

Should recompute be one global switch in an interleaved model?+

No. Treat recompute as metadata on the typed schedule, not as a blanket model flag. An E position may be a good place for selective expert-side recompute because the memory pressure sits in expert activations, while replaying the whole MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble. block can rerun routing and communication work. M and R positions need a different rule because their useful sequence state can be cheaper to retain than to rebuild. The nearby activation checkpointing policy is the deeper version of the same boundary.

What should an opaque wrapper still expose?+

Only the planner-facing contract: family tag, tensor shape, estimated cost, memory ceiling, and dependency edges. Hiding the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and… or recurrent internals is useful, but hiding those metadata fields is not; without them, the PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors. planner, recompute policy, and profiling receipts cannot tell whether a stage is balanced or merely opaque. The same boundary shows up in compile-time versus runtime tradeoffs and profiler receipts.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

A / M / E / R

MegaCpp shorthand for the four main block families: attention, Mamba/state-space, expert/MoE, and recurrent tail layers.

Grounding

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Grounding

AEMEAEMEAEMR

A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.

Grounding

scheduler

The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.

Grounding

ablock

The attention-heavy block family in MegaCpp's A/M/E/R notation.

Grounding

mblock

The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.

Grounding

eblock

The expert / MoE block family in MegaCpp's A/M/E/R notation.

Grounding

rblock

The recurrent tail block family in MegaCpp's A/M/E/R notation.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.

Grounding

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

Grounding

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

Grounding

Mamba

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Grounding

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

Grounding

Topic hubs

Entity Hub

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

A curated Mamba3 reading path: why MegaCpp kept a hybrid stack, how the kernels evolved across CUDA, TileLang, and TPU, and where the runtime wins actually held.

MegaCpp Engineering • MegaCppMore posts →

Hybrid Layer Interleaving: Why A/M/E/R Schedules Need Real Execution Plans

Hybrid Layer Interleaving: Why A/M/E/R Schedules Need Real Execution Plans

Pattern notation is only the first layer of meaning

What the layer split buys you

The planner's real contribution

Interleaving is also a pipeline question

Why opaque scheduling is a feature, not a compromise

What this means for NAM52 and NAM56R discussions

A representative interleaved configuration

The practical lesson

Why interleaving changes debugging, not just training

Interleaving is also about resource alternation

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

Mamba3 Architecture, Kernels, and Runtime Tradeoffs