MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 10 min readMegaCpp Engineering
Megatron
Tensor Parallel
Pipeline Parallel
MoE
Mamba
NAM56R

What Megatron Can and Cannot Split

A grounded look at split-friendly and split-hostile model surfaces: TP, SP, PP, EP, recurrent state, side embeddings, and why some boundaries remain architectural rather than automatic.

MegaCpp
Focused on applied C++ model engineering
Article Preview
What Megatron Can and Cannot Split
Published 10 min readMegaCpp Engineering

MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks is excellent at partitioning regular dense computation. It is much less magical on heterogeneous ownership surfaces such as expert routing, recurrent state transitions, side-channel embeddings, and topology-sensitive pipeline boundaries. The practical skill is not maximizing how much gets split, but deciding which surfaces should be split aggressively and which should remain explicit architectural boundaries.

MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks-style systems are compelling because they offer a powerful vocabulary for scale: tensor parallelismQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding splits weight matrices, sequence parallelismQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel changes activation ownership on the TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding axis, context parallelismQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample Reference: context parallel and sequence parallel changes long-sequence ownership across ranks, pipeline parallelismQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample partitions depth, and expert parallelismQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding distributes expert banks. On slides that can look close to universal. In code it is not universal at all.

The right question is therefore not "can MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks split the model?" The right question is "which surfaces preserve their meaning when split?" The answer is visible in current runtimes because the stack supports TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel, PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample, optional virtual staging, expert-distribution modes, and hybrid family layouts. Reading those code paths makes one thing clear: regular dense math is naturally split-friendly, but control-heavy or topology-bearing surfaces still need explicit ownership rules.

What Megatron splits very well

The strongest targets are the ones the framework was effectively designed around: repeated dense projections and repeated depth. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries projections, dense MLP projections, and the inner body of standard blocks all have predictable shapes and communication patterns. Those properties are exactly what tensor and sequence parallelismQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel want.

That is why TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding and SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel remain the most transferable parts of the MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks story. They work best when the runtime can treat the computation as regular linear algebra with known collective boundaries. CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample Reference: context parallel and sequence parallel solves a different problem: not dense matrix ownership, but long-context sequence ownership. The dense path does not become trivial, but it stays legible enough for compiler transforms, sharding strategies, and checkpoint rules to agree.

Public distributed-training notes reflect that orientation. They are full of explicit decomposition logic, but the places where decomposition is cleanest are the places where shapes and ownership are stable. Likewise, the training configuration exposes TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel, PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample, virtual-stage, and expert-distribution flags because these are not theoretical axes. They are the real knobs the runtime can use when the underlying structure is cooperative.

Surface Split quality Why it works
Dense attention projections Excellent predictable shapes and collectives
Dense MLP projections Excellent same regularity as attention projections
Repeated layer depth Strong PP and virtual staging map well when boundaries are meaningful
Embedding and output head edges Good, but explicit works once edge ownership is fixed

That is the part people usually mean when they say MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks scales well. They are not wrong. They are just describing the regular half of the system.

Pipeline splitting still depends on architectural topology

Pipeline parallelismQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample is often described as "split the layers by depth." That description is incomplete. Real pipeline splitting is only valid after the system decides what each stage owns besides the obvious repeated block range.

Modern runtimes make this explicit. The runtime has to reason not only about how many layers land on each stage, but also about which stage owns embeddings, heads, RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries-related state, auxiliary embeddings, and family-specific support modules. That means PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample is not merely arithmetic partitioning. It is architectural partitioning.

This is one reason pattern notation matters. A family like NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample is not just a depth number. AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R Megatron plan sample Example: NAM56R pattern composition sample implies heterogeneous block roles across the stack. If you cut depth carelessly, you may preserve layer count while damaging ownership semantics. A stage boundary that is fine for dense ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample repetition may be awkward if it slices through an eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample routing-heavy region or an rblockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample state-carrying region.

The practical lesson is simple: PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample works best when the runtime exposes the topology rather than hiding it. Weighted partitioning, explicit stage boundaries, and family-aware placement are signs of maturity, not of failure. Edge-heavy stages can justify fewer ordinary layers because embedding or output-head ownership is already real stage work before the repeated body is counted.

The runtime-facing continuation of that point is DualPipe and 3D parallelism on NVIDIA. Once the model is stage-cut, embeddings, heads, auxiliary losses, and microbatch accounting stop being background detail and become part of the split contract.

Dense repeated body   -> split by regular depth rules
Family-specific edges -> assign explicitly
Topology-heavy seams  -> make visible instead of pretending they are generic

That block describes what the current stack is already doing in spirit, even when the exact implementation details keep evolving.

Expert weights split cleanly; expert routing does not

MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack is the clearest example of the difference between split-friendly math and split-hostile control flow. Expert parallelismQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding is excellent at distributing expert banks. Once the heavy feed-forward weights are organized per expert, sharding them across workers is natural. That is the good part.

The hard part is routing. Token assignment to experts is data dependent. Load is uneven. Communication often requires all-to-all behavior or carefully managed dispatch/combination steps. None of that disappears just because expert weights were partitioned successfully.

The training configuration exposes several MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-related choices precisely because the runtime has to care about these distinctions. The stack needs to know not only whether MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack is enabled, but also how expert data parallelismQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP and expert tensor parallelismQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding are configured, whether specific distribution or packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles modes are legal, and how those choices interact with checkpointing and recomputation. Those are routing-and-ownership concerns, not just GEMM concerns. The important systems win is overlap: dispatch improvements matter when permutation and return traffic can hide under adjacent micro-batch work, not merely when expert weights are sharded.

MoE surface Split status Real limitation
Expert weights Very splittable natural EP target
Expert compute kernels Often highly optimizable still governed by kernel contracts
Token routing Only partially abstractable live token distribution drives communication
Zero-token expert handling Must stay explicit autograd and reduction assumptions can break

This is why "MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks solves MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack scaling" needs to be read narrowly. It solves a major fraction of the expert-weight problem. It does not erase routing as a systems problem.

Recurrent and Mamba-style blocks resist the usual split intuition

Hybrid families complicate the picture further because not every block is attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries-shaped. NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample keeps A, M, E, and R roles visible for a reason. The M and R surfaces carry different semantics from a standard attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries block, and those semantics matter when deciding what can be partitioned mechanically.

A recurrent transition cares about state evolution across steps. A MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style mixer may have fused and split execution paths, internal convolution/scan structure, and different performance sensitivities from attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: public-safe MLA integration patterns Reference: shared MLA adapter boundaries. Those are not reasons to avoid splitting altogether. They are reasons to avoid pretending that the same split rule applies everywhere.

This is where many architecture discussions go wrong. They ask whether the framework can split the model, when the real question is whether the state semantics survive the split. Sometimes they do. Sometimes the clean answer is to split the dense projections around the block while leaving the state-bearing contract explicit. If an R block crosses a virtual pipeline boundary, the recurrent state has to cross that boundary with the activations too; otherwise the split is only cosmetic.

That is not a weakness of the framework. It is an admission that architecture still exists.

That is why the smallest honest description of these blocks is not "unsplittable," but "partially split with an explicit state contract." The checked-in TP-aware Mamba mixer sample shows the familiar tensor-parallel story still applying to in_proj, local head and group slices, and the out_proj rejoin, while helper surfaces such as angle_proj stay replicated on purpose. That is a useful correction to the usual oversimplified rule: some projection edges still shard cleanly, but the state-bearing handoff remains architectural.

Side-channel embeddings and auxiliary inputs are ownership problems first

Another class of surfaces that resists generic partitioning is side-channel input. The stack contains optional structure-aware embedding paths and other non-default input enrichments. Those are not naturally split-first abstractions. They are representational features whose placement has to be designed.

If an auxiliary embedding should exist only on the first stage, or should remain visible after a stage reshuffle, or has to preserve checkpoint compatibility when the split geometry changes, then the problem is no longer only about tensor shapes. It is about ownership.

This is why it is misleading to ask whether MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks can split "the whole model." It can split many compute surfaces inside the model. It cannot automatically invent the ownership rules for every auxiliary channel. Someone still has to define where those modules live, who materializes them, and how checkpoint state is interpreted.

Checkpointing and recompute expose the same boundary

One of the most useful signs of the true split boundary is that checkpoint and recompute logic tend to fail in the same places that naive partitioning fails. Regular dense regions are easier to shard, checkpoint, and replay. Irregular ownership regions accumulate edge cases.

The training configuration is valuable here because it makes many of those interactions explicit: options around pipeline shape, virtual stages, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack distribution, compile behavior, and related runtime toggles all affect what can be recomputed cleanly and what must remain visible to the runtime. A system does not need that many knobs if everything is truly generic.

That does not mean the design is bad. It means the design is honest. Once the model combines dense math, expert routing, recurrent state, and side inputs, the runtime needs declared boundaries.

Category Usually safe to split aggressively? Usually needs extra ownership logic?
Dense projections Yes rarely
Repeated depth body Yes, with stage planning sometimes
Expert banks Yes yes, around routing
Long-context token ownership Yes, when the model is CP-aware yes, around sequence exchange and gathers
Recurrent/stateful surfaces Sometimes often
Side-channel embeddings Sometimes usually

The recurring pattern is obvious: regular compute loves splitting, heterogeneous ownership does not.

The best systems do not maximize splitting; they maximize useful splitting

The strongest design lesson from the repo is that the right goal is not universal automatic partitioning. The right goal is useful splitting plus visible exceptions. Dense math should be split aggressively. Expert weights should be split aggressively. Routing, state transitions, and edge ownership should remain explicit where necessary.

That is a more mature target than trying to force every surface into a generic decomposition rule. It is also more maintainable. Engineers can debug a system where exceptions are visible and justified. They struggle in systems that claim everything is generic while quietly encoding special cases everywhere.

The same point also explains why hybrid-family notation is worth preserving. If a report mentions NAM52 or NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample, that is not branding. It is a reminder that the stack contains different block families with different split behavior. A pattern string like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R Megatron plan sample Example: NAM56R pattern composition sample is useful because it signals where automatic assumptions stop being safe.

What Megatron does not split for you

MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks can partition dense tensors, pipeline stages, expert groups, and long-context work when the ownership model is explicit. It does not erase the need to define architectural boundaries.

Distributed checkpointing, often shortened to DCPQuick term guideDCPDistributed checkpoint planning and restore contract used by the Megatron-style save-and-resume lane.GroundingAbout: Checkpoint format and resume History: Restoration without git history Example: FSDP sharding sample, is not a compute split axis. It governs how state is saved and restored.

Activation recompute and checkpointing are also not partitioning modes. They change the memory and compute tradeoff, but they do not decide where parameters, tokens, or experts live.

Two taxonomy corrections matter in practice. First, SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel and CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample Reference: context parallel and sequence parallel are not interchangeable token-splitting names. SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel is a TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding companion for activation layout; CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample Reference: context parallel and sequence parallel is a long-context ownership strategy. Second, EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding is not a substitute for TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding or PP. EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding distributes expert banks and routed-token transport; TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding and PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample still own dense math and depth placement.

The remaining hard cases are the ones where ownership is conditional or family-specific: routed control flow, auxiliary losses, cross-stage side channels, recurrent state handoff, and expert-local bookkeeping. Those still need explicit contracts even when the dense path is fully under MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks control.

So what can Megatron split, and what can it not?

MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks can split regular dense computation extremely well. It can split expert banks well. It can split depth well once stage boundaries are chosen meaningfully. It cannot automatically dissolve routing complexity, recurrent state semantics, or auxiliary-input ownership into linear algebra.

That is the boundary that matters in practice. Once you accept it, the framework stops fighting the architecture. The runtime can do what it is good at on regular regions and leave the hard ownership surfaces explicit. That is not a compromise. It is what real scalability looks like when the model is not uniform.

FAQ

Frequently asked questions

Why is pipeline parallelism not just equal layer counts?+
Because stage boundaries also own embeddings, heads, and family-specific seams. A lighter stage can need fewer ordinary layers simply because its non-block ownership is already expensive.
Why doesn't expert routing become "solved" once expert weights shard cleanly?+
Because routing is live control flow, not just static weight placement. Expert banks can shard cleanly, but token assignment, all-to-all traffic, shared-expert paths, and auxiliary-loss accounting still depend on the current batch and schedule. Expert parallel and MoE sharding and the checked-in Expert-parallel routing sample keep that boundary concrete.
What does virtual staging buy you, and what does it not remove?+
It can improve depth partitioning and bubble economics on regular stacks, but it does not erase family-specific seams. Recurrent state, side-channel embeddings, auxiliary losses, and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble. routing still need explicit ownership across the stage cut. DualPipe and 3D parallelism on NVIDIA is the runtime follow-on, with the checked-in Pipeline parallel sample as the smallest local reminder that stage count alone is not the contract.
Is distributed checkpointing another split axis?+
No. PyTorch Distributed CheckpointQuick term guideDCPDistributed checkpoint planning and restore contract used by the Megatron-style save-and-resume lane. saves and loads distributed state_dicts in SPMD style and requires the target model's sharding information at load time so it can reshard state onto a different trainer count or parallel layout. That makes checkpointing the mechanism that preserves ownership choices across restarts, not the mechanism that decides how forward compute is split. Checkpoint format and resume is the local continuation of that boundary.
Are sequence parallelism and context parallelism interchangeable?+
No. SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP. is a tensor-parallel companion for activation layout inside TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.-aware execution, while CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job. is a long-context ownership strategy across ranks. They can coexist, but they solve different bottlenecks, which is why the clean local follow-ons are Context parallel and sequence parallel and Sequence, context, and expert splits.
Why is context parallelism not one stable layout rule in a hybrid stack?+
Because a hybrid A/M/R lane can change what "correct token ownership" means from one block family to the next. An attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.-oriented block can stay sequence-sharded across CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job. ranks, while a state-bearing path may need a different handoff contract around recurrent state or chunk-boundary semantics before the next block runs. The split is still real, but the hard part is the explicit layout transition between block families rather than one permanent "CP layout" that survives the whole stack. The local follow-ons are Context parallel and sequence parallel, Sequence, context, and expert splits, and Porting to Megatron: where the friction really came from.
Which local files show the split-friendly versus split-hostile boundary most directly?+
Use the checked-in Pipeline parallel sample for regular depth splits, Expert-parallel routing sample for routed control flow, and MoE auxiliary-loss collection sample for the stage-local bookkeeping that generic matrix splitting does not remove.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

TP

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

PP

Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.

EP

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

SP

Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.

CP

Context parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.

DCP

Distributed checkpoint planning and restore contract used by the Megatron-style save-and-resume lane.

DualPipe

Bidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

A / M / E / R

MegaCpp shorthand for the four main block families: attention, Mamba/state-space, expert/MoE, and recurrent tail layers.

AEMEAEMEAEMR

A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.

ablock

The attention-heavy block family in MegaCpp's A/M/E/R notation.

eblock

The expert / MoE block family in MegaCpp's A/M/E/R notation.

rblock

The recurrent tail block family in MegaCpp's A/M/E/R notation.

DP

Data parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.

RoPE

Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

Mamba

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Topic hubs