Manual Splits and What They Cost
A grounded look at explicit pipeline boundaries, pipe-delimited patterns, weighted partitioning, and the maintenance cost of forcing stage shapes by hand in hybrid attention, MoE, and recurrent stacks.

Manual Splits and What They Cost
Manual splits are sometimes the only way to make a hybrid stack trainable, especially when attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, recurrent blocks, and auxiliary embeddings do not partition cleanly. But they carry a tax: every explicit stage boundary becomes a maintenance contract for RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries state, embeddings, loss heads, metadata, and schedule assumptions. The right use of manual splits is tactical, not ideological.
When people discuss pipeline parallelismQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample, they often skip over the part that actually breaks systems: deciding where the split points go. A clean transformer with repeated identical blocks can be partitioned by count. A real hybrid stack cannot. Once the model contains different block families, optional side inputs, expert-heavy layers, and recurrent segments, a naive equal split becomes a proxy for "hope the runtime will sort it out later." It usually will not.
That is why MegaCpp keeps a real path for explicit boundaries. The important detail is not merely that the model can be partitioned. The important detail is that the runtime acknowledges two different partitioning modes: automatic partitioning and pipe-delimited explicit boundaries in the pattern string. That makes the split decision visible, auditable, and debuggable.
The local Pipeline parallel sample shows the automatic and weighted stage builder, while the DualPipe stage contract sample shows the stricter output-lifetime and auxiliary-loss rules that can make an explicit boundary part of the schedule contract, not just the layer map.
Why explicit boundaries exist at all
The pipeline runtime in MegaCpp does not hide what a stage really owns. create_pipeline_stage builds a stage from concrete layer spans, attaches embeddings to the first stage, attaches the head to the last stage, and wires in RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries buffers, optional n-gram or structure embeddings, and stage-local window sizes. That is already enough to explain why manual boundaries matter. A stage is not just "some layers." It is a bundle of responsibilities.
The runtime contract around pipe-delimited nem_pattern supports this directly. If explicit PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample boundaries are provided, the runtime expects the delimiter count to match the requested number of stages. That sounds strict because it is strict. Manual splits are effectively part of the program.
| Split mode | Benefit | Cost |
|---|---|---|
| Automatic equal partition | Minimal setup | Ignores heterogeneous layer weight |
| Weighted automatic partition | Better first approximation | Still heuristic |
| Manual pipe-delimited boundaries | Exact operator ownership | Permanent maintenance burden |
The reason teams still use manual boundaries is simple: hybrid stacks can be very asymmetric. An E block with expert routing, a heavy M block, and a dense A block do not cost the same thing.
Patterns like AEMEAEMEAEMR are more than notation
The pattern notation matters because it encodes where asymmetry comes from. In this naming, A means attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, M means MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style state-space layers, E means expert/MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, and R means recurrent. That is already more operationally useful than saying "a mixed architecture." It tells you why the split problem is hard.
A pattern like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample is not just a decorative label. It predicts that the partitioning problem will not be uniform across depth. Some stages will want attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-side buffers and RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries-heavy behavior. Some will hit expert routing and dispatch collectives. Some will end with recurrent state handling. If you split that model only by raw layer count, you are pretending these blocks impose identical runtime cost. They do not.
The repo language around ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, mblockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, rblockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, and cblock is useful here because it gives teams a practical vocabulary for stage composition. Even when the exact training schedule evolves, the notation helps preserve one key truth: the model contains qualitatively different segments, not just repeated blocks.
The hidden cost is stage-local plumbing
The biggest mistake in discussions of manual splits is treating them as if the cost were only balancing FLOPs. In MegaCpp, the more persistent cost was plumbing. Every explicit boundary decides where side data must exist and where it must not.
The first stage owns token embedding and several optional input-side embeddings. The last stage owns the head and also receives auxiliary pieces such as MTP-related references when enabled. RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries buffers are provided broadly enough to survive resume and split changes. Relation-bias handling needs access on all stages that compute the corresponding additive bias, which is the same ownership wrinkle explored in Structure embeddings and relation bias.
That means a manual split changes more than layer count. It changes where the runtime must preserve non-obvious state.
# schematic pattern with explicit PP boundaries
nem_pattern: "AEME|AEME|AEMR"
pipeline_parallel_size: 3
weighted_pipeline_split: true
The value of an explicit pattern like this is that it makes the decision inspectable. The cost is that every future modifier now has to preserve its assumptions.
| Stage concern | Why manual split affects it |
|---|---|
| Token embedding | Must stay on the first stage |
| LM head | Must stay on the last stage |
| RoPE buffers | Need consistency across resumes and stage changes |
| Structure or platform embeddings | First-stage ownership matters |
| Relation bias | Can require cross-stage awareness |
| Aux losses / MTP hooks | Typically last-stage anchored |
This is why manual splits often feel worse over time than they did on day one. The initial change is easy. The ongoing burden is keeping all of this aligned as the model evolves.
Weighted automatic partitioning is better than equal split, but not enough
The runtime also includes weighted partitioning for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-aware layouts. That is important because it shows the team did not jump straight from naive splitting to hard-coded splits. There is a middle ground: heuristics that recognize some layers cost more than others.
That middle ground is useful, but it does not eliminate the need for explicit boundaries. Weighted partitioning helps when asymmetry is broad and predictable. It helps less when the topology itself matters. For example, if a stage needs to end before a recurrent transition, or if you want expert-heavy blocks clustered away from a fragile communication boundary, the problem is not just weight. It is semantics.
So the real decision tree looks like this:
- Start with automatic partitioning when the model is homogeneous enough.
- Use weighted partitioning when different block families have meaningfully different cost.
- Use manual boundaries when topology or ownership rules matter more than heuristics.
That ordering is important because manual splits should be the last resort that remains explicit, not the first tool used out of habit.
Manual boundaries can also freeze old assumptions
Another cost of explicit splits is that they preserve history, not just intent. A boundary that made sense before a model changed may become the wrong boundary after side features, compile strategy, or expert implementation change. The pipeline code comments about interleaving and schedule equivalence are a reminder that schedulersQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention evolve. What counted as a balanced stage at one point can become a bad stage later.
The same is true for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack behavior. The expert path in MegaCpp includes variable-split dispatch, compile-disabled outer routing, and logic that avoids padded equal-split behavior where it wastes memory. If the cost profile of eblocks changes, the old split might still be syntactically valid while being operationally bad.
This is the tax manual splits impose: they make topology explicit, but they also make topology sticky.
What manual splits are good for
Despite all of that, manual splits are not a mistake. They are often exactly the right tool in hybrid systems. They are good for preserving intentional stage ownership when the model has non-uniform blocks. They are good for aligning expensive collectives away from fragile boundaries. And they are good for making the partitioning decision inspectable when debugging a pipeline schedule.
The key is to be honest about the price. Manual splits are a control surface, not a simplification. Every explicit delimiter in a pattern string is a promise that the surrounding runtime assumptions still hold.
That is why the best use of manual splits in MegaCpp was not "we prefer hand tuning." The best use was "the model is heterogeneous enough that implicit heuristics are no longer a sufficient explanation."
The real cost is organizational, not just technical
The deepest cost is that manual splits create long-lived operational knowledge. New contributors have to understand why the split exists, what side effects it protects, and which invariants they must re-check when adding a new block family. If that knowledge is not written down, the split degrades into superstition.
That is why pattern notation and explicit stage construction matter so much. They convert a hidden arrangement into a debuggable contract. The contract is still costly, but at least it is legible.
In practice, that is the trade worth making. When the stack is simple, let heuristics win. When the stack becomes AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample with real ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, mblockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, and rblockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample asymmetry, use manual splits deliberately and assume they are part of the architecture, not just part of the launch script.
Schedule mechanics are part of the split cost
The comments in the pipeline runtime around virtual pipeline parallelismQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample are a useful reminder that a split is also a schedule decision. The implementation references MegatronQuick term guideMegatronWhy lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…GroundingPorting To Megatron-Core Is Harder Than It Looks What Megatron Can and Cannot Split-style interleaving and the relationship between schedule tables, forward stage index, and rounds of microbatches. That means a manual split is never only about parameter placement. It also affects how bubbles, warmup, and flush behavior are experienced by the actual runtime.
This becomes especially important in heterogeneous models. If one stage contains an eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample-heavy segment and another stage contains lighter ablocks, a schedule that is technically valid can still create synchronization pressure or idle windows that are hard to diagnose just from high-level metrics. Manual splits can correct that, but the correction itself becomes one more piece of scheduling knowledge that has to stay alive across future changes.
The checked-in DualPipe stage contract sample makes the strict version explicit: DualPipeVQuick term guideDualPipeVDualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.GroundingAbout: DualPipe and 3D parallelism on NVIDIA Example: DualPipe stage contract sample Example: DualPipe schedule sample treats pipeline degree and explicit boundaries as one joint contract, so a pp_degree of N implies 2N total stage slots, non-terminal stages still have to preserve auxiliary-loss signals, and stage outputs cannot be freed early. A split that looks valid under plain PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample can still be the wrong split once an overlapped or interleaved schedule owns the lane.
Resume, refactor, and feature growth all make old splits worse
A split that is correct for one revision can become subtly wrong after a refactor. The stage builder comments about resuming from a different split are especially revealing because they admit the problem directly. Some buffers and references are passed broadly not because that is elegant, but because the runtime needs enough continuity to survive stage-layout changes.
That is the long-term operational cost of manual boundaries. They pin architectural intent at a moment in time. If later work adds new side embeddings, changes recurrent-state handling, or moves an auxiliary loss surface, every explicit split has to be re-evaluated. Otherwise the code still runs, but the split becomes a fossil from an older model.
Checkpoint and resume policy make that more concrete. A split is not only a forward-pass decision; it is also part of the saved ownership story. The same state that has to survive a refactor also has to survive restart, which is why the local continuation in Checkpoint format and resume treats stage-layout changes as something the runtime must carry explicitly rather than infer after the fact.
The right habit is to treat manual splits as versioned architecture, not a temporary launch tweak.
Frequently asked questions
Why can a split that works under plain PP still break under DualPipe or VPP?+
Why does resume policy belong in a discussion about manual splits at all?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.
Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.
Bidirectional pipeline schedule: forward chunks from one end and backward chunks from the other end of the pipeline run concurrently and meet in the middle, overlapping F / B / weight-grad work. Same per-GPU layer ownership as plain PP — each GPU still owns its stage — only the order of compute and activation-send changes. Benefit: the pipeline bubble shrinks versus standard 1F1B, so throughput recovers without changing where weights live. Cost: trickier scheduler logic and peak activation memory stays similar to plain PP.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.
The attention-heavy block family in MegaCpp's A/M/E/R notation.
The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
The expert / MoE block family in MegaCpp's A/M/E/R notation.
The recurrent tail block family in MegaCpp's A/M/E/R notation.
DualPipe V-shape variant: each physical GPU owns two virtual stages from opposite ends of the pipeline (GPU 0 holds the first and last blocks, GPU 1 the second and second-to-last, etc.) instead of one contiguous slice. Benefit: halves per-GPU peak activation memory at the same GPU count, because two half-depth stages keep fewer microbatches in flight than one full-depth stage. Cost: more complex scheduling and non-contiguous weight placement — useful when plain DualPipe's peak activation memory is what's blocking you.
The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…