MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20266 min readMegaCpp Engineering

Context Parallel

Sequence Parallel

Long Context

Tensor Parallel

Context Parallel and Sequence Parallel: Similar Names, Different Jobs

Q: Is sequence parallelism enough for long-context training?

No. SP is a local TP-layout optimization. It reduces activation duplication inside TP-aware execution, but it does not by itself solve global sequence ownership once context length becomes the dominant limit.

Q: What if tensor parallelism and expert parallelism are both enabled?

Megatron Core treats sequence parallelism as required when EP and TP are combined. The operational reason is that expert dispatch leaves the dense TP layout and the return path still has to restore a clean sequence-sharded activation contract for the next TP-aware block.

An explanation of SP versus CP using TP-aware helpers, long-context bring-up patterns, and hybrid model design.

By MegaCpp Engineering

MegaCpp

Focused on applied C++ model engineering

Article Preview

Published April 18, 2026•6 min read•MegaCpp Engineering

Context Parallel and Sequence Parallel: Similar Names, Different Jobs

Reader-first version: SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample is the "keep TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding activations sharded longer" tool, while CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample is the "one rank should not own the whole context" tool. The shortest local proof surfaces are 3D parallelism sample, Long-context examples guide, and Chunk boundary remap sample.

The confusion is understandable. Both features mention the sequence dimension, both can reduce memory pressure, and both become visible in long-context discussions. But the cleaner distinction is the operational one. In Megatron-style tensor-parallel code, sequence-parallel behavior shows up directly in TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding helper calls such as scatter_to_sequence_parallel_region and gather_from_sequence_parallel_region. Context-aware long-context lanes show up in model recipes and distributed bring-up notes where the question is no longer "should this TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding shard keep a shorter local activation view?" but "can this run avoid a full-context residency problem at all?"

That distinction matters even more once the model stops being a pure attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns stack. The project vocabulary already uses A, M, E, and R for attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, Mamba, expert, and recurrent families, with local names like ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, mblockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, and rblockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample. Pattern strings such as AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample are not decorative. They are a reminder that different block families stress memory, collectives, and token layout in different ways. SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample and CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample touch that stress map at different layers of the stack.

Sequence Parallelism Lives Inside TP-Aware Execution

The clearest evidence for SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample is the boring, practical kind: helper placement. In public Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample style code, embedding activations are scattered to the sequence-parallel region when the relevant config is enabled, and the same family of helpers appears around TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding-aware flow. That is not the signature of a global long-context strategy. It is a local contract about where activations should remain sharded so tensor-parallel execution does not immediately pay back all of its memory savings at the next layer boundary.

The important part is not the existence of one scatter call. It is the implied lifetime of the sharded layout. If activations are scattered and then gathered again one operator later, SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample turns into ceremony. If the layout persists across enough compute to avoid redundant activation residency, then the helper calls are doing real work.

if self.config.sequence_parallel:
    embeddings = tensor_parallel.scatter_to_sequence_parallel_region(embeddings)

That pattern is a strong clue about ownership. SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample belongs near TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding-aware module boundaries, collective utilities, and embedding or projection paths that can tolerate a sequence-sharded view. It is mostly about activation economics and sharding discipline, not about making extreme context lengths possible by itself.

Mechanism	Scale of decision	Primary goal	Typical companion
Sequence parallelism	Local layer/layout contract on the TP axis	Lower activation residency under TP	Tensor parallelism
Context parallelism	Whole-sequence placement strategy across ranks	Make long contexts fit and scale	Long-context attention or other CP-aware sequence lanes
Early gather	Boundary decision	Restore full sequence semantics	Often the main cost leak

This is why people get misled by the common phrase "both split the sequence." Technically true, operationally incomplete.

Context Parallelism Starts Where Full-Sequence Residency Breaks the Plan

CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample belongs in that second category. It is a response to global sequence pressure, not only to local TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding memory duplication. When the sequence gets long enough, the main design question shifts from "did we shard activations cleanly within TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding?" to "where does the sequence live, how often do we need the full view, and which block families truly require that global materialization?"

This is especially visible in hybrid systems. An A block stresses attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns state and KV movement. An M block often changes the cost model because recurrence-like state can reduce or reshape sequence pressure. An E block adds routing, exchange, and expert residency concerns that are orthogonal to the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns layout itself. An R block changes state propagation again. A pattern like AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample therefore does not imply one clean answer for "sequence splitting." It implies a per-family analysis of when local sequence sharding is enough and when the global sequence footprint becomes the dominant constraint.

The Real Cost Is Usually the Gather Boundary

Both SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample and CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample can be undermined by premature gathers. This is the main conceptual bug, and it is more common than explicit math mistakes. People say they are running sequence-sharded execution, but the implementation reconstructs a full token view at the next normalization, projection, or convenience wrapper. At that point, the optimization is real on paper and mostly absent in residency.

Helper placement is useful because it exposes that boundary explicitly. You can ask a concrete question: where is the next gather_from_sequence_parallel_region, and is it semantically necessary? If the answer is "we just needed a simpler downstream API," that is not an architectural requirement. That is a local abstraction leak.

The same discipline should be applied to CP. A long-context layout only earns its keep if full-context materialization is rare and justified. If an allegedly CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample-aware path rebuilds the whole sequence around every major block family, then the system is paying the coordination cost of CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample without preserving the main benefit.

Taxonomy correction: SP, CP, DCP, and recompute are not the same thing

Sequence parallelismQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample is a tensor-parallel companion. It keeps activations sharded across TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding ranks for longer and is usually discussed together with TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, not as an alternative to it. In Megatron-style MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack lanes, EP + TP usually implies SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample; Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample makes that dependency explicit in the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack docs so the dense activation layout stays compatible with expert dispatch and return.

Context parallelismQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample is a long-context distribution strategy. It partitions sequence work across ranks when full-context residency becomes the bottleneck. In Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample that includes both attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-oriented CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample documentation and an explicit mamba_context_parallel API for state-space models, so CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample is not limited to transformer-only paths. CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample is still not the same thing as SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample: CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample changes sequence ownership across ranks, while SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample changes activation layout inside TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding-aware execution.

Distributed checkpointing, often abbreviated DCPQuick term guideDCPDistributed checkpoint planning and restore contract used by the Megatron-style save-and-resume lane.GroundingAbout: Checkpoint format and resume History: Restoration without git history Example: FSDP sharding sample, is not a compute-parallel axis at all. It is a state save and restore mechanism.

Activation recompute, selective checkpointing, and similar memory-saving techniques are also not parallel axes. They trade extra compute for lower activation residency, but they do not define where model, sequence, or expert work is partitioned.

What Good Integration Looks Like

A good SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample integration leaves obvious evidence in helper placement, avoids unnecessary gathers, and does not pretend to solve the global long-context problem. A good CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample integration is visible in training ladders, topology decisions, and module contracts that can operate on partitioned sequence ownership without constantly asking for a dense global view.

That is the practical takeaway from the code and docs. Use SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample to make TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding-aware execution economically sane. Use CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.GroundingAbout: parallelism map overview Example: chunk boundary remap sample when the sequence itself is large enough that one-rank ownership is the wrong abstraction. In hybrid stacks, especially ones that mix A, M, E, and R families, be explicit about where each contract begins and ends.

FAQ

Frequently asked questions

Is sequence parallelism enough for long-context training?+

No. SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP. is a local TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.-layout optimization. It reduces activation duplication inside TP-aware execution, but it does not by itself solve global sequence ownership once context length becomes the dominant limit.

What if tensor parallelism and expert parallelism are both enabled?+

Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges. treats sequence parallelismQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP. as required when EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size. and TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node. are combined. The operational reason is that expert dispatch leaves the dense TP layout and the return path still has to restore a clean sequence-sharded activation contract for the next TP-aware block.

What should I inspect first if the memory win disappears?+

Inspect the next gather boundary. The fastest way to lose the benefit of SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP. or CPQuick term guideCPContext parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job. is to rebuild a full token view too early through a convenience gather.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

Grounding

Context parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.

Grounding

Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.

Grounding

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

Grounding

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

Grounding

DCP

Distributed checkpoint planning and restore contract used by the Megatron-style save-and-resume lane.

Grounding

A / M / E / R

MegaCpp shorthand for the four main block families: attention, Mamba/state-space, expert/MoE, and recurrent tail layers.

Grounding

AEMEAEMEAEMR

A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.

Grounding

ablock

The attention-heavy block family in MegaCpp's A/M/E/R notation.

Grounding

mblock

The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.

Grounding

eblock

The expert / MoE block family in MegaCpp's A/M/E/R notation.

Grounding

rblock

The recurrent tail block family in MegaCpp's A/M/E/R notation.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

Topic hubs

Entity Hub

Megatron Parallelism and Layout Boundaries

A curated Megatron reading path: the parallelism map, what actually splits, how NVIDIA and TPU wrappers differ, and the migration surfaces around NAM56R-style layouts.

MegaCpp Engineering • MegaCppMore posts →

Context Parallel and Sequence Parallel: Similar Names, Different Jobs

Context Parallel and Sequence Parallel: Similar Names, Different Jobs

Sequence Parallelism Lives Inside TP-Aware Execution

Context Parallelism Starts Where Full-Sequence Residency Breaks the Plan

The Real Cost Is Usually the Gather Boundary

Taxonomy correction: SP, CP, DCP, and recompute are not the same thing

What Good Integration Looks Like

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

Megatron Parallelism and Layout Boundaries