H200 Memory Geometry for the Hybrid Stack
How weights, gradients, optimizer state, activations, routing scratch, runtime reserve, and fragmentation stack up on one H200 device in a hybrid training stack.

On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: training speed anatomy on H200, memory fit in a hybrid stack is decided by geometry, not by a single parameter-count number. A useful capacity model splits the budget into replicated weights, tensor-parallel shards, expert-parallel shards, gradients, optimizer state, activations, routing scratch, runtime reserve, and allocator overhead, and that split is the only reliable way to predict whether a launch actually clears the HBM limit.
Why MegaCpp cares
The specialist stack is a hybrid of attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks, Mamba blocks, and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack blocks, so one device budget is shaped by several different ownership rules at once. Some tensors replicate everywhere, some split across TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, some split only across EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding, and some exist only during backward or serving. That means “the model is about this big” is not enough for launch planning. The usable question is how the full memory geometry lands on one GPU after TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding, DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP, optimizer assignment, and recompute policy have all taken effect.
The pre-flight answer should live in a capacity model that names what it counts: dense layers, recurrent or state-space blocks, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack blocks, activation checkpointing, sharding strategy, and precision format. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: training speed anatomy on H200, 141 GB is the physical HBM figure, not the part a planner should spend to zero as model budget. A realistic estimator leaves runtime reserve and allocator headroom outside the model columns before it calls the remaining space usable. Precision choices such as BF16, FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper, and low-bit weight formats then set the byte cost of each tensor family. That is a much better starting point than a spreadsheet approximation.
What a useful estimator should expose
The central output should track categories such as replicated parameters, tensor-parallel parameter shards, expert-parallel parameter shards, gradients, optimizer state, activations, routing scratch, feature-specific activations, allocator overhead, and runtime reserve. That is more useful than a single total because it exposes which axis is actually dominant.
The first split is parameter ownership. Embeddings and norms are usually replicated. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, MLP, and recurrent projection weights are often sliced by tensor parallelismQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding. Expert banks are distributed by expert parallelismQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding. A common ordering is tensor parallelQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding first, then pipeline partitioning, then state sharding, then expert placement. That ordering is the reason a serious estimator does not pretend every parameter shrinks with every degree.
That point is easy to miss in practice because hybrid models encourage people to talk in aggregate counts. But the runtime does not load an aggregate count. It loads concrete tensor families with different ownership rules. Embeddings, routers, and norms remain visible on every relevant rank. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns projections can be sliced by TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding because their algebra was written for it. Expert banks can be handed to EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding because token dispatch later reconstructs the ownership boundary at runtime. The geometry is therefore not just a memory story. It is the consequence of what the execution model is legally allowed to split.
The second split is optimizer geometry. A stack does not pay one flat optimizer tax. Large matrix paths, embeddings, norms, routers, and other special-case tensors often carry different optimizer state. That matters because one change in parameter grouping can swing the optimizer column even when the raw parameter count does not move much.
The third split is the activation surface. In the estimator, ordinary activations, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing state, and feature activations are distinct fields. In the runtime, that matches reality. The dense-model path carries residual and normalization state through attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks; the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch path builds dispatch counts, rank-local receives, and combine buffers for the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack route; and the feature side adds smaller but still real activation surfaces for optional structure-aware paths. That is why this category has to be managed explicitly: sequence parallelismQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel reduces activation memory roughly with tensor-parallel degree, and selective activation recomputation is often a better lever than blunt full checkpointing when the goal is to shrink the activation column without overpaying in extra compute.
This is also where hybrid depth matters more than people expect. A stack with more attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy layers pays for a different activation mix than a stack with more Mamba or expert-heavy layers, even when total parameter count is similar. Dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-side residuals dominate one profile, routed-expert metadata and dispatch scratch dominate another, and sequence-state work dominates a third. The estimator's category split is valuable precisely because it lets us change architecture without losing the ability to reason about where memory is going.
The fourth split is routing scratch. Expert parallelismQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding is not just expert parameters. The dispatch path also needs token counts, receive buffers, combine buffers, and sometimes extra transport metadata. Even with sparse expert math, the routing subsystem still allocates memory, which is why it deserves its own line rather than being smeared into activations.
The fifth split is the runtime tail. Any realistic estimate includes allocator overhead and runtime reserve, and it should warn when headroom gets too small. That is the estimator admitting that launch-time behavior is not purely steady-state math. Compile peaks, collectives, allocator scars, and scratch buffers create a real tail, which is exactly why the runtime tail stays visible in the model.
The allocator and collective tail is also where "safe on average" becomes unsafe in production. A launch can look fine if you only inspect final resident state after the step settles. But startup has to pass process-group construction, scratch allocation, compile-time graph work, and the first real backward. MegaCpp treats those as part of memory geometry rather than as unrelated incidents. That is a more honest model of how H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: training speed anatomy on H200 jobs fail.
Serving adds one more dimension that trainingQuick term guideTrainingWhat actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…GroundingTraining speed anatomy on H200 SLM training in MegaCpp: what the stack optimizes for and what stays explicit often does not pay: KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack. The engine and serving stack distinguish contiguous cache and paged cache substrates, and only attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns layers consume them. In a hybrid model that interleaves non-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns layers, that matters a lot. Mamba blocks do not contribute to KV residency the way attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks do, so the serving geometry of a hybrid stack is better than a pure-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns stack of equal depth.
That only stays true if the estimate carries an explicit attention-layer mapQuick term guideattention-layer mapThe layer-index remap that allocates cache slots only for attention-bearing layers in a hybrid stack instead of reserving one slot per layer unconditionally.GroundingAbout: KV cache and paged attention Example: shared MLA adapter sample Reference: Mamba3 PsiV cache scaffold example instead of silently allocating cache against total depth. In practice, serving memory should budget KV against the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-owned layers and keep the recurrent-state pool separate for the Mamba side; otherwise a hybrid stack gets priced like a dense transformer and the estimate stops matching the real serving substrate. The local follow-through for that split is KV Cache and Paged Attention for the MegaCpp Specialist Ensemble.
The cleanest summary is to ask what scales each surface.
| Memory surface | Main scaling factors | Primary code surface | What it does not care about |
|---|---|---|---|
| Replicated params | width, vocab, replicated features | the capacity model | EP, SP, CP |
| TP-sharded params | width and TP degree | the parallelism plan | expert ownership |
| EP-sharded params | expert count, expert width, EP degree | the expert runtime and dispatch path | dense attention projections |
| Gradients | parameter geometry plus DP mode | the capacity model | serving cache layout |
| Optimizer state | optimizer assignment by tensor type | the capacity model | SP, CP |
| Activations | microbatch, sequence length, recompute policy | the dense-model runtime | EP on non-expert residuals |
| Routing scratch | routed tokens and top-k load | the expert-dispatch path | TP on dense-only layers |
| KV cache | serving batch and context length | the serving runtime | training-only backward state |
| Fragmentation and reserve | allocator and runtime behavior | estimator plus runtime-debug notes | any one sharding axis directly |
What H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: training speed anatomy on H200 buys you is enough headroom that these categories can trade against each other instead of forcing a single desperate choice. Public change notes show the pattern directly: sequence parallelismQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel shrank the activation term, selective recompute replaced full checkpointing, and a fused recurrent-convolution path removed a large transient-memory spike. Those were not isolated optimizations. They were geometry repairs.
Another way to say this is that H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: training speed anatomy on H200 changes the optimization order. On a smaller device, you often begin with emergency compression or aggressive offload just to launch. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: training speed anatomy on H200, you can instead choose the cleaner structural fix: better sharding of the right parameter family, a narrower recompute boundary, or a reduction in transient expert-routing scratch. That tends to produce systems that are easier to reason about and easier to carry into production.
# The per-device budget surfaces that matter before launch.
Estimate(
replicated_params_gb=...,
tensor_parallel_params_gb=...,
expert_parallel_params_gb=...,
gradients_gb=...,
matrix_optimizer_gb=...,
other_optimizer_gb=...,
activations_gb=...,
routing_gb=...,
runtime_reserve_gb=...,
overhead_gb=1.5,
)
How it lands in production
The production lift is to preserve the same memory geometry in launch planning and recipe review.
That means keeping replicated tensors explicit, keeping Muon and AdamW state separate, budgeting MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing scratch as its own term, and refusing to spend the full nominal HBM number as if runtime reserve did not exist. It also means respecting axis-specific ownership rules. Expert banks can be EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding-sharded, but routers and embeddings do not disappear just because EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding is enabled. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns and Mamba projections can be TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding-sharded, but the routing subsystem is a separate cost center.
It also means teaching the launch surface to answer the right questions. If a proposed recipe change increases expert count, the first follow-up should not be “what is the new parameter count?” It should be “which bucket grows: EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding-sharded expert banks, routed-token scratch, or both?” If a precision change is proposed, the question is not just what happens to weights; it is what happens to gradients, optimizer state, communication buffers, and whether the runtime tail now dominates instead.
Design choices that held up
Public code and documentation samples give the short list of memory decisions that held up.
The fused recurrent-convolution path stayed because it cut peak memory meaningfully compared with a naive grouped implementation. Sequence parallelismQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel stayed because it reduces activation memory roughly with tensor-parallel degree. Selective activation recomputation stayed because it beat a blunt full-checkpoint strategy. Explicit routing accounting stayed because expert parallelismQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding is not just “expert weights somewhere else”; the dispatch and combine path is part of the budget.
What we did not keep is the habit of talking about fit as if only parameters mattered. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: training speed anatomy on H200, the winner is the configuration where weights, gradients, optimizer state, activations, routing scratch, KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack, and reserve all fit together with real headroom.
That is the real meaning of memory geometry. It is not an aesthetic way to present a budget. It is a way to stop using a scalar mental model for a multidimensional runtime.
Production checklist
- Run the estimator with the real TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding, DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP, precision, and feature flags.
- Keep embeddings, norms, routers, and other replicated tensors visible in the budget.
- Track Muon and AdamW state separately when reviewing optimizer-group changes.
- Budget MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch and combine scratch whenever EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding is active.
- Cross-check the activation estimate against the actual recompute policy in the dense-model runtime.
- Add KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack for serving scenarios instead of assuming the trainingQuick term guideTrainingWhat actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…GroundingTraining speed anatomy on H200 SLM training in MegaCpp: what the stack optimizes for and what stays explicit budget covers them.
- Leave nontrivial headroom for compile peaks and fragmentation on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: training speed anatomy on H200.
Frequently asked questions
What does that routing scratch look like on one concrete batch?+
4096 tokens on a rank with top_k=2, 128 total experts, capacity_factor=1.2, hidden_dim=4096, and 4 local experts in bf16. Then capacity_e = ceil((4096 * 2 / 128) * 1.2) = 77, the dispatch-count surface is 4096 * 2 * 4 bytes or about 32 KB, and the receive plus combine buffers land at 77 * 4096 * 4 * 2 bytes each, or about 2.5 MB each. That is already about 5 MB of live routing scratch for one layer before you count the expert math itself, which is why EP capacity planning sample, Expert-parallel routing sample, and A Memory-Budget Anatomy for One Specialist on H200:8 keep routing separate from dense activations.Why not use one optimizer multiplier for the whole model?+
Is 141 GB the budget a launch planner should treat as usable model space?+
141 GB is the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. physical-memory figure, not the part a capacity model should collapse into weights or KV and then spend to zero. The safer question is how much room remains after runtime reserve, allocator scars, and other tail costs stay outside the model columns, and whether weights, optimizer state, activations, routing scratch, and any serving cache still fit with headroom. That is the same seam behind the OOM Debugging Playbook for H200 Training Runs and A Memory-Budget Anatomy for One Specialist on H200:8.Should the estimator hard-code 0.90 * 141 GB as the safe H200 budget?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.
Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.
Data parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
Context parallelism splits the sequence itself along the token axis. On 8× H200 with a 128K-token sample and CP=8 each GPU processes 16K local tokens; during attention the GPUs ring-exchange KV chunks so every one still sees the full past. Cost: a ring of KV sends that scales with sequence length — cheap on NVLink, expensive across nodes. Weights replicate on every CP GPU; only activations and the KV cache shard along sequence. Use CP when the sequence is too long for one GPU's KV cache, not to reduce weight memory — that's TP or FSDP's job.
Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.
What actually sets training speed on H200 in public MegaCpp reporting: compile warmup policy, block mix, memory shape, and why local wins often fail…
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.
The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.
The layer-index remap that allocates cache slots only for attention-bearing layers in a hybrid stack instead of reserving one slot per layer unconditionally.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…