Activation Recompute Boundaries in Hybrid Stacks
Why selective recompute has to align with module boundaries, communication edges, and graph-safe surfaces in hybrid training systems.

Activation recompute only looks like a generic memory lever until you place it inside a hybrid training stack. Recompute too high and you replay collectives or metadata paths that can diverge across ranks. Recompute too low and you save very little. The durable answer is selective recompute at module boundaries that are graph-safe, shard-safe, and topology-aware.
There is a strong temptation to describe activation recompute as a binary switch: on means lower memory, off means higher speed. That story is too shallow for hybrid models. Once attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, state-space, and recurrent blocks live in the same stack, the real question is where the recompute boundary sits.
Recompute is a placement problem
The important design choice is not whether recompute exists. It is whether the chosen boundary is explicit enough to reason about.
Named module surfaces do two useful things. First, they let memory savings concentrate where activations are actually large. Second, they make failures debuggable. If a run deadlocks, regresses throughput, or breaks graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample, it is much easier to inspect a concrete wrapped surface than a giant checkpoint over the whole block.
That becomes a correctness issue in distributed training. Divergent recompute decisions across ranks can replay different collectives and deadlock the run. Metadata mismatches can break checkpoint replay. Compiled blocks need stable grouping so that backward replay is structurally identical to forward.
| Recompute placement | Typical outcome |
|---|---|
| Whole block, too high | Saves memory but risks replaying collectives or rank-divergent control flow |
| Micro-op, too low | Safer but often too little memory relief |
| Selective submodule boundary | Best compromise when the module contract is stable |
| Topology-blind placement | Looks simple but breaks hybrid-specific assumptions |
Why hybrid stacks complicate the boundary
In a dense transformer, checkpoint placement is already a tradeoff between memory and replay cost. In a hybrid stack, it also becomes a topology problem.
AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy paths concentrate pressure around QKV projection, the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns core, and output projection. MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack paths concentrate pressure around expert activations and routing-related tensors. State-space and recurrent paths have their own scan or state-retention surfaces. A boundary that is ideal for attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns can be wrong for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack or recurrent blocks.
That is why selective recompute is most useful when it targets the block surface that actually dominates activation memory. In some hybrid lanes that surface is the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack activation path. In others it is core attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns. The lesson is not "turn on recompute." The lesson is "place recompute where the memory is and where replay is structurally safe."
Communication seams are the dangerous boundaries
The sharpest edge is collective replay. If one rank replays an all-reduce or all-to-all and another rank does not, the run can deadlock outright. Replay is only safe when the replayed graph is structurally identical across all participants.
That leads to four practical rules:
- Recompute must not depend on rank-local branching.
- Module boundaries need deterministic shapes and metadata.
- Graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample and recompute must agree on signatures and kwargs.
- Communication-heavy seams should stay outside checkpoint boundaries unless the replay contract has been proven.
These rules sound conservative, but they are cheaper than debugging a distributed deadlock caused by an apparently innocent checkpoint wrapper.
Graph-safe boundaries are not always memory-optimal boundaries
The best memory-saving boundary is not always the best production boundary. Sometimes the largest tensors sit next to graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample seams, fused kernels, or collective-heavy code paths that are fragile under replay. In those cases, the better production choice is a slightly smaller memory win on a boundary that remains stable under compilation and distribution.
That is why many hybrid systems settle on a block-aware policy instead of a fully generic per-operator policy. The block-aware policy gives up some theoretical precision in exchange for a boundary that operators, compilers, and distributed runtimes can all agree on.
Practical rule of thumb
Use the highest recompute boundary that remains deterministic across ranks and safe under graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample, but no higher.
For attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, that often means core attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns or the full block in eager mode.
For MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, that usually means expert compute but not dispatch.
For Mamba-style or recurrent layers, that often means a narrow in-module recompute instead of whole-block replay.
One concrete way to keep that rule auditable is to name the recompute surface instead of only naming the layer. Public Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample policies expose selective module names such as core attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, MLP, layer norm, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack activation, shared experts, and MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries up-projection; the useful MegaCpp habit is the same: name the replayable compute island, and keep routing or communication decisions outside it.
Once the stack is hybrid, boundary placement is no longer a detail. It is the feature.
Frequently asked questions
What changes when the recurrent or SSM lane is also CUDA-graph captured?+
What if sparse routing leaves an expert with no local tokens?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
A bounded family of tensor shapes captured or compiled as one stable runtime topology instead of treating every dynamic case as one global worst case.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.
Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.
CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.