MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20264 min readDavid Gornshtein

Activation Recompute

Checkpointing

Memory

Distributed Training

Activation Recompute Boundaries in Hybrid Stacks

Q: What changes when the recurrent or SSM lane is also CUDA-graph captured?

Treat graph capture as a stricter version of the same boundary rule. A captured replay needs stable launch topology, shape classes, and buffer addresses, so a whole-block checkpoint is fragile if the recurrent or SSM path can change scan shape, state-buffer lifetime, or host-side control. A shape class is just a fixed replay bucket with bounded scan and buffer sizes; rare shapes still need their own bucket or an eager fallback. The safer public MegaCpp policy is to bucket fixed shapes, keep rare dynamic cases outside capture, and place recompute on a named in-module compute island instead of the whole hybrid wrapper. DSA and CUDA graph safety is the local graph-capture companion.

Why selective recompute has to align with module boundaries, communication edges, and graph-safe surfaces in hybrid training systems.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Activation Recompute Boundaries in Hybrid Stacks

Published April 18, 2026•4 min read•David Gornshtein

Activation recompute only looks like a generic memory lever until you place it inside a hybrid training stack. Recompute too high and you replay collectives or metadata paths that can diverge across ranks. Recompute too low and you save very little. The durable answer is selective recompute at module boundaries that are graph-safe, shard-safe, and topology-aware.

There is a strong temptation to describe activation recompute as a binary switch: on means lower memory, off means higher speed. That story is too shallow for hybrid models. Once attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, state-space, and recurrent blocks live in the same stack, the real question is where the recompute boundary sits.

Recompute is a placement problem

The important design choice is not whether recompute exists. It is whether the chosen boundary is explicit enough to reason about.

Named module surfaces do two useful things. First, they let memory savings concentrate where activations are actually large. Second, they make failures debuggable. If a run deadlocks, regresses throughput, or breaks graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample, it is much easier to inspect a concrete wrapped surface than a giant checkpoint over the whole block.

That becomes a correctness issue in distributed training. Divergent recompute decisions across ranks can replay different collectives and deadlock the run. Metadata mismatches can break checkpoint replay. Compiled blocks need stable grouping so that backward replay is structurally identical to forward.

Recompute placement	Typical outcome
Whole block, too high	Saves memory but risks replaying collectives or rank-divergent control flow
Micro-op, too low	Safer but often too little memory relief
Selective submodule boundary	Best compromise when the module contract is stable
Topology-blind placement	Looks simple but breaks hybrid-specific assumptions

Why hybrid stacks complicate the boundary

In a dense transformer, checkpoint placement is already a tradeoff between memory and replay cost. In a hybrid stack, it also becomes a topology problem.

AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy paths concentrate pressure around QKV projection, the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns core, and output projection. MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack paths concentrate pressure around expert activations and routing-related tensors. State-space and recurrent paths have their own scan or state-retention surfaces. A boundary that is ideal for attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns can be wrong for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack or recurrent blocks.

That is why selective recompute is most useful when it targets the block surface that actually dominates activation memory. In some hybrid lanes that surface is the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack activation path. In others it is core attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns. The lesson is not "turn on recompute." The lesson is "place recompute where the memory is and where replay is structurally safe."

Communication seams are the dangerous boundaries

The sharpest edge is collective replay. If one rank replays an all-reduce or all-to-all and another rank does not, the run can deadlock outright. Replay is only safe when the replayed graph is structurally identical across all participants.

That leads to four practical rules:

Recompute must not depend on rank-local branching.
Module boundaries need deterministic shapes and metadata.
Graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample and recompute must agree on signatures and kwargs.
Communication-heavy seams should stay outside checkpoint boundaries unless the replay contract has been proven.

These rules sound conservative, but they are cheaper than debugging a distributed deadlock caused by an apparently innocent checkpoint wrapper.

Graph-safe boundaries are not always memory-optimal boundaries

The best memory-saving boundary is not always the best production boundary. Sometimes the largest tensors sit next to graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample seams, fused kernels, or collective-heavy code paths that are fragile under replay. In those cases, the better production choice is a slightly smaller memory win on a boundary that remains stable under compilation and distribution.

That is why many hybrid systems settle on a block-aware policy instead of a fully generic per-operator policy. The block-aware policy gives up some theoretical precision in exchange for a boundary that operators, compilers, and distributed runtimes can all agree on.

Practical rule of thumb

Use the highest recompute boundary that remains deterministic across ranks and safe under graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.GroundingAbout: DSA and CUDA graph safety Example: DSA CUDA graph safety sample Example: CUDA graph block validation sample, but no higher.

For attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, that often means core attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns or the full block in eager mode.

For MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, that usually means expert compute but not dispatch.

For Mamba-style or recurrent layers, that often means a narrow in-module recompute instead of whole-block replay.

One concrete way to keep that rule auditable is to name the recompute surface instead of only naming the layer. Public Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample policies expose selective module names such as core attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, MLP, layer norm, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack activation, shared experts, and MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries up-projection; the useful MegaCpp habit is the same: name the replayable compute island, and keep routing or communication decisions outside it.

Once the stack is hybrid, boundary placement is no longer a detail. It is the feature.

FAQ

Frequently asked questions

What changes when the recurrent or SSM lane is also CUDA-graph captured?+

Treat graph captureQuick term guideCUDA GraphsCUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph. as a stricter version of the same boundary rule. A captured replay needs stable launch topology, shape classesQuick term guideshape classA bounded family of tensor shapes captured or compiled as one stable runtime topology instead of treating every dynamic case as one global worst case., and buffer addresses, so a whole-block checkpoint is fragile if the recurrent or SSM path can change scan shape, state-buffer lifetime, or host-side control. A shape class is just a fixed replay bucket with bounded scan and buffer sizes; rare shapes still need their own bucket or an eager fallback. The safer public MegaCpp policy is to bucket fixed shapes, keep rare dynamic cases outside capture, and place recompute on a named in-module compute island instead of the whole hybrid wrapper. DSA and CUDA graph safety is the local graph-capture companion.

What if sparse routing leaves an expert with no local tokens?+

Do not let that case decide whether a rank enters the recompute wrapper. The safe boundary is above the dynamic skip decision or below dispatch on a compute-only expert path with deterministic metadata, so every rank keeps the same synchronization story even when local token counts differ. In MegaCpp terms, route and combine stay outside the replay island; the replay island is the expert math that can be re-run without re-issuing communication. Expert parallel and MoE sharding covers the routing side.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

Grounding

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

shape class

A bounded family of tensor shapes captured or compiled as one stable runtime topology instead of treating every dynamic case as one global worst case.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

Grounding

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Grounding

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

David Gornshtein • MegaCppMore posts →

Activation Recompute Boundaries in Hybrid Stacks

Recompute is a placement problem

Why hybrid stacks complicate the boundary

Communication seams are the dangerous boundaries

Graph-safe boundaries are not always memory-optimal boundaries

Practical rule of thumb

Read next

References

Frequently asked questions

Terms used in this article