Activation checkpointing deep dive: why per-block policies beat one global switch
Full, selective, and narrow recompute across attention, MoE, Mamba-style, and recurrent blocks: what saves memory, what costs too much compute, and why a per-block policy usually wins.

This post covers the ablation history behind a practical checkpointing policy. Full checkpointing everywhere was too expensive. Per-operator selective activation checkpointing helped in a few places but became hard to reason about at system level. What held up was a per-block policy: attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks use full-block or framework-level selective recompute, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack blocks recompute expert GEMMs only, MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style blocks avoid full checkpointing and keep a narrow conv-plus-projection recompute, and recurrent blocks use full checkpointing plus a small in-module recompute.
Why this matters
Hybrid models do not have one dominant activation bottleneck. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style sequence layers, and recurrent blocks all concentrate memory in different operators, and the cost of recomputing those operators is very different. A MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode selective scan is expensive to rerun. Core attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns is moderately expensive. A standard MLP is usually much cheaper. One global flag throws away most of that structure. Per-block policy keeps most of the memory benefit without turning the runtime into a maze of special cases.
FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper adds another constraint. Standard PyTorch checkpointing does not preserve FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper amax history across recompute, while Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer checkpointing does. If the checkpoint boundary and autocast scope are mismatched, the training curve can drift long before the failure is obvious.
The mechanisms that matter
MegaCpp needed four mechanisms that are often all called "checkpointing," even though they behave very differently.
Manual block checkpointing wraps each block forward with torch.utils.checkpoint.checkpoint(..., use_reentrant=False). In practice, the main model forward decides layer by layer using block type, layer index, and a spacing policy. This is the eager-mode path.
Inductor automatic rematerialization is the compiled regional path. A compile helper translates gradient_checkpointing=True into an activation-memory budget, then inductor inserts recompute nodes to satisfy that budget. Compiled rematerialization and manual checkpointing do not compose well. If both are active in the same region, they can double-count work, slow training sharply, and still save less memory than expected.
CPU offload checkpointing trades recompute for host-link traffic. Instead of rerunning the whole block, it copies large saved inputs to pinned CPU memory and brings them back during backward. A finer-grained variant can use saved-tensor hooks above a size threshold. This is CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-only and only useful when recompute is expensive enough to justify the transfer.
Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer checkpointing is the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-safe path. It preserves amax history across the recompute boundary and is the right choice whenever an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-autocast block must be checkpointed.
Operator-local recompute is the last piece. Recurrent blocks can rerun only the recurrence instead of the whole block. MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries can save compact latent KV state and regenerate full K and V during backward. MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style layers can rerun only the convolution and projection pieces. Those narrow cuts matter because they often recover most of the memory win at a fraction of the wall-clock cost of full-block checkpointing.
The per-block ablation history
Attention blocks
The useful comparison was between no checkpointing, full-block checkpointing, framework-level selective core-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns recompute, and a custom per-operator policy at the SDPA boundary. The landed policy was simple: full-block checkpointing in eager mode and framework-level selective recompute in the standard compiled configuration. That boundary worked because MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries up-projection was already being recomputed from compact latent state, so recomputing core attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns captured the expensive part without duplicating the rest.
MoE blocks
MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack made the tradeoff much clearer. Full-block checkpointing was too expensive because backward had to replay dispatch, permutation, collective traffic, expert compute, and combine. The winning policy was selective expert-GEMM recompute only. That left dispatch metadata alone and reran just the cheapest part of the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack chain. It delivered the largest memory win in the stack while keeping throughput cost small.
Mamba-style blocks
MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style layers were the opposite of MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack. Full-block checkpointing was a bad trade because selective scan is expensive to rerun and tends to dominate the block cost. It also interacted badly with FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper packed-token paths when recompute re-entered packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles logic with already quantized inputs. The narrow conv-plus-projection recompute path was much better. It recovered meaningful memory and cost little in throughput because convolution backward already performed part of that work.
Recurrent blocks
Recurrent blocks benefited from a combination of coarse and narrow recompute. The recurrence chain alone can pin several gigabytes, so rerunning just that chain is highly effective. In practice, full-block checkpointing plus the narrow recurrence recompute produced the cleanest result: memory close to the narrow path with a simpler block-level rule.
Why custom per-op SAC usually lost
Custom per-operator SAC looked attractive because it promised exact control over what to save and what to recompute. In practice it lost on complexity.
One policy saved only expensive operators. That worked on attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy stacks but missed important buffer-heavy work in MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack paths. Another policy used a raw tensor-size threshold. That was easy to explain but unstable because compact latent states could fall just below the threshold and get recomputed even when they were the wrong tensors to rerun. A block-aware operator policy worked better, but by then it was effectively rebuilding the block-level policy in a more fragile form. The result was clear: if your intended rule already depends on block identity and runtime context, it is usually cleaner to encode that policy at the block boundary.
The platform-specific lessons
On compiled CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 paths with an activation-memory budget below full retention, let the compiler own rematerialization. That path depends on gradient checkpointing being exposed to the compile layer, because the compile layer turns that signal into the inductor budget. If a higher-level configuration silently disables the flag, the compiled rematerialization lane disappears.
On TPU-class systems, autotuned rematerialization already does much of the work. Manual checkpointing still helps for attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy regions, but CPU offload is not available, and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack tradeoffs differ because dispatch buffers scale differently than they do on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 systems.
CPU offload is reserved for the narrow cases where recompute is expensive and host-link bandwidth is mostly idle. It is not a default training strategy. It is a situational escape hatch.
The policy that survived
The practical policy is short:
- AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns: full-block checkpointing in eager mode, compiler or framework-level selective recompute in compiled mode.
- MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack: recompute expert GEMMs, not dispatch.
- MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style blocks: avoid full-block checkpointing; use narrow conv-plus-projection recompute.
- Recurrent blocks: combine full-block checkpointing with narrow recurrence recompute.
- FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper: use an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-safe checkpoint path whenever recompute crosses an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-autocast boundary.
That policy is less elegant than a single global flag, but it matches where memory is actually spent and where recompute is actually cheap.
Frequently asked questions
Why is the selective-scan core not the Mamba checkpoint target?+
Why not offload checkpointed tensors to CPU instead of recomputing them?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.