MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 5 min readDavid Gornshtein
Activation Checkpointing
Selective Recompute
Mamba
MoE
MLA
H200
Ablation

Activation checkpointing deep dive: why per-block policies beat one global switch

Full, selective, and narrow recompute across attention, MoE, Mamba-style, and recurrent blocks: what saves memory, what costs too much compute, and why a per-block policy usually wins.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Activation checkpointing deep dive: why per-block policies beat one global switch
Published 5 min readDavid Gornshtein

This post covers the ablation history behind a practical checkpointing policy. Full checkpointing everywhere was too expensive. Per-operator selective activation checkpointing helped in a few places but became hard to reason about at system level. What held up was a per-block policy: attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks use full-block or framework-level selective recompute, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack blocks recompute expert GEMMs only, MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style blocks avoid full checkpointing and keep a narrow conv-plus-projection recompute, and recurrent blocks use full checkpointing plus a small in-module recompute.

Why this matters

Hybrid models do not have one dominant activation bottleneck. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style sequence layers, and recurrent blocks all concentrate memory in different operators, and the cost of recomputing those operators is very different. A MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode selective scan is expensive to rerun. Core attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns is moderately expensive. A standard MLP is usually much cheaper. One global flag throws away most of that structure. Per-block policy keeps most of the memory benefit without turning the runtime into a maze of special cases.

FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper adds another constraint. Standard PyTorch checkpointing does not preserve FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper amax history across recompute, while Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer checkpointing does. If the checkpoint boundary and autocast scope are mismatched, the training curve can drift long before the failure is obvious.

The mechanisms that matter

MegaCpp needed four mechanisms that are often all called "checkpointing," even though they behave very differently.

Manual block checkpointing wraps each block forward with torch.utils.checkpoint.checkpoint(..., use_reentrant=False). In practice, the main model forward decides layer by layer using block type, layer index, and a spacing policy. This is the eager-mode path.

Inductor automatic rematerialization is the compiled regional path. A compile helper translates gradient_checkpointing=True into an activation-memory budget, then inductor inserts recompute nodes to satisfy that budget. Compiled rematerialization and manual checkpointing do not compose well. If both are active in the same region, they can double-count work, slow training sharply, and still save less memory than expected.

CPU offload checkpointing trades recompute for host-link traffic. Instead of rerunning the whole block, it copies large saved inputs to pinned CPU memory and brings them back during backward. A finer-grained variant can use saved-tensor hooks above a size threshold. This is CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-only and only useful when recompute is expensive enough to justify the transfer.

Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer checkpointing is the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-safe path. It preserves amax history across the recompute boundary and is the right choice whenever an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-autocast block must be checkpointed.

Operator-local recompute is the last piece. Recurrent blocks can rerun only the recurrence instead of the whole block. MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries can save compact latent KV state and regenerate full K and V during backward. MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style layers can rerun only the convolution and projection pieces. Those narrow cuts matter because they often recover most of the memory win at a fraction of the wall-clock cost of full-block checkpointing.

The per-block ablation history

Attention blocks

The useful comparison was between no checkpointing, full-block checkpointing, framework-level selective core-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns recompute, and a custom per-operator policy at the SDPA boundary. The landed policy was simple: full-block checkpointing in eager mode and framework-level selective recompute in the standard compiled configuration. That boundary worked because MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries up-projection was already being recomputed from compact latent state, so recomputing core attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns captured the expensive part without duplicating the rest.

MoE blocks

MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack made the tradeoff much clearer. Full-block checkpointing was too expensive because backward had to replay dispatch, permutation, collective traffic, expert compute, and combine. The winning policy was selective expert-GEMM recompute only. That left dispatch metadata alone and reran just the cheapest part of the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack chain. It delivered the largest memory win in the stack while keeping throughput cost small.

Mamba-style blocks

MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-style layers were the opposite of MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack. Full-block checkpointing was a bad trade because selective scan is expensive to rerun and tends to dominate the block cost. It also interacted badly with FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper packed-token paths when recompute re-entered packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles logic with already quantized inputs. The narrow conv-plus-projection recompute path was much better. It recovered meaningful memory and cost little in throughput because convolution backward already performed part of that work.

Recurrent blocks

Recurrent blocks benefited from a combination of coarse and narrow recompute. The recurrence chain alone can pin several gigabytes, so rerunning just that chain is highly effective. In practice, full-block checkpointing plus the narrow recurrence recompute produced the cleanest result: memory close to the narrow path with a simpler block-level rule.

Why custom per-op SAC usually lost

Custom per-operator SAC looked attractive because it promised exact control over what to save and what to recompute. In practice it lost on complexity.

One policy saved only expensive operators. That worked on attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy stacks but missed important buffer-heavy work in MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack paths. Another policy used a raw tensor-size threshold. That was easy to explain but unstable because compact latent states could fall just below the threshold and get recomputed even when they were the wrong tensors to rerun. A block-aware operator policy worked better, but by then it was effectively rebuilding the block-level policy in a more fragile form. The result was clear: if your intended rule already depends on block identity and runtime context, it is usually cleaner to encode that policy at the block boundary.

The platform-specific lessons

On compiled CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 paths with an activation-memory budget below full retention, let the compiler own rematerialization. That path depends on gradient checkpointing being exposed to the compile layer, because the compile layer turns that signal into the inductor budget. If a higher-level configuration silently disables the flag, the compiled rematerialization lane disappears.

On TPU-class systems, autotuned rematerialization already does much of the work. Manual checkpointing still helps for attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy regions, but CPU offload is not available, and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack tradeoffs differ because dispatch buffers scale differently than they do on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 systems.

CPU offload is reserved for the narrow cases where recompute is expensive and host-link bandwidth is mostly idle. It is not a default training strategy. It is a situational escape hatch.

The policy that survived

The practical policy is short:

That policy is less elegant than a single global flag, but it matches where memory is actually spent and where recompute is actually cheap.

FAQ

Frequently asked questions

Why is the selective-scan core not the Mamba checkpoint target?+
Because the selective scan is the stateful recurrent mixer, not a cheap projection island. The MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and… paper describes input-dependent state-space parameters that rule out the older convolution shortcut and require a hardware-aware parallel algorithm in recurrent mode, so replaying the full scan spends the extra work on the expensive sequence-state path instead of the cheap conv/projection surfaces. MegaCpp keeps the Mamba rule narrow: recompute conv/projection-side tensors, leave the scan-owned state path to the kernel boundary, and read the surrounding seam through The Mamba 3 Kernel Journey and Author Mamba3 spec inside Megatron.
Why not offload checkpointed tensors to CPU instead of recomputing them?+
CPU offload is a fallback, not the default, on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.-class systems. A PCIe Gen5 x16 link is roughly 64 GB/s per direction, while H200 HBM3e is 4.8 TB/s. As a bandwidth floor, a one-way 2 GB host transfer costs about 31 ms before overheads, while touching 2 GB through HBM is about 0.4 ms before kernel overheads. That gap is why MegaCpp only reaches for offload when the tensor is genuinely expensive to recompute, the transfer can be hidden behind other work, or the alternative is a hard OOM.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Mamba

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.