CPU Offload and Startup Memory Calibration on H200 and GB10
How MegaCpp picks microbatch and offload knobs at boot, the zero-copy pinned offload paths, the AdamW-only optimizer offload trade-offs, and what shipped versus what stayed experimental.

The two memory questions before every CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 training run are "what microbatch fits" and "what stays on the GPU". The answer is not a one-time benchmark; it is a calibration loop that combines a feature-aware estimator, a persistent records catalog, and two distinct offload paths (activation and optimizer state). On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 we mostly use the loop to confirm headroom; on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story it is the only thing standing between us and a 47-minute compile-then-OOM cycle. This post walks through how the loop is wired, what the offload mechanisms actually cost, and which knobs remained useful in sustained training.
That calibration loop is easiest to read next to H200 memory geometry, training on H200 eight-GPU machines, and training the MegaCpp SLM Ensemble on GB10, because the same model lands against very different memory limits on those boxes.
Why MegaCpp cares about this
H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 has 141 GB of HBM3e per device and a fast NVLink fabric; GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story (DGX Spark, sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough) has 128 GB of unified memory and consumer-tier interconnect. The same model definition with the same flags runs differently on the two. For the first-touch hardware naming layer, GB10 stack parity for MegaCpp explains why the consumer sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough lane should not be read as datacenter sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough or tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough proof. A representative configuration may also enable a long list of memory-hungry features at the same time: hybrid Mamba-3 plus Transformer blocks, multi-token prediction (MTP), engram memory, structure embeddings, n-gram hash tables, FIM masking, and the eight-specialist routing. If those stack-local names are not fresh, SLM architecture, MegaCpp model glossary, and the checked-in instruction-aware FIM sample are the shortest local decoder ring. A naive "set device_batch_size and pray" approach used to lose us a full GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story day per misconfigured launch.
The calibration loop replaces that with three artifacts: a pre-flight estimate that accounts for every enabled feature, a persistent JSONL catalog of past launches and their OOM outcomes, and a small set of escape hatches (automatic fit search, AdamW offload, activation offload) the runtime can pull when the estimate is tight.
What we built in the public MegaCpp calibration path
The pre-flight estimator carries a hardware table with usable HBM per chip for the SKUs we care about — H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200: 141.0 GB, GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story: 128.0 GB unified, plus B200, GB200, H100, A100, L40S, and T4 — and a dtype byte table that knows about bf16, fp8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper, and nvfp4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference (0.5 bytes per element). The estimate is feature-aware: it walks the configuration, adds parameter bytes per layer including the right Mamba state-space contribution and MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack expert count, optimizer-state bytes for AdamW plus Muon (Muon's two momentum buffers vs AdamW's two moments), gradient bytes, and activation bytes per microbatch, then subtracts a configurable headroom reserve. The Mamba contribution is the trickiest because the SSM state caches scale with seq_len * d_state * n_groups rather than with n_embd alone; that audit caught a 6 GB blowup on the depth-52 hybrid preset before any GPU touched it.
The automatic fit search explores the joint space of tensor parallelismQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, expert parallelismQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding, data parallelismQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP, device batch size, and checkpointing for valid combinations that fit. On CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 we pin checkpointing off in this search because Nemotron Nano 3 trains 30B without recompute and our shape behaves the same way. The candidate scoring is not "minimize peak HBM"; it is shape-aware. Large multi-GPU CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack runs prefer sharded-expert meshes (target around 16 local experts per device on 8x H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, dropping toward 8 on larger hosts) before they chase extra DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP, because EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding=1 maximizes DPQuick term guideDPData parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.GroundingAbout: parallelism map overview Example: FSDP sharding sample Reference: FSDP on CUDA and Megatron DDP on paper but recreates the routing/gather OOMs have already been observed. The selection reason is recorded with each candidate so you can inspect why one combination beat another.
The startup calibration loop makes the process persistent. Each boot writes a JSONL record with the schema version, hardware shape, requested config, estimator output, and runtime outcome. We also maintain a small set of retryable memory failures that mean "the estimator was too optimistic, retry with safer knobs" rather than "there is a logic bug"; that list covers compile-time HBM OOMs, runtime CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 OOMs, XLA resource exhaustion, and Triton shared-memory or resource limits. Historical records are then used to demote shapes that have already OOMed on the same hardware. CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 and XLA history stay separated so one backend does not pollute the other's fit decisions.
The runtime introspection layer dumps torch.cuda.memory_stats() (current and peak allocated/reserved/active, allocation retries, OOM count, inactive split bytes) and the top-N allocations from torch.cuda.memory_snapshot() when the snapshot recorder is active. It is gated behind a debug-only environment flag and prints once per phase boundary. The fragmentation signal (allocation retries plus inactive split bytes) is what catches the slow-creep OOMs that look like leaks but are really allocator fragmentation under FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample reshard churn.
That is also why the receipt stores phase boundaries instead of only final peak memory. A launch that fits at step 50 and fails at step 10,000 is usually not missing the initial size estimate; it is drifting through allocator fragmentation, curriculum shape changes, or a new overlap between prefetch and reshard buffers.
Meta-device initialization removes the double-allocation tax. We build the model on torch.device("meta"), then move it to empty CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 storage, then run weight initialization in place. We never allocate the full model on CPU first. For the depth-52 hybrid preset this is the difference between "boot fits in the calibrator's headroom budget" and "OOM during construction before the first forward". The same idiom makes resume-from-checkpoint cheap because we materialize empty and then load weights in shards. The checked-in meta-init memory bootstrap keeps that bootstrap path visible in minimal form.
The system has two offload mechanisms.
The activation-offload path has two modes. The coarse mode is a block-level offload path that replaces gradient checkpointing for a whole transformer block: instead of discarding the saved hidden state and recomputing, it copies it to pinned CPU memory and pulls it back during backward. Recomputation is gone; PCIe or unified-memory traffic takes its place. The fine-grained mode uses torch.autograd.graph.saved_tensors_hooks to selectively offload saved tensors above a min_numel threshold from registered submodules. Both use async CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 streams to overlap D2H/H2D with compute. This is CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-only; XLA/TPU keeps standard gradient checkpointing.
The operational guard is to reserve the pinned-host budget before the first step, not discover it during backward. Lazy pinning can turn an offload win into a synchronization cliff, and it can collide with communication buckets even when GPU HBM still looks safe.
The optimizer-offload path wraps an existing AdamW optimizer in a CPU-backed stepper. The contract is deliberately narrow: Muon parameters always stay on the GPU because orthogonalization needs fast matmul; AdamW parameters are candidates for offload based on offload_fraction and the largest tensors go first to maximize per-byte savings. Steps copy gradients D2H async, run the CPU AdamW step, copy params back H2D async. The wrapper preserves the torch.optim.Optimizer interface so the training loop does not branch. A guard returns the original optimizer unchanged when offload_fraction <= 0 or CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 is unavailable.
For the Muon side of that split, Muon optimizer on NVIDIA and the checked-in FSDP2 Muon local shard sample are the local references to read next; this article only uses Muon to explain why AdamW is the offload target.
How it lands in MegaCpp
The estimator and the calibration catalog graduate almost as-is. The estimator becomes the boot-time gate: MegaCpp refuses to launch a CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 training job whose estimated peak exceeds 95% of the device's usable HBM, period. A debug-only override exists for kernel development, but it is not used in standard launch flows. Automatic fit search becomes the default microbatch picker; the operator only specifies a global token budget and the search picks a tensor/expert/data-parallel layout plus device batch size deterministically.
The startup calibration store lifts as the cross-platform record layer; the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 wrapper merges into the production calibrator as a lightweight namespace. Records persist under the MegaCpp data root and are never indexed by identifying metadata, so they replicate cleanly across hosts.
Meta-device initialization is the lifted idiom. Every model construction in MegaCpp goes through the meta-device path; we removed the legacy "build on CPU then .to(cuda)" code paths.
The activation-offload split survives. Whole-block offload becomes the supported coarse primitive. Fine-grained saved-tensor offload ships as an opt-in that is wired only on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story, where unified memory removes the separate HBM-to-host staging step for tensors above the size threshold; it still consumes memory bandwidth, so the target list stays explicit. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 it is available but rarely beats plain checkpointing. The runtime flag for this path enumerates its target regions explicitly rather than trying to infer them from the model.
The CPU AdamW offload path remains available but is used selectively. It ships behind a feature flag, defaults off on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 (where AdamW state for the eight specialists fits comfortably), and defaults to offload_fraction=0.5 on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story for the largest specialists. We do not offload Muon, ever. The CPU step's float64 reduction tail can be a few percent of step time when offload_fraction is high, so the practical rule is: only offload enough AdamW state to keep peak headroom above 8 GB; do not chase the maximum.
What stayed experimental: the multi-tier offload scheme that paged optimizer state across pinned host memory, NVMe, and a remote shared store. The complexity-versus-savings ratio was bad and the failure modes were operationally awful. The MegaCpp deployment stack has two tiers — GPU and pinned host — and that is the contract.
Ablations and what we kept
The estimator audit history is the most useful artifact in this whole stack. Every time the estimator under-counted, we either added the missing feature term or tightened the headroom reserve. The current model accounts for 26+ features explicitly: Engram, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack (Token Choice plus shared plus null experts), Mamba-3, DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample, mHC (intra plus cross-layer HC), MTP, MoD, n-gram hash, structure embeddings, Tree FFN, relation bias, NCP, TOP, GateSkip, FIM/IFIM, FIRE/ReDo/DASH, gradient checkpointing, FSDP, TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding, EP. The "feature-aware" tag is not marketing; missing any of them and the estimate is wrong by gigabytes.
Calibration on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 mostly confirms what auto_fit already picked. The catalog still earns its keep because the same shape that fits at compile time can OOM during a particular curriculum phase when sequence length grows or when MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack token routing concentrates. Recording those events lets the next launch start from a safer microbatch automatically.
Calibration on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story is the load-bearing story. The unified-memory ceiling is hard, the compile takes long enough that retry costs money, and the consumer-tier silicon (sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough, no tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough, no TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity, ~128 KiB physical SMEM/SM versus 228 KiB on H100/H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200) means the kernel mix differs and the activation footprint is not transferable from H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200. The catalog has caught:
- A Mamba-3 SSM-state blowup at the 8K-context curriculum step, fixed by demoting microbatch by 1.
- A first-compile temporary that fits in 141 GB on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and does not in 128 GB on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story. The retry rule (compile-time OOM bumps the headroom floor by 8 GB on the next attempt) prevents the second 47-minute compile from happening.
- A pinned-buffer collision between the CPU-offload optimizer step and the FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample bucket buffers; the rule is now "offload_fraction must leave at least 8 GB of pinned-host headroom for NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200".
The activation-offload ablation set: on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200, the coarse whole-block offload path is roughly compute-equivalent to standard checkpointing because PCIe is the bottleneck and we have plenty of headroom anyway; it remains useful for the rare configuration where recompute breaks compile. On GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story, activation offload is useful only when the target list is small enough that bandwidth pressure stays below the next fit tier; the long-context attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns block is the case that cleared that bar.
The optimizer-offload ablation set: offload_fraction=0.0 is the default everywhere. At 0.5 on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story we recover roughly enough HBM to enable the next-larger microbatch, but only after the AdamW CPU step is parallelized over OpenMP threads (single-threaded was a 12% step regression). At 1.0 we lose more in step time than we gain in throughput. The largest-first selection rule matters; an even split across param groups loses the largest-tensor benefit and ends up paying overhead per group with no proportional savings.
What we threw out:
- Predictive offload schedulersQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: observability and SLO dashboards Reference: KV cache and paged attention that promised to swap optimizer state in and out per layer. Too clever, too fragile, never beat the simple "split AdamW once at boot" rule.
- A learned estimator trained on past records. The hand-written feature accounting is more accurate and more debuggable.
- Estimator runtime calibration ("compile, measure, redo"). The compile cost makes it useless on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story; on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 the headroom is wide enough that it never fired.
Public checklist
- Always boot through the automatic fit search; do not hand-set
device_batch_sizefor routine launches. - Refuse to launch when estimated peak exceeds 95% of usable HBM.
- Build models via
create_model_on_device(meta-device +to_empty+init_weights); never allocate the full model on CPU first. - Persist every launch into the calibration catalog with outcome and OOM type; rerank future candidates from the history.
- Treat the listed
RETRYABLE_MEMORY_FAILURE_TYPESas estimator misses, not bugs; bump the headroom floor on retry. - Keep Muon parameters on the GPU; never route them through the CPU optimizer wrapper.
- Cap AdamW offload at the fraction that keeps at least 8 GB of pinned-host headroom for NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200.
- Enable fine-grained activation offload only on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story and only for explicitly enumerated submodule targets.
- Gate memory-debug dumps behind a debug-only environment flag; record allocation retries and inactive split bytes per phase to catch fragmentation creep.
- Two offload tiers only: GPU and pinned host. Do not ship NVMe or remote tiers.
Calibration snapshot
| Hardware | Offload path | Primary use |
|---|---|---|
| H200 | activation offload (pinned) | claw back activation memory when PCIe idle |
| H200 | AdamW optimizer offload | fit larger dbs on memory-binding presets |
| GB10 | activation offload | required on tight-memory presets |
| GB10 | optimizer offload | bypasses the compile-then-OOM cycle |
# boot-time calibration: pick microbatch from the estimator plus probe
from memory_runtime import estimator, probe
plan = estimator.plan(model, preset)
dbs = probe.fit(model, plan, headroom_gb=8)
Frequently asked questions
What should I tune first after a compile-then-OOM?+
Why not push offload_fraction all the way to 1.0 on GB10?+
Why does GB10 activation offload use an explicit target list?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.
Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.
The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.
Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.
Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.
Data parallelism replicates the whole model on every GPU and each GPU trains on a different slice of the batch (global_bs = local_bs × DP). After backward, gradients all-reduce across the DP GPUs so every replica ends the step with identical weights. Cost: one all-reduce per step sized to the full model — on 8× H200 a 70B model is about 140 GB of gradient traffic every step. Plain DDP keeps the whole model + optimizer state on every GPU; FSDP / ZeRO-3 shards them across the DP mesh to recover that memory. Use DP to raise throughput, not to fit a bigger model — that's FSDP's job.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.
NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.
The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.