MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 12 min readDavid Gornshtein
NCCL
H200
Distributed
MegaCpp

NCCL and collective hangs: the H200 multi-host timeout playbook

Allreduce stragglers, NCCL deadlocks, P2P env vars, ibverbs quirks, and the liveness/timeout playbook we run on MegaCpp's H200 multi-host CUDA lanes.

MegaCpp
Focused on applied C++ model engineering
Article Preview
NCCL and collective hangs: the H200 multi-host timeout playbook
Published 12 min readDavid Gornshtein

Most of the genuinely expensive debugging on our H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 fleet was not about the model. It was about NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts, the NVIDIA Collective Communications Library that owns allreduce, allgather, reduce-scatter, and related multi-GPU exchange: bootstrap failures, plugin regressions, watchdog timeouts firing in the wrong place, and recovery paths that replaced one failure mode with another. First touch: bootstrap is the communicator-creation phase before step 0, the watchdog is the timeout layer around collectives, and the heartbeat monitor is the liveness signal trying to decide whether ranks are still making progress. This post is the playbook we landed on: the env vars we set, the ones we unset, the retry logic in the training entrypoint, and the liveness rules we enforce before declaring a run healthy. The quickest checked-in proof surfaces are compile runtime env sample, compile warmup policy sample, pipeline-parallel compile warmup sample, and regional compile runtime sample. It pairs naturally with the distributed-optimizer stress harness, because collective-order bugs and optimizer drift are often two faces of the same distributed mistake. The transport view also connects directly to Hybrid FSDP/DDP on NVIDIA and expert parallel and MoE sharding, where the same collectives show up as wrapper and routing constraints.

Why this matters

NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts failures are bimodal. Either the run never starts (you bisect an env var that should have been unset at the launcher), or it starts, looks healthy for some minutes, and then a collective on rank 3 quietly waits forever for rank 5 to finish a kernel that is itself waiting for rank 3. Both look identical in the launcher: the parent process sits there with no useful message. The watchdog either fires too early (during compile warmup) or too late (after a real hang). Every fix has a blast radius: an env var that saves a single-host lane breaks the multi-host one, a heartbeat extension that survives compile lets a real hang sleep through the night.

Compile warmup on our dense+MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack preset takes long enough that NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts defaults assume something has died. We had to teach the watchdog and its heartbeat monitor the difference between "compute is busy" and "the communicator is dead," teach the launcher to scrub plugin envs that leak in from the host shell, and teach the retry path which failure classes are eligible for a lazy-init second attempt. That distinction matters most on the longer H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 receipts described in training on H200 eight-GPU machines, where compile and steady-state phases have very different timing signatures. The checked-in compile runtime env sample, regional compile runtime sample, and distributed debugging notes are the smallest public-safe surfaces showing that timeout and liveness contract, and they pair naturally with Profiler and performance reports when the question is whether a stall belongs to bootstrap, compile warmup, or steady-state collectives.

1. The operating environment

The CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 side runs on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 hosts with 8 GPUs each. Training ranges from single-host H200:8 to multi-host jobs of up to 4xH200:8. The fabric differs by host: some sit on plain IP with NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts's default socket path; others have a vendor multi-node plugin present in LD_LIBRARY_PATH. Both had to work, and both had to fail gracefully when the plugin environment leaked into a single-host run. In practice this means the launcher contract from training on H200 eight-GPU machines and the recovery discipline from modal debugging playbook have to agree on when a host is sick versus merely slow.

PyTorch is 2.12 nightly (2.12.0.dev20260304+cu130), NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts is whatever ships with that wheel, Triton 3.6, Python 3.13. DDP is the production path for most rungs; FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample lanes and expert-parallel MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack lanes add their own collective patterns on top.

2. The five hang classes we learned to name

Everything we dealt with falls into one of these buckets:

  1. Communicator bootstrap failure. ncclRemoteError, socketPollConnect ... Connection refused, remote process exited, Failed to initialize any NET plugin. The ranks never get a working communicator; the job dies before step 0.
  2. Watchdog firing during compile. torch.compile's Triton JIT takes 15-20 minutes on the larger dense+MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack preset, and NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts's default 600 s heartbeat watchdog kills ranks that do not run a collective during compile. The surviving ranks then hang on the next collective.
  3. Allreduce stragglers. One rank is 100-300 ms slower than the others per step, usually on a specific GPU, sometimes correlating with a specific PCIe slot. The collective blocks the fast ranks; effective throughput falls to the straggler's rate.
  4. Plugin / env mismatch. A multi-node NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts tuner config leaks into a single-node bench and the bootstrap fails with the plugin error above. Or the opposite: a single-node config runs on a multi-host lane and IB never gets used.
  5. Coalesced-op unsupported. Backend nccl does not support allgather_into_tensor_coalesced. Not a hang, but it looks like one: the run gets past bootstrap, into compile warmup, and then dies a few tens of seconds later with that string. We saw it on replay rungs more than on mainline.

3. Env vars: what we set, and why

The default environment policy lives in the main training entrypoint and applies to every CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 rank unless the operator has set the variable by hand. The defaults are:

CUDA_DEVICE_MAX_CONNECTIONS=1
TORCH_NCCL_AVOID_RECORD_STREAMS=1
NCCL_NVLS_ENABLE=0
NVTE_DP_AMAX_REDUCE_INTERVAL=0
NVTE_ASYNC_AMAX_REDUCTION=1
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_P2P_NET_CHUNKSIZE=524288
TORCH_NCCL_HIGH_PRIORITY=1
TOKENIZERS_PARALLELISM=False

CUDA_DEVICE_MAX_CONNECTIONS=1 serialises CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 streams so NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts overlaps with compute deterministically; without it we saw intermittent backward stalls. TORCH_NCCL_AVOID_RECORD_STREAMS=1 removes a sync per collective. NCCL_NVLS_ENABLE=0 is the single most impactful line for bootstrap reliability on hosts without NVLink Sharp; the probing stall masquerades as a hang. NCCL_P2P_NET_CHUNKSIZE=524288 matters more for pipeline parallelismQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample than for pure DDP, but the cost of setting it defensively is zero. TORCH_NCCL_HIGH_PRIORITY=1 puts NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts streams on the high priority so compute/comm overlap is not starved. That same NCCL_NVLS_ENABLE=0 choice also keeps fragmented communicator layouts away from brittle NVLS multicast allocation paths. Once a run creates many small FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample or MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack sub-communicators, NVLS can fail during communicator creation in a way that reads like noisy CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200/NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts bootstrap churn rather than a clean "too many groups" explanation.

The useful mental model is that NVLS is not an infinite free speedup. On H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 NVSwitch lanes there is a finite multicast budget underneath it, so a run that fans out into many small communicators can exhaust that budget long before the operator sees a clean topology error. In that regime, NCCL_NVLS_ENABLE=0 acts less like a generic "disable a fancy feature" switch and more like a way to force the stack back onto the stable point-to-point path when communicator fragmentation matters more than the theoretical NVLS win.

Two more env vars are set specifically for regional-compile FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample runs (_configure_regional_compile_nccl_timeouts):

TORCH_NCCL_ASYNC_ERROR_HANDLING=3
TORCH_NCCL_BLOCKING_WAIT=1
TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=<compile_timeout_minutes*60>
NCCL_TIMEOUT=<compile_timeout_minutes*60>

ASYNC_ERROR_HANDLING=3 aborts on NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts error rather than raising asynchronously: fail-fast, not a ghost collective. BLOCKING_WAIT=1 makes collectives block instead of spin; combined with the long heartbeat timeout, it gives us clean tracebacks instead of watchdog-abort mystery. Current preference is to extend the timeout on the specific child process groups after FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.GroundingAbout: FSDP2 on XLA TPU History: FSDP2 pain and payoff Example: FSDP sharding sample or Megatron-style code creates them, and to keep the older torch.distributed.new_group patch only as a compatibility fallback for lanes that cannot reach those groups explicitly. The point is the same either way: extra groups must not keep the default short timeout during compile warmup. That extra-group problem is exactly why FSDP-heavy lanes like Hybrid FSDP/DDP on NVIDIA cannot treat NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts timeouts as a launcher-only concern.

4. The env vars we unset

Bench hosts often inherit a multi-node IB tuning env from the login shell. On a single-node 2-GPU lane that env is poison. The relevant symptom is a Failed to initialize any NET plugin early in bootstrap. The fix is a sanitize_single_node_nccl_env function that unsets the leaked vars and forces a plain intra-node path:

unset NCCL_NET
unset NCCL_CROSS_NIC
unset NCCL_NET_GDR_LEVEL
unset NCCL_TUNER_CONFIG_PATH
unset NCCL_IB_ADAPTIVE_ROUTING
unset NCCL_IB_FIFO_TC
unset NCCL_IB_QPS_PER_CONNECTION
unset NCCL_IB_TC
unset NCCL_NVLS_CHUNKSIZE
unset NCCL_P2P_NET_CHUNKSIZE
export NCCL_NET_PLUGIN=none
export NCCL_IB_DISABLE=1
export NCCL_IBEXT_DISABLE=1

And we scrub the vendor IB lib path out of LD_LIBRARY_PATH before launching. IB_DISABLE=1 and NET_PLUGIN=none are aggressive; they are right for single-node lanes and wrong for multi-host. On the single-node fallback, the dual disable matters: NCCL_IB_DISABLE=1 keeps core NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts off verbs while NCCL_IBEXT_DISABLE=1 keeps external IB plugins from half-attaching to a fabric the lane does not actually have. The launcher knows which lane it is starting, so the sanitiser only runs where it belongs. The multi-host flavour keeps IB on and the plugin present but still sets NVLS_ENABLE=0 and the heartbeat envs from the defaults. We do not set NCCL_DEBUG=INFO by default; only on bench quartet runs where we are actively chasing a bootstrap issue.

5. The watchdog timeout story

This is the one we got wrong first, more than once. The default 600 s heartbeat kills ranks during compile. Classic fix:

TORCH_NCCL_ENABLE_MONITORING=0
TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=7200
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT=7200

These three land in base_train automatically when LOCAL_RANK is detected. With them, the DDP+compile lane went from "hanging and NaN" to a steady-state DDP-on-MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack throughput band on the recompilation-fix receipt.

What went wrong first: we disabled monitoring on lanes that should have kept it on. TORCH_NCCL_ENABLE_MONITORING=0 makes NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts stop exporting the heartbeat signal, so a real hang looks exactly like a slow compile. On a two-hour replay we only learned that the wrong way. Current policy: compile-sensitive lanes start with monitoring disabled and a long heartbeat window, then tighten timeout behavior on the relevant process groups after warmup instead of assuming mid-run env-var flips will change NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts behavior. The long heartbeat window (7200 s) stays on for the whole run; cheap insurance, and compile can re-trigger on a retry.

Second mistake: applying these vars uniformly to retry re-execs. Plain DDP lanes were fine; expert-parallel lanes needed MEGACPP_SKIP_CUDA_BOOTSTRAP_BARRIER and lazy NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts init, and generic CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 retry re-execs crashed early under lazy init when eager would have worked. The landed policy in the main training entrypoint is:

Retry class NCCL init Bootstrap barrier Watchdog envs
Generic CUDA retry child Eager Standard Standard
Expert-parallel retry child Lazy Skipped Standard
Known retry-eligible startup error Lazy Skipped Standard, retry once

The narrow eager-NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts retry matcher now treats ncclRemoteError, remote process exited, socketPollConnect, and Connection refused as lazy-retry-eligible bootstrap failures. That list was built from reproducible bootstrap receipts on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 hosts.

6. Liveness checks we actually run

The rules we enforce before calling a run "healthy":

  • Compile warmup completed. The log contains a compile_warmup timing line. Without it, no step-0 claim is valid. Typical warmup values on the 8-GPU DDP lane after cache warm are in the low tens of seconds; cold is a couple of minutes.
  • Step 0 is compile-contaminated. We classify step 00000 as contaminated by default. It routinely shows numbers in the low thousands of tok/sec on a stack that will steady-state in the high six figures of tok/sec. Any bisect that uses step 0 as a data point is discarded.
  • Step 1 is the first real number. Step 1 + step 2 + step 3 is the minimum receipt. A single-step receipt is a rumour.
  • Peak memory printed. If the receipt does not include peak memory, we do not use it for throughput bisects. Peak memory catches the "accidental FSDP resharding" class that looks fine on tok/s but shows a 2-3x memory swing.
  • End marker present. Runs that die inside compile often leave plausible-looking partial logs. A run without a clean end marker is classified as a startup/compile stall, not a throughput receipt.

The contract is visible in concrete receipts. Successful historical replays come back with a compile_warmup line in the tens of seconds, a contaminated step 0 in the low thousands of tok/sec, then steps 1, 2, and 3 climbing into the steady-state band, with peak memory in the high teens of GiB on the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8 lane. Without the NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts timeout fixes and the compile-warmup tolerance we would have classified some of those receipts as hung.

7. Stragglers, the coalesced-op class, and multi-host

Stragglers we handle structurally. We keep TORCH_NCCL_HIGH_PRIORITY=1 and TORCH_NCCL_AVOID_RECORD_STREAMS=1 on, wrap reduce-scatter chains in OverlappedGradReducer.wait_all() with the coalescing manager (opt-out via MEGACPP_DISABLE_COALESCING_MANAGER=1), and pad bucket sizes via pad_buckets_for_high_nccl_busbw=True aligned to 65536 elements. When a specific rank is consistently slow we swap the GPU on the host; PCIe contention and cooling variance are real, and "re-rank" is not a software fix. We watch per-rank step time in the log and the /status API; when the delta exceeds roughly five percent steady-state we investigate. Rare, and always physical. The corresponding throughput-side interpretation is in training speed by feature, but the operational rule stays transport-first here.

The allgather_into_tensor_coalesced class looks like a hang and is not. Some historical commits called the coalesced variant; the PyTorch+NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts combo on the replay host did not support it. The process crashes cleanly after compile warmup with a clear backend message. Fix: commit-family classification, not an env var. Bisects landing in this class are treated as known-bad; the replay moves to an earlier rung.

Most incidents in this class are version-matrix or device-placement mismatches, not transport deadlocks, which is why we classify them separately from true hangs before touching any watchdog knobs.

Multi-host lanes add the trouble you would expect: bootstrap takes longer (NVLS off, IB on, init timeout matched to heartbeat); NCCL_P2P_NET_CHUNKSIZE=524288 is a reasonable starting point and on narrower inter-host links we have tuned down, never up; TORCH_NCCL_BLOCKING_WAIT=1 matters more than on single-host because async errors across hosts are harder to attribute. We do not publish a production multi-host receipt; that work is still exploratory.

8. Operator rules we now enforce

  1. Never inherit IB env into a single-node lane. Scrub at the launcher, always. The sanitiser runs unconditionally; if the lane needs IB, the launcher re-enables it explicitly after.
  2. Never kill a run while a collective is outstanding. Ranks left mid-collective take the NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts communicator with them and the next attempt fails bootstrap. Always prefer the control-plane stop (the /control API) over kill -9.
  3. Never trust a single-step receipt. Step 0 is compile-contaminated, step 1 is the first real number, three steps is the minimum.
  4. Always print the stack line at start: torch.__version__, flash_attn.__version__, triton.__version__, and the NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts major version from torch.cuda.nccl.version(). Without it, post-mortem is guesswork.
  5. Retry on known bootstrap errors only. The eager-NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts retry matcher is the source of truth. New failure classes get a regression test before they get added.
  6. Keep the watchdog window greater than compile warmup. 7200 s is overkill for most runs and the right number when warmup is variable and the cost of being wrong is a full restart.

What we kept and what we threw away

We kept the env defaults, the regional-compile timeout overlay, per-group timeout propagation for internally-created collectives, the single-node sanitiser and its multi-host counterpart, the narrow eager-NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts retry matcher, the lazy-init retry path for expert-parallel children, the long heartbeat window for the whole run, and the five-rule liveness contract. We never kill -9 a rank mid-collective. That last operator rule lines up with the more general run-control discipline in modal debugging playbook.

We threw away disabling monitoring globally (now back on after warmup), uniform retry policy across lane types, blanket NCCL_DEBUG=INFO, and chasing stragglers in software when the real fix is a GPU swap. We do not currently run PyTorch's NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingExample: pipeline parallel sample Reference: training on 8x H200 Reference: runtime optimization receipts flight recorder (TORCH_NCCL_TRACE_BUFFER_SIZE); it would have shortened several "which rank hung first" investigations and we should wire it in. A small per-rank step-time watchdog emitting a structured event on deviation is the other obvious next step. The rest of the playbook stays.

FAQ

Frequently asked questions

Why does compile warmup look like a hang on these lanes?+
Because Triton and torch.compile can keep ranks busy long enough that the default NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes. heartbeat assumes the communicator died, even though the job is still compiling. The symptom looks like a transport failure; the root cause is often compile skew.
Why extend timeouts on child process groups instead of only changing launcher env vars?+
Because FSDP2Quick term guideFSDP2PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism. and Megatron-style optimizers create additional process groups inside the job, and those groups need the same timeout contract as the root communicator. If only the launcher env is long-lived, the internally-created groups still fire the short watchdog during compile warmup. When the lane owns those groups explicitly, prefer the direct per-group override; patching torch.distributed.new_group is the compatibility fallback.
When is a collective issue probably hardware, not software?+
When one rank stays a few percent slower than the others across clean retries and the slowdown follows a specific GPU or host slot. At that point the safer fix is a swap, not another env-var theory, because the transport contract is already clean and the asymmetry is physical.
Which checked-in sample should I read before changing timeout policy?+
Start with the compile runtime env sample for the launcher-side environment overlay, then the compile warmup policy sample, regional compile runtime sample, and PP compile warmup sample for the "compile first, collectives later" contract. Those checked-in samples make the compile-era timeout story easier to reason about than a pile of shell snippets, but they do not model the full child process-group timeout override path by themselves; for that boundary, pair them with Dynamo and compile breakage and The Compile-Time Tax We Accept for Runtime Speed.
Which API actually changes a child process-group timeout after startup?+
Use torch.distributed.distributed_c10d._set_pg_timeout(...) on the relevant group. The important part is not the private-looking name; it is that child communicators need a real post-init timeout change, because the launcher env vars were already consumed when the root NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes. groups came up.
Why can compile-time autotune look like a collective hang?+
Because TORCHINDUCTOR_DISTRIBUTED_MAX_AUTOTUNE_GEMM=1 adds a cross-rank autotune sync while Inductor is still benchmarking kernels. On compile-heavy multi-rank lanes that creates a second synchronization surface before the main training collectives are stable, so an autotune delay can fail with the same timeout family as bootstrap or watchdog trouble. That is why our H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks. compile-sensitive defaults keep it off unless the lane is explicitly prepared for the wider autotune path. For the compile-policy side of that trade, continue into The Compile-Time Tax We Accept for Runtime Speed.
Why is allgather_into_tensor_coalesced its own failure class instead of a normal hang?+
Because the failure is often semantic before it is transport-level. A real hang usually means some rank never reached the collective. The coalesced path adds a stricter packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a… contract: every rank has to build the same flattened tensor list with the same shapes, ordering, and placement assumptions. If one rank packages a different payload, the run can die with a backend rejection or watchdog-looking error even though the real bug is the coalesced-payload contract, not the network. Treat it as its own class: prove the packed inputs match across ranks before you start debugging sockets or IB.
Why disable both NCCL_IB_DISABLE=1 and NCCL_IBEXT_DISABLE=1 on a single-node fallback?+
Because NCCL_IB_DISABLE=1 only keeps NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.'s built-in verbs path from probing InfiniBand. An external IB plugin can still half-attach, churn during bootstrap, and turn a local-only lane into noisy plugin failure. Our safe single-node fallback disables both paths together, then lets the launcher re-enable IB only on true multi-host lanes. The compile runtime env sample grounds the launcher-side env layer, and distributed debugging notes keeps the failure-family framing narrow; the exact external-plugin disable rule itself comes from the current NVIDIA plugin contract rather than from a checked-in repro.
Where should a suspected NCCL hang show up first: dashboard, receipt, or trace?+
Usually all three, but in order. The dashboard should tell you which lane or pool is degraded, the receipt should tell you whether the failure was bootstrap, compile-era timeout, or steady-state collective drift, and the heavy trace should come last if the first two still cannot explain the stall. In checked-in form that means Observability and the three dashboards first, then compile/runtime receipt sample plus runtime optimization receipts, then a heavier profile if the lane is still ambiguous.
Why not leave NCCL_DEBUG=INFO on for every run?+
Because it turns a liveness tool into background noise. NVIDIA documents the NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes. debug environment as an experimental/debugging surface rather than a production default, and the failure classes in this article need different first receipts: bootstrap needs init logs, compile-era stalls need timeout and warmup receipts, and steady-state stragglers need per-rank timing. Keep NCCL_DEBUG=INFO as a scoped repro knob, then turn it off once the lane has a stable classification.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

EP

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

PP

Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Modal

A container-first GPU execution surface with explicit image, GPU, Volume, Secret, and detached-launch primitives that MegaCpp uses for isolated benchmark and validation lanes.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

Topic hubs