Dynamo and torch.compile Breakage on a Mamba-3 Hybrid
Graph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 + Transformer stack.

MegaCpp's training core is a hybrid: attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks (ABlockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample), MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-3 SSM
blocks (MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample), expert blocks (EBlockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack), and Engram blocks in a
repeating pattern across 52 layers. graph break here means TorchDynamo stops
tracing at one piece of Python, runs that slice eagerly, and resumes tracing on
the other side. recompilation means the previously compiled graph no longer
matches the runtime guards for shapes, dtypes, or Python-side state, so
TorchDynamo and Inductor build another version. The TPU/XLA compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot story lives
in Graph recompilation hell; this one is
specifically about Dynamo and Inductor on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200.
Why this matters
On a hybrid model the difference between a compiled run that merely boots and
one that actually pays off is tens of minutes of first-compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot wall clock and a
low single-digit percent steady-state tax that compounds across the whole
training wave. Dynamo's defaults assume small, homogeneous models; a 52-block
mixed-architecture net with an opaque Triton kernel per MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode layer, top-k MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack
routing, and a padded expert dispatch path is the exact shape those defaults
were not designed for. The failure modes are easy to misread - a recompile
storm looks like a hang, a guard explosion looks like a NaN, an autotune OOM
looks like a kernel bug - and the tooling tells you about each one only by
printing at the wrong level. What follows is the set of knobs, disable points,
and cache rules that finally made torch.compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot a net positive for us on
CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200.
1. Ground rules we ended up with
Six configuration lines in the main training entrypoint do most of the work. They are set before any compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot happens, before Dynamo is allowed to trace anything. The local proof surface is mirrored in smaller checked-in form by Compile warmup policy sample and Compile runtime env sample:
torch._dynamo.config.capture_scalar_outputs = True
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.accumulated_cache_size_limit = 256
torch._dynamo.config.automatic_dynamic_shapes = False
torch._dynamo.config.assume_static_by_default = True
torch._dynamo.config.enable_compiler_collectives = False
Each line is a scar.
capture_scalar_outputs=True lets Dynamo trace through ops that return Python
scalars (.item()-adjacent patterns in MoD/MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack bookkeeping) without forcing a
graph break on the site. We do not want that - we want the break to move into a
torch.compiler.disable region we control - but we want the choice, not
whatever Dynamo would have done by default.
cache_size_limit=64 and accumulated_cache_size_limit=256 exist because our
deep hybrid preset has 52 compiled blocks; the per-callsite default of 8
guarantees a cache-eviction storm. The accumulated cap is the total budget
across every compiled callsite in the process. Hitting either triggers the
recompile_limit=64 log line, and once you see it the next step will be slow:
Dynamo silently falls back to eager on the callsite that hit the limit.
automatic_dynamic_shapes=False plus assume_static_by_default=True is the
main load-bearing default. Letting Dynamo auto-promote dims to dynamic after it
sees two values is how you get a run where step 1 compiles for 20 minutes, step
2 recompiles for 12, and step 3 recompiles again because dbs wiggled by one
during a gradient-accumulation warmup. The key point is static-first, not
static-only. We mark dynamic dims by hand, explicitly, and only where we want
them. Concretely, the first batch axis is marked with
torch._dynamo.maybe_mark_dynamic(t, 0) on warmup inputs; everything else
stays static by construction.
enable_compiler_collectives=False disables the experimental knob that tries
to coordinate Dynamo across DDP ranks. It interacts poorly with our
regional compile setup and
produced ranks with divergent guard trees, which then deadlock on a collective
that one rank decided to inline and another did not.
2. The graph breaks we accepted
We run fullgraph=False. That is a choice.
MBlock.forward is permanently wrapped in @torch.compiler.disable. We tried
the alternatives - per-block compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot with the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode kernel allowed to recompile,
allow_in_graph(mamba_chunk_scan_combined), splitting MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample into a compiled
outer shell and disabled inner - and each one eventually failed. The
Dynamo-traces-through-it path crashed because Dynamo still walked into the
Triton kernel and hit .data_ptr() on a FakeTensor proxy. Per-block compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot
made the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode graph breaks worse, not better, because each block carried its
own guard tree and the guards did not all match across blocks of the same type.
So: MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample is black-boxed. The whole-model compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot runs with breaks at each
MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode layer. We count them and cap them. On our deep hybrid preset that is 13
breaks per forward. Each break costs a little (sync plus Python dispatch
reentry); collectively, on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8 DDP, it is a fixed one to two percent tax
measured steady state.
Four other disable points exist inside the main model runtime module and friends, gating things Dynamo cannot safely see:
- state mutations in the API server hand-off surface
- the DTensorQuick term guideDTensorPyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.GroundingAbout: EP / PP / TP / CP / SP / DP overview Example: 3D parallelism sample Reference: FSDP2 on XLA TPU-safe embedding dispatch
- a fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack entry that forces a graph break because
F.grouped_mmgoes through a path Inductor cannot fuse acrossEBlockboundaries - the
score_modwrapper that adapts around the current softcap ABI mismatch
None of these are "nice to have" disables. Each earned its decorator by crashing a run.
3. The graph breaks we fought
The MoE overflow counter
The worst graph-break incident was not a graph break. It was a recompile storm that looked like one.
MoE._overflow_total was a Python int. It incremented every forward().
Dynamo specialized on its value. Every step produced a new guard, a new cache
key, a new compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot. On 8-GPU DDP the behavior manifested as "compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot hangs and
NaN", and the team misdiagnosed it as a Muon bf16 interaction, the same kind of
misleading early-run surface discussed in Loss curves and the divergence playbook,
and worked around it for weeks with a no-compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot escape hatch.
The real root cause was Dynamo hitting recompile_limit=64 on the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack
callsite, falling back to eager for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, disagreeing with DDP's reducer about
which parameters had run, and producing silent grad-sync drops. Converting the
counter to register_buffer makes it a tensor, takes it off the guard path, and
restores a stable compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot cache. The fix recovered thousands of tokens per
second on 8-GPU DDP that had previously been hanging or NaN-ing.
The useful refinement from the research packet is that not every mutable field deserves to live in the same state class. If a router statistic or overflow tracker is runtime bookkeeping rather than model identity, keeping it as a non-persistent buffer is usually the cleaner contract: it stays tensor-shaped for compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot, but it does not pretend to be checkpoint-defining state.
The lesson: any Python scalar touched by compiled code becomes part of the guard tree. If it increments, you have a time bomb.
The padded MoE path
MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch wants dynamic shapes (per-expert token counts). Dynamo with
automatic_dynamic_shapes=False does not want them. We reconciled the two with
a padded dispatch: tokens are bucketed to the next power of two of expert
capacity, the dense matmul runs on the padded shape, and a mask selects valid
outputs. The padded path has a static shape, which means it compiles once per
bucket size instead of once per observed distribution.
The checked-in Expert-parallel routing sample and MoE dispatch fast paths sample are the smallest local proof of that coupling. One keeps the capacity math explicit, the other keeps the permutation and communication cost explicit. If either side changes, the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot contract changes with it.
The tradeoff is explicit: roughly 25 percent padding overhead in the worst case, for a fully compilable graph that does not recompile when the routing distribution shifts. We measured the alternative and it was worse in every dimension: slower steady state, longer first compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot, unpredictable tail.
The one dynamic axis we do mark
The global-batch dimension is the only genuinely dynamic axis in the training
graph. Gradient accumulation, auto-fit retries, and the final-batch-of-epoch
case all vary it. We mark it with torch._dynamo.maybe_mark_dynamic(t, 0) on
the warmup step, exactly once, and automatic_dynamic_shapes=False prevents
Dynamo from inferring any other dim as dynamic.
When Dynamo sees mark_dynamic on a tensor whose shape happens to match another
tensor's shape it might have inferred as dynamic earlier, it will create a new
symbolic int and try to reconcile. With automatic_dynamic_shapes=False that
reconciliation does not happen, and the run stays on the static path. This is
exactly what we want: the one dynamic axis is opt-in, not Dynamo-inferred.
4. Compile cache hygiene
Where the cache lives
TORCHINDUCTOR_CACHE_DIR is set explicitly at import time in the main training
entrypoint. It defaults to the project cache root, falls back to a
process-private temporary cache if that is unwritable, and is reported in the
status API. The checked-in local companion is Compile runtime env sample,
which keeps cache location and runtime env reporting visible without requiring a
full training launch.
TORCHINDUCTOR_FX_GRAPH_CACHE=1 and TORCHINDUCTOR_AUTOGRAD_CACHE=1 are
enabled in the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 bench launchers. The autograd cache is the difference
between a 15-minute first compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot and a 30-second second compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot for the same
model on the same host.
Ephemeral storage bites
On hosted H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 runs the inductor cache previously filled the ephemeral mount while compiling the deep hybrid preset with full enriched features. Padded MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack is still compilable, but the cache footprint is large enough that a single cache-clear run on a fresh host can take an hour of lazy backward compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot before steady state. We moved the cache to a persistent volume and wired an explicit warm-cache sync step into the launcher so that bench hosts inherit a warm cache instead of recompiling from zero.
Cache sync across hosts
The public cache-plumbing examples pin the contract: a bench host starting up
should refresh the expected tokenizer artifact, set TORCHINDUCTOR_CACHE_DIR
to the shared path, and skip re-seeding if the hash matches. That contract
exists because concurrent cache sync on the same host can trash a warm cache;
the safe path is "skip if already synced" and "refuse to sync into a
non-writable path."
Reset discipline
torch._dynamo.reset() is called at exactly one site - after a CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 retry
re-exec that rebuilds the model. Anywhere else it is a bug. Resetting Dynamo
invalidates all cached graphs, and on the deep hybrid that is 15 to 20 minutes
of re-compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot. We once had a helpful piece of auto-fit code that called
reset() on every shape-change candidate, and it made the retry loop feel like
it was hung.
Suppressing errors
torch._dynamo.config.suppress_errors = True and .disable = True are used in
exactly two places, both behind compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot-off guards. They exist for operator
footguns and not as a general "make compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot problems go away" switch. We do not
ship with suppress in the default hot path - if something does not compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot, we
want the error.
5. The NCCL heartbeat interaction
torch.compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot's Triton JIT on the deep hybrid preset takes 15 to 20 minutes
on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8 cold. NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200's default heartbeat monitor kills any rank that does not
run a collective during that window. The symptom is a process torn down with a
timeout error deep inside a compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot pass, and a remaining rank hanging on the
next collective.
Fix is three env vars we set automatically when LOCAL_RANK is detected:
TORCH_NCCL_ENABLE_MONITORING=0
TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=7200
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT=7200
Plus TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=0 because the autotune subprocess OOMed
on large matmuls and returned inf ms for legitimate configs, which then
poisoned the cache with a bad pick. The compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot-side lesson is that compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot
warmup and the distributed watchdog have no native handshake, so we impose one.
The broader troubleshooting literature also helps with triage discipline here. Long cold-starts, watchdog kills, and backend autotune memory pressure can all surface as the same vague "compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot hung" complaint. Treat them as different classes of failure: rank coordination, cache reuse, and autotune search breadth are separate control knobs, so they should be bisected separately instead of collapsed into one bucket.
6. Noise we learned to ignore
Dynamo prints a lot. Some of it matters, most does not. We keep a short allowlist:
| Log signal | Action |
|---|---|
triton._C.libtriton.native_specialize_impl warnings during warmup |
Ignore - expected, not a break |
graph break log lines matching known Mamba sites |
Count; ignore if within expected bound |
accumulated_cache_size_limit hit |
Always a regression, alert |
| Autotune "Ignoring this choice" | Ignore unless correlated with a step-time jump; if correlated, autotune OOMed |
Anything above the expected MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode break count is a regression. Cache-limit hits are treated the same way; we have an alert on the log line.
Current compile policy
The current policy keeps the six Dynamo config lines, the MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample disable, the
four surgical disable points around DTensorQuick term guideDTensorPyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.GroundingAbout: EP / PP / TP / CP / SP / DP overview Example: 3D parallelism sample Reference: FSDP2 on XLA TPU/MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack/score-mod/API, the padded MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack
path, the single explicit dynamic axis, the separate Inductor and autograd
caches, the reset-exactly-once rule, and the NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 heartbeat trio. The
buffer-not-int rule is treated as a hard lint item on anything compiled code
touches.
The policy does not treat fullgraph=True as a near-term goal; the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode
chunk-scan custom op it would require to close the one to two percent break
overhead is substantial work and remains deferred. It also excludes compiler
collectives, any use of torch._dynamo.reset() outside the retry re-exec, and
any uncomputed dynamic axis. The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-counter-as-int pattern is gone from the
codebase, and compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot-disable guards remain scoped rather than broad.
torch.compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot is genuinely load-bearing once these rules are in place.
Without them it is a liability. The difference is not the compiler; it is the
stance you compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot with.
That stance also frames The Torch 2.12 journey, Mamba 3 parallel performance, and Sequence, context, and expert splits in the hybrid stack: compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot behavior, kernel economics, and ownership boundaries all have to be read on the same lane or the diagnosis drifts into folklore.
Frequently asked questions
Why not force fullgraph=True and eliminate all breaks?+
Why keep compiler collectives disabled?+
When is allow_in_graph worth trying instead of disabling the whole block?+
torch.compiler.allow_in_graph() is an escape hatch, not the default integration path, and if you need a boundary to stay opaque through the whole compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed… stack the supported route is a real custom op with torch.library plus FakeTensor/meta behavior. In practice: use allow_in_graph for a Dynamo-only seam; use a custom op if you own the kernel boundary and want that boundary to stay stable.How do I tell whether the failure is in Dynamo tracing or deeper in the backend?+
TORCH_LOGS="graph_breaks,guards,recompiles" will usually tell you whether you are looking at trace churn, guard churn, or a true backend crash. If it is still ambiguous, the fastest ablation is to keep the same callsite and change the backend: backend="eager" tests Dynamo capture without backend lowering, while backend="aot_eager" keeps AOTAutograd in the loop without asking Inductor to lower the result.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The attention-heavy block family in MegaCpp's A/M/E/R notation.
The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.
The expert / MoE block family in MegaCpp's A/M/E/R notation.
PyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.
Why MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…
NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.