MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202611 min readDavid Gornshtein

Torch Compile

Dynamo

Mamba

MegaCpp

Dynamo and torch.compile Breakage on a Mamba-3 Hybrid

Q: How do I tell whether the failure is in Dynamo tracing or deeper in the backend?

Start with logging before retuning knobs: TORCH_LOGS="graph_breaks,guards,recompiles" will usually tell you whether you are looking at trace churn, guard churn, or a true backend crash. If it is still ambiguous, the fastest ablation is to keep the same callsite and change the backend: backend="eager" tests Dynamo capture without backend lowering, while backend="aot_eager" keeps AOTAutograd in the loop without asking Inductor to lower the result.

Graph breaks, recompile storms, guard explosions, and cache-hygiene rules we landed while keeping torch.compile useful on MegaCpp's hybrid Mamba-3 + Transformer stack.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Dynamo and torch.compile Breakage on a Mamba-3 Hybrid

Published April 18, 2026•11 min read•David Gornshtein

MegaCpp's training core is a hybrid: attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks (ABlockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample), MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode-3 SSM blocks (MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample), expert blocks (EBlockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack), and Engram blocks in a repeating pattern across 52 layers. graph break here means TorchDynamo stops tracing at one piece of Python, runs that slice eagerly, and resumes tracing on the other side. recompilation means the previously compiled graph no longer matches the runtime guards for shapes, dtypes, or Python-side state, so TorchDynamo and Inductor build another version. The TPU/XLA compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot story lives in Graph recompilation hell; this one is specifically about Dynamo and Inductor on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200.

Why this matters

On a hybrid model the difference between a compiled run that merely boots and one that actually pays off is tens of minutes of first-compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot wall clock and a low single-digit percent steady-state tax that compounds across the whole training wave. Dynamo's defaults assume small, homogeneous models; a 52-block mixed-architecture net with an opaque Triton kernel per MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode layer, top-k MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing, and a padded expert dispatch path is the exact shape those defaults were not designed for. The failure modes are easy to misread - a recompile storm looks like a hang, a guard explosion looks like a NaN, an autotune OOM looks like a kernel bug - and the tooling tells you about each one only by printing at the wrong level. What follows is the set of knobs, disable points, and cache rules that finally made torch.compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot a net positive for us on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200.

1. Ground rules we ended up with

Six configuration lines in the main training entrypoint do most of the work. They are set before any compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot happens, before Dynamo is allowed to trace anything. The local proof surface is mirrored in smaller checked-in form by Compile warmup policy sample and Compile runtime env sample:

torch._dynamo.config.capture_scalar_outputs = True
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.accumulated_cache_size_limit = 256
torch._dynamo.config.automatic_dynamic_shapes = False
torch._dynamo.config.assume_static_by_default = True
torch._dynamo.config.enable_compiler_collectives = False

Each line is a scar.

capture_scalar_outputs=True lets Dynamo trace through ops that return Python scalars (.item()-adjacent patterns in MoD/MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack bookkeeping) without forcing a graph break on the site. We do not want that - we want the break to move into a torch.compiler.disable region we control - but we want the choice, not whatever Dynamo would have done by default.

cache_size_limit=64 and accumulated_cache_size_limit=256 exist because our deep hybrid preset has 52 compiled blocks; the per-callsite default of 8 guarantees a cache-eviction storm. The accumulated cap is the total budget across every compiled callsite in the process. Hitting either triggers the recompile_limit=64 log line, and once you see it the next step will be slow: Dynamo silently falls back to eager on the callsite that hit the limit.

automatic_dynamic_shapes=False plus assume_static_by_default=True is the main load-bearing default. Letting Dynamo auto-promote dims to dynamic after it sees two values is how you get a run where step 1 compiles for 20 minutes, step 2 recompiles for 12, and step 3 recompiles again because dbs wiggled by one during a gradient-accumulation warmup. The key point is static-first, not static-only. We mark dynamic dims by hand, explicitly, and only where we want them. Concretely, the first batch axis is marked with torch._dynamo.maybe_mark_dynamic(t, 0) on warmup inputs; everything else stays static by construction.

enable_compiler_collectives=False disables the experimental knob that tries to coordinate Dynamo across DDP ranks. It interacts poorly with our regional compile setup and produced ranks with divergent guard trees, which then deadlock on a collective that one rank decided to inline and another did not.

2. The graph breaks we accepted

We run fullgraph=False. That is a choice.

MBlock.forward is permanently wrapped in @torch.compiler.disable. We tried the alternatives - per-block compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot with the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode kernel allowed to recompile, allow_in_graph(mamba_chunk_scan_combined), splitting MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample into a compiled outer shell and disabled inner - and each one eventually failed. The Dynamo-traces-through-it path crashed because Dynamo still walked into the Triton kernel and hit .data_ptr() on a FakeTensor proxy. Per-block compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot made the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode graph breaks worse, not better, because each block carried its own guard tree and the guards did not all match across blocks of the same type.

So: MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample is black-boxed. The whole-model compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot runs with breaks at each MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode layer. We count them and cap them. On our deep hybrid preset that is 13 breaks per forward. Each break costs a little (sync plus Python dispatch reentry); collectively, on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8 DDP, it is a fixed one to two percent tax measured steady state.

Four other disable points exist inside the main model runtime module and friends, gating things Dynamo cannot safely see:

state mutations in the API server hand-off surface
the DTensorQuick term guideDTensorPyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.GroundingAbout: EP / PP / TP / CP / SP / DP overview Example: 3D parallelism sample Reference: FSDP2 on XLA TPU-safe embedding dispatch
a fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack entry that forces a graph break because F.grouped_mm goes through a path Inductor cannot fuse across EBlock boundaries
the score_mod wrapper that adapts around the current softcap ABI mismatch

None of these are "nice to have" disables. Each earned its decorator by crashing a run.

3. The graph breaks we fought

The MoE overflow counter

The worst graph-break incident was not a graph break. It was a recompile storm that looked like one.

MoE._overflow_total was a Python int. It incremented every forward(). Dynamo specialized on its value. Every step produced a new guard, a new cache key, a new compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot. On 8-GPU DDP the behavior manifested as "compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot hangs and NaN", and the team misdiagnosed it as a Muon bf16 interaction, the same kind of misleading early-run surface discussed in Loss curves and the divergence playbook, and worked around it for weeks with a no-compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot escape hatch.

The real root cause was Dynamo hitting recompile_limit=64 on the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack callsite, falling back to eager for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, disagreeing with DDP's reducer about which parameters had run, and producing silent grad-sync drops. Converting the counter to register_buffer makes it a tensor, takes it off the guard path, and restores a stable compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot cache. The fix recovered thousands of tokens per second on 8-GPU DDP that had previously been hanging or NaN-ing.

The useful refinement from the research packet is that not every mutable field deserves to live in the same state class. If a router statistic or overflow tracker is runtime bookkeeping rather than model identity, keeping it as a non-persistent buffer is usually the cleaner contract: it stays tensor-shaped for compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot, but it does not pretend to be checkpoint-defining state.

The lesson: any Python scalar touched by compiled code becomes part of the guard tree. If it increments, you have a time bomb.

The padded MoE path

MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch wants dynamic shapes (per-expert token counts). Dynamo with automatic_dynamic_shapes=False does not want them. We reconciled the two with a padded dispatch: tokens are bucketed to the next power of two of expert capacity, the dense matmul runs on the padded shape, and a mask selects valid outputs. The padded path has a static shape, which means it compiles once per bucket size instead of once per observed distribution.

The checked-in Expert-parallel routing sample and MoE dispatch fast paths sample are the smallest local proof of that coupling. One keeps the capacity math explicit, the other keeps the permutation and communication cost explicit. If either side changes, the compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot contract changes with it.

The tradeoff is explicit: roughly 25 percent padding overhead in the worst case, for a fully compilable graph that does not recompile when the routing distribution shifts. We measured the alternative and it was worse in every dimension: slower steady state, longer first compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot, unpredictable tail.

The one dynamic axis we do mark

The global-batch dimension is the only genuinely dynamic axis in the training graph. Gradient accumulation, auto-fit retries, and the final-batch-of-epoch case all vary it. We mark it with torch._dynamo.maybe_mark_dynamic(t, 0) on the warmup step, exactly once, and automatic_dynamic_shapes=False prevents Dynamo from inferring any other dim as dynamic.

When Dynamo sees mark_dynamic on a tensor whose shape happens to match another tensor's shape it might have inferred as dynamic earlier, it will create a new symbolic int and try to reconcile. With automatic_dynamic_shapes=False that reconciliation does not happen, and the run stays on the static path. This is exactly what we want: the one dynamic axis is opt-in, not Dynamo-inferred.

4. Compile cache hygiene

Where the cache lives

TORCHINDUCTOR_CACHE_DIR is set explicitly at import time in the main training entrypoint. It defaults to the project cache root, falls back to a process-private temporary cache if that is unwritable, and is reported in the status API. The checked-in local companion is Compile runtime env sample, which keeps cache location and runtime env reporting visible without requiring a full training launch.

TORCHINDUCTOR_FX_GRAPH_CACHE=1 and TORCHINDUCTOR_AUTOGRAD_CACHE=1 are enabled in the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 bench launchers. The autograd cache is the difference between a 15-minute first compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot and a 30-second second compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot for the same model on the same host.

Ephemeral storage bites

On hosted H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 runs the inductor cache previously filled the ephemeral mount while compiling the deep hybrid preset with full enriched features. Padded MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack is still compilable, but the cache footprint is large enough that a single cache-clear run on a fresh host can take an hour of lazy backward compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot before steady state. We moved the cache to a persistent volume and wired an explicit warm-cache sync step into the launcher so that bench hosts inherit a warm cache instead of recompiling from zero.

Cache sync across hosts

The public cache-plumbing examples pin the contract: a bench host starting up should refresh the expected tokenizer artifact, set TORCHINDUCTOR_CACHE_DIR to the shared path, and skip re-seeding if the hash matches. That contract exists because concurrent cache sync on the same host can trash a warm cache; the safe path is "skip if already synced" and "refuse to sync into a non-writable path."

Reset discipline

torch._dynamo.reset() is called at exactly one site - after a CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 retry re-exec that rebuilds the model. Anywhere else it is a bug. Resetting Dynamo invalidates all cached graphs, and on the deep hybrid that is 15 to 20 minutes of re-compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot. We once had a helpful piece of auto-fit code that called reset() on every shape-change candidate, and it made the retry loop feel like it was hung.

Suppressing errors

torch._dynamo.config.suppress_errors = True and .disable = True are used in exactly two places, both behind compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot-off guards. They exist for operator footguns and not as a general "make compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot problems go away" switch. We do not ship with suppress in the default hot path - if something does not compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot, we want the error.

5. The NCCL heartbeat interaction

torch.compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot's Triton JIT on the deep hybrid preset takes 15 to 20 minutes on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200:8 cold. NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200's default heartbeat monitor kills any rank that does not run a collective during that window. The symptom is a process torn down with a timeout error deep inside a compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot pass, and a remaining rank hanging on the next collective.

Fix is three env vars we set automatically when LOCAL_RANK is detected:

TORCH_NCCL_ENABLE_MONITORING=0
TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=7200
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT=7200

Plus TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=0 because the autotune subprocess OOMed on large matmuls and returned inf ms for legitimate configs, which then poisoned the cache with a bad pick. The compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot-side lesson is that compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot warmup and the distributed watchdog have no native handshake, so we impose one.

The broader troubleshooting literature also helps with triage discipline here. Long cold-starts, watchdog kills, and backend autotune memory pressure can all surface as the same vague "compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot hung" complaint. Treat them as different classes of failure: rank coordination, cache reuse, and autotune search breadth are separate control knobs, so they should be bisected separately instead of collapsed into one bucket.

6. Noise we learned to ignore

Dynamo prints a lot. Some of it matters, most does not. We keep a short allowlist:

Log signal	Action
`triton._C.libtriton.native_specialize_impl` warnings during warmup	Ignore - expected, not a break
`graph break` log lines matching known Mamba sites	Count; ignore if within expected bound
`accumulated_cache_size_limit` hit	Always a regression, alert
Autotune "Ignoring this choice"	Ignore unless correlated with a step-time jump; if correlated, autotune OOMed

Anything above the expected MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode break count is a regression. Cache-limit hits are treated the same way; we have an alert on the log line.

Current compile policy

The current policy keeps the six Dynamo config lines, the MBlockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample disable, the four surgical disable points around DTensorQuick term guideDTensorPyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.GroundingAbout: EP / PP / TP / CP / SP / DP overview Example: 3D parallelism sample Reference: FSDP2 on XLA TPU/MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack/score-mod/API, the padded MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack path, the single explicit dynamic axis, the separate Inductor and autograd caches, the reset-exactly-once rule, and the NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200 heartbeat trio. The buffer-not-int rule is treated as a hard lint item on anything compiled code touches.

The policy does not treat fullgraph=True as a near-term goal; the MambaQuick term guideMambaA grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode chunk-scan custom op it would require to close the one to two percent break overhead is substantial work and remains deferred. It also excludes compiler collectives, any use of torch._dynamo.reset() outside the retry re-exec, and any uncomputed dynamic axis. The MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack-counter-as-int pattern is gone from the codebase, and compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot-disable guards remain scoped rather than broad.

torch.compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot is genuinely load-bearing once these rules are in place. Without them it is a liability. The difference is not the compiler; it is the stance you compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot with.

That stance also frames The Torch 2.12 journey, Mamba 3 parallel performance, and Sequence, context, and expert splits in the hybrid stack: compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…GroundingRegional compile without losing the plot behavior, kernel economics, and ownership boundaries all have to be read on the same lane or the diagnosis drifts into folklore.

FAQ

Frequently asked questions

Why not force fullgraph=True and eliminate all breaks?+

Because on this hybrid stack the forced full-graph route pushes failure into unsupported custom-op and Triton paths. A bounded, counted set of known breaks is cheaper than pretending the whole model is one healthy graph. The local decision surface is Opaque-kernel compile wrapper sample, which shows the same choice in a smaller checked-in example.

Why keep compiler collectives disabled?+

Because the feature solves one real class of distributed compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed… bugs, but it is not free. On this lane the dominant risk is still rank-divergent guard trees interacting with regional compile, manual disable seams, and long cold-start behavior. A coordination feature is only a win here if it preserves the same stable callsites and watchdog behavior we already trust.

When is allow_in_graph worth trying instead of disabling the whole block?+

Only when the failure is in Dynamo's frontend and the downstream stack already handles the function. PyTorch's current guidance is stricter than a lot of old field lore: torch.compiler.allow_in_graph() is an escape hatch, not the default integration path, and if you need a boundary to stay opaque through the whole compileQuick term guideCompileWhy MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed… stack the supported route is a real custom op with torch.library plus FakeTensor/meta behavior. In practice: use allow_in_graph for a Dynamo-only seam; use a custom op if you own the kernel boundary and want that boundary to stay stable.

How do I tell whether the failure is in Dynamo tracing or deeper in the backend?+

Start with logging before retuning knobs: TORCH_LOGS="graph_breaks,guards,recompiles" will usually tell you whether you are looking at trace churn, guard churn, or a true backend crash. If it is still ambiguous, the fastest ablation is to keep the same callsite and change the backend: backend="eager" tests Dynamo capture without backend lowering, while backend="aot_eager" keeps AOTAutograd in the loop without asking Inductor to lower the result.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

ablock

The attention-heavy block family in MegaCpp's A/M/E/R notation.

Grounding

mblock

The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.

Grounding

eblock

The expert / MoE block family in MegaCpp's A/M/E/R notation.

Grounding

DTensor

PyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.

Grounding

Compile

Why MegaCpp treats regional compile as a runtime-boundary decision rather than a blanket switch, and how compile ordering stays tied to distributed…

Grounding

Regional compile without losing the plot

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

Grounding

Mamba

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

David Gornshtein • MegaCppMore posts →

Dynamo and torch.compile Breakage on a Mamba-3 Hybrid

Why this matters

1. Ground rules we ended up with

2. The graph breaks we accepted

3. The graph breaks we fought

The MoE overflow counter

The padded MoE path

The one dynamic axis we do mark

4. Compile cache hygiene

Where the cache lives

Ephemeral storage bites

Cache sync across hosts

Reset discipline

Suppressing errors

5. The NCCL heartbeat interaction

6. Noise we learned to ignore

Current compile policy

Read next

References

Frequently asked questions

Terms used in this article