GateSkip and FlexiDepth after the router
How MegaCpp treats dynamic-depth features as bookkeeping and wiring problems after the router, not just as a paper-level skipping idea.

The interesting part of dynamic depth is not the first router score. The interesting part is everything that comes after it: residual semantics, loss bookkeeping, frozen-adapter wiring, and a reliable accounting of which tokens actually used which layers.
That is where the MegaCpp examples are strongest. They do not stop at the idea of skipping compute. They show what the runtime still has to preserve after the router has already made its decision.
That post-router ownership problem also shows up elsewhere in MegaCpp. What Megatron can and cannot split is the cleaner boundary when the question is "which subsystem owns the branch?" and the MoE routing we actually shipped is the production-routing companion when the question is "what survives contact with a real runtime?"
The router is only the start of the contract
GateSkip and FlexiDepth both sound simple when reduced to one sentence. Route a token, skip a layer, save compute. But the local examples make the real cost visible.
For GateSkip, the skip decision still has to live inside a residual stream that remains well-defined when some tokens do less work than others. That is why the examples split the surface into:
- a residual router example
- explicit loss bookkeeping
- block-taxonomy context so the skip decision stays tied to a real block family
For FlexiDepth, the examples go farther. They preserve not only the skip logic but also the adapter surface and the frozen-backbone wiring. That is the right public lesson: once dynamic depth is attached to a pretrained or partially frozen stack, the question is no longer just "which layer was skipped?" It is also "which moving part is still allowed to learn?"
The checked-in examples also keep the implementation contract narrow. GateSkip preserves residual contiguity with elementwise gating and bookkeeping rather than a shape-changing gather/scatter path. FlexiDepth keeps both branches static-shape friendly by pairing skip decisions with explicit adapter wiring and a frozen-backbone story. That separation is why the two features stay legible in real training code instead of dissolving into one generic router abstraction.
Why MegaCpp keeps these surfaces separate
The local split between GateSkip and FlexiDepth is useful because the two ideas pay different operational costs.
GateSkip in this pack is primarily about token-wise gating and the accounting that follows from it. The bookkeeping sample matters because sparsity pressure is easy to describe badly. If the runtime cannot show how the gate loss, budget pressure, and actual token path line up, then the feature is only half real.
FlexiDepth is more structural. The examples preserve layer-usage stats, adapter-side movement, and a frozen-backbone story. That makes FlexiDepth less like a routing paper and more like a controlled migration path for dynamic depth on top of an existing model.
The bookkeeping surfaces also differ in a way papers often compress away. The GateSkip samples keep CE loss, gate sparsity pressure, and a linearly decaying token budget in one explicit control surface so later hard-budget inference is still reading the same training-side story. The FlexiDepth samples instead track mean layers used, overall and per-layer skip rate, and a squared total-usage penalty over router scores. That difference is a good reason not to hide both features behind one generic "skip loss" label.
Why this belongs in core blocks rather than in a generic routing folder
These examples live next to Engram, mHC, n-gram embeddings, and block taxonomy for a reason. In MegaCpp, routing is not treated as a free-floating policy module. It is attached to real block families and real residual paths.
That is important because a skip surface can interact badly with branch mixing or residual alternatives if the runtime pretends they are independent. The residual-path and mHC-adjacent examples make that risk explicit. Dynamic depth is not only a router problem. It is a stream-integrity problem.
What the public examples prove
The useful claim is narrower than "we support dynamic depth."
The examples prove that MegaCpp has a public-safe contract for:
- token-wise residual gating
- skip-loss and usage accounting
- frozen-backbone adapter wiring for dynamic-depth variants
- block-family-aware placement of these features in a larger hybrid model
That is enough to support a serious architectural claim. It shows the feature exists as a runtime surface rather than as a research aspiration.
Prior art and context
The general idea is not unique. Mixture-of-Depths is the clearest direct prior art for dynamic token-wise depth allocation. FlexiDepth-style work extends that idea toward pretrained-model adaptation, while older adaptive-compute papers such as ACT, PonderNet, Universal Transformers, and Depth-Adaptive Transformer show the longer history of learned variable compute. GateSkip sits closer to residual-gated layer skipping. MegaCpp's local contribution is narrower and more practical: the public examples show how these ideas survive contact with block taxonomy, residual contracts, adapter wiring, and training-time bookkeeping.
Frequently asked questions
Is dynamic depth just a router feature?+
Why keep GateSkip and FlexiDepth separate?+
Do these examples prove a CUDA speedup?+
Which checked-in files show the budget and adapter sides most directly?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…