MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 4 min readDavid Gornshtein
Routing
Dynamic Depth
Gateskip
Flexidepth
Core Blocks

GateSkip and FlexiDepth after the router

How MegaCpp treats dynamic-depth features as bookkeeping and wiring problems after the router, not just as a paper-level skipping idea.

MegaCpp
Focused on applied C++ model engineering
Article Preview
GateSkip and FlexiDepth after the router
Published 4 min readDavid Gornshtein

The interesting part of dynamic depth is not the first router score. The interesting part is everything that comes after it: residual semantics, loss bookkeeping, frozen-adapter wiring, and a reliable accounting of which tokens actually used which layers.

That is where the MegaCpp examples are strongest. They do not stop at the idea of skipping compute. They show what the runtime still has to preserve after the router has already made its decision.

That post-router ownership problem also shows up elsewhere in MegaCpp. What Megatron can and cannot split is the cleaner boundary when the question is "which subsystem owns the branch?" and the MoE routing we actually shipped is the production-routing companion when the question is "what survives contact with a real runtime?"

The router is only the start of the contract

GateSkip and FlexiDepth both sound simple when reduced to one sentence. Route a token, skip a layer, save compute. But the local examples make the real cost visible.

For GateSkip, the skip decision still has to live inside a residual stream that remains well-defined when some tokens do less work than others. That is why the examples split the surface into:

  • a residual router example
  • explicit loss bookkeeping
  • block-taxonomy context so the skip decision stays tied to a real block family

For FlexiDepth, the examples go farther. They preserve not only the skip logic but also the adapter surface and the frozen-backbone wiring. That is the right public lesson: once dynamic depth is attached to a pretrained or partially frozen stack, the question is no longer just "which layer was skipped?" It is also "which moving part is still allowed to learn?"

The checked-in examples also keep the implementation contract narrow. GateSkip preserves residual contiguity with elementwise gating and bookkeeping rather than a shape-changing gather/scatter path. FlexiDepth keeps both branches static-shape friendly by pairing skip decisions with explicit adapter wiring and a frozen-backbone story. That separation is why the two features stay legible in real training code instead of dissolving into one generic router abstraction.

Why MegaCpp keeps these surfaces separate

The local split between GateSkip and FlexiDepth is useful because the two ideas pay different operational costs.

GateSkip in this pack is primarily about token-wise gating and the accounting that follows from it. The bookkeeping sample matters because sparsity pressure is easy to describe badly. If the runtime cannot show how the gate loss, budget pressure, and actual token path line up, then the feature is only half real.

FlexiDepth is more structural. The examples preserve layer-usage stats, adapter-side movement, and a frozen-backbone story. That makes FlexiDepth less like a routing paper and more like a controlled migration path for dynamic depth on top of an existing model.

The bookkeeping surfaces also differ in a way papers often compress away. The GateSkip samples keep CE loss, gate sparsity pressure, and a linearly decaying token budget in one explicit control surface so later hard-budget inference is still reading the same training-side story. The FlexiDepth samples instead track mean layers used, overall and per-layer skip rate, and a squared total-usage penalty over router scores. That difference is a good reason not to hide both features behind one generic "skip loss" label.

Why this belongs in core blocks rather than in a generic routing folder

These examples live next to Engram, mHC, n-gram embeddings, and block taxonomy for a reason. In MegaCpp, routing is not treated as a free-floating policy module. It is attached to real block families and real residual paths.

That is important because a skip surface can interact badly with branch mixing or residual alternatives if the runtime pretends they are independent. The residual-path and mHC-adjacent examples make that risk explicit. Dynamic depth is not only a router problem. It is a stream-integrity problem.

What the public examples prove

The useful claim is narrower than "we support dynamic depth."

The examples prove that MegaCpp has a public-safe contract for:

  • token-wise residual gating
  • skip-loss and usage accounting
  • frozen-backbone adapter wiring for dynamic-depth variants
  • block-family-aware placement of these features in a larger hybrid model

That is enough to support a serious architectural claim. It shows the feature exists as a runtime surface rather than as a research aspiration.

Prior art and context

The general idea is not unique. Mixture-of-Depths is the clearest direct prior art for dynamic token-wise depth allocation. FlexiDepth-style work extends that idea toward pretrained-model adaptation, while older adaptive-compute papers such as ACT, PonderNet, Universal Transformers, and Depth-Adaptive Transformer show the longer history of learned variable compute. GateSkip sits closer to residual-gated layer skipping. MegaCpp's local contribution is narrower and more practical: the public examples show how these ideas survive contact with block taxonomy, residual contracts, adapter wiring, and training-time bookkeeping.

FAQ

Frequently asked questions

Is dynamic depth just a router feature?+
No. The router score is the easy part. The hard part is preserving residual semantics, loss accounting, and adapter ownership after different tokens take different paths.
Why keep GateSkip and FlexiDepth separate?+
Because they pay different operational costs. GateSkip is mostly about token- wise gating and bookkeeping, while FlexiDepth carries a stronger frozen- backbone and adapter-wiring story.
Do these examples prove a CUDA speedup?+
No. They prove the safer prerequisite: the skip choice can be represented without changing tensor shapes, the GateSkip loss path can keep CE, sparsity, and budget in one control surface, and FlexiDepth can keep the full block frozen while the router and adapter remain trainable. Exact kernel synchronization cost or end-to-end speedup would need a separate benchmark, so this article treats the local code as a wiring and bookkeeping contract rather than a performance claim.
Which checked-in files show the budget and adapter sides most directly?+
GateSkip residual-budget sample is the shortest checked-in decoder for the token-budget side. FlexiDepth skip-router sample and FlexiDepth frozen adapter wiring sample are the shortest decoder for the cheap-path and frozen-adapter side.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Megatron

Why lifting a hybrid attention/Mamba/MoE stack into Megatron-Core is a multi-adapter exercise: base config mapping, layer specs, mixer protocol, and…