MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 7 min readDavid Gornshtein
Mixture Of Depths
MoDA
MTP
Speculative Decoding
Training

MoD, MoDA, and MTP: Dynamic Depth and Multi-Token Heads

How we allocate compute per layer with Mixture-of-Depths, cross-attend across layers with MoDA, and train multi-token prediction heads that double as a draft source for self-speculative decoding.

MegaCpp
Focused on applied C++ model engineering
Article Preview
MoD, MoDA, and MTP: Dynamic Depth and Multi-Token Heads
Published 7 min readDavid Gornshtein

Three features in the hybrid stack all sit at the "what does each layer do for each token" layer of the design: Mixture-of-Depths (MoD) decides which tokens a layer actually processes, MoDA lets attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns heads reach across layer depth, and Multi-Token Prediction (MTP) trains a shared block to predict several future tokens per step. They overlap enough that people conflate them; they are genuinely different.

If these terms are new

  • MoD means Mixture-of-Depths: a router decides whether a token should pay for a given block.
  • MoDA means a depth-aware attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns path where a layer can read K/V from earlier layers at the same position.
  • MTP means multi-token prediction: an auxiliary head predicts several future tokens from one main forward pass.
  • GateSkip is the MoD mode that keeps all tokens on the main path but gates the block contribution softly instead of gathering and scattering token subsets.
  • roll-and-mask means keeping tensor shapes fixed while shifting ids and labels left and masking the invalid tail with ignore_index.

The smallest checked-in explainer pack is the MoD routing surface sample, MoDr router bookkeeping sample, and Shared-block MTP sample.

Why this stack cares about this

TrainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 compute is not uniform across tokens. A whitespace token, an opening brace, a common identifier like i does not deserve the same budget as a function-name token inside a long definition or an if condition in a dense code region. MoD is the routing answer to that observation.

MoDA is a different axis. Instead of "do I run this block for this token", it asks "can my attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns heads reach keys and values from earlier layers at the same position?" That gives the stack cross-depth context with relatively small parameter growth.

MTP is the third axis. At train time it is an auxiliary loss head that predicts multiple future tokens through a shared block reused K times. At inference time the same weights can become a drafter for self-speculative decoding, although the current public inference lane still keeps that integration disabled.

What we built

Mixture-of-Depths routing

The MoD implementation is large, but the first-touch contract is simple. The checked-in MoD routing surface sample keeps the real split visible:

  • topk is the original gathered-token MoD path
  • threshold is the static-shape-friendly compare-against-a-threshold path
  • gateskip keeps all tokens on the path and gates the block contribution softly

That sample also exposes the most important operational distinction directly: topk and threshold are gather/scatter modes, while gateskip is the all-tokens lane. That is why gateskip remains the default shipped mode.

That split lines up with the external literature. The original Mixture-of-Depths paper treats sparse depth as a top-k token-capacity decision per layer, while newer residual-gating work such as GateSkip keeps all tokens on the main path and learns how much of the block output to retain instead of compacting token subsets first. That makes gateskip feel less like a minor variant of gathered-token MoD and more like a different operating point.

Two bugs from the ablation history matter here. First, MoD plus relation bias crashed because the wrapped block saw compacted (B, T') shapes while the relation-bias path still expected (B, T) shapes. Second, the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-scorer bridge used to hardcode dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns in a way that broke side-output accounting, and the full-block MoD path dropped the MoDA depth-KV contract.

MoDA depth-KV buffering

MoDA is intentionally small. The live runtime uses a DepthKVBuffer contract: earlier layers push their K/V views into a depth buffer, and later layers read the concatenated depth K/V as extra context. The key engineering rule is not the class name. It is the detachQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook. Depth K/V must stay out of autograd or the memory story collapses immediately on long context.

The practical scaling rule is simple: this buffer grows with sequence length, layer participation, and KV width at the same time. That makes MoDA different from a small per-layer metadata feature. Even when each layer only appends one extra depth slice, the retained K/V grows with the same long-context pressure that already stresses attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy lanes, which is why the packet treated its memory formula as a first-class open question rather than as an implementation footnote.

That is why MoDA stays off by default in shipped presets. It is a useful context signal, but the memory bill is real. The trade reads more honestly next to Memory-budget anatomy and Sequence, Context, and Expert Splits in the Hybrid Stack than it does in a paper-only framing.

Multi-Token Prediction and drafting

The MTP module implements the shared-block design reflected directly in the checked-in Shared-block MTP sample. One block is reused K times, step weights are precomputed, and the roll-and-mask path keeps the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 graph shape-stable. That is the whole reason the public article can talk about MTP as a practical trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 feature rather than a paper-only head.

That shared-block story is also now visible in upstream substrate docs. Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample's MTP implementation supports a repeated-layer mode where one MTP layer can be applied across multiple prediction depths, and its pipeline guidance puts MTP on the last PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample stage because that is where the extra loss is computed. The practical takeaway matches the sample here: shared-block MTP is a real systems choice, not just a toy simplification.

One cast-and-preserve helper turned out to be load-bearing: the vocabulary shard markers used by tensor-parallel output heads must survive the dtype cast inside MTP. Otherwise the fused loss path stops seeing the same ownership contract as the main head.

The drafting path reuses the trained MTP module at inference time. The math is credible. The missing public-safe piece is still the engine work around KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack rollback, acceptance sampling, and hardware-specific receipts.

How it lands in production

The public feature surface exposes these paths selectively because the three features have genuinely different cost profiles, and those costs only make sense when you read them against Sequence, Context, and Expert Splits in the Hybrid Stack and Throughput vs quality knobs.

MoD: the checked-in routing sample makes the shipped stance readable: gateskip is the default, threshold is the static-shape-friendly ablation, and topk remains available when explicit gathered-token routing is the point of the experiment.

MoDA: the public configuration surface is intentionally minimal. The main contract is the depth-KV buffer, and the default shipped answer is still "off unless the extra memory cost is justified."

MTP: the production trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 path uses a dedicated fast MTP layer rather than a plain per-depth head stack. The checked-in shared-block sample grounds the part that matters most for first-touch readers: K depths reuse one block, step weights are explicit, and roll-and-mask preserves one stable graph.

Ablations and what we kept

Snippets from the ablation history that shaped these decisions:

  • gateskip sits on the throughput frontier and trains most stably
  • MoDA is measurable but expensive on long context
  • MTP-as-loss is cheap once the output-head ownership and cast rules are right
  • MTP-as-drafter is only interesting once the inference engine catches up

Production checklist

  • MoD must default to gateskip in shipped presets. threshold and topk are ablation modes, not defaults.
  • MoDA K/V must stay detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook from autograd.
  • FastMTP requires the output-head ownership markers to survive dtype casts.
  • MTP on XLA must use the safe non-fused loss path.
  • MTP as drafter is not part of the current public inference contract.

Three features, three axes of impact:

Feature What it skips or adds Default Dominant cost
MoD skips low-importance tokens per layer gateskip router plus gather or mask path
MoDA cross-depth K/V attention buffer off depth-KV memory on long context
MTP K-step future-token auxiliary loss K=1 or 3 fused CE plus roll-and-mask

FastMTP uses a K-step roll-and-mask loop like this:

for k in range(K):
    h = shared_block(h)
    h_k = roll_and_mask(h, shift=k + 1)
    loss_k = liger_fused_linear_ce(h_k, lm_head_weight, labels_k)
    total += step_weights[k] * loss_k
FAQ

Frequently asked questions

What is the shortest way to separate MoD from MoDA?+
MoD decides whether a token pays for a layer at all; MoDA changes what context an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. head can read by letting it see K/V from earlier layers at the same position.
What does "roll-and-mask" mean in FastMTP?+
It means the tensors keep their original (B, T, C) shapes while the future-token target is represented by rolling ids and labels left, then masking the tail with ignore_index=-1 so XLA and TorchDynamo still see one static graph. The checked-in Shared-block MTP sample is the compact receipt for that contract.
Why use roll-and-mask instead of dynamic slicing for MTP?+
Because the point is not only convenience. Keeping one fixed tensor shape helps compilers and XLA-style runtimes hold onto one stable graph across all K prediction depths, while dynamic slicing turns the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and… head into a shape-changing branch. That is why the article treats roll-and-mask as part of the feature contract, not just as one implementation detail.
Why is gateskip the default shipped MoD mode?+
Because it gave the cleanest throughput-versus-loss trade-off without paying the gather, scatter, and sorting overhead of the heavier routing modes. The MoD routing surface sample shows that boundary directly.
Why is MTP not already exposed as the default public drafter?+
Because the weights alone are not enough. The inference engine still needs KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step. rollback, an acceptance path, and hardware-specific benchmarking before drafting becomes a reliable net speedup.
Why does shared-block MTP change pipeline layout at all?+
Because the auxiliary MTP losses are computed alongside the model's final output path, so the extra work naturally accumulates on the last pipeline stage. That is why current upstream guidance treats MTP as a layout concern, not as a free extra head you can scatter arbitrarily across PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors. stages.
How should I size MoDA's depth-KV buffer?+
Start with batch * sequence * contributing_layers * n_kv_heads * head_dim * bytes_per_element * 2, where the final factor is for K and V. That estimate is only the retained buffer; the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. call still has to read it. Put it next to Memory-budget anatomy before turning MoDA on for long-context runs.
Which checked-in files explain the three features fastest?+
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

PP

Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

detached

A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…