MoD, MoDA, and MTP: Dynamic Depth and Multi-Token Heads
How we allocate compute per layer with Mixture-of-Depths, cross-attend across layers with MoDA, and train multi-token prediction heads that double as a draft source for self-speculative decoding.

Three features in the hybrid stack all sit at the "what does each layer do for each token" layer of the design: Mixture-of-Depths (MoD) decides which tokens a layer actually processes, MoDA lets attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns heads reach across layer depth, and Multi-Token Prediction (MTP) trains a shared block to predict several future tokens per step. They overlap enough that people conflate them; they are genuinely different.
If these terms are new
- MoD means Mixture-of-Depths: a router decides whether a token should pay for a given block.
- MoDA means a depth-aware attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns path where a layer can read K/V from earlier layers at the same position.
- MTP means multi-token prediction: an auxiliary head predicts several future tokens from one main forward pass.
- GateSkip is the MoD mode that keeps all tokens on the main path but gates the block contribution softly instead of gathering and scattering token subsets.
- roll-and-mask means keeping tensor shapes fixed while shifting ids and
labels left and masking the invalid tail with
ignore_index.
The smallest checked-in explainer pack is the MoD routing surface sample, MoDr router bookkeeping sample, and Shared-block MTP sample.
Why this stack cares about this
TrainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 compute is not uniform across tokens. A whitespace token, an opening
brace, a common identifier like i does not deserve the same budget as a
function-name token inside a long definition or an if condition in a dense
code region. MoD is the routing answer to that observation.
MoDA is a different axis. Instead of "do I run this block for this token", it asks "can my attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns heads reach keys and values from earlier layers at the same position?" That gives the stack cross-depth context with relatively small parameter growth.
MTP is the third axis. At train time it is an auxiliary loss head that predicts multiple future tokens through a shared block reused K times. At inference time the same weights can become a drafter for self-speculative decoding, although the current public inference lane still keeps that integration disabled.
What we built
Mixture-of-Depths routing
The MoD implementation is large, but the first-touch contract is simple. The checked-in MoD routing surface sample keeps the real split visible:
topkis the original gathered-token MoD paththresholdis the static-shape-friendly compare-against-a-threshold pathgateskipkeeps all tokens on the path and gates the block contribution softly
That sample also exposes the most important operational distinction directly:
topk and threshold are gather/scatter modes, while gateskip is the
all-tokens lane. That is why gateskip remains the default shipped mode.
That split lines up with the external literature. The original
Mixture-of-Depths paper treats sparse depth as a top-k token-capacity decision
per layer, while newer residual-gating work such as GateSkip keeps all tokens
on the main path and learns how much of the block output to retain instead of
compacting token subsets first. That makes gateskip feel less like a minor
variant of gathered-token MoD and more like a different operating point.
Two bugs from the ablation history matter here. First, MoD plus relation bias
crashed because the wrapped block saw compacted (B, T') shapes while the
relation-bias path still expected (B, T) shapes. Second, the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-scorer
bridge used to hardcode dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns in a way that broke side-output
accounting, and the full-block MoD path dropped the MoDA depth-KV contract.
MoDA depth-KV buffering
MoDA is intentionally small. The live runtime uses a DepthKVBuffer contract:
earlier layers push their K/V views into a depth buffer, and later layers read
the concatenated depth K/V as extra context. The key engineering rule is not
the class name. It is the detachQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook. Depth K/V must stay out of autograd or the
memory story collapses immediately on long context.
The practical scaling rule is simple: this buffer grows with sequence length, layer participation, and KV width at the same time. That makes MoDA different from a small per-layer metadata feature. Even when each layer only appends one extra depth slice, the retained K/V grows with the same long-context pressure that already stresses attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-heavy lanes, which is why the packet treated its memory formula as a first-class open question rather than as an implementation footnote.
That is why MoDA stays off by default in shipped presets. It is a useful context signal, but the memory bill is real. The trade reads more honestly next to Memory-budget anatomy and Sequence, Context, and Expert Splits in the Hybrid Stack than it does in a paper-only framing.
Multi-Token Prediction and drafting
The MTP module implements the shared-block design reflected directly in the checked-in Shared-block MTP sample. One block is reused K times, step weights are precomputed, and the roll-and-mask path keeps the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 graph shape-stable. That is the whole reason the public article can talk about MTP as a practical trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 feature rather than a paper-only head.
That shared-block story is also now visible in upstream substrate docs. Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample's MTP implementation supports a repeated-layer mode where one MTP layer can be applied across multiple prediction depths, and its pipeline guidance puts MTP on the last PPQuick term guidePPPipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.GroundingAbout: parallelism map overview Example: pipeline parallel sample Example: pipeline activation sample stage because that is where the extra loss is computed. The practical takeaway matches the sample here: shared-block MTP is a real systems choice, not just a toy simplification.
One cast-and-preserve helper turned out to be load-bearing: the vocabulary shard markers used by tensor-parallel output heads must survive the dtype cast inside MTP. Otherwise the fused loss path stops seeing the same ownership contract as the main head.
The drafting path reuses the trained MTP module at inference time. The math is credible. The missing public-safe piece is still the engine work around KV-cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack rollback, acceptance sampling, and hardware-specific receipts.
How it lands in production
The public feature surface exposes these paths selectively because the three features have genuinely different cost profiles, and those costs only make sense when you read them against Sequence, Context, and Expert Splits in the Hybrid Stack and Throughput vs quality knobs.
MoD: the checked-in routing sample makes the shipped stance readable:
gateskip is the default, threshold is the static-shape-friendly ablation,
and topk remains available when explicit gathered-token routing is the point
of the experiment.
MoDA: the public configuration surface is intentionally minimal. The main contract is the depth-KV buffer, and the default shipped answer is still "off unless the extra memory cost is justified."
MTP: the production trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 path uses a dedicated fast MTP layer rather than a plain per-depth head stack. The checked-in shared-block sample grounds the part that matters most for first-touch readers: K depths reuse one block, step weights are explicit, and roll-and-mask preserves one stable graph.
Ablations and what we kept
Snippets from the ablation history that shaped these decisions:
gateskipsits on the throughput frontier and trains most stably- MoDA is measurable but expensive on long context
- MTP-as-loss is cheap once the output-head ownership and cast rules are right
- MTP-as-drafter is only interesting once the inference engine catches up
Production checklist
- MoD must default to
gateskipin shipped presets.thresholdandtopkare ablation modes, not defaults. - MoDA K/V must stay detachedQuick term guidedetachedA launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.GroundingAbout: Modal benchmark receipts History: multi-GPU Modal benchmarks Reference: Modal debugging playbook from autograd.
- FastMTP requires the output-head ownership markers to survive dtype casts.
- MTP on XLA must use the safe non-fused loss path.
- MTP as drafter is not part of the current public inference contract.
Three features, three axes of impact:
| Feature | What it skips or adds | Default | Dominant cost |
|---|---|---|---|
| MoD | skips low-importance tokens per layer | gateskip | router plus gather or mask path |
| MoDA | cross-depth K/V attention buffer | off | depth-KV memory on long context |
| MTP | K-step future-token auxiliary loss | K=1 or 3 | fused CE plus roll-and-mask |
FastMTP uses a K-step roll-and-mask loop like this:
for k in range(K):
h = shared_block(h)
h_k = roll_and_mask(h, shift=k + 1)
loss_k = liger_fused_linear_ce(h_k, lm_head_weight, labels_k)
total += step_weights[k] * loss_k
Frequently asked questions
What is the shortest way to separate MoD from MoDA?+
What does "roll-and-mask" mean in FastMTP?+
(B, T, C) shapes while the future-token target is represented by rolling ids and labels left, then masking the tail with ignore_index=-1 so XLA and TorchDynamo still see one static graph. The checked-in Shared-block MTP sample is the compact receipt for that contract.Why use roll-and-mask instead of dynamic slicing for MTP?+
Why is gateskip the default shipped MoD mode?+
Why is MTP not already exposed as the default public drafter?+
Why does shared-block MTP change pipeline layout at all?+
How should I size MoDA's depth-KV buffer?+
batch * sequence * contributing_layers * n_kv_heads * head_dim * bytes_per_element * 2, where the final factor is for K and V. That estimate is only the retained buffer; the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. call still has to read it. Put it next to Memory-budget anatomy before turning MoDA on for long-context runs.Which checked-in files explain the three features fastest?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Pipeline parallelism cuts the model by depth — each GPU gets a contiguous range of layers. 32 transformer blocks on 8× H200 with PP=8 puts 4 layers on each GPU. Weights and optimizer state live only on the GPU owning that stage; activations flow GPU0→GPU1→... forward and back on the reverse pass. Cost: a pipeline bubble of roughly 1/microbatches — you need many microbatches per step to amortize. Use PP to scale past a single NVLink island across nodes, because what crosses the wire is tiny stage-boundary activations, not full tensors.
The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.
The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.
A launch style where the job outlives the caller session and MegaCpp preserves durable run identity, manifests, and later receipt collection instead of relying on live terminal output.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…