Gated DeltaNet, hyper-connections, and DynamicTanh inside the hybrid stack
How Gated DeltaNet, cross-layer hyper-connections, dynamic tanh normalization, attention residuals, and gated attention compose inside the MegaCpp hybrid stack, what augments, what replaces, and what survived ablation.

The hybrid Mamba 3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode plus Transformer interleave is the load-bearing decision for MegaCpp's C++ specialist, but the layer pattern by itself is not what moved the loss curves. What changed the numbers was a small set of cross-cutting residual, normalization, and gating components that we tested between blocks: Gated DeltaNet as a third token-mixer, hyper-connections as a residual-stream replacement, DynamicTanh as a normalization-light experiment, MoonshotAI-style attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns residuals, and a learned sigmoid gate on attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns output. Some of these landed and some did not. This post walks through what each one does in the MegaCpp training stack, where it sits in the layer interleave, what it replaces or augments, and what we are taking forward into production, as a deeper follow-on to Mamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++.
If these terms are new
- Gated DeltaNet is the recurrent delta-rule token mixer in this cluster: an alternative sequence mixer that can occupy a hybrid slot instead of an attention or Mamba slot.
- Hyper-connections or mHC are multi-stream residual mechanics. They change how block outputs are aggregated across streams; they are not another token mixer.
- DynamicTanh or DyT is a bounded learnable activation used here as a normalization-light replacement experiment, not as a new block family.
- Gated attentionQuick term guideGated attentionA query-dependent sigmoid gate on the post-attention output, applied before c_proj so sink mitigation stays backend-agnostic instead of rewriting the attention kernel.GroundingHistory: long context and attention sinks Reference: attention validity and structure here means a learned sigmoid gate on the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns output after projection.
- AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns residuals here mean a residual-path alternative that aggregates prior block summaries, not a change to the attention kernel itself.
The fastest checked-in proof surfaces are DeltaNet + hyper-connection sample, mHC stream residual sample, mHC fused static sample, and mHC branch mixer sample. For the adjacent memory-tier article in the same cluster, continue to M2RNN and Engram.
Why MegaCpp cares about this
The frontier-architectureQuick term guideArchitectureA grounded architectural read of the MegaCpp small-model stack: hybrid patterns, block semantics, schedule ownership, and why names like ablock,…GroundingSLM architecture in MegaCpp: hybrid patterns, block ownership, and why the letters matter MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode literature converged in 2025 and 2026 on a small set of recurring tricks: linear-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns sequence mixers as a cheap alternative to softmax attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, multi-stream residuals as a remedy for gradient-flow degeneracies in deep stacks, and learned bounded activations such as DyT and gated attentionQuick term guideGated attentionA query-dependent sigmoid gate on the post-attention output, applied before c_proj so sink mitigation stays backend-agnostic instead of rewriting the attention kernel.GroundingHistory: long context and attention sinks Reference: attention validity and structure as cures for the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-sink and massive-activation pathologies that show up around 16K-context training. We reproduced or ported every one of these in the MegaCpp training stack, then ablated them at both small dense scale and the production hybrid shape. We did this because reading papers is cheap and reproducing them is the only honest way to know which ones are real for our shape, optimizer, and corpus. The verdicts are not what the abstracts predicted, and they also connect directly to the sink-mitigation story in Long context and attention sinks and the broader attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-boundary discussion in Attention validity and structure.
What we built in the MegaCpp training stack
Gated DeltaNet is our drop-in alternative to both standard causal attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns and Mamba-style sequence layers, exposing the same surrounding contract so the layer interleave can swap a D slot in for an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns or Mamba slot without touching the rest of the block. The math is the gated delta rule from the Gated DeltaNet paper [Gated Delta Networks — Yang et al.] and the OLMo-Hybrid implementation [OLMo 2 Hybrid — Allen AI]: a recurrent state S_t = g_t * S_{t-1} + beta_t * k_t outer (v_t - S_{t-1} @ k_t) with output o_t = q_t @ S_t. On CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 we route to a fused Triton implementation; off CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 we keep two reference paths, a per-timestep recurrence and a chunked version that splits the time loop into smaller subgraphs so XLA tracing stays tractable. The layer fuses six projections, runs a depthwise causal convolution on Q, K, and V with kernel size 4, computes the log-space gate g = -exp(A_log) * softplus(A + dt_bias), applies a doubled sigmoid beta in the negative-eigenvalue variant, and ends with an output gate fed through fused RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Author Mamba3 spec About: Mamba3 hybrid Example: author Mamba3 spec sample-gating or a pure PyTorch fallback before the output projection. It supports per-document boundary resets from document IDs, which is what lets us pack the training corpus without the recurrence bleeding state across document boundaries. Reader-first version: doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample are the inspectable packed-row labels, while cu_seqlensQuick term guidecu_seqlensThe cumulative sequence-length offsets passed to varlen attention kernels so packed subsequences stay isolated without computing then masking cross-document attention.GroundingAbout: packed rows as the real training contract Reference: tokenized enriched packed rows on TPU is the compact boundary form the recurrent kernel actually consumes. That conversion is the reset contract, not optional packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles trivia. That slot-compatibility is the same block-contract idea used throughout Mamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++. It also has to be read with Packed rows as the real training contract: the recurrent slot is only safe when packed-document edges stay explicit all the way through doc_ids -> cu_seqlens.
Hyper-connections are our port of cross-layer manifold-constrained hyper-connections from [mHC: Manifold-Constrained Hyper-Connections — Xie et al., DeepSeek-AI]. They replace the single residual stream with n_streams parallel streams, we use four, and insert three small learned matrices around each block: an aggregation matrix H_pre that reads the streams down to one hidden state, a distribution matrix H_post that writes the block output back, and a Sinkhorn-constrained mixing matrix H_res that re-mixes the streams. We ship both a static-logit variant with per-layer parameters and a dynamic variant in which H depends on the current activations. Both initialize to identity-like behavior so a step-zero model is mathematically equivalent to a single-stream residual. The hot path is fused because at 52 layers and four streams the reference two-kernel mix-plus-distribute path was one of the largest GPU time buckets. The shipped fast path is the fused forward and residual-update surface, which is where the wall-clock win paid for itself at the production hybrid shape; backward stays on the explicit PyTorch formulas for now. That makes hyper-connections part of the same repeated-hot-path accounting discussed in Training speed by feature: which parts of the stack really move step time.
The constraint is not decorative. The paper's point is that widening the residual graph only stays useful if the mixer preserves identity-like behavior instead of turning into an arbitrary gain stage, and the checked-in mHC branch mixer sample shows the public-safe local version of that rule: row-and-column normalization for three-or-more branches, plus a plain softmax fallback when there are only two.
If you want the readable routing math instead of only the deployment knobs, mHC branch mixer sample is the checked-in surface where pooled branch scoring, Sinkhorn normalization, and the two-branch softmax fallback stay explicit.
DynamicTanh implements the DyT layer from [Transformers without Normalization — Zhu et al.]: y = gamma * tanh(alpha * x) + beta with a learnable scalar alpha and per-channel affine. We evaluated four modes: full replacement, an equivalent alias, a selective mode where DyT is used only on attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-related normalization sites while MLP norms stay on RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Author Mamba3 spec About: Mamba3 hybrid Example: author Mamba3 spec sample, and a hybrid wrapper that runs RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Author Mamba3 spec About: Mamba3 hybrid Example: author Mamba3 spec sample followed by DyT. The factory is purpose-aware across attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks, MLP blocks, embeddings, and final layers, which is what makes selective mode possible. We did this because preliminary ablations suggested DyT helped attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns norms and hurt MLP norms, so the decision had to stay coupled to the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-path reasoning rather than treated as an isolated normalization swap.
The failure mode also turned out to be more specific than "tanh bad, RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Author Mamba3 spec About: Mamba3 hybrid Example: author Mamba3 spec sample
good." The unstable parameter was the single global alpha: one scalar sees
gradient from the whole B x T x D activation volume, so full DyT replacement
quietly turns one normalization knob into a whole-stack accumulator. That is
why DyT stayed useful as an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-path probe next to
Long context and attention sinks while failing
the broader production filter.
That is also why Attention validity and structure
keeps DyT behind the gate-first recommendation order: the next useful question
was sink mitigation, not a whole-stack norm swap.
The Block AttnRes variant is the MoonshotAI-style alternative to multi-stream residuals. Instead of n parallel streams, it keeps a list of N << L block-summary representations and replaces the standard residual sum with a softmax attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns pass: each sub-layer computes its input via softmax(w · rms_norm(B)) · B over completed block representations plus the current partial sum, where w is a learned per-layer pseudo-query initialized to zero so the step-0 weights are uniform (which recovers standard residual behavior). It supports MoonshotAI's dual-application design (separate pseudo-queries before attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns and before MLP, so block_size counts both sub-layer ops) and has a Full mode that attends over every prior residual state at O(LD) memory. Memory is O(ND) in block mode; the implementation pre-allocates a (max_blocks+1, B, T, D) buffer and detaches stored block reps so the backward graph does not span across blocks, which is essential for compatibility with our gradient-checkpointing policy.
That forward story was friendlier than the backward one. Zero-initialized
pseudo-queries recover a uniform residual average at step 0, but the earliest
stored block states still collect gradient from every later block that looks
back into the buffer. That mismatch is the practical reason AttnRes could read
like a plausible residual alternative in paper terms and still fail MegaCpp's
short production screen.
The gated-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns variant is the simplest of the bunch: a learned sigmoid gate applied after c_proj, with two modes (headwise, one scalar per head, and perchannel, one scalar per (head, head_dim) pair). Gate parameters initialize to zero so sigmoid(0) = 0.5 — every head starts at half strength and learns to open or close. The point of the gate is sink mitigation: heads that latch onto the BOS sink can learn to close themselves rather than dragging the average representation. We replicated the gate across the standard, DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample, and clustered-sparse attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns paths so --gated_attention is a single switch regardless of which attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend is active for a given layer, keeping the control surface aligned with the sink story rather than fragmented per backend.
The cluster split is deliberate. M2RNN and Engram cover the memory-tier additions around the same hybrid backbone. This article stays on the alternative token-mixer, residual, and normalization-side experiments.
How it lands in production
The production MegaCpp package consumes the hybrid stack through a Megatron MambaStack whose layer types are MAMBA | ATTENTION | MLP | MOE | GDN. In production, the GDN symbol resolves to upstream Megatron GatedDeltaNet, with Megatron-native parallel projections and output normalization. So GDN is being lifted from the upstream contract as-is. We are not forking the kernel; we are mapping the same layer-type symbol onto the production substrate. The recurrence kernel is the same Triton path the MegaCpp training stack uses. Only the surrounding block scaffolding becomes Megatron-native.
Hyper-connections are the opposite story. The fused mHC kernels and Sinkhorn fp32 normalization are not yet part of upstream Megatron, so MegaCpp keeps mHC behind a fail-closed configuration surface. Today that surface carries four streams, five Sinkhorn iterations, a temperature of 1.0, epsilon 1e-6, two dynamic modes, a fused-ops toggle, and a recompute-group-size knob, all validated and frozen. Practically, the production stack ships with mHC enabled at inherited preset defaults, the fused kernels are imported when the host is CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 and fused operations are enabled, and the dynamic mode is wired but not on by default while we settle the optimizer interaction.
The checked-in DeltaNet + hyper-connection sample helps keep those jobs separate. Its choose_hybrid_layer(...) helper can switch a layer onto deltanet or mamba, but uses_mhc only turns on around the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-family lane. That is the useful local distinction: GDN changes which token mixer occupies the slot, while mHC changes how hidden streams are carried across block boundaries.
That static-versus-dynamic split matters more than the raw flag list suggests. The checked-in mHC stream residual sample is the quickest way to see it: static mHC means the multi-stream residual contract is fixed once the layer surface is chosen, while dynamic mHC means the routing weights are recomputed from the live hidden state and therefore carry extra optimizer and recompute pressure. The compact DeltaNet + hyper-connection sample and mHC fused static sample keep the same distinction visible in smaller surfaces. That is also why production keeps the static fused path as the receipted default and treats dynamic mHC as a narrower opt-in rather than "the same feature, but smarter." If you want the budget view beside the residual-design view, Memory budget anatomy is the right companion: the dynamic route is not scary because of one extra tensor alone, it stays narrow because it widens the live routing and recompute surface exactly where the fused static path is supposed to stay predictable.
DynamicTanh and AttnRes are not landing in the production stack. The ablation killed them, and the production path keeps RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Author Mamba3 spec About: Mamba3 hybrid Example: author Mamba3 spec sample (WrappedTorchNorm) on every norm site. Gated attentionQuick term guideGated attentionA query-dependent sigmoid gate on the post-attention output, applied before c_proj so sink mitigation stays backend-agnostic instead of rewriting the attention kernel.GroundingHistory: long context and attention sinks Reference: attention validity and structure is a research-repo-only feature today: our production attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns is the upstream Megatron self-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns with FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample routing, and the sigmoid gate would be applied after the linear projection rather than inside the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns kernel — small enough to add as a wrapper if a future regression motivates it, but not worth carrying without that motivation.
The kernel boundary is roughly: GDN's recurrence is a Triton kernel today and stays one. The mHC mix/distribute fused forward ops are Triton today, while backward still uses the explicit PyTorch chain; we have a PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample port roughly sketched for the TPU path but it is not what we ship on the GPU side. DyT is two pointwise ops, no kernel work needed. AttnRes is small einsums plus a softmax — also no kernel work needed. The gated-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns gate is a single broadcast multiply.
Ablations and what we kept
The 100-step AdamW sweep at the small dense ablation shape is the table that decided this. The dense Transformer baseline finished at loss 5.43 at 508 tok/s. The Gated DeltaNet variants ranked roughly: dense baseline > GDN-6 hybrid (loss 6.67) > GDN-no-Mamba (loss 6.88) > Mamba-majority hybrid (loss 7.06). At 100 steps and 4K context none of the hybrids beat dense, which is the expected story — hybrid wins are a long-context phenomenon. The GDN runs all converged cleanly. The full-stack run (every feature on) finished at 6.84 / 0.91 gnorm — within noise of dense ref. The two ablations that did not finish were DyT (loss 8.02 with gnorm 241, marked unstable) and AttnRes (loss 25.91 with gnorm 18.9M, diverged). Both got cut from the tree on this single run, which is the same keep-or-cut discipline behind Throughput vs quality knobs.
Two longer-running anecdotes round out the story. First, mHC at the production hybrid depth produced something the paper did not predict: the original implementation cost about a third of step time at 52 layers with four streams, which is why the fused kernels exist. Once fused, mHC's overhead became acceptable. The reported paper improvement did not reproduce as such on our corpus; we kept mHC because it composes cleanly with the rest of the stack and because removing it would require re-tuning every dependent preset, not because it is a giant loss win. Second, gated attentionQuick term guideGated attentionA query-dependent sigmoid gate on the post-attention output, applied before c_proj so sink mitigation stays backend-agnostic instead of rewriting the attention kernel.GroundingHistory: long context and attention sinks Reference: attention validity and structure is a sink-mitigation tool that pairs naturally with the long-context work. It broadened coverage for the sparse-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns family, but it stayed an opt-in research fallback rather than becoming a production preset.
The GDN result is just as specific. The research comparison sharpened the crossover story rather than weakening it: below the long-context regime, dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns still tends to win on simplicity and short-horizon fit, but around the 8K-context boundary the linear mixer starts collecting the scaling dividend that justifies keeping GDN slots in the hybrid pattern. That is why the production story is "reserve GDN for the long-context lane," not "replace attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns everywhere."
Dynamic mHC is the same kind of narrow decision. The extra routing tensors are not the real blocker by themselves; the bigger cost is sequential Sinkhorn work, extra recompute pressure, and a wider numerical surface for the optimizer. Static mHC kept the topology benefit while avoiding that per-token control cost, which is why it became the default even though dynamic mode stayed available for research.
The ablation history is explicit: the GDN integration is treated as stable, the DyT result is marked unstable in the same accelerator sweep, the AttnRes result is marked diverged, and the mHC fused-kernel note records the forward-side savings while keeping the shipped backward path explicit.
Production checklist
- The hybrid layer interleave declares which slots are
GDNand which areMAMBA,ATTENTIONQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns,MLP, orMOEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack via the--hybrid-layer-patternupstream argument; production presets bias the GDN slots to depths where attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns is not paying for itself. - mHC requires
n_streams > 1, positivesinkhorn_iters, a fail-closedlayer_indiceslist, and afused_opstoggle that defaults off off-CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200. The static path is the receipted default; dynamic mode stays behind an explicit opt-in. - The gated-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns gate is initialized at
sigmoid(0) = 0.5everywhere and must remain so; checkpoint loaders verify gate-param shape against(n_head,)for headwise mode and(n_head, head_dim)for perchannel. - DyT is off in production. If it ever comes back it must enter through the
create_norm_layer(purpose=...)factory, never by directRMSNorm -> DyTsubstitution, because the factory's purpose-awareness is what makes the selective and hybrid modes auditable. - AttnRes is off in production and disabled in the preset registry; the module remains for research exploration only.
- GDN's per-document recurrence reset uses
doc_ids -> cu_seqlens. Any ingest pipeline change that breaksdoc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample monotonicity will silently bleed state across documents. - The fused mHC mix/distribute kernel has a
MEGACPP_FUSED_MIX_DISTRIBUTE=0escape hatch that drops to the reference 2-kernel path; keep it as a debug switch and never delete it. - Sink-related sparse attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns paths (DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample, clustered-sparse) must keep the mirrored gate parameter alive — the gate is mathematically a no-op when set to 1.0 but its absence from the state dict will fail strict loads on every checkpoint produced after the gating switch landed.
What survived ablation
The clean way to read the ablation is to separate production defaults from
research-only seams. The production lane kept two things: Gated DeltaNet as the
long-context token-mixer slot, and static four-stream mHC as the cross-layer
residual topology. Everything else either stayed as an opt-in research hook or
was dropped for stability.
| Feature | Where it lives | Final lane | Why it stayed or moved out |
|---|---|---|---|
| Gated DeltaNet (GDN) | hybrid token-mixer slots | kept on the long-context lane | short-context dense attention still wins, but GDN collects the scaling dividend around the 8K crossover |
mHC, static 4-stream |
residual graph at block boundaries | kept as the production default | preserves the topology benefit without per-token routing tensors or sequential Sinkhorn pressure |
mHC, dynamic routing |
residual graph at block boundaries | research-only opt-in | adds a per-token 4x4 routing matrix, sequential Sinkhorn work, and wider optimizer/recompute pressure |
| Gated attention | attention output wrapper | research-only fallback | small enough to keep as a wrapper, but not strong enough to own a production preset |
| DynamicTanh (DyT) | norm replacement experiment | dropped | the global alpha accumulates O(B*T*D) gradient mass and destabilizes quickly |
| AttnRes | residual-path alternative | dropped | zero-init pseudo-queries keep the forward tame but destabilize the backward graph |
Minimal sketch of the kept production lane:
def production_hybrid_block(streams, *, use_gdn_slot):
h = mhc_pre_mix_static(streams, n_streams=4)
h = gdn(h) if use_gdn_slot else attention_or_mamba(h)
h = mlp(wrapped_torch_norm(h))
return mhc_post_mix_static(streams, h)
That resolves the contradiction in one glance: the production lane did not keep
DyT, AttnRes, or gated attentionQuick term guideGated attentionA query-dependent sigmoid gate on the post-attention output, applied before c_proj so sink mitigation stays backend-agnostic instead of rewriting the attention kernel.GroundingHistory: long context and attention sinks Reference: attention validity and structure as defaults. It kept GDN plus static mHC,
and left the other seams either disabled or research-only.
Frequently asked questions
Which of these features actually survived into the production-facing stack?+
mHC survived into the production-facing stack. Dynamic mHC and gated attentionQuick term guideGated attentionA query-dependent sigmoid gate on the post-attention output, applied before c_proj so sink mitigation stays backend-agnostic instead of rewriting the attention kernel. remain opt-in research surfaces. DynamicTanh and AttnRes did not survive the production lane. For the kernel and deployment side of the residual path, continue to Multi-Head Cross fused on Blackwell.Why keep hyper-connections if the paper gains did not reproduce cleanly?+
Why is Gated DeltaNet framed as a slot-compatible mixer?+
GDN slot into the hybrid pattern without rewriting every surrounding block contract.Do Gated DeltaNet and mHC compete for the same layer slot?+
D changes the token-mixer slot to DeltaNet, while mHC is the cross-layer residual topology around the block boundary. One changes what the layer computes; the other changes how hidden streams are mixed between layers.How is Gated DeltaNet different from M2RNN or Engram in this cluster?+
What does fail-closed mean for mHC in practice?+
Why does mHC insist on Sinkhorn-normalized mixing instead of arbitrary learned branch weights?+
What actually broke DynamicTanh and AttnRes in the short ablations?+
alpha acted like one global accumulator over the whole activation volume, so a small normalization knob became unstable quickly. AttnRes failed on the backward side: zero-initialized pseudo-queries looked harmless in the forward pass but piled too much gradient mass into the earliest buffered states. That is why both were treated as architectural stability issues rather than near-miss tunings.Why is AttnRes kept out once production mHC owns the cross-layer path?+
mHC already owns the multi-stream cross-layer mixing surface, and mHC stream residual sample shows that residual bookkeeping explicitly. Stacking AttnRes on top would not just "add one more idea"; it would put two learned cross-layer routing schemes on the same boundary and make the ablation story much harder to trust.What is doc_ids -> cu_seqlens doing here?+
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries. are the inspectable per-token document labels in the packed row, while cu_seqlensQuick term guidecu_seqlensThe cumulative sequence-length offsets passed to varlen attention kernels so packed subsequences stay isolated without computing then masking cross-document attention. is the compact cumulative-length form the recurrent kernel consumes. Converting the first into the second is what resets the recurrence at real document edges instead of bleeding state across adjacent samples. The packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…-side explanation lives in Packed rows as the real training contract, and the mask-validity side stays aligned with Attention validity and structure.What is the actual per-token routing object in dynamic mHC?+
4x4 stream-mixing matrix from the live activations, then normalizing that matrix before the streams are mixed. That keeps the raw routing object small, but it also explains why the production decision is about live control flow rather than just parameter count: mHC stream residual sample exposes the mhc_dynamic switch, and mHC branch mixer sample shows the Sinkhorn-style normalization that makes dynamic routing more than a plain weighted sum.If dynamic mHC does not automatically blow the memory budget, why is it still research-only?+
mHC adds per-token routing work, sequential Sinkhorn normalization, and extra recompute pressure on the hot path, so the operational question is "how wide did the live routing surface become?" rather than just "how many bytes did it add?" Memory budget anatomy and Multi-Head Cross fused on Blackwell: from reference einsum to Triton are the two adjacent receipts for that boundary.Which checked-in files should I inspect first if I want the local proof surfaces?+
Where do GateSkip and FlexiDepth fit relative to this article?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
A query-dependent sigmoid gate on the post-attention output, applied before c_proj so sink mitigation stays backend-agnostic instead of rewriting the attention kernel.
The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.
The cumulative sequence-length offsets passed to varlen attention kernels so packed subsequences stay isolated without computing then masking cross-document attention.
The validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.
Root-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.
The long-context failure mode where a few tokens, often the first token, absorb disproportionate attention mass and behave like a null-attention valve.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…
JAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.
DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
A grounded architectural read of the MegaCpp small-model stack: hybrid patterns, block semantics, schedule ownership, and why names like ablock,…
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.