M2RNN and Engram: The Memory Subsystem Inside the Hybrid
Where matrix-state RNN layers, causal n-gram Engram branches, and the learned concept bank fit inside our Mamba 3 + Transformer hybrid — and which pieces remain useful in the public memory stack.

The hybrid stack described here is not "a transformer with some Mamba sprinkled in". It is a deliberate memory hierarchy: Mamba 3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode SSM blocks for bulk long-range recurrence, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks for sharp content-addressed lookups, M2RNN layers for matrix-valued recurrent state, an Engram branch that runs local causal n-gram features in parallel with the residual stream, and an optional concept-retrieval block that reads from a learned bank of global patterns. Here M2RNN means a matrix-state recurrent mixer rather than a generic RNN cell, and Engram means the local n-gram memory branch we wire into the model block, not a vague catch-all memory label. This post is about the three of those that are not vanilla attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns or Mamba: M2RNN, Engram, and the concept bank.
It fits best beside Mamba 3 + Transformers, Mamba3 kernel journey, and packed rows as the real training contract, because the memory stack only stays honest when its kernel path, hybrid schedule, and packed-document contract are read together. If you want the checked-in surfaces first, start with the M2RNN mixer spec sample, the Engram branch sample, and the Engram + mHC stack sample.
If these memory terms are new
- M2RNN here means a matrix-state recurrent mixer: per-head state is a small matrix rather than a vector-only recurrent state.
- Engram here means the local causal n-gram side branch, not an external retrieval database.
CBlockhere means the optional concept-retrieval tier, not a generic industry-standard block name.- mHC means multi-stream hyper-connections layered around blocks; it is a residual-stream mechanism adjacent to this memory stack, not the same thing as Engram or M2RNN.
The closest related topic hubs in this cluster are Mamba 3 + Transformers for the backbone split and Gated DeltaNet, hyper-connections, and DynamicTanh for the alternative mixer and residual-side experiments.
Why this setup matters
C++ training data has several memory demands that standard attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns does not satisfy cheaply. Long files need linear-in-length recurrence — Mamba handles that. Local n-gram patterns (operator idioms, template boilerplate, four-token sequences that are effectively copied everywhere) want a cheap causal smoother, not a full attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns head. Cross-document abstractions ("this is a linked-list insert regardless of tokens") want a global pattern bank you can cross-attend to without paying T x T. M2RNN occupies a different niche again: a matrix-valued recurrent state that can store more per step than a vector SSM without breaking into full attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns.
None of these are critical-path on their own. What matters is that layering them is cheap — the features are additive to the residual and most of them are zero-initialised — so ablations are clean, and the wins stack.
What this memory stack includes
M2RNN
The research M2RNN module implements the layer. In plain language, it keeps a small matrix per head as recurrent state, so each token update can write more structure than a vector-only state-space mixer usually can. The core recurrence is straightforward:
h_t = tanh(h_{t-1} @ W + k_t ⊗ v_t)
h_t = f_t * h_{t-1} + (1 - f_t) * h_t
y_t = q_t @ h_t
The state h is a matrix, not a vector — at head dims K=64, V=16 the per-head state is 64x16. That extra capacity versus a vector SSM is the whole point. _softplus_decay_gate produces the forget factor f in SSM style (exp(-A * softplus(x + dt_bias))), and the per-layer setup is input_proj then optional causal_conv1d, then split into q, k, v plus the decay and output gates, then recurrence, residual v*D, output gate, norm, output_proj.
The file has three forward paths chosen at import time. The default is the XMA Triton kernel (_xma_m2rnn_forward) when xma.functional.m2rnn is importable. If not, we fall back to a pure-PyTorch sequential loop (_torch_m2rnn_forward) wrapped with @torch.compiler.disable so dynamo does not try to unroll the step loop and explode the graph. The wrapper reads MEGACPP_STARTUP_TRACE once at import as a dynamo-friendly constant instead of polling os.environ every step. These are the kinds of integration details that matter more than the math when you want regional compile without losing the plot to stay stable across epochs, and the checked-in regional compile M2RNN wrapper sample shows the public boundary directly.
The Megatron bridge around that layer is the glue for training inside a Megatron stack. The production bridge adapts the M2RNN layer to Megatron's mixer protocol: it accepts the standard module-construction arguments, transposes from [seq, batch, hidden] to [batch, seq, hidden] for the forward path, returns (output, None) so the surrounding residual path stays correct, and can run the pure-PyTorch recurrence rather than the kernel when compile compatibility matters. A small config bridge reconciles the earlier attribute names with Megatron's transformer configuration when both exist, which is the same narrow seam summarized by the M2RNN mixer spec sample.
Engram
The local causal n-gram branch runs in parallel with the main block: given (B, T, C) hidden states it produces (B, T, C) features that get added back into the residual. In this article, Engram means that branch specifically: a cheap causal smoother over short local code motifs, not a separate retrieval database. The original mode (gated=False, conv_kernel=0) is the minimal version — project to a bottleneck, compute causal local averages at orders 2/3/4 via avg_pool1d, mix per-order, project back out, zero-init. The upgraded mode (gated=True, conv_kernel=4) adds two things on top. First, a DeepSeek-style context-aware gating step: alpha = sigmoid(RMSNorm(h)ᵀ · RMSNorm(k) / sqrt(d)), where h is the pre-norm residual and k is the linearly projected n-gram features. Second, a grouped causal convolution with SiLU activation on top. Two conv implementations are selectable: maxtext_depthwise uses nn.Conv1d(groups=D) on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 and unrolls by hand on XLA, while xla_safe uses a manual unfold-and-sum on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 and the same manual loop on XLA. The per-device split exists because the fast CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 conv path is not well optimised on TPU.
There is subtle work in _same_doc_shift_mask and _causal_local_average: when documents are packed together, the causal smoother must not leak across document boundaries. Engram threads doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample through every shift so an n-gram window that would cross a document boundary is treated as left-pad zeros. This was a real regression in earlier runs — packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles was correct but Engram was quietly mixing documents, and the attention validityQuick term guideAttentionValidityThe validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.GroundingAbout: attention validity and structure Example: Pallas softcap attention sample Reference: tokenized enriched packed rows on TPU tests did not cover it until we added them.
The companion _RMSNorm class is deliberately small: no learnable parameters, and the forward is a single F.rms_norm call with weight=None. An earlier version did x.pow(2).mean(-1) and a manual .float() upcast. The manual path broke fusion under Inductor (the explicit dtype change was a fusion boundary) and overflowed in bf16 for large activations. The fix both stabilised bf16 training and reclaimed throughput.
That bug is easier to reason about after packed rows as the real training contract and attention validity and structure: the failure was not "Engram is weak", it was "a local-memory branch violated the same document boundary assumptions as the rest of the stack".
The engram "package" and concepts
The brief asks about an engram/ package, and it is worth being honest: the Engram path discussed here is a single module rather than a larger package. The broader "engram concept" — external learned memory the model reads from without writing — is represented in two places. The local half is EngramBranch above. The global half is the concept-retrieval module. CBlock is a cross-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns block where the queries come from hidden states and the keys/values come from a learned Embedding(n_concepts, concept_dim) — the concept bank. There is no causal mask because the concepts are global prototypes, no RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries on the concept K/V because the concepts are order-invariant, softmax is done in fp32 for stability, and the output projection is zero-initialised so dropping a CBlock into any layer position is identity at step 0. Concepts are read-only: the bank is learned by gradient descent, never written during the forward pass. That is the distinction we care about — Engram is the local reader, the concept bank is the global reader, and neither is a write-enabled episodic memory in this shipment.
For naming, this is one place where MegaCpp model glossary matters: cblock is safe only because this article defines the retrieval role explicitly, rather than assuming every reader already knows which flavor of block naming is meant.
How it lands in the public sample
The production layer keeps the pieces that paid off and drops the more speculative surfaces.
M2RNN in the public sample
The production M2RNN seam lives in the Megatron spec layer, plus a public configuration sample. That configuration holds d_model, k_head_dim (default 64), v_head_dim (default 16), conv_kernel (default 4), gradient clipping, a residual flag, and A/dt init ranges; a single builder pulls those from the Megatron config with sensible defaults. The big rewrite is the kernel path: the public M2RNN Triton sample provides a Triton M2RNN scan (m2rnn_scan_triton) that is a drop-in replacement for _torch_m2rnn_forward. On our reference hybrid geometries (B=2, S=4096, H=8, K=64, V=16) the Triton path is dramatically faster than the Python reference loop — the exact multiplier depends on hardware, but the reference loop is orders of magnitude slower and is used only as a deliberate debug path via an explicit fallback-kernel switch. If Triton is not importable, the wrapper warns loudly rather than silently degrading: running the reference loop in production is not something we want to discover from a throughput dashboard or in the throughput accounting discussed in Mamba3 parallel performance.
The config bridge between the earlier M2RNN configuration surface and the Megatron transformer configuration is intentionally simple: there is a single entry point, and M2RNN reads its fields directly from the Megatron config via stable attribute access rather than through an extra shim layer.
Engram in practice
The public Engram config sample defines EngramConfig and NgramHashConfig, both fail-closed. EngramConfig.from_args validates layer indices, n-gram orders, bottleneck dim, dropout, conv kernel, and conv impl (must be "xla_safe" or "maxtext_depthwise"). The EngramBranch implementation itself is what matters architecturally: a small causal n-gram memory branch with doc-id threading, a fused RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Author Mamba3 spec About: Mamba3 hybrid Example: author Mamba3 spec sample path, and explicit convolution-mode selection. NgramHashConfig is adjacent rather than identical: it handles hashed n-gram token embeddings, not the main Engram residual branch. The Engram branch sample and Engram + mHC stack sample are the checked-in proof surfaces for that split.
Concepts and CBlock
The concept bank (CBlock) is easiest to describe as a cross-document retrieval block. In the public glossary, cblock means a lightweight coordination or connector block; in this memory-stack context it is the optional concept-retrieval tier rather than the default path. The reason it stays optional is pragmatic: in ablation runs the concept bank was additive but small, and it is parameter-heavy at useful n_concepts. The design remains worth keeping because the underlying idea has strong precedents in Memorizing Transformers and Flamingo. The important public question here is not paper lineage but operational fit: for the memory stack discussed here, the default tiers are M2RNN + Engram + Mamba + attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, not five always-on subsystems, and they still inherit the packed-sequence constraints described in packed rows as the real training contract. For the public naming around cblock, use the hybrid layout notes, the core block taxonomy sample, and MegaCpp model glossary rather than treating it as a field-standard term.
Ablations and what we kept
The published notes and companion articles tell a consistent story about what survives contact with real hardware:
Memory-subsystem roles at a glance:
| Module | Role | Cost profile | Default |
|---|---|---|---|
| Mamba 3 | bulk long-range recurrence | O(N), low bandwidth | on |
| Attention | sharp content lookups | O(N^2) on its minority share | on |
| M2RNN | matrix-valued recurrent state | matrix state per head | on |
| Engram | local causal n-gram smoother | cheap, additive residual | on |
| CBlock | cross-doc concept retrieval | parameter-heavy | off |
- M2RNN kernel path dominates. The pure-PyTorch loop is a correctness reference, not a training path. The XMA/Triton kernels are the only sensible choice once the model is past toy size.
- Engram gated + conv=4 is the default. Earlier variants without conv were fine but measurably weaker on code benchmarks. The conv adds real capacity; the gate prevents it from drowning out the main block.
F.rms_normfused path in Engram's_RMSNormis not optional. The manual bf16 variance path overflows on real activations and breaks Inductor fusion. This is the single largest regression we fixed in the Engram subsystem.- Cross-document bleed in Engram's conv and n-gram pools is a correctness bug, not a quality one. Packed sequences + Engram without
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample threading silently mixes documents. Regression tests for cross-document isolation now cover both the n-gram pool and the conv kernel. - mHC (multi-stream hyper-connections) is an adjacent residual-stream seam that can layer around Engram-bearing blocks. A bug we found during review: mHC without Engram layers was silently no-op because the mHC layer list fell back to the empty engram layer list. It now defaults to all layers with a warning when
--mhcis used without an explicit engram layer set. - The full stack (Mamba + Engram + mHC + MTP + MoD + structure + ngram hash) is measurable in the throughput tables: baseline MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack + Mamba + Engram on the training bench is our reference, adding Engram + mHC costs memory bandwidth but adds capacity on code tasks, which is easier to reason about alongside Mamba3 parallel performance and memory budget anatomy.
- M2RNN +
regional_compileworks, M2RNN + whole-model compile does not without the@torch.compiler.disablebreak. Keeping the Triton call outside the compiled graph is non-negotiable.
Production checklist
- The production M2RNN config should be built through the canonical builder. Any direct construction must still pass all fields; the dataclass is frozen.
- The fallback-kernel switch is a debug knob. Never force the pure PyTorch path in a real training run; if Triton fails to import, fix the environment before starting.
- Engram's
conv_implmust match the training device:maxtext_depthwiseon CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200,xla_safeon TPU. Crossing them trains correctly but loses throughput. - Packed sequences + Engram require
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample threaded through every branch. Without it, Engram leaks across document boundaries. _RMSNorminside Engram must useF.rms_norm(x, (D,), weight=None, eps=eps). Do not reintroduce the manualx.pow(2).mean(-1)variance path.- When enabling mHC, always specify engram layer indices explicitly. Relying on the default fallback list is a known silent-noop trap.
- Concept bank (
CBlock) is off by default in the public memory-stack recipe described here. Turning it on is an ablation, not the baseline configuration. - Keep M2RNN outside whole-model compile.
regional_compileonly, with the step loop under@torch.compiler.disable.
Frequently asked questions
When does M2RNN earn its place next to Mamba?+
Why is doc_ids threading mandatory for Engram?+
doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries. masking rule that prevents that leak.Why keep xla_safe on TPU instead of forcing maxtext_depthwise everywhere?+
maxtext_depthwise is the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.-oriented fast path in this public memory branch, while TPU/XLA keeps xla_safe so Engram's grouped causal convolution stays inside the static-shape, device-friendly contract. Crossing the modes can still train, but it turns the convolution seam into a throughput problem rather than proving portability. The checked-in Engram branch sample shows the explicit mode boundary, and XLA vs CUDA stack decisions explains why backend-specific ownership is the safer rule.Is the concept bank part of the default shipped stack?+
CBlock stays as an optional retrieval tier for ablations and targeted experiments. The default memory recipe here is Mamba plus attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. plus M2RNN plus Engram, with the concept bank left off unless there is a specific reason to pay the parameter cost.Why not run CBlock on every layer if it adds global memory?+
How is this different from Gated DeltaNet or hyper-connections?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
Root-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.
Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.
The validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
A grounded architectural read of the MegaCpp small-model stack: hybrid patterns, block semantics, schedule ownership, and why names like ablock,…