MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 12 min readDavid Gornshtein
M2rnn
Engram
Memory
Hybrid
Mamba3
Architecture

M2RNN and Engram: The Memory Subsystem Inside the Hybrid

Where matrix-state RNN layers, causal n-gram Engram branches, and the learned concept bank fit inside our Mamba 3 + Transformer hybrid — and which pieces remain useful in the public memory stack.

MegaCpp
Focused on applied C++ model engineering
Article Preview
M2RNN and Engram: The Memory Subsystem Inside the Hybrid
Published 12 min readDavid Gornshtein

The hybrid stack described here is not "a transformer with some Mamba sprinkled in". It is a deliberate memory hierarchy: Mamba 3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode SSM blocks for bulk long-range recurrence, attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns blocks for sharp content-addressed lookups, M2RNN layers for matrix-valued recurrent state, an Engram branch that runs local causal n-gram features in parallel with the residual stream, and an optional concept-retrieval block that reads from a learned bank of global patterns. Here M2RNN means a matrix-state recurrent mixer rather than a generic RNN cell, and Engram means the local n-gram memory branch we wire into the model block, not a vague catch-all memory label. This post is about the three of those that are not vanilla attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns or Mamba: M2RNN, Engram, and the concept bank.

It fits best beside Mamba 3 + Transformers, Mamba3 kernel journey, and packed rows as the real training contract, because the memory stack only stays honest when its kernel path, hybrid schedule, and packed-document contract are read together. If you want the checked-in surfaces first, start with the M2RNN mixer spec sample, the Engram branch sample, and the Engram + mHC stack sample.

If these memory terms are new

  • M2RNN here means a matrix-state recurrent mixer: per-head state is a small matrix rather than a vector-only recurrent state.
  • Engram here means the local causal n-gram side branch, not an external retrieval database.
  • CBlock here means the optional concept-retrieval tier, not a generic industry-standard block name.
  • mHC means multi-stream hyper-connections layered around blocks; it is a residual-stream mechanism adjacent to this memory stack, not the same thing as Engram or M2RNN.

The closest related topic hubs in this cluster are Mamba 3 + Transformers for the backbone split and Gated DeltaNet, hyper-connections, and DynamicTanh for the alternative mixer and residual-side experiments.

Why this setup matters

C++ training data has several memory demands that standard attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns does not satisfy cheaply. Long files need linear-in-length recurrence — Mamba handles that. Local n-gram patterns (operator idioms, template boilerplate, four-token sequences that are effectively copied everywhere) want a cheap causal smoother, not a full attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns head. Cross-document abstractions ("this is a linked-list insert regardless of tokens") want a global pattern bank you can cross-attend to without paying T x T. M2RNN occupies a different niche again: a matrix-valued recurrent state that can store more per step than a vector SSM without breaking into full attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns.

None of these are critical-path on their own. What matters is that layering them is cheap — the features are additive to the residual and most of them are zero-initialised — so ablations are clean, and the wins stack.

What this memory stack includes

M2RNN

The research M2RNN module implements the layer. In plain language, it keeps a small matrix per head as recurrent state, so each token update can write more structure than a vector-only state-space mixer usually can. The core recurrence is straightforward:

h_t = tanh(h_{t-1} @ W + k_t ⊗ v_t)
h_t = f_t * h_{t-1} + (1 - f_t) * h_t
y_t = q_t @ h_t

The state h is a matrix, not a vector — at head dims K=64, V=16 the per-head state is 64x16. That extra capacity versus a vector SSM is the whole point. _softplus_decay_gate produces the forget factor f in SSM style (exp(-A * softplus(x + dt_bias))), and the per-layer setup is input_proj then optional causal_conv1d, then split into q, k, v plus the decay and output gates, then recurrence, residual v*D, output gate, norm, output_proj.

The file has three forward paths chosen at import time. The default is the XMA Triton kernel (_xma_m2rnn_forward) when xma.functional.m2rnn is importable. If not, we fall back to a pure-PyTorch sequential loop (_torch_m2rnn_forward) wrapped with @torch.compiler.disable so dynamo does not try to unroll the step loop and explode the graph. The wrapper reads MEGACPP_STARTUP_TRACE once at import as a dynamo-friendly constant instead of polling os.environ every step. These are the kinds of integration details that matter more than the math when you want regional compile without losing the plot to stay stable across epochs, and the checked-in regional compile M2RNN wrapper sample shows the public boundary directly.

The Megatron bridge around that layer is the glue for training inside a Megatron stack. The production bridge adapts the M2RNN layer to Megatron's mixer protocol: it accepts the standard module-construction arguments, transposes from [seq, batch, hidden] to [batch, seq, hidden] for the forward path, returns (output, None) so the surrounding residual path stays correct, and can run the pure-PyTorch recurrence rather than the kernel when compile compatibility matters. A small config bridge reconciles the earlier attribute names with Megatron's transformer configuration when both exist, which is the same narrow seam summarized by the M2RNN mixer spec sample.

Engram

The local causal n-gram branch runs in parallel with the main block: given (B, T, C) hidden states it produces (B, T, C) features that get added back into the residual. In this article, Engram means that branch specifically: a cheap causal smoother over short local code motifs, not a separate retrieval database. The original mode (gated=False, conv_kernel=0) is the minimal version — project to a bottleneck, compute causal local averages at orders 2/3/4 via avg_pool1d, mix per-order, project back out, zero-init. The upgraded mode (gated=True, conv_kernel=4) adds two things on top. First, a DeepSeek-style context-aware gating step: alpha = sigmoid(RMSNorm(h)ᵀ · RMSNorm(k) / sqrt(d)), where h is the pre-norm residual and k is the linearly projected n-gram features. Second, a grouped causal convolution with SiLU activation on top. Two conv implementations are selectable: maxtext_depthwise uses nn.Conv1d(groups=D) on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 and unrolls by hand on XLA, while xla_safe uses a manual unfold-and-sum on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 and the same manual loop on XLA. The per-device split exists because the fast CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 conv path is not well optimised on TPU.

There is subtle work in _same_doc_shift_mask and _causal_local_average: when documents are packed together, the causal smoother must not leak across document boundaries. Engram threads doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample through every shift so an n-gram window that would cross a document boundary is treated as left-pad zeros. This was a real regression in earlier runs — packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles was correct but Engram was quietly mixing documents, and the attention validityQuick term guideAttentionValidityThe validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.GroundingAbout: attention validity and structure Example: Pallas softcap attention sample Reference: tokenized enriched packed rows on TPU tests did not cover it until we added them.

The companion _RMSNorm class is deliberately small: no learnable parameters, and the forward is a single F.rms_norm call with weight=None. An earlier version did x.pow(2).mean(-1) and a manual .float() upcast. The manual path broke fusion under Inductor (the explicit dtype change was a fusion boundary) and overflowed in bf16 for large activations. The fix both stabilised bf16 training and reclaimed throughput.

That bug is easier to reason about after packed rows as the real training contract and attention validity and structure: the failure was not "Engram is weak", it was "a local-memory branch violated the same document boundary assumptions as the rest of the stack".

The engram "package" and concepts

The brief asks about an engram/ package, and it is worth being honest: the Engram path discussed here is a single module rather than a larger package. The broader "engram concept" — external learned memory the model reads from without writing — is represented in two places. The local half is EngramBranch above. The global half is the concept-retrieval module. CBlock is a cross-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns block where the queries come from hidden states and the keys/values come from a learned Embedding(n_concepts, concept_dim) — the concept bank. There is no causal mask because the concepts are global prototypes, no RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries on the concept K/V because the concepts are order-invariant, softmax is done in fp32 for stability, and the output projection is zero-initialised so dropping a CBlock into any layer position is identity at step 0. Concepts are read-only: the bank is learned by gradient descent, never written during the forward pass. That is the distinction we care about — Engram is the local reader, the concept bank is the global reader, and neither is a write-enabled episodic memory in this shipment.

For naming, this is one place where MegaCpp model glossary matters: cblock is safe only because this article defines the retrieval role explicitly, rather than assuming every reader already knows which flavor of block naming is meant.

How it lands in the public sample

The production layer keeps the pieces that paid off and drops the more speculative surfaces.

M2RNN in the public sample

The production M2RNN seam lives in the Megatron spec layer, plus a public configuration sample. That configuration holds d_model, k_head_dim (default 64), v_head_dim (default 16), conv_kernel (default 4), gradient clipping, a residual flag, and A/dt init ranges; a single builder pulls those from the Megatron config with sensible defaults. The big rewrite is the kernel path: the public M2RNN Triton sample provides a Triton M2RNN scan (m2rnn_scan_triton) that is a drop-in replacement for _torch_m2rnn_forward. On our reference hybrid geometries (B=2, S=4096, H=8, K=64, V=16) the Triton path is dramatically faster than the Python reference loop — the exact multiplier depends on hardware, but the reference loop is orders of magnitude slower and is used only as a deliberate debug path via an explicit fallback-kernel switch. If Triton is not importable, the wrapper warns loudly rather than silently degrading: running the reference loop in production is not something we want to discover from a throughput dashboard or in the throughput accounting discussed in Mamba3 parallel performance.

The config bridge between the earlier M2RNN configuration surface and the Megatron transformer configuration is intentionally simple: there is a single entry point, and M2RNN reads its fields directly from the Megatron config via stable attribute access rather than through an extra shim layer.

Engram in practice

The public Engram config sample defines EngramConfig and NgramHashConfig, both fail-closed. EngramConfig.from_args validates layer indices, n-gram orders, bottleneck dim, dropout, conv kernel, and conv impl (must be "xla_safe" or "maxtext_depthwise"). The EngramBranch implementation itself is what matters architecturally: a small causal n-gram memory branch with doc-id threading, a fused RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Author Mamba3 spec About: Mamba3 hybrid Example: author Mamba3 spec sample path, and explicit convolution-mode selection. NgramHashConfig is adjacent rather than identical: it handles hashed n-gram token embeddings, not the main Engram residual branch. The Engram branch sample and Engram + mHC stack sample are the checked-in proof surfaces for that split.

Concepts and CBlock

The concept bank (CBlock) is easiest to describe as a cross-document retrieval block. In the public glossary, cblock means a lightweight coordination or connector block; in this memory-stack context it is the optional concept-retrieval tier rather than the default path. The reason it stays optional is pragmatic: in ablation runs the concept bank was additive but small, and it is parameter-heavy at useful n_concepts. The design remains worth keeping because the underlying idea has strong precedents in Memorizing Transformers and Flamingo. The important public question here is not paper lineage but operational fit: for the memory stack discussed here, the default tiers are M2RNN + Engram + Mamba + attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, not five always-on subsystems, and they still inherit the packed-sequence constraints described in packed rows as the real training contract. For the public naming around cblock, use the hybrid layout notes, the core block taxonomy sample, and MegaCpp model glossary rather than treating it as a field-standard term.

Ablations and what we kept

The published notes and companion articles tell a consistent story about what survives contact with real hardware:

Memory-subsystem roles at a glance:

Module Role Cost profile Default
Mamba 3 bulk long-range recurrence O(N), low bandwidth on
Attention sharp content lookups O(N^2) on its minority share on
M2RNN matrix-valued recurrent state matrix state per head on
Engram local causal n-gram smoother cheap, additive residual on
CBlock cross-doc concept retrieval parameter-heavy off
  • M2RNN kernel path dominates. The pure-PyTorch loop is a correctness reference, not a training path. The XMA/Triton kernels are the only sensible choice once the model is past toy size.
  • Engram gated + conv=4 is the default. Earlier variants without conv were fine but measurably weaker on code benchmarks. The conv adds real capacity; the gate prevents it from drowning out the main block.
  • F.rms_norm fused path in Engram's _RMSNorm is not optional. The manual bf16 variance path overflows on real activations and breaks Inductor fusion. This is the single largest regression we fixed in the Engram subsystem.
  • Cross-document bleed in Engram's conv and n-gram pools is a correctness bug, not a quality one. Packed sequences + Engram without doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample threading silently mixes documents. Regression tests for cross-document isolation now cover both the n-gram pool and the conv kernel.
  • mHC (multi-stream hyper-connections) is an adjacent residual-stream seam that can layer around Engram-bearing blocks. A bug we found during review: mHC without Engram layers was silently no-op because the mHC layer list fell back to the empty engram layer list. It now defaults to all layers with a warning when --mhc is used without an explicit engram layer set.
  • The full stack (Mamba + Engram + mHC + MTP + MoD + structure + ngram hash) is measurable in the throughput tables: baseline MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack + Mamba + Engram on the training bench is our reference, adding Engram + mHC costs memory bandwidth but adds capacity on code tasks, which is easier to reason about alongside Mamba3 parallel performance and memory budget anatomy.
  • M2RNN + regional_compile works, M2RNN + whole-model compile does not without the @torch.compiler.disable break. Keeping the Triton call outside the compiled graph is non-negotiable.

Production checklist

  • The production M2RNN config should be built through the canonical builder. Any direct construction must still pass all fields; the dataclass is frozen.
  • The fallback-kernel switch is a debug knob. Never force the pure PyTorch path in a real training run; if Triton fails to import, fix the environment before starting.
  • Engram's conv_impl must match the training device: maxtext_depthwise on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200, xla_safe on TPU. Crossing them trains correctly but loses throughput.
  • Packed sequences + Engram require doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.GroundingAbout: XLA SPMD tokenizer and vocab on TPU About: Block-sparse attention on TPU Example: document-mask segment ID sample threaded through every branch. Without it, Engram leaks across document boundaries.
  • _RMSNorm inside Engram must use F.rms_norm(x, (D,), weight=None, eps=eps). Do not reintroduce the manual x.pow(2).mean(-1) variance path.
  • When enabling mHC, always specify engram layer indices explicitly. Relying on the default fallback list is a known silent-noop trap.
  • Concept bank (CBlock) is off by default in the public memory-stack recipe described here. Turning it on is an ablation, not the baseline configuration.
  • Keep M2RNN outside whole-model compile. regional_compile only, with the step loop under @torch.compiler.disable.
FAQ

Frequently asked questions

When does M2RNN earn its place next to Mamba?+
When you want more per-step state capacity than a vector SSM can cheaply hold, especially on long code sequences. It only pays off in production if the Triton or XMA kernel path is healthy; the pure PyTorch loop is a correctness fallback, not a serious training path. The quickest checked-in anchor is the M2RNN mixer spec sample, then the compile-boundary companion regional compile M2RNN wrapper sample.
Why is doc_ids threading mandatory for Engram?+
Because packed documents otherwise bleed into each other through local averages and convolutions. If the n-gram branch crosses a document boundary, it stops being a local memory feature and starts corrupting sequence semantics. The checked-in Engram branch sample shows the exact doc_idsQuick term guidedoc_idsThe fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries. masking rule that prevents that leak.
Why keep xla_safe on TPU instead of forcing maxtext_depthwise everywhere?+
Because the convolution ownership is device-specific. maxtext_depthwise is the CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.-oriented fast path in this public memory branch, while TPU/XLA keeps xla_safe so Engram's grouped causal convolution stays inside the static-shape, device-friendly contract. Crossing the modes can still train, but it turns the convolution seam into a throughput problem rather than proving portability. The checked-in Engram branch sample shows the explicit mode boundary, and XLA vs CUDA stack decisions explains why backend-specific ownership is the safer rule.
Is the concept bank part of the default shipped stack?+
No. CBlock stays as an optional retrieval tier for ablations and targeted experiments. The default memory recipe here is Mamba plus attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. plus M2RNN plus Engram, with the concept bank left off unless there is a specific reason to pay the parameter cost.
Why not run CBlock on every layer if it adds global memory?+
Because it changes the bottleneck. M2RNN and Engram are local state updates, while a concept-bank query is a retrieval plus cross-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. pass over a learned store. That can be useful for targeted long-range facts, but making it always-on would tax decode bandwidth on every token. Keeping it off by default preserves the cheap local memory tiers and makes concept retrieval an explicit ablation.
How is this different from Gated DeltaNet or hyper-connections?+
M2RNN and Engram are memory-tier additions: a matrix-state recurrent mixer and a local causal n-gram branch. Gated DeltaNet, hyper-connections, and DynamicTanh covers a different surface: an alternative recurrent token mixer plus residual-stream mechanics. The checked-in split reflects that difference directly: M2RNN mixer spec sample and Engram branch sample live on the memory side, while DeltaNet + hyper-connection sample lives on the alternative-mixer side.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

doc_ids

The fixed-width per-token document identifiers that keep packed rows auditable and let TPU masking respect document boundaries.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

RMSNorm

Root-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.

RoPE

Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.

AttentionValidity

The validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Architecture

A grounded architectural read of the MegaCpp small-model stack: hybrid patterns, block semantics, schedule ownership, and why names like ablock,…

Topic hubs