MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 12 min readDavid Gornshtein
Performance
Kernels
Mamba3
MoE
MLA
Transformer Engine

Training speed by feature: which parts of the stack really move step time

A grounded feature-by-feature look at training speed across a modern hybrid stack: Mamba fused paths, memory-traffic cleanup, MLA pieces, MoE dispatch, routing bridges, and feature taxes that should stay experimental.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Training speed by feature: which parts of the stack really move step time
Published 12 min readDavid Gornshtein

Training speed by feature: which parts of the stack really move step time

Not every interesting feature moves training speed in the same way. The biggest durable wins usually come from removing repeated hot-path work: fused Mamba or state-space updates, fused residual math, narrow MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries ingress fusion, and especially MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch-plus-compute cleanup. hot path here means code that runs every step, over large enough tensors, often enough that shaving small overhead there changes end-to-end wall time. Some features are speed enablers because they route work to a better backend. Others are quality or architecture features that should be treated as measured taxes, not presumed accelerators. The practical job is to separate hot-path wins from feature costs and to keep exact measurements for both, which is also why H200 bringup and naming and Muon on Hopper and Blackwell keep returning to explicit lane definitions. The quickest checked-in proof surfaces are goodput tracker sample, measured optimization receipts, expert parallel routing sample, and MLA integration pattern sample. If you need the article-family view around these measurements, Training on 8x H200 SXM: the operator playbook keeps this feature slice next to the operator and lane-level companions.

The easy way to talk about speed is to say everything matters. The useful way is to ask a narrower question: which code paths are executed often enough, over tensors large enough, that cleaning them up changes the wall-clock reality of training? A serious training stack answers that question in a grounded way. It exposes throughput and goodput reporting, preserves feature flags in launch surfaces, and keeps a reproducible record of which optimizations are worth keeping versus which ones are still experiments, which is also the point of Profiler and receipts and Throughput vs quality knobs. First touch: a throughput knob is any compile, routing, fusion, data, or observability setting that can move end-to-end step time; a receipt is the per-run record that shows what the knob actually did; a dashboard is the rolling surface that shows whether the effect persists across many receipts; and a trace is the heavier drill-down artifact you open only when the receipt still cannot explain the result. goodput here means the fraction of wall time spent doing useful training-step work, while badput is the wall time lost to compile, checkpoint, data, eval, or idle overhead.

Start from observability, not from intuition

A stack can only have a sane speed conversation if it measures the right thing. The MegaCpp examples already have the right ingredients: goodput accounting, temporal performance reporting, and machine-checkable result records. That matters because otherwise every feature discussion degrades into profiler screenshots and memory of “it felt faster.”

Surface What it measures Why it matters for feature evaluation
Goodput accounting Useful training time versus badput categories Separates model progress from compile, idle, or data overhead
Temporal performance tracking Step-level throughput, tokens, and peak memory Shows feature tax or gain over time instead of at one lucky step
Stable report output Comparison-ready summaries Makes comparison shareable instead of anecdotal
Structured result schema Structural invariants for results Prevents incomplete ablations from being treated as final

That measurement layer changes the whole conversation. Once it exists, features stop being judged by enthusiasm and start being judged by whether they improve useful work per unit wall time, and whether they do so reliably. The checked-in goodput tracker sample and measured optimization receipts are the compact proof surfaces for that measurement layer. When the question becomes "is this drifting over weeks of runs?" rather than "did this one knob help this one lane?", the handoff is Observability and the three dashboards.

The measurement layer is not literally free, but that is the wrong objection in most tuning work. Goodput counters and structured receipts add some bookkeeping cost, yet they are still cheaper than promoting a feature on one lucky trace and then rediscovering later that compile badput, checkpoint badput, or data stalls erased the gain. The observability-overhead question is useful mainly as a reminder to measure that tax once and then keep it visible, not as a reason to run blind.

The hot-path wins are the ones worth defaulting on

The strongest speed features in the current tree share one property: they eliminate repeated work inside paths that dominate the training loop.

Mamba-related fused work fits that pattern. In hybrid lanes where M blocks are active, any fused state update or scan cleanup hits a core loop, not a side branch. That is why Mamba-side fusion belongs in the “likely worth keeping” bucket. If the model uses many Mamba layers, repeated elementwise and state-update overhead compounds quickly, which is the same reason Mamba 3 parallel performance and Mamba 3 kernel journey spend so much time on seemingly small inner-loop cleanup.

The same logic explains the value of residual-path fusion. Small-looking operations matter because they run constantly. Elementwise launches on activation-shaped tensors are easy to dismiss one by one and expensive in aggregate. When those are fused into fewer passes over memory, the gain is rarely dramatic at one instruction boundary and often very real end to end.

MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries is a more mixed case, but still instructive. Narrow ingress fusion is the kind of optimization that often survives scrutiny because it removes reshape/apply/reshape style overhead from a hot boundary. Broader projection fusion, by contrast, is something to treat more skeptically until it continues to beat library improvements and compiler evolution. The general lesson is not “MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries fusion is good” or “MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries fusion is bad.” It is “keep the narrow wins that repeatedly hit hot ingress paths, and force the bigger fusions to prove themselves,” which is the same split argued in MLA and weight absorption and Fused MLA on NVIDIA.

MoE is the largest obvious speed surface

If one feature family deserves to be called a first-order throughput concern, it is MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack. A fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack implementation makes that visible by comparing a standard route-permute-pad-batched-gemm-unpad-unpermute shape against a tighter route-sort-jagged-gemm-weighted-scatter shape. That is why Fused MoE and DeepEP on NVIDIA and Expert parallel and MoE sharding are speed posts as much as architecture posts.

standard:
route -> permute -> pad -> batched_gemm -> unpad -> unpermute

fused:
route -> sort_by_expert -> jagged_gemm_fused -> weighted_scatter

That is not just an implementation detail. It is a map of where speed disappears. Every extra permute, pad, and unpad stage is another opportunity for memory traffic and dispatch overhead to dominate the useful expert compute. When MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack underperforms, the culprit is often not the GEMM itself but the work around it.

This is also where EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding and speed meet directly. Expert parallelismQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding is not only a scaling feature. It changes the cost model of the model. Routing, sorting, combine, and cross-rank ownership are part of step time. That is why the combined TP + SP + EP + FSDP2 + compile lane matters for performance too. If the lane only barely works semantically, the throughput number will be meaningless. Once the lane is healthy, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack optimization becomes one of the highest-payoff speed investments in the stack. The checked-in expert parallel routing sample and MoE loss collection sample are the fastest local readbacks when that routing and ownership surface is the real speed question.

Public-facing architecture notes reinforce this. MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, grouped GEMM, DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample, and MTP should stay explicitly visible in topology and throughput discussions. That is the correct shape: MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack is not a feature you mention in model prose and then ignore in speed accounting.

Some features are backend selectors, not direct kernels

Another important category is the backend bridge. These are features that may not themselves be a new fused kernel but still matter because they route work onto a better-maintained fast path. backend bridge here means a dispatcher or compatibility seam that keeps model code stable while routing one narrow operation onto the healthiest vendor or upstream implementation. Operationally, that matters because one bridge can preserve the fast path across library upgrades, backend swaps, or fallback conditions without scattering backend-specific decisions through the model. In production engineering, that can be as valuable as writing a new kernel by hand.

Dispatcher boundaries serve that role. The production lesson is straightforward: centralize backend choice when possible. If a better vendor path exists for a narrow operation, the right architecture is often a disciplined dispatcher with solid fallback behavior, not hardwired backend-specific branches scattered across model code. That keeps the maintenance burden near the tradeoff surface in Compile-time vs runtime tradeoffs instead of smearing it across the whole model. The checked-in MLA integration pattern sample and index-cache patch nearcopy are compact proof surfaces for that bridge pattern. The adjacent runtime consequence is training speed anatomy on H200: once a backend bridge keeps the right fast path active, the next question is whether that path is dominant enough to move whole-step goodput.

This matters for speed because it changes the maintenance cost of staying fast. A dispatcher can inherit improvements from upstream backends. A custom path has to keep justifying itself against that moving baseline. It also changes what the receipt has to prove: requested backend is not enough, observed backend and runtime mode belong in the record too, which is exactly the line held by GPU profile receipt sample and FA4 receipt summary sample.

Feature taxes should be treated honestly

Not every feature is supposed to be a speed win. Some are architecture or quality features that impose extra compute, memory, or bookkeeping. The mistake is not that they exist. The mistake is pretending they are “free enough” without measurement.

In this bucket belong things like STP and other auxiliary-loss or metadata-heavy features. feature tax here means a deliberate throughput, memory, or bookkeeping cost paid for architecture or quality reasons, not a hidden regression we pretend is free. These features may be good ideas. They are not baseline speed features. The correct question is not “are they elegant?” It is “what do they do to throughput, memory, and convergence-adjusted productivity?” That is why feature-tax arguments have to stay attached to Throughput vs quality knobs rather than float as architecture opinions.

Feature family Likely speed role Default stance
Fused Mamba path Direct hot-path speed win Default on when the model uses it
Fused residual helpers Repeated small wins that compound Default on
Narrow MLA ingress fusion Direct local win Default on if validated
Broad MLA projection fusion Conditional / needs repeated proof Keep selective
Fused MoE dispatch and compute Major throughput lever Treat as core optimization
Backend bridge / dispatcher Indirect speed enabler Keep centralized
STP and similar aux features Measured tax Keep experimental

That table is the practical decision surface for a production stack. The core rule is easy: hot-path wins can graduate to defaults; taxes must keep proving their value.

NAM56R is a good example of why feature accounting matters

The NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample family is a good illustration because it concentrates several feature families in one model description: hybrid block patterns, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample, MTP, and hardware-specific throughput claims. Public recipe samples preserve the pattern layer, while public status notes keep measured configurations and throughput on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200-class systems visible. That is also why H200 bringup and naming matters to speed work: a fuzzy lane label produces a fuzzy throughput claim.

That means “training speed by feature” cannot be separated from “training speed by model shape.” A feature that is minor on a dense lane can become a first-order concern on a hybrid AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample lane. A DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample optimization that matters when full and shared layers interleave may be irrelevant on a simpler topology. The only sane answer is to keep exact model naming, exact topology, and exact feature state in the same measurement record.

An index-cache optimization is a good example of a feature whose value depends on shape. The patch exists because adjacent DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.GroundingAbout: DSA and CUDA graph safety History: DSA index cache patch Example: DSA CUDA graph safety sample layers share enough top-k structure that recomputing indexer work every time is wasteful. That is not a universal speed truth. It is a shape- and architecture-aware optimization. But when the pattern fits, it is exactly the kind of repeated overhead reduction worth keeping.

The same goes for a streamlined MTP layer. It is not just “MTP exists.” It is a narrow design that bypasses a more complex path and avoids some SPQuick term guideSPSequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.GroundingAbout: parallelism map overview Example: 3D parallelism sample Reference: context parallel and sequence parallel/TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding workspace burden in the shared block. That is a speed-relevant implementation choice, not just a feature flag, and it is part of why speculative decoding inside an eight-specialist ensemble can reuse the same MTP surface instead of adding a separate drafter model. The checked-in NAM56R NeMo recipe sample, NAM56R pattern composition sample, MLA integration pattern sample, index-cache patch nearcopy, and MTP shared-block sample are the compact local proof surfaces for those shape-specific claims.

How to decide what graduates into production

The production rule should be conservative and repeatable.

  1. Keep features that remove repeated work from the inner training loop.
  2. Prefer narrow, validated fusion over giant fused abstractions that may age badly.
  3. Centralize backend selection rather than scattering backend-specific logic.
  4. Treat quality or architecture features as opt-in taxes until repeated measurements say otherwise.
  5. Record every meaningful ablation with exact model names, topology, and observability output.

That rule sounds procedural because it is. Most bad speed decisions happen when a team skips procedure and promotes a feature because it sounds important.

speed_defaults:
  fused_mamba: true
  fused_residual: true
  fused_mla_ingress: true
  fused_moe: true
  backend_dispatch: auto
experimental_taxes:
  stp: false
  extra_aux_losses: false
observability:
  goodput: true
  temporal_perf: true
  measurements: true

That config is illustrative. It captures the right posture: enable the repeated hot-path wins, keep dispatcher logic on, and force experimental taxes to justify themselves.

The durable lesson

The durable lesson is that speed is rarely improved by a single giant trick. It is improved by repeatedly removing unnecessary work from the surfaces the model hits every step, then preserving enough measurement context that the gain can survive handoff and re-testing. If the next question is how those per-feature wins aggregate into one H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 lane, training speed anatomy on H200 is the direct continuation.

That is why the local measurement layer matters as much as the kernels. Without it, the team cannot tell whether a feature is a real acceleration, a small tax with quality upside, or just a one-run illusion.

FAQ

Frequently asked questions

Which feature family usually matters most for step time?+
MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble. cleanup, because the routing, sorting, padding, and scatter work around expert compute can dominate the useful math if left unfused. Expert parallel routing sample and MoE loss collection sample are the quickest checked-in proofs that the expert path is more than one GEMM.
Why are backend bridges treated as speed features at all?+
Because routing work onto a healthier vendor or upstream fast path can be as valuable as writing a new custom kernel, while being cheaper to maintain. The checked-in GPU profile receipt sample and FA4 receipt summary sample are the shortest proof surfaces because they record requested and observed fast-path truth instead of stopping at a generic "dispatcher" slogan.
Where do goodput and badput fit into per-feature decisions?+
They keep feature discussions honest. If a new fusion wins one operator but compile, checkpoint, or idle badput grows enough to erase the gain, the feature did not improve the lane that operators actually run. The checked-in goodput tracker sample, measured optimization receipts, and compile/runtime receipt sample are the narrow proof surfaces; training speed anatomy on H200 is the lane-level companion.
Is observability itself a feature tax?+
Yes, but usually a small and worthwhile one. Counters, receipts, and summary records do consume some time and memory, yet that tax is easier to budget than the repeated misreads that happen when teams compare features without matched goodput and badput receipts.
Why are STP and similar objectives called taxes here?+
Because they should be judged as opt-in costs that may buy quality, not as presumed accelerators. They need explicit measurement to justify their place. Measured optimization receipts is the local model for "show the tax in a receipt," and Observability and the three dashboards is the next article when the tax needs to stay visible across many runs instead of one ablation.
When does the DSA index-cache reuse seam actually matter for speed?+
When adjacent DSAQuick term guideDSADeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture. layers reuse enough top-k structure that rebuilding the same indexer state each time becomes visible in step time. That makes it a shape-dependent hot-path cleanup, not a universal rule: on a simpler topology it may barely move the lane, while on a NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.-style interleaved pattern it is exactly the kind of repeated overhead reduction worth keeping. The quickest local readbacks are index-cache patch nearcopy, NAM56R pattern composition sample, and DSA CUDA graph safety deep dive.
Which local files make the MoE speed claim concrete?+
Start with expert parallel routing sample and MoE loss collection sample. One keeps routing and ownership visible, the other keeps the extra MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.-side bookkeeping visible, so “MoE speed” does not collapse into “one big GEMM got faster.”
What counts as a hot path in this article?+
A hot path is any code surface that runs every step, over large enough tensors, often enough that shaving small overhead there changes end-to-end step time. Expert parallel routing sample, MTP shared-block sample, and MoE dispatch fast-path sample are concrete local examples of surfaces that stay hot enough to matter.
Why prefer narrow fusion over giant fused abstractions?+
Because narrow fusion usually targets a repeated bottleneck with bounded maintenance cost. Bigger fused surfaces can still win, but they have to keep proving that the extra complexity survives changing compilers, libraries, and model shapes.
Which checked-in files ground the measurement side most directly?+
Start with goodput tracker sample for wall-time accounting, measured optimization receipts and GPU profile receipt sample for matched deltas, distributed example index for the feature families that stay visible in training receipts, and distributed debugging notes for the "one narrow receipt per failure family" rule.
Which files separate throughput knobs from traces most cleanly?+
Use compile/runtime receipt sample and MoE dispatch fast-path sample for the knob side, GPU profile receipt sample for the matched-trace side, and Observability and the three dashboards when the question becomes how those per-run surfaces roll up over time.
Where should I go next if I want the H200 lane-level view instead of feature slices?+
Read training speed anatomy on H200 for lane structure first, then H200 bringup and naming if you need the naming and topology contract behind those measurements.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

EP

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

DSA

DeepSeek Sparse Attention: a sparse-attention lane where routing and masking logic must stay faithful to the score path without breaking runtime constraints such as CUDA graph capture.

AEMEAEMEAEMR

A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.

TP

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

SP

Sequence parallelism is a TP-region activation saver — not a separate mesh. Plain TP leaves layernorm / dropout / residual activations replicated on every TP GPU; SP keeps those intermediates sharded along the sequence axis so each TP GPU holds only 1/TP of them. Cost: same bandwidth as plain TP — the single all-reduce becomes an all-gather + reduce-scatter pair. Weights identical to plain TP; only the activation tensors shrink. Turn on whenever TP is on — near-free memory savings, which is what makes long contexts fit under TP.

CUDA Graphs

CUDA's capture-and-replay execution model, where hidden host sync points or Python-side branching break an otherwise valid GPU work graph.

FSDP2

PyTorch's Fully Sharded Data Parallel v2 wrapper API. On CUDA it shards parameters, gradients, and optimizer state across the data-parallel group; in the TPU/XLA posts here it is usually a memory-goal analogy, not the actual eager wrapper mechanism.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Topic hubs