MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202612 min readDavid Gornshtein

Triton

Kernels

Performance

MoE

MLA

Mamba3

Kernels that pay for themselves

Q: Why is MLA fused projection treated more cautiously than MLA fused RoPE?

Because MLA fused RoPE is a narrow hot-path contraction with a small semantic surface, while fused projection has to carry a larger custom-autograd and recompute contract. That makes the projection path more sensitive to compiler changes, backend-library catch-up, and fallback complexity, so it has to keep earning its place in run-level reports instead of graduating on microbenchmarks alone.

Which custom kernels and fused paths in MegaCpp are worth their maintenance cost, which ones are borderline, and which ones belong behind a fallback or in experiments.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Published April 18, 2026•12 min read•David Gornshtein

Every custom kernel is a liability until it proves otherwise. It ties you to compiler behavior, backend quirks, memory-layout assumptions, and a testing burden that plain PyTorch code does not have. The only honest reason to keep one is that it pays for itself. In MegaCpp, a handful of kernels clearly do. Others are still good experiments, but they do not deserve unconditional residency in MegaCpp.

Why MegaCpp cares

MegaCpp is downstream of an active research and bring-up process, not a blank slate. That means the real question is not “can we write a fused kernel for this?” but “should this fused path survive contact with a long-lived product codebase?” If a kernel only wins in a narrow benchmark and loses on maintainability, it should stay in experiments. If it removes a persistent bottleneck across real training runs, it belongs in the platform.

MegaCpp exposes the right evidence surfaces for this decision. Goodput tracking measures useful training time against wall time. Time-series reporting measures throughput and memory over time. Stable report schemas make ablations comparable. The rest of the answer comes from the public fused-path implementations and associated articles such as Fused MoE and DeepEP on NVIDIA: what actually shipped, Mamba3 kernel journey, and Fused MLA on Hopper and Blackwell: projection, RoPE, and the KV cache that ships.

When you need the checked-in example map behind those claims, use the MegaCpp model wiring examples for NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample, GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story, TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample, and MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries wiring anchors, then jump to the sibling kernel example catalog for the lower-level FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample, fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries, residual, and mHC receipts. Here mHC means multi-stream hidden-state mixing and hyper-connections, the cross-layer path covered in mHC fused on Blackwell.

What we built in MegaCpp

The surviving kernel set falls into three buckets: definitely pays for itself, conditionally worth keeping, and not worth a permanent product dependency.

Surface	Files	Why it exists	Keep level
Fused MoE path	the fused MoE kernel module	Replace a multi-stage route/permute/pad/GEMM/unpermute pipeline	Keep
Mamba fused update path	fused Mamba-related modules and public MegaCpp articles	Remove repeated state-update overhead from a hot recurrent lane	Keep
Fused residual and mHC helpers	Public residual and bias-dropout-add helpers	Collapse repeated elementwise work around every block	Keep
Fused RoPE Q+K	Public Triton RoPE kernels	Share loads and reduce launches on a hot attention-adjacent path	Keep
MLA fused RoPE	Public MLA rotary kernels	Narrow MLA-specific hot-path win	Keep
MLA fused projection	Public MLA projection kernels	Reduce projection-stage traffic with custom autograd	Borderline
Custom backend dispatch wrappers	Public backend dispatch layer	Choose vendor/library fast paths with stable fallbacks	Keep
Small one-off fusions without repeated hot-path evidence	various experiments	Save a little work without moving step time enough	Do not keep by default

The best example of a kernel paying for itself is the fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack kernel module. The file is explicit about what it replaces. The standard path has distinct routing, permutation, padding, grouped compute, unpadding, and weighted recombination stages. The fused path reduces that to a much tighter route-sort-compute-scatter flow and keeps multiple implementations available depending on backend support. That is the model of a justified complex kernel: a large avoided memory-movement bill, a fallback path for correctness, and a hot enough usage pattern that the savings recur throughout training.

The Mamba fused path belongs in the same top tier. Related MegaCpp posts describe a fused trapezoidal update replacing chains of small elementwise work around state-space recurrence. That kind of kernel usually pays for itself because recurrent layers amplify launch overhead: a “small” inefficiency repeated across blocks and timesteps becomes a real budget item. MegaCpp should keep that class of optimization whenever the hybrid architecture still relies on those layers, as in Mamba3 kernel journey and Mamba3 fused trapezoidal on TPU.

The public fused residual helper is a more subtle but still convincing keeper. The file contains fused residual-add, residual-scale-add, lerp-style mixing, and branch-composition helpers. The comments in the file point to profiler evidence where the unfused path consumed a meaningful fraction of the step. That is exactly the threshold we want. Residual math is boring, but it happens everywhere. Boring repeated work is often where fusion pays for itself fastest.

The public bias-dropout-add helper is also a good lesson. It is not a giant custom Triton kernel; it leans on @torch.compile to let bias, dropout, and residual addition become a visible optimization unit. That still counts as a fused path worth keeping. MegaCpp should prefer this kind of low-drama, compiler-visible fusion over bespoke kernels when the result is comparable.

The public Triton kernel pack contains another honest keeper: fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries Q+K. Applying RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries to Q and K in one launch with shared trig loads is the kind of narrow improvement that is easy to underestimate. But it sits on a hot path, has a clear semantic contract, and does not drag a giant maintenance surface behind it. That is a textbook “small but worth it” kernel.

There is also a useful negative lesson in the same neighborhood. Some kernels look attractive because they are mathematically neat or locally elegant, but they do not remove enough whole-path work. A path that turns three tiny elementwise ops into one kernel is not automatically a product win. If it does not sit on a repeated hot path, if torch.compile could have handled it anyway, or if the resulting code becomes brittle across backends, it does not pay for itself. MegaCpp is most valuable where it already internalized that distinction.

The stricter filter from the run reports is that permanent kernels usually earn their keep only when they remove a real HBM round trip, a padding-heavy staging phase, or a graph-breaking seam the compiler cannot legally fuse away. If the same intermediates still spill to memory and torch.compile can already see the same contraction, the custom path is competing mostly on elegance. Those are the kernels that age badly once compiler passes or vendor libraries catch up.

The MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries story is split. The public MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries fused-RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries kernel is easy to justify: it encodes MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries-specific query and key/value rotary handling in one narrow Triton path and avoids extra staging work. The checked-in Fused MLA projection sample is harder. Its goal is good: fuse down-projection, normalization, and up-projection while recomputing intermediates on backward. But that kind of custom autograd fusion is exactly the area where vendor libraries and compiler improvements can catch up quickly. If the win shrinks, the maintenance cost dominates. So the right conclusion is not “delete it,” but “treat it as conditional,” which is easier to interpret alongside Fused MLA on Hopper and Blackwell: projection, RoPE, and the KV cache that ships and The FA4 Catalog on Blackwell: Variants, sm Guards, and Runtime Selection.

That conditional label matters because the stronger keepers here survive run-level comparisons, not only isolated kernel benches. Fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, the Mamba update path, and the repeated loss or dispatch seams keep paying back their maintenance cost after compile churn, fallback behavior, and backend drift are counted. Larger MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries projection fusion still has to keep re-earning that status, which is why the article keeps it in the experimental bucket instead of promoting it by default.

The backend dispatch layer deserves mention because a kernel can pay for itself indirectly. The dispatch layer routes cross-entropy and normalization to the best available backend, with stable fallbacks for other devices and modes. A solid dispatch surface can be more valuable than yet another custom kernel because it lets the product adopt fast vendor code without scattering backend logic through the model. That is a complexity-saving mechanism, and complexity saved is part of the payoff equation.

That is especially visible in the loss path. The dispatch layer contains several cross-entropy variants: current plain execution, Liger-routed execution, CCE-backed execution, chunked execution, and row-sharded vocab-parallel handling. The point is not that every one of those is a “kernel we keep.” The point is that the dispatch layer protects the model from having to know which exact fast path is valid on which lane. In production, this can save more engineering time than squeezing one extra micro-optimization into a custom Triton file.

The right comparison there is not Python dispatch overhead in isolation. The dispatch seam pays for itself when it keeps the model on the backend that avoids whole-path costs: materializing a larger loss tensor than necessary, launching a chain of smaller fallback kernels, or giving up a fused path that was saving HBM traffic on every step. If the lane-level alternative is extra materialization, extra launches, or extra bandwidth, the microsecond-scale routing cost is not the interesting number.

That is also why the dispatch layer belongs in the same keep discussion as the kernels themselves. A cached Python selector can cost microseconds; the backend it selects can avoid gigabytes of temporary logits or keep a loss path chunked enough to stay inside memory. If the chosen backend prevents a whole materialization or HBM round trip, the selector overhead is effectively noise compared with the path it prevented.

Another sign that a kernel pays for itself is when the implementation contains explicit fallback reasoning rather than just a fast path. MegaCpp is strongest where a fused path can still fall back cleanly to a simpler implementation when hardware, layout, or compiler conditions change. This is not accidental defensive programming. It is part of the kernel’s economic case. A fused path that cannot degrade gracefully is much more expensive to carry.

How it lands in MegaCpp

MegaCpp should keep the kernels and fused paths that meet four conditions.

They operate on a hot path that repeats throughout training.
They remove whole stages of movement or launch overhead, not just one micro-op.
They have a clean fallback path.
They can be judged with goodput and structured report data, not just microbenchmarks.

By that standard, the default keep set is straightforward: fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, Mamba fused update, fused residual and mHC helpers, fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries Q+K, MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries, and backend dispatch layers.

The conditional set is also straightforward: larger MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries projection fusion and any custom path whose advantage depends too heavily on a specific nightly or backend version.

The product rule should look something like this:

[fast_paths]
fused_moe = true
mamba_fused_update = true
fused_residuals = true
fused_rope_qk = true
mla_fused_rope = true
mla_fused_projection = "experimental"
backend_dispatch = "required"

Again, that is not a literal production config. It is the right operational contract extracted from the public implementation pattern.

There is also a governance implication for MegaCpp. The kernel catalog should not grow by default. New fused work should have to displace an existing bottleneck or remove a class of backend pain. If a proposal only says “this benchmark kernel is faster,” that is insufficient. It should also explain why a compiler-visible fusion, a vendor path, or a dispatch-layer solution is not enough. The product should be biased toward fewer custom kernels, not more.

The checked-in sample tree also makes the surviving pattern narrower than "any fast kernel." The keepers are mostly boundary seams: paired K/V row gather, fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries ingress, the loss-path dispatch helpers, and the fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack route-sort-compute collapse all remove staging, graph-break, or materialization costs the compiler still does not absorb cleanly. That is why The Triton kernels we actually maintain in-tree, Triton row gather pair sample, and Fused RoPE QK sample are better neighbors for this post than another dense-GEMM microbenchmark.

Ablations and what we kept

MegaCpp already exposes the ingredients for a keep-or-drop policy that is better than taste.

First, goodput accounting separates useful training work from badput categories such as compilation and idle time. That keeps kernel conversations honest. A path that lowers one kernel’s local runtime but increases compile churn or synchronization waste can still lose at the run level.

Second, a time-series performance report should record step history, tokens, revisions, and peak memory. That makes it possible to see whether a fused path improves training stability over time or just front-loads wins into a short benchmark.

Third, a stable report schema turns ablations into structured comparisons. If a fused path matters, it should survive comparison under a stable schema. If it does not, then the team is keeping it on faith.

This matters because “kernel value” is often nonlinear. A path can look neutral in a microbenchmark and still pay for itself in a real run by stabilizing memory use, avoiding padding blowups, or preserving compile-friendly graphs. It can also look great in isolation and lose at the run level because it adds compile churn or makes fallback behavior worse. Without structured reports, teams tend to remember only the impressive local benchmark and forget the operational bill.

That is also why this article belongs next to regional compile without losing the plot: a fused path that looks elegant in isolation can still be the wrong keep decision once compile churn and lane stability enter the cost model.

RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries is the cleanest example of why synthetic compile benches can lie. Short, cache-friendly loops can flatter a stack of compiled micro-kernels because the GPU's local caches are doing too much of the work, while the real long-context lane pushes both the compiled path and the fused Triton path back toward the same HBM ceiling. That is why fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries Q+K stays in the keep set on run-level evidence and portability, not because one tiny benchmark happened to look pretty on a warm cache.

What this suggests for MegaCpp is simple:

Keep fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack because it removes whole categories of padding and dispatch overhead.
Keep the Mamba fused path because recurrent-state math punishes unfused execution.
Keep fused residual and mHC helpers because they save work on every block and comments in the file already connect them to profiler pressure.
Keep fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries Q+K and MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries because they are narrow, understandable, and hot.
Keep backend dispatch surfaces because they reduce product complexity while preserving access to fast backends.
Treat larger MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries projection fusion and similar custom autograd stacks as experiments until they continue to beat baseline libraries in real training reports.

One more practical filter is readability of the contract. fused_rope_qk and MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries are easy to describe in one sentence. So are fused residual helpers and the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack stage collapse. When a kernel cannot be explained succinctly in terms of which whole-path costs it removes, that is often a warning sign that the complexity is outrunning the payoff. MegaCpp should be skeptical of any kernel whose justification depends on a long chain of caveats.

Production checklist

Require every kept kernel to have an explicit fallback path.
Judge kernels by run-level goodput, not microbenchmarks alone.
Use a stable before/after performance report for every keep/drop comparison.
Require schema-checked reports for any keep/drop decision.
Prefer compiler-visible fusion patterns when they deliver comparable wins.
Re-review conditional kernels whenever backend libraries improve.

FAQ

Frequently asked questions

What does it mean for a kernel to “pay for itself”?+

It means the kernel improves run-level goodput, stability, or both enough to justify its maintenance cost. A microbenchmark win by itself is not enough.

Why keep dispatch layers in the same article as fused kernels?+

Because dispatch logic can save more engineering time than one extra custom kernel by routing to the best backend while preserving stable fallbacks and keeping backend-specific logic out of the model code.

How can a dispatch layer pay for itself if it is not itself a kernel?+

Because it can preserve the real win. If the dispatch seam keeps the model on the fused or sharded backend that avoids extra tensor materialization, extra kernel launches, or extra bandwidth on the hot lane, then that is the payoff. The dispatch code is only the selector; the important comparison is the whole execution path it prevents.

Why does the loss path get the same keep treatment as custom kernels?+

Because large-vocabulary training can turn the final projection plus cross-entropy into a giant temporary logits tensor. If a Liger-style or Cut Cross Entropy-style path computes that loss without materializing the full logits matrix, the product win is not a prettier kernel; it is avoiding a memory bill that can decide whether the run fits at all.

Why are some MLA kernels treated as conditional keepers?+

Because custom autograd fusion is exactly the area where vendor libraries and compiler improvements can erase the original advantage, turning a one-time win into long-term maintenance drag.

Why is MLA fused projection treated more cautiously than MLA fused RoPE?+

Because MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table. is a narrow hot-path contraction with a small semantic surface, while fused projection has to carry a larger custom-autograd and recompute contract. That makes the projection path more sensitive to compiler changes, backend-library catch-up, and fallback complexity, so it has to keep earning its place in run-level reports instead of graduating on microbenchmarks alone.

What actually graduates a borderline kernel into the keep set?+

Run-level evidence that survives the real lane: repeatable goodput improvement, stable fallbacks, no new compile churn, and a contract simple enough that another engineer can still explain why the kernel exists.

What would move MLA fused projection from "experiment" to "drop"?+

A compiled baseline that matches its long-context memory savings within a small margin and keeps the wider graph faster. Once the win only survives in one awkward edge case, or only after paying for a graph-breaking custom autograd seam, the recompute-heavy MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. projection path stops paying rent.

What is the fastest reason to demote a kernel from the keep set?+

A backend or compiler upgrade that removes the run-level win. If a vendor path or compiler-visible fusion now matches the performance and the custom path mostly pays maintenance cost, the kernel stopped paying for itself.

Why can a short compile benchmark still underrate a kept kernel?+

Because some synthetic loops flatter a stack of compiled micro-kernels with warm-cache behavior that does not survive the real lane. RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table. is the cleanest example: a short cache-friendly bench can make a compiled path look closer than it really is, while the long-context run pushes both paths back to the same HBM ceiling and makes the fused path's lower movement bill visible again. That is why the keep/drop rule here stays anchored to run-level receipts and Profiler-guided optimization, not to one pretty short-loop number.

Does a kept fast path mean the fallback path can be deleted?+

No. A fast path only earns default status if the simpler fallback still exists for the cases it does not honestly cover. AttentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.-heavy lanes are the clearest example: if a vendor backend drops backward coverage, grouped-query support, variable-length support, or one hardware lane, the product still needs a correct SDPA, cuDNN, or simpler kernel route behind it. That is the same contract kept explicit in Flash Attention 4 in practice and The FA4 Catalog on Blackwell.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Grounding

RoPE

Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.

Grounding

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

Grounding

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Grounding

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

Grounding

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

Grounding

David Gornshtein • MegaCppMore posts →

Kernels that pay for themselves

Why MegaCpp cares

What we built in MegaCpp

How it lands in MegaCpp

Ablations and what we kept

Production checklist

Read next

References

Frequently asked questions

Terms used in this article