MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 20269 min readMegaCpp Engineering

Kernels

H200

MoE

Attention

Triton

Systems

Kernel Catalog and Impact: Why the Runtime Needed a Real Map

Q: Which checked-in files show the catalog rows most directly?

Use Kernel samples for the attention, sparse MLA, and loss-path rows; MegaCpp model wiring examples for the recipe/runtime map around those rows; and MegaCpp model glossary, Mamba3 kernel journey, and Our honest experience with CuTe DSL when the question is about adapter seams, donor choices, or why one family stayed a reference surface instead of shipping as a local keep.

A grounded tour of the kernel catalog across attention, sparse MLA, MoE, MTP, and dispatch/combine paths, with emphasis on why naming the kernel family and backend contract changed system-level decisions.

By MegaCpp Engineering

MegaCpp

Focused on applied C++ model engineering

Article Preview

Published April 18, 2026•9 min read•MegaCpp Engineering

Kernel Catalog and Impact: Why the Runtime Needed a Real Map

The important performance lesson in MegaCpp was not “switch to a faster kernel.” It was that the project needed an explicit kernel catalog: which family handled attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, which family handled expert dispatch and combine, which family handled sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries, and which family was only a donor or deferred substrate. Once those boundaries were named and documented, system-level decisions about memory, launchers, and model variants became much more trustworthy.

When teams talk about kernel work, they often compress everything into a single heroic path: one fused kernel, one extension, one benchmark chart. The public evidence here points in the opposite direction. The working system is a catalog, not a monolith. Different block families use different kernel backends, and the backend choice affects not only raw speed but also memory materialization, autograd structure, optimizer assumptions, and the comparability of performance reports.

That is why the public MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack substrate notes matter so much. They state the target explicitly: not merely fewer saved tensors, but lower end-to-end runtime materialization and HBM traffic. That is the right lens for understanding kernel impact in a real training stack.

The cleanest companion posts here are kernels that pay for themselves, Triton kernels we maintain, and Sparse MLA FP8 dispatch, because each one zooms into a single row of the catalog instead of collapsing all backend work into one headline. For checked-in proof surfaces, start with Kernel samples and MegaCpp model wiring examples. This article is the map; those sources show the individual rows.

The catalog starts with model structure, not with CUDA code

Before listing any backend, it helps to remember why a catalog exists at all. NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample is already a mixed architecture. In the recipe layer it is declared as AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample, which means the runtime must support at least attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-style layers, expert layers, Mamba-family layers, and recurrent-style or selective layers. A single kernel family cannot carry all of that.

MegaCpp runtime reflects this directly.

the main model runtime module owns attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-side integration, precision plans, and block composition.
the main MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack runtime module and the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch runtime module own router-adjacent, dispatch, and combine behavior.
MegaCpp contributes sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries and native Hopper-facing components under dedicated modules.

This is already a kernel catalog in embryo. The performance problem is not “find the best kernel,” but “assign the right kernel family to the right structural surface.”

Model surface	Typical backend family	Why the distinction matters
Dense attention	Flash-attention family and related attention kernels	Throughput, causal masking, layout assumptions
Sparse MLA	TileLang sparse MLA kernels	Different tensor layout, different scale handling
MoE dispatch/combine	Triton-first substrate plus donor-inspired structure	Materialization and routing metadata dominate
MTP / CE path	Native Hopper-oriented kernels where available	Avoid logits materialization and extra reshapes

The table is useful because it turns “kernel optimization” into a routing problem across real subsystems.

It also pairs naturally with the MegaCpp block glossary. ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample questions usually land in attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns kernels, projection kernels, positional handling, and their surrounding layout adapters. eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample questions land in routing, dispatch, combine, grouped GEMM, and metadata motion. mblockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample questions land in state-space mixers, selective scan style kernels, and their runtime boundary code. rblockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample questions land in recurrent persistence logic and state-carrying update paths. cblock is useful when discussing a composite runtime region that bundles multiple sub-operations under one scheduling or checkpointing policy. A real kernel catalog should make those distinctions visible because different block families fail, scale, and optimize for different reasons.

Attention kernels were a family, not a single switch

The top-level model file makes this obvious. the main model runtime module imports the project’s flash-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns entry points and also references decode and cache-aware helpers. That means “attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns kernel” already means more than one thing: training-time full attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, decode-time single-token paths, and backend-specific variants.

The public implementation also reflects a practical split between what is patched in from outside and what is kept as the project’s own high-level contract. That separation matters because a training stack cannot tolerate every attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns experiment rewriting the whole module surface.

The impact of a clean attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns catalog is mostly defensive, and it feeds directly into serving boundaries like the ones in inference serving stack.

It prevents backend-parity claims from being made too early.
It makes it clear whether a benchmark changed the entire module path or only a low-level kernel.
It keeps decode, cache, and full-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns cases from being collapsed into one benchmark story.

That same discipline shows up again in sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries.

Sparse MLA needed its own kernel family and layout contract

The MegaCpp sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries path is unusually explicit in public code. It wraps TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample fused sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries forward and backward kernels behind a single interface, and it makes the layout adaptation visible enough to reason about the integration cost.

The FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper variant goes further. Public MegaCpp code shows a cached kernel build keyed by launch parameters while still returning BF16 output from an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper compute path. Those details are not trivia. They tell you exactly why sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries belongs in its own catalog row.

It is not just a different implementation of dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns. It has a different data contract, different scale metadata, and different caching behavior.

q = query.permute(1, 0, 2, 3).contiguous()
kv = key.permute(1, 0, 2, 3)[:, :, 0:1, :].contiguous()
out, lse = kernel(q, kv, q_scale, kv_scale, indices)
output = out.permute(1, 0, 2, 3).contiguous().reshape(sq, b, np_ * hnv)

That schematic captures the key point: a sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries kernel family imposes its own ingress and egress layout rules. Once that is true, the choice of kernel affects not only speed but the surrounding adapter code, memory traffic, and error surface.

The MoE catalog was about substrate, not just compute

The public kernel-substrate decision record is the clearest catalog document in MegaCpp. It explicitly compares Triton, vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off modular fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, SGLang fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack Triton, Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, MegaBlocks, fusedswiglu, TensorRT-LLM, and CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample or CuTe. More importantly, it labels each one by role: chosen first, donor, deferred donor, or deferred substrate.

That is exactly what a kernel catalog should do. It should not only say what exists. It should say what each option is for.

Option	Role in the catalog	Reason
Triton	Chosen first	Best fit with the current public stack and lowest integration cost
vLLM modular fused MoE	Donor	Good decomposition for permute, unpermute, and weighted finalize
SGLang fused MoE Triton	Donor	Useful Triton organization and top-k handling
Megatron-Core MoE	Donor	Strong training-system boundaries for router, dispatch, experts, combine
MegaBlocks	Donor	Metadata and topology ideas
fusedswiglu	Narrow donor	Direct fused gate-up activation shape
CUTLASS / CuTe	Deferred substrate	High ceiling, higher integration burden

The impact of this document is bigger than a backend choice. It stops the project from pretending every kernel source is equally ready for direct adoption. It also keeps the main goal focused on end-to-end traffic and intermediate materialization rather than isolated arithmetic throughput.

That distinction is why the runtime catalog matters. A fast compute kernel can still lose if it requires the wrong staging buffers.

The donor labeling also prevented a lot of bad engineering behavior. Without it, every attractive upstream kernel starts to look like a near-term integration candidate. With it, reviewers can say something sharper: this source is useful as a decomposition donor, this one is useful for metadata ideas, this one is useful only as a ceiling reference, and this other one is not worth pulling into the training path until the surrounding ABI and materialization story are under control. That is a systems decision, not a benchmark vanity choice.

Native Hopper-oriented kernels changed some ceilings

The MegaCpp side also contains targeted Hopper-facing work outside the main MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack substrate conversation. Public code exposes a multitoken prediction cross-entropy path that uses a native Hopper-oriented kernel so logits do not need to be materialized in the usual way. The public implementation also makes clear that masked positions are handled inside the kernel rather than in a generic fallback path.

This is a good example of why a catalog can improve system design even before it improves every benchmark. Once a path is recognized as a separate kernel family with its own capability profile, the launcher and model stack can decide when to use it and when not to. Without that, a specialized kernel either gets overused or forgotten.

The broader lesson is that some kernels are system-level enablers more than raw-throughput stars. Avoiding logits materialization can matter as much as a few percentage points of arithmetic speed, especially in large sequence or multitask training regimes.

That same lesson appears in how MegaCpp treats launch seams. Several modules are not “the kernel” in the narrow sense, but they matter just as much because they determine whether a good kernel is fed cleanly or surrounded by expensive copies, reshapes, and compatibility buffers. In practice, many performance wins come from moving a boundary so that a kernel family receives the layout it actually wants, rather than from rewriting the arithmetic body itself. A kernel catalog that ignores adapters and launch objects is incomplete.

Kernel impact was mostly about reducing ambiguity

The most practical impact of the catalog was that it made benchmark interpretation less sloppy.

If a sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries result improved, reviewers could ask whether TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper caching, scale handling, or layout adapters changed. If an MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack lane improved, reviewers could ask whether the gain came from better Triton substrate staging, less routing materialization, or a donor-inspired dispatch shape. If an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns configuration changed, it could be traced to the flash-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns family rather than being conflated with sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries or decode behavior.

That kind of separation is what keeps optimization work cumulative instead of anecdotal.

Question	Without a catalog	With a catalog
“Why did this benchmark improve?”	Hard to know which backend moved	Usually attributable to one kernel family or adapter seam
“Can we compare these runs?”	Easy to compare unlike with unlike	Easier to see when backend families differ
“What should we optimize next?”	Chases symptoms	Targets a named substrate or launch seam

The impact, in other words, was operational clarity.

Why the catalog matters for future work too

A project like this will keep gaining new kernels. Some will be direct implementations, some will remain donor references, and some will stay deferred because they require a new extension or ABI lane. The right response is not to hide that diversity. It is to keep the catalog explicit.

MegaCpp already shows what that looks like.

Pattern notation identifies which model surfaces exist.
Runtime files keep family-specific adapters near their real call sites.
Public notes declare which backend is chosen, deferred, or donor-only.
Specialized kernel modules document their layout and scale contracts.

The next benefit is educational for maintainers. When someone says a benchmark moved after a kernel change, the catalog gives reviewers a checklist. Was it the dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns family, the sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries family, the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch family, or a launch adapter around one of them? Did the change affect only an ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample region, or did it alter an eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample dispatch path that would never show up in a dense-only benchmark? Those questions sound simple, but they are exactly what keeps future optimization work grounded instead of turning every speedup into folklore.

That is the foundation for honest systems work. It lets the project say not just “we have a faster kernel,” but “this model surface is now served by this backend family, under this contract, with these tradeoffs.”

For a mixed architecture like NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample, that is the only definition of a useful kernel improvement.

FAQ

Frequently asked questions

Why call this a catalog instead of just a list of kernels?+

Because the point is not enumeration. The point is to map kernel families to model surfaces, launch seams, and donor status so optimization decisions stay attributable and reviewable.

Why do donor and deferred-substrate labels matter?+

They prevent the team from treating every upstream kernel as equally ready for direct adoption. Some sources are best kept as reference material or decomposition donors rather than immediate integration candidates.

Which checked-in files show the catalog rows most directly?+

Use Kernel samples for the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand., sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion., and loss-path rows; MegaCpp model wiring examples for the recipe/runtime map around those rows; and MegaCpp model glossary, Mamba3 kernel journey, and Our honest experience with CuTe DSL when the question is about adapter seams, donor choices, or why one family stayed a reference surface instead of shipping as a local keep.

Why do adapters and launch seams belong in a kernel article?+

Because a good kernel can still lose if it is fed through the wrong layout, staging buffer, or compatibility seam. In practice many wins come from fixing the boundary around the kernel, not only from rewriting the arithmetic body.

What should a new catalog row prove before it gets trusted?+

It should name the model surface, the required layout, the metadata that has to cross the boundary, the fallback path, and the receipt that proves which backend actually executed. The checked-in kernel examples deliberately separate those questions: attention validityQuick term guideAttentionValidityThe validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region., Sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. dispatch, row-gather staging, and chunked loss all expose different boundary facts instead of pretending one kernel label proves the whole runtime path.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

CuTe DSL

The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.

Grounding

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

Grounding

AttentionValidity

The validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.

Grounding

CUTLASS

NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.

Grounding

AEMEAEMEAEMR

A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.

Grounding

ablock

The attention-heavy block family in MegaCpp's A/M/E/R notation.

Grounding

mblock

The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

eblock

The expert / MoE block family in MegaCpp's A/M/E/R notation.

Grounding

rblock

The recurrent tail block family in MegaCpp's A/M/E/R notation.

Grounding

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Grounding

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

Grounding

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

Grounding

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

Grounding

vLLM

How MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…

Grounding

vLLM on GB10: the overlay, the registration fixes, and the paths we kept off

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Topic hubs

Topic Hub

H200 Training and Kernel Bring-Up

A curated path through the H200 lane: operator bring-up, step-time anatomy, memory pressure, and the NVIDIA kernel surfaces that actually moved the stack.

Topic Hub

MoE, Routing, and Distributed Model Splits

A curated path through the expert stack: what the specialist path changed, how routing works, and how the parallelism map constrains the model layout.

MegaCpp Engineering • MegaCppMore posts →

Kernel Catalog and Impact: Why the Runtime Needed a Real Map

Kernel Catalog and Impact: Why the Runtime Needed a Real Map

The catalog starts with model structure, not with CUDA code

Attention kernels were a family, not a single switch

Sparse MLA needed its own kernel family and layout contract

The MoE catalog was about substrate, not just compute

Native Hopper-oriented kernels changed some ceilings

Kernel impact was mostly about reducing ambiguity

Why the catalog matters for future work too

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

H200 Training and Kernel Bring-Up

MoE, Routing, and Distributed Model Splits