Kernel Catalog and Impact: Why the Runtime Needed a Real Map
A grounded tour of the kernel catalog across attention, sparse MLA, MoE, MTP, and dispatch/combine paths, with emphasis on why naming the kernel family and backend contract changed system-level decisions.

Kernel Catalog and Impact: Why the Runtime Needed a Real Map
The important performance lesson in MegaCpp was not “switch to a faster kernel.” It was that the project needed an explicit kernel catalog: which family handled attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, which family handled expert dispatch and combine, which family handled sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries, and which family was only a donor or deferred substrate. Once those boundaries were named and documented, system-level decisions about memory, launchers, and model variants became much more trustworthy.
When teams talk about kernel work, they often compress everything into a single heroic path: one fused kernel, one extension, one benchmark chart. The public evidence here points in the opposite direction. The working system is a catalog, not a monolith. Different block families use different kernel backends, and the backend choice affects not only raw speed but also memory materialization, autograd structure, optimizer assumptions, and the comparability of performance reports.
That is why the public MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack substrate notes matter so much. They state the target explicitly: not merely fewer saved tensors, but lower end-to-end runtime materialization and HBM traffic. That is the right lens for understanding kernel impact in a real training stack.
The cleanest companion posts here are kernels that pay for themselves, Triton kernels we maintain, and Sparse MLA FP8 dispatch, because each one zooms into a single row of the catalog instead of collapsing all backend work into one headline. For checked-in proof surfaces, start with Kernel samples and MegaCpp model wiring examples. This article is the map; those sources show the individual rows.
The catalog starts with model structure, not with CUDA code
Before listing any backend, it helps to remember why a catalog exists at all.
NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample is already a mixed architecture. In the recipe layer it is declared as
AEMEAEMEAEMRQuick term guideAEMEAEMEAEMRA concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.GroundingAbout: MegaCpp model glossary Example: NAM56R pattern composition sample Example: NAM56R Megatron plan sample, which means the runtime must support at least attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-style
layers, expert layers, Mamba-family layers, and recurrent-style or selective
layers. A single kernel family cannot carry all of that.
MegaCpp runtime reflects this directly.
- the main model runtime module owns attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-side integration, precision plans, and block composition.
- the main MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack runtime module and the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch runtime module own router-adjacent, dispatch, and combine behavior.
- MegaCpp contributes sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries and native Hopper-facing components under dedicated modules.
This is already a kernel catalog in embryo. The performance problem is not “find the best kernel,” but “assign the right kernel family to the right structural surface.”
| Model surface | Typical backend family | Why the distinction matters |
|---|---|---|
| Dense attention | Flash-attention family and related attention kernels | Throughput, causal masking, layout assumptions |
| Sparse MLA | TileLang sparse MLA kernels | Different tensor layout, different scale handling |
| MoE dispatch/combine | Triton-first substrate plus donor-inspired structure | Materialization and routing metadata dominate |
| MTP / CE path | Native Hopper-oriented kernels where available | Avoid logits materialization and extra reshapes |
The table is useful because it turns “kernel optimization” into a routing problem across real subsystems.
It also pairs naturally with the MegaCpp block glossary. ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample questions
usually land in attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns kernels, projection kernels, positional handling, and
their surrounding layout adapters. eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample questions land in routing,
dispatch, combine, grouped GEMM, and metadata motion. mblockQuick term guidemblockThe state-space or Mamba-family block in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample questions land
in state-space mixers, selective scan style kernels, and their runtime
boundary code. rblockQuick term guiderblockThe recurrent tail block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample questions land in recurrent persistence logic and
state-carrying update paths. cblock is useful when discussing a composite
runtime region that bundles multiple sub-operations under one scheduling or
checkpointing policy. A real kernel catalog should make those distinctions
visible because different block families fail, scale, and optimize for
different reasons.
Attention kernels were a family, not a single switch
The top-level model file makes this obvious. the main model runtime module imports the project’s flash-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns entry points and also references decode and cache-aware helpers. That means “attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns kernel” already means more than one thing: training-time full attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns, decode-time single-token paths, and backend-specific variants.
The public implementation also reflects a practical split between what is patched in from outside and what is kept as the project’s own high-level contract. That separation matters because a training stack cannot tolerate every attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns experiment rewriting the whole module surface.
The impact of a clean attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns catalog is mostly defensive, and it feeds directly into serving boundaries like the ones in inference serving stack.
- It prevents backend-parity claims from being made too early.
- It makes it clear whether a benchmark changed the entire module path or only a low-level kernel.
- It keeps decode, cache, and full-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns cases from being collapsed into one benchmark story.
That same discipline shows up again in sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries.
Sparse MLA needed its own kernel family and layout contract
The MegaCpp sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries path is unusually explicit in public code. It wraps TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample fused sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries forward and backward kernels behind a single interface, and it makes the layout adaptation visible enough to reason about the integration cost.
The FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper variant goes further. Public MegaCpp code shows a cached kernel build keyed by launch parameters while still returning BF16 output from an FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper compute path. Those details are not trivia. They tell you exactly why sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries belongs in its own catalog row.
It is not just a different implementation of dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns. It has a different data contract, different scale metadata, and different caching behavior.
q = query.permute(1, 0, 2, 3).contiguous()
kv = key.permute(1, 0, 2, 3)[:, :, 0:1, :].contiguous()
out, lse = kernel(q, kv, q_scale, kv_scale, indices)
output = out.permute(1, 0, 2, 3).contiguous().reshape(sq, b, np_ * hnv)
That schematic captures the key point: a sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries kernel family imposes its own ingress and egress layout rules. Once that is true, the choice of kernel affects not only speed but the surrounding adapter code, memory traffic, and error surface.
The MoE catalog was about substrate, not just compute
The public kernel-substrate decision record is the clearest catalog document in MegaCpp. It explicitly compares Triton, vLLMQuick term guidevLLMHow MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…GroundingvLLM on GB10: the overlay, the registration fixes, and the paths we kept off modular fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, SGLang fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack Triton, Megatron-CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack, MegaBlocks, fusedswiglu, TensorRT-LLM, and CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample or CuTe. More importantly, it labels each one by role: chosen first, donor, deferred donor, or deferred substrate.
That is exactly what a kernel catalog should do. It should not only say what exists. It should say what each option is for.
| Option | Role in the catalog | Reason |
|---|---|---|
| Triton | Chosen first | Best fit with the current public stack and lowest integration cost |
| vLLM modular fused MoE | Donor | Good decomposition for permute, unpermute, and weighted finalize |
| SGLang fused MoE Triton | Donor | Useful Triton organization and top-k handling |
| Megatron-Core MoE | Donor | Strong training-system boundaries for router, dispatch, experts, combine |
| MegaBlocks | Donor | Metadata and topology ideas |
| fusedswiglu | Narrow donor | Direct fused gate-up activation shape |
| CUTLASS / CuTe | Deferred substrate | High ceiling, higher integration burden |
The impact of this document is bigger than a backend choice. It stops the project from pretending every kernel source is equally ready for direct adoption. It also keeps the main goal focused on end-to-end traffic and intermediate materialization rather than isolated arithmetic throughput.
That distinction is why the runtime catalog matters. A fast compute kernel can still lose if it requires the wrong staging buffers.
The donor labeling also prevented a lot of bad engineering behavior. Without it, every attractive upstream kernel starts to look like a near-term integration candidate. With it, reviewers can say something sharper: this source is useful as a decomposition donor, this one is useful for metadata ideas, this one is useful only as a ceiling reference, and this other one is not worth pulling into the training path until the surrounding ABI and materialization story are under control. That is a systems decision, not a benchmark vanity choice.
Native Hopper-oriented kernels changed some ceilings
The MegaCpp side also contains targeted Hopper-facing work outside the main MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack substrate conversation. Public code exposes a multitoken prediction cross-entropy path that uses a native Hopper-oriented kernel so logits do not need to be materialized in the usual way. The public implementation also makes clear that masked positions are handled inside the kernel rather than in a generic fallback path.
This is a good example of why a catalog can improve system design even before it improves every benchmark. Once a path is recognized as a separate kernel family with its own capability profile, the launcher and model stack can decide when to use it and when not to. Without that, a specialized kernel either gets overused or forgotten.
The broader lesson is that some kernels are system-level enablers more than raw-throughput stars. Avoiding logits materialization can matter as much as a few percentage points of arithmetic speed, especially in large sequence or multitask training regimes.
That same lesson appears in how MegaCpp treats launch seams. Several modules are not “the kernel” in the narrow sense, but they matter just as much because they determine whether a good kernel is fed cleanly or surrounded by expensive copies, reshapes, and compatibility buffers. In practice, many performance wins come from moving a boundary so that a kernel family receives the layout it actually wants, rather than from rewriting the arithmetic body itself. A kernel catalog that ignores adapters and launch objects is incomplete.
Kernel impact was mostly about reducing ambiguity
The most practical impact of the catalog was that it made benchmark interpretation less sloppy.
If a sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries result improved, reviewers could ask whether TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper caching, scale handling, or layout adapters changed. If an MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack lane improved, reviewers could ask whether the gain came from better Triton substrate staging, less routing materialization, or a donor-inspired dispatch shape. If an attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns configuration changed, it could be traced to the flash-attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns family rather than being conflated with sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries or decode behavior.
That kind of separation is what keeps optimization work cumulative instead of anecdotal.
| Question | Without a catalog | With a catalog |
|---|---|---|
| “Why did this benchmark improve?” | Hard to know which backend moved | Usually attributable to one kernel family or adapter seam |
| “Can we compare these runs?” | Easy to compare unlike with unlike | Easier to see when backend families differ |
| “What should we optimize next?” | Chases symptoms | Targets a named substrate or launch seam |
The impact, in other words, was operational clarity.
Why the catalog matters for future work too
A project like this will keep gaining new kernels. Some will be direct implementations, some will remain donor references, and some will stay deferred because they require a new extension or ABI lane. The right response is not to hide that diversity. It is to keep the catalog explicit.
MegaCpp already shows what that looks like.
- Pattern notation identifies which model surfaces exist.
- Runtime files keep family-specific adapters near their real call sites.
- Public notes declare which backend is chosen, deferred, or donor-only.
- Specialized kernel modules document their layout and scale contracts.
The next benefit is educational for maintainers. When someone says a benchmark
moved after a kernel change, the catalog gives reviewers a checklist. Was it
the dense attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns family, the sparse MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries family, the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch family, or
a launch adapter around one of them? Did the change affect only an ablockQuick term guideablockThe attention-heavy block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample
region, or did it alter an eblockQuick term guideeblockThe expert / MoE block family in MegaCpp's A/M/E/R notation.GroundingAbout: SLM architecture Example: block taxonomy sample dispatch path that would never show up in a
dense-only benchmark? Those questions sound simple, but they are exactly what
keeps future optimization work grounded instead of turning every speedup into
folklore.
That is the foundation for honest systems work. It lets the project say not just “we have a faster kernel,” but “this model surface is now served by this backend family, under this contract, with these tradeoffs.”
For a mixed architecture like NAM56RQuick term guideNAM56RA concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.GroundingAbout: NAM56R Megatron translation About: MegaCpp model glossary Example: NAM56R Megatron plan sample, that is the only definition of a useful kernel improvement.
Frequently asked questions
Why call this a catalog instead of just a list of kernels?+
Why do donor and deferred-substrate labels matter?+
Which checked-in files show the catalog rows most directly?+
Why do adapters and launch seams belong in a kernel article?+
What should a new catalog row prove before it gets trusted?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.
CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.
The validity carrier built from row-level counts or masks so sparse or structured attention paths know which token prefix is real without re-inferring it inside the compiled region.
NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.
A concrete NAM56R-style hybrid pattern string that encodes the ordered A/M/E/R block mix.
The attention-heavy block family in MegaCpp's A/M/E/R notation.
The state-space or Mamba-family block in MegaCpp's A/M/E/R notation.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
The expert / MoE block family in MegaCpp's A/M/E/R notation.
The recurrent tail block family in MegaCpp's A/M/E/R notation.
A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.
The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.
How MegaCpp stabilized a GB10-oriented vLLM lane with an on-disk overlay, text-only model registration, and a deliberate keep-disabled list for…
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
Continue with a curated reading path
H200 Training and Kernel Bring-Up
A curated path through the H200 lane: operator bring-up, step-time anatomy, memory pressure, and the NVIDIA kernel surfaces that actually moved the stack.
MoE, Routing, and Distributed Model Splits
A curated path through the expert stack: what the specialist path changed, how routing works, and how the parallelism map constrains the model layout.