MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202614 min readDavid Gornshtein

Triton

Kernels

Gpu

Megacpp Training Stack

The Triton Kernels We Actually Maintain In-Tree

Q: What are mHC and THD layout in plain English?

mHC is the multi-stream hidden-state mixing path used in MegaCpp hybrid blocks, not a generic upstream Megatron feature. Hybrid examples overview, mHC stream residual sample, and MegaCpp model glossary are the quickest public-safe decoders. THD layout in this article is local tensor-layout shorthand for the fused MLA partial-RoPE kernel contract, not a broad public term with one canonical external definition; MLA integration pattern sample and the MLA row in the Kernel examples catalog are the right proof surfaces.

Which custom Triton kernels we keep in the training stack, how we autotune them without getting burned, and the numerical tests that keep us honest.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

The Triton Kernels We Actually Maintain In-Tree

Published April 18, 2026•14 min read•David Gornshtein

This codebase has gone through several waves of "let's just write a Triton kernel for that." Most of those kernels are gone. A small set stayed, because they either saved measurable wall-clock on real training runs or removed a graph break that torch.compile could not otherwise close. This post is the honest list of what is currently in-tree, why each one earns its keep, and how we keep the numerics from silently drifting underneath us. If you want the public-safe proof surface before the narrative, start with Kernel examples overview and the Kernel examples catalog: they are the checked-in map behind the keep set described here.

Why a small set, not a large one

The same selection logic shows up from the framework side in kernel catalog and impact and from the rollout side in kernels that pay for themselves. This article is the narrower "what stayed in-tree" version of that same argument.

Custom Triton kernels are one of the cheapest-looking and most expensive-to-maintain things you can put in a training repo. They are cheap to write because the language is pleasant; they are expensive because the autotuner, the compiler, and the runtime stack change underneath you across PyTorch and Triton releases, and a kernel that was a percent faster than the framework alternative six months ago can be a percent slower today, or worse, silently incorrect. Every Triton kernel in the tree is a promise to keep paying the verification bill as the stack moves.

The other reason it matters is correctness drift. Tolerances written in haste become "what the kernel produces today" rather than "what the math says". We have been burned more than once. So we want a small, defensible set of kernels, each justified by a real bottleneck and each guarded by tests that compare against a recomputed reference, not against last week's output.

When this article mentions a PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 TMA multicast probe surface fallback lane, it means NVIDIA's parallel-thread-execution instruction layer used as a portability check, not a separate MegaCpp kernel family. The maintained unit stays the Triton kernel and its reference test; PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 TMA multicast probe surface is only the lower-level vocabulary we use when checking which GPU generations still have a legal path.

What's actually in the tree

The kernels that survived fall into a handful of families: a fused RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries family for Q and K, plus a pair of 3D row-gather kernels used by doc-masking and mixture-of-depth gather/scatter; a fused residual family covering residual-plus-scale add, residual-plus-add-plus-RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Author Mamba3 spec About: Mamba3 hybrid Example: author Mamba3 spec sample, and a few hierarchy-compose variants; the mHC dynamic-weights family with a Sinkhorn-normalized autotuned variant; the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries-specific partial RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries for Q and K with THD layout; a public Mamba fused trapezoidal pre/diag pair that replaced most of the backward-side elementwise churn; and a fused ReLU-squared helper used by the ReLU-squared MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack FFN lane. The checked-in proof surface for that list is Kernel examples overview, with the exact keep-set jump table in Kernel examples catalog.

Two first-touch decoders help here. mHC is MegaCpp shorthand for multi-stream hidden-state mixing: instead of carrying one running stream between layers, the block can mix several hidden-state streams with explicit residual-ownership rules, which is why the most useful checked-in grounding lives in the Hybrid examples overview, the mHC stream residual sample, and the MegaCpp model glossary. THD layout is not a public standards term we expect readers to already know; here it is local shorthand for the tensor contract expected by the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries partial-RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries lane. The clean checked-in decoders are MLA integration pattern sample and the MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries row in the Kernel examples catalog, not an external standards document.

We intentionally do not carry a hand-written fused RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Author Mamba3 spec About: Mamba3 hybrid Example: author Mamba3 spec sample or fused-linear-cross-entropy kernel. The public kernel routing sends rms_norm and fused_linear_cross_entropy to Liger when available, to a chunked CCE variant for the vocab-parallel shard case, and to a plain PyTorch path otherwise. Writing a custom fused RMSNormQuick term guideRMSNormRoot-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.GroundingAbout: Author Mamba3 spec About: Mamba3 hybrid Example: author Mamba3 spec sample was never worth the maintenance burden when Liger exists and gets fuzzed by a much larger user base.

The checked-in public-safe proof surface for that fallback lane is Chunked fused linear cross-entropy sample: it keeps the bounded-memory contract visible without pretending the vendor fused path is ours to maintain.

That is why Liger FLCE reduction=none is a useful adjacent post: it documents the exact moment where "delegate to upstream" is still right, but only on the subset of the upstream contract that is actually stable.

Kernel	In tree	Reason
Fused Q+K RoPE	Yes	Shared cos/sin, single launch, hot path
3D row gather	Yes	Doc-masking/MoD gather without graph break
Fused residual / RMSNorm hybrids	Yes	Profile-driven on backward
mHC dynamic weights	Yes	Sinkhorn fusion, autotuned
MLA partial RoPE	Yes	THD layout, no fallback equivalent
Mamba3 trapezoidal pre/diag	Yes	Replaced majority of backward elementwise
Fused ReLU squared	Yes	Required for ReLU-squared MoE FFN lanes
Custom fused RMSNorm	No	Delegated to Liger
Fused linear cross-entropy	No	Delegated to Liger / chunked CCE
Bias+dropout+add Triton	No	JIT-script path is good enough
Custom MLA projection	No	cuBLAS grouped GEMMs caught up
Homegrown Mamba3 SSM scan	No	Official Mamba3 SISO kernel wins

The selection rule

Three criteria must clear before a Triton kernel goes into the tree. If a proposed kernel does not clear all three, it stays in an experimental lane or we delete it.

First, it must remove a real bottleneck that profiling agrees on. "Elementwise add accounts for the majority of the backward step in nsys" was the wedge that let the Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode trapezoidal kernels in. A kernel that saves half a microsecond on a path that already lives inside a fused graph does not qualify.

Second, it must not prevent torch.compile from doing useful work elsewhere. That means the kernel has to be wrapped as a torch.autograd.Function with strides we trust, and it needs a pure-PyTorch fallback we can force via an environment flag. Every surviving kernel has a plain fallback path we can A/B against.

Third, it must have numerical tests at training precision and fp32, with tolerances that match the math, not tolerances that match the current output. The fused Q-K RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries kernel is the canonical survivor by all three criteria: it shares cos and sin loads across heads and fuses Q and K into a single launch, it coexists with FA3, FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample, and PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns without interfering with their compile paths, and it has a dedicated parity suite covering forward, backward, GQA, fp16, bf16, fp32, and the older-SM PTXQuick term guidePTXNVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Reference: GB10 TMA multicast probe surface fallback lane we still care about for portability checks.

Autotune discipline

Triton's autotuner is a great tool that will happily wreck you if you treat it as free. Practical rules from repeated tuning work:

Keep autotune configs short and meaningful. The dynamic-weights kernel carries a small, hand-curated set of (BLOCK_M, BLOCK_N, num_warps, num_stages) points; the Cartesian product would be much larger and most of it is identically slow.
Pin keys explicitly. triton.autotune(key=[...]) controls when the cache is invalidated; including only the dimensions that actually change shape behaviour avoids spurious re-autotunes when an unrelated argument flips.
Always run autotune in a subprocess on the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 lane. We hit a Triton stream regression where autotune crashes propagated into the live training process; the subprocess pattern turned that from "training dies" into "first compile is slower."
Cap the workspace. Keep Inductor's GEMM autotune workspace capped by default because the workspace can OOM on the largest matmuls and poison the cache with inf ms picks.

The autotune cache also has to live somewhere stable. We pin it to the per-run scratch directory under the persistent volume; the boot disk is too small and shared autotune cache across runs has bitten us when an adjacent run picked a config that did not fit the current shapes. For a checked-in compile-boundary decoder that pairs with that rule, use Opaque kernel compile wrapper sample and Flex attention compile sample: both keep the "custom kernel inside a compiled graph" boundary visible without exposing internal runtime glue.

Numerical tests that keep us honest

Every surviving kernel has a public kernel regression test that does the same three things in some order: build a known input distribution, run the Triton kernel and a recomputed pure-PyTorch reference at fp32, and assert max-absolute and max-relative error against tolerances written next to the math, not next to the current output.

def test_fused_qk_rope_matches_reference():
    q, k, cos, sin = make_qk(...)
    out_q_t, out_k_t = fused_qk_rope(q, k, cos, sin)
    out_q_r, out_k_r = qk_rope_reference(q.float(), k.float(), cos, sin)
    torch.testing.assert_close(out_q_t.float(), out_q_r, atol=1e-3, rtol=1e-3)
    torch.testing.assert_close(out_k_t.float(), out_k_r, atol=1e-3, rtol=1e-3)

The pattern is dull on purpose. The discipline is that any kernel whose tolerances change is presumed wrong until shown otherwise; bumping atol to make a failing test pass requires an explicit rationale in the commit record. We have caught at least three regressions where a kernel author "fixed" a failing test by loosening the tolerance and the underlying drift was real.

The tolerance table is intentionally asymmetric too. In practice fp16 kernels usually stay in the tighter 1e-3 neighborhood, while bf16 kernels only get a looser operator window if they also clear a short model-slice cosine check before promotion. A bf16 kernel that passes allclose only because the window widened, then drifts materially once stacked through layers, is not "close enough"; it stays in the candidate lane or gets removed.

The other test class is graph-shape stability. For each surviving kernel we have a "compile this small model with and without the kernel and assert the FX graph is structurally equivalent" test. That catches the case where a kernel breaks a Dynamo fusion window without changing any per-element output. That same "pay only for kernels that survive real graph pressure" rule is the reason kernel catalog and impact and kernels that pay for themselves stay close to this article.

What we kept and threw away

We kept the seven kernels above, the three-criterion selection rule, the autotune-in-subprocess default, the per-run scratch for autotune cache, and the rule that any kernel without an env-var fallback gets deleted.

We threw away every "let's fuse this micro-op for fun" kernel, the global Triton autotune cache shared across runs, the practice of writing tolerances by running the kernel and copying the number, and the homegrown Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode scan once the official SISO kernel landed. We also threw away a fused MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries projection that briefly outperformed cuBLAS grouped GEMMs and stopped doing so within two PyTorch nightlies; the maintenance bill outran the win.

The throughline: a Triton kernel earns its place by removing a profile-confirmed bottleneck that no upstream library will close, with a fallback we can force on, and with numerical tests that compare against the math. Anything else is technical debt waiting to be discovered by a stranger at 2 a.m.

How a kernel actually lands in the tree

The lifecycle of a kernel is short on paper and long in practice. It starts as an experimental script run from a notebook against a pinned tensor shape and a pure-PyTorch reference. Once the script reproduces a measurable speedup against the reference on the actual training shape, not a microbenchmark shape, the author attaches the motivating Nsight Systems slice. The kernel then moves into a candidate module with the three guards: a torch.autograd.Function wrapper, an environment-flag fallback, and a parity test that compares against recomputed math on the real shape family.

The reverse path is also documented. Any kernel whose parity tests degrade across two consecutive PyTorch nightlies, or whose Inductor lowering shifts in a way that makes the compiled fallback faster, gets removed. We have done this twice in the last year: once for a fused MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.GroundingAbout: MLA and weight absorption Reference: fused MLA on NVIDIA Reference: shared MLA adapter boundaries projection that cuBLAS grouped GEMMs caught up with after a Hopper algorithm cache update, and once for a custom fused bias+dropout+add that was beaten by Inductor's own lowering after a 2.11 nightly. Removing kernels is part of the maintenance policy, not an embarrassment.

A removal plan should be almost as explicit as the admission plan. We do not jump directly from "custom kernel on" to "delete the file." First we prove the upstream path on the same shape family, then flip the env-var fallback so the plain path is the default for one review cycle, and only then delete the kernel once parity and graph-shape checks stay green. That retirement discipline is the kernel-side version of How we keep a patch lane: local speed paths are allowed, undocumented drift is not.

The Inductor interaction

Triton kernels do not exist in isolation; they live inside graphs that torch.compile is also trying to optimise. Two interactions matter. The first is fusion windows: a custom kernel inserted as a torch.autograd.Function is a graph break, and the surrounding ops must form fusable subgraphs on either side. We have hit cases where adding a custom kernel saved 5% on the kernel itself and lost 8% on the surrounding ops because the fusion window collapsed; the kernel was reverted.

The second is autotune interaction: Triton and Inductor each search their own schedules, and a custom boundary can keep the better upstream schedule from even being tried. The practical implication is that "add a Triton kernel here" is not a local decision. It changes how the surrounding region compiles. We require a graph-shape stability test for every new kernel, comparing the FX graph of a small model with and without the kernel; that test catches the fusion-window collapse class of regression before the kernel reaches main.

The public-safe escape hatch is narrower than "keep fighting Dynamo until it traces the kernel." If a boundary has to stay opaque because the backward path reaches for low-level details such as fake-tensor-hostile pointer or stride facts, the more stable pattern is a real custom op with fake or meta behavior so the surrounding block can still compile cleanly around it. Compile-time vs runtime tradeoffs and the checked-in Opaque kernel compile wrapper sample are the compact proof surface for that pattern.

That is also why the fallback rule is load-bearing rather than ceremonial. A custom kernel wrapped as an opaque autograd boundary can win its own microbenchmark and still make the compiled region around it worse by collapsing a fusion window the upstream compiler would have used. Being able to flip back to the plain path is how we tell whether the kernel improved the full region or only the line item it replaced. If the fallback plus compiler wins, the custom kernel is already on borrowed time.

Autotune receipts belong in that same admission test. A kernel that only looks good after a wide Triton search that pollutes the cache or needs retry loops in the compile process is not a stable keep. On the H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 lane we isolate autotune in a subprocess, persist the winning cache on durable scratch, and treat cache reuse as part of the proof. If the upstream path wins once those boundaries are enforced, the custom kernel does not stay. That is the same "measure the whole compiled region, not the inner op" rule used in Compile-time vs runtime tradeoffs and Kernels that pay for themselves.

That same interaction predicts which kernels upstream invalidates first. Dense GEMM replacements and standard MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing helpers tend to have the shortest half-life, while narrow layout and boundary kernels survive longer when they encode a local contract the compiler still cannot see cleanly. That split is the same one behind Dynamo and compile breakage and Migration policy: native Megatron vs narrow custom seams.

Where we still might add a kernel

There are two open candidates we have looked at and not landed. The first is a fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack expert sort kernel that would replace the public dispatcher's argsort-heavy path. Profiling shows argsort at a few percent of step time on the deep MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack preset, but the upstream Transformer EngineQuick term guideTransformer EngineNVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.GroundingAbout: Transformer Engine on H200 and Blackwell-class GPUs: the bridge we use Reference: NVIDIA Transformer Engine documentation Reference: Transformer Engine FP8 and FP4 primer sort path has caught up enough that the win is small and the maintenance bill would be real.

The public upstream stack is also much less primitive than it used to be. Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.GroundingAbout: Porting to Megatron friction About: Nemotron-style recipe as pure Megatron CLI Example: Mamba3 TP mixer sample now documents DeepEP-backed dispatch, GroupedGEMM, router fusion, and permute fusion as the fast path for serious MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack training, and its fused MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack surface already treats permute -> FC1 -> activation -> FC2 -> unpermute as one native lane rather than four separate chores. That does not make every MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing idea "solved," but it does mean a local Triton sort kernel only stays interesting once the routing contract is genuinely non-standard. The operator follow-ons are Fused MoE and DeepEP on NVIDIA and MoE routing we actually shipped.

The second is a fused per-document RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.GroundingAbout: fused MLA on NVIDIA History: long context and attention sinks Reference: shared MLA adapter boundaries reset kernel that would replace the current prefix-reset pattern. Profiling shows a real local cost there too, but the reset semantics are entangled with packed-row and document-boundary contracts strongly enough that we have not yet seen a version whose maintenance bill is clearly worth paying.

The remaining frontier is narrower than it used to be: unusual quantization unpack paths, state-space recurrences, and truly irregular sparse indexing are better candidates than mainstream transformer math. Even there, the keep rule does not change: the kernel has to attach to a stable local contract rather than race upstream on standard dense work. The closest adjacent examples are NVFP4 inference, Mamba3 kernel journey, and Packed rows as the real training contract.

The checked-in examples make that frontier concrete. Triton row gather sample shows the single-tensor staging contract, and Triton row gather pair sample shows the paired K/V variant with the same strict contiguous-CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 fast path and obvious eager fallback. Opaque kernel compile wrapper sample is the compile-side twin: it keeps one fragile custom op opaque so the surrounding compiled region can still lower cleanly. Those are the kinds of seams that tend to outlive dense Triton rewrites.

The list above is the current state. It will move; the rules will not. Profile-confirmed bottleneck, env-var fallback, parity tests at fp32, graph-shape stability test, removal when the upstream catches up. Anything that respects those rules is welcome; anything that does not is dead code from the moment it lands.

FAQ

Frequently asked questions

Why do we keep so few Triton kernels in-tree?+

Because most of the maintenance cost is in autotune drift, compiler drift, and correctness drift, not in writing the kernel once.

What must every surviving kernel have?+

A profile-confirmed reason to exist, an env-var fallback, parity tests against recomputed math, and graph-shape stability coverage.

Why is the autotune cache kept per run instead of in one shared global directory?+

Because autotune picks are not timeless truths; they are receipts from one shape family, one compiler stack, and one hardware/runtime posture. A shared global cache makes it too easy for one run to replay a stale or oversized candidate set into another run and turn "first compile was fast" into "the wrong kernel shape won quietly." The per-run scratch rule keeps that boundary inspectable, and training speed anatomy on H200 plus compile-time vs runtime tradeoffs are the follow-on reads when the cache story starts to dominate the measurement.

Why can a seemingly small custom wrapper still break torch.compile?+

Because the failure is usually in the opaque boundary, not the arithmetic. Once a torch.autograd.Function reaches for low-level layout facts like raw data pointers or strict stride metadata, Dynamo is no longer tracing ordinary tensor math; it is staring at a boundary it cannot legally peer through. That is how a tiny wrapper can collapse a much larger fusion window around it.

Does Compiled Autograd retire the fallback rule for custom kernels?+

No. It can help by capturing a larger backward graph even when the forward path hit graph breaks, so it is a useful diagnostic and sometimes a real rescue path for training. But it still adds backward-start overhead and can recompile more often when autograd structure or shapes move around, so it does not turn an opaque custom boundary into a free one. If the plain compiled fallback wins on the full region, the custom kernel is still on borrowed time.

Why is a custom MoE sort kernel still not an automatic keep?+

Because profile heat alone is not enough. Standard top-k MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble. routing is exactly where upstream sort-and-grouped-execution paths keep catching up, so a local sort kernel only earns a place if the routing algorithm is genuinely non-standard and still wins after the full compiled region is measured. Otherwise it belongs in the experimental lane, not in the keep set. That is not just a local taste call anymore. Public Megatron CoreQuick term guideMegatron CoreThe NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges. guidance now recommends --moe-grouped-gemm, --moe-router-fusion, --moe-permute-fusion, and the DeepEP-backed flex dispatcher for high-performance MoE training, while the fused MoE API already internalizes the usual permute and grouped-execution steps. For the standard top-k path, a custom Triton sort kernel therefore has to beat a fairly complete native stack rather than filling a missing primitive.

What kinds of new kernels still clear the bar?+

Usually boundary kernels, not replacements for mainstream dense math. Packed decode paths, local recurrent scan or update paths, and awkward gather or scatter seams are still plausible keeps; plain GEMMs and standard expert routing usually are not.

Which adjacent post covers the upstream-delegation line most clearly?+

Liger FLCE reduction=none is the cleanest example of "delegate when upstream is good enough."

What are mHC and THD layout in plain English?+

mHC is the multi-stream hidden-state mixing path used in MegaCpp hybrid blocks, not a generic upstream Megatron feature. Hybrid examples overview, mHC stream residual sample, and MegaCpp model glossary are the quickest public-safe decoders. THD layout in this article is local tensor-layout shorthand for the fused MLAQuick term guideMLAMulti-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion. partial-RoPEQuick term guideRoPERotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table. kernel contract, not a broad public term with one canonical external definition; MLA integration pattern sample and the MLA row in the Kernel examples catalog are the right proof surfaces.

Which checked-in files show the keep set most directly?+

Start with Kernel examples overview and Kernel examples catalog, then jump to Fused RoPE QK sample, Triton row gather pair sample, Fused residual add RMSNorm sample, Fused ReLU squared sample, Hybrid examples overview, and Opaque kernel compile wrapper sample. That set covers the hot-path ingress kernels, block-boundary kernels, activation helper, hybrid naming, and compile-boundary maintenance story.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

RMSNorm

Root-mean-square normalization used as an explicit contract seam in the wrapped Mamba3 and Megatron integration paths.

Grounding

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

Grounding

MLA

Multi-Latent Attention: an attention layout that keeps a compressed latent path plus a small RoPE-carrying slice instead of a full dense per-head K/V expansion.

Grounding

RoPE

Rotary positional embedding: the complex-plane rotation applied to a chosen Q/K slice so attention carries relative position without a learned absolute-position table.

Grounding

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

Grounding

PTX

NVIDIA's low-level parallel-thread execution ISA and adjacent ptxas toolchain surface used here when discussing generated copy and tensor-path mnemonics.

Grounding

Pallas

JAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.

Grounding

Transformer Engine

NVIDIA's Transformer Engine library path for accelerated Transformer modules and lower-precision training surfaces such as FP8, kept behind optional adapter seams in these posts.

Grounding

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

Grounding

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

NVFP4

NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.

Grounding

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

Grounding

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

David Gornshtein • MegaCppMore posts →

The Triton Kernels We Actually Maintain In-Tree

Why a small set, not a large one

What's actually in the tree

The selection rule

Autotune discipline

Numerical tests that keep us honest

What we kept and threw away

How a kernel actually lands in the tree

The Inductor interaction

Where we still might add a kernel

Read next

References

Frequently asked questions

Terms used in this article