MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 17 min readDavid Gornshtein
CuTe DSL
Cutlass
Kernels
Gpu

Our honest experience with CuTe DSL

What we tried to build with CuTe DSL, where it held up, where it lost to alternatives, and the chunks we rewrote back to Triton or kept in CUDA.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Our honest experience with CuTe DSL
Published 17 min readDavid Gornshtein

We spent a meaningful chunk of Q1 2026 learning CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and Blackwell-class hardware. Some of that work ended up in the tree, most of it did not, and a couple of experiments taught us enough to change how we think about kernel stacks. This post is the honest retrospective, because we kept getting asked "so is CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample worth it?" and the answer really depends on what you are trying to do.

Why this matters

Kernel stack choices are sticky. Once a critical kernel is in CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample, your container builds, your test machines, your CI lanes, and your debugging muscle memory all bend around it. CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample is also the path NVIDIA itself ships its newest reference kernels through (FlashAttention-4 most visibly), so any team running on Hopper or Blackwell ends up with at least a transitive dependency on it whether they meant to or not. The question is no longer whether to learn it; it is which kernels to actually own and which to consume. The shortest checked-in catalog for that decision is MegaCpp model wiring examples, and the shortest local "where does cuTile stop and CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample start?" path is this article together with TileLang TMA and H200 reality and TileLang TMA bulk copy 3D SMEM deep dive.

The other reason it matters: the per-kernel engineering cost is materially higher than the paths in The Triton Kernels We Actually Maintain In-Tree, and the torch.compile integration story still has sharp edges. Misjudging where on the cost curve a particular kernel lives is how you spend a quarter porting something that was already fast.

That is easiest to see when you line this post up with the more execution-focused follow-ups: Flash Attention 4 in practice covers the backend that actually survived into the training stack, Mamba3 kernel journey shows where the Mamba path stayed elsewhere, and GB10 Blackwell tensor paths: what we actually proved is the clearest record of where the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story capability matrix stopped being guesswork.

1. Context and scope

CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample, for our purposes, is the Python surface shipped as nvidia-cutlass-dsl (we used 4.4.1 and 4.4.2) that lets you write CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample-style kernels in Python, JIT-compile them, and call them without an ahead-of-time CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 build. The headline attractions were the FlashAttention-4 integration (flash-attn-4 4.0.0b4, a pure-Python CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel set) and the ability to hand-write WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron-backed GEMMs for shapes that cuBLAS and Triton do not corner well.

Our two concrete workstreams:

  1. Running FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and B200 via the CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel, both as a drop-in attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend and as a nested kernel inside flex_attention(kernel_options={"BACKEND": "FLASH"}).
  2. Exploring whether hand-written CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample GEMMs could replace or augment our TileLang fused scan for the Mamba MIMO backward-backward kernel on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story.

The rest of the post is what actually happened in those two tracks.

First-touch terms

If you want the compiler-managed counterpart first, start with TileLang TMA and H200 reality, TileLang TMA bulk copy 3D SMEM deep dive, and the checked-in TileLang TMA bulk-copy SMEM near-copy. If you want the consumer-versus-datacenter Blackwell stop line, use GB10 Blackwell tensor paths: what we actually proved plus the checked-in GB10 gate matrix. That set is the shortest checked-in bridge for deciding when TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample legality should stay compiler-managed versus when CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample should own TMA, WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron, and staging directly.

For the target-name boundary, the exact NVIDIA labels matter in normal prose as much as they do in probes: sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro is the baseline Blackwell name, sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough is the datacenter Blackwell target, and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story target. The datacenter-oriented tensor path is the separate tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough lane, with tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough, tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails, TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough, TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity, and UMMAQuick term guideUMMAA Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.GroundingAbout: GB10 tensor-path proof summary Example: TileLang TMA bulk-copy sample kept out of the broad GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story claim. The same conservative reading applies to FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper here: a BF16 CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample receipt on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story does not imply the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper path is healthy.

2. What the learning curve actually costs

Everyone says CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample has a learning curve. It does, and that curve is steeper than people who have already absorbed CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample concepts tend to notice. For an engineer who is competent in Triton, comfortable in CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200, and has never written CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample kernels, the ramp looks roughly like this:

Concretely: the 3-GEMM fused CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 matched cuBLAS on the same chain. It did not beat it. At small shapes, launch overhead is comparable to compute time, so the fusion eliminates launch overhead but has no headroom to do more. That was a useful result, because it told us exactly where the CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample payoff curve starts and stops.

3. What we kept

  1. The FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel lane itself, as a first-class backend in the training stack. The checked-in FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample examples and receipts carry the pinned dependency story, and the kernels JIT-compile on first call, which adds a one-time warmup we account for in our throughput measurements.
  2. The specific FlexAttention + FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample wiring, where the backend is selected via flex_attention(kernel_options={"BACKEND": "FLASH"}) and wrapped in torch.compiler.disable() at the call site. That single wrap is what made the preset lanes compile and run end-to-end without Inductor trying to lower a CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel into its outer compile graph.
  3. The empirical GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story / sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough capability matrix. When we went in, the working assumption was that CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample was "fully blocked" on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough. Our actual probes found that the bf16 path works; fp16 and fp8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper paths have sharper restrictions that we documented rather than guessed at. That pivot kept us from burning weeks on the wrong assumption.
  4. The 3-GEMM fused WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron proof on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200. We do not ship that particular fused GEMM in training, but the checked-in kernel examples stay around as a reference for anyone who needs to cut a specific shape that cuBLAS does not serve well. It is the cleanest example in our public tree of a hand-written WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron kernel that is correct at bf16 and inside the ballpark of cuBLAS on a real shape.

4. What we rewrote back

  1. The fused Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode MIMO bwd_bwd kernel stayed in TileLang. We explored porting it to CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample and hit three structural walls: smem-layout round-tripping (WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron reads swizzled smem; StMatrix writes plain layouts, so each inter-GEMM intermediate wants an extra smem-to-smem copy, roughly a few microseconds of overhead per hop), manual stage/pipeline management for ten-plus GEMMs, and the absence of the automatic multi-stage pipelining that TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample provides for free.
  2. A custom CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample fused softcap+causal kernel for H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200. We built a prototype that worked at one shape and lost decisively to FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample with softcap on every other shape. We deleted it.
  3. A second-attempt CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample grouped GEMM for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch. cuBLAS grouped GEMMs caught up before we finished, and our Python-level dispatch logic was already eating most of the headroom.

5. Sharp edges we hit

The compile seam was narrower than the ecosystem slogans made it sound. The checked-in FlexAttention compile sample keeps three facts separate on purpose: compile the outer flex_attention surface only on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200, probe the optional FLASH or CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingExample: MegaCpp model wiring example index Reference: TileLang and CuTe boundary imports up front, and then prove actual execution with Dense FA4 execute proof sample or FA4 receipt summary sample. That split is what kept import success or a preset name from being mistaken for a CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingExample: MegaCpp model wiring example index Reference: TileLang and CuTe boundary kernel surviving inside the compiled graph.

The other sharp edge is that "backward diverged" is often too vague to be a useful diagnosis. Once a kernel depends on TMA staging, swizzled shared-memory layouts, and tightly bounded tile shapes, a shape that passes one benchmark does not automatically validate nearby shapes. That is one reason this post keeps coming back to narrow proof surfaces instead of broad ecosystem claims: the failure mode is often in the staging contract, not in the top-line algorithm name.

On Hopper, the fragile part was usually not the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns formula itself but the staging geometry around it. The nearby shapes that worried us were the ones that pushed against 128-byte TMA alignment, swizzled shared-memory layouts, or barrier timing between a producer refill and the next WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron consumer. That does not prove one universal root cause for every divergence report, but it does explain why one passing benchmark was never enough to bless a whole neighborhood of shapes. The closest checked-in contrast is TileLang TMA bulk copy 3D SMEM deep dive plus TileLang TMA bulk-copy SMEM near-copy: they show the kind of staging contract that compiler-managed lowering can keep consistent for you.

6. Comparisons we actually measured

For the one configuration where we have clean head-to-head numbers, a 3-GEMM chain on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 at small bf16 shapes:

Variant Wall time Notes
CuTe DSL fused 3-GEMM, one launch ~36 microseconds Our hand-written kernel
CuTe DSL, three separate WGMMA launches ~64 microseconds Same body, no fusion
torch.bmm / cuBLAS, three launches ~36 microseconds Stock baseline

That is the reality check. Fusion buys ~1.78x by killing launch overhead. Matching cuBLAS at a single fused shape is not a win; it is a tie. The interesting fusion wins start appearing only when you cross ten-plus GEMMs with non-trivial intermediates, which is exactly where CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample's manual smem management collapses unless you have a large engineering budget.

One later ecosystem contrast made that trade easier to explain. The older Triton TMA experiments we reviewed paid a host-side descriptor tax before the useful math even started, because tensor-map setup was still living outside the launched kernel path. CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample and the compiler-managed lane in TileLang TMA and H200 reality avoid that specific setup pattern, which is one reason we stopped treating "has TMA" as proof that two kernel stacks pay the same staging cost.

Numbers we did not chase to a clean comparison: CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample versus TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample for the full Mamba MIMO backward-backward kernel on B200. We stopped early because the structural gap was obvious after the 3-GEMM fusion result and the missing multi-stage pipelining analysis. "n/a" is the honest answer; we did not do the 600-plus line port.

7. Where CuTe DSL earns its keep

After all that, we still think CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample is the right tool in two situations:

  1. When you need to write a new atom or a GEMM kernel for a shape or dtype combination that cuBLAS and Triton cannot corner well, and you need Hopper or Blackwell features (WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron, TMA, swizzled smem) that Triton does not expose cleanly. For these, Python-level CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample is meaningfully more productive than writing CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample in C++.
  2. When you are consuming a third-party CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel like FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample that was written by people who have already paid the learning curve cost. In that case you are essentially using it as a prebuilt kernel, and the job is wiring it into your compile graph without triggering inductor lowering of the kernel body itself.

There is one more narrow lesson from the later ecosystem work: higher-level bridge tooling only matters if it removes manual pipeline ownership without hiding the backend boundary. A compiler-managed route that emits CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingExample: MegaCpp model wiring example index Reference: TileLang and CuTe boundary-shaped code can be useful for experiments, but it does not change the main product rule in this post. MegaCpp still prefers TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample or Triton where they already own the fusion and compile story honestly, and reaches for raw CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample only when that extra control is the actual point.

That split is easier to state now because the upstream tools are more explicit about what they do and do not automate. CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample documents cutlass.pipeline helpers such as PipelineProducer, PipelineConsumer, and TMA-oriented pipeline types, but those are still coordination primitives rather than automatic schedule synthesis. In parallel, TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample's target="cutedsl" backend made the opposite trade visible: keep TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample's higher-level lowering, layout inference, and runtime packaging, then emit CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample only after the compiler has already owned those choices. That is useful for shrinking boilerplate, but it still does not turn raw CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample into the default answer for deep fused kernels. The real ownership question stays the same: who decides stage scheduling, shared-memory layout, and fallback behavior once the kernel leaves its happy path?

The most meaningful future improvement is therefore not "more bespoke wrappers around CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample." It is a cleaner native-operator or dispatcher path that lets a CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingExample: MegaCpp model wiring example index Reference: TileLang and CuTe boundary-backed kernel stay legible to the compile stack without pretending the backend boundary disappeared. If a higher-level tool only saves boilerplate while keeping ownership honest, that is useful. If it hides which lane owns scheduling, staging, and fallback behavior, it makes the product story worse rather than better.

The bridge-tooling question is therefore practical, not aspirational. A higher-level compiler route is only interesting if it removes manual stage and layout ownership while still leaving a readable proof surface when the lowering goes wrong. The closest checked-in reading path for that trade is TileLang TMA and H200 reality, TileLang TMA bulk copy 3D SMEM deep dive, and Upstream PRs we wrote for TileLang and Megatron-Core: together they show why compiler-managed lowering is attractive and why it still has to earn trust on real kernels before it replaces a simpler lane.

Where it does not earn its keep, for us:

  1. Replacing a mature TileLang or Triton fused kernel that is already at or near speed-of-light for your shapes. The automatic fusion and pipelining you get for free in those compilers is worth more than the hand-written flexibility CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample gives you.
  2. Writing simple elementwise-plus-reduction kernels, where Triton or Liger is strictly easier to author, test, and debug.

What we kept and what we threw away

We kept the FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample backend, the FlexAttention+FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample wiring with torch.compiler.disable() at the call site, the empirical GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story capability matrix, the 3-GEMM WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron proof as a reference, and the strict version pinning of the CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample stack. We kept the rule that benchmarks include a JIT warmup and that step 0 is excluded from any throughput plot involving a CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel.

We threw away the hand-written CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample fused softcap+causal kernel, the second-attempt grouped GEMM for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch, the assumption that CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample would beat TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample on the Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode backward, and the inherited belief that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story was "fully blocked" for CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample. We also threw away the previous mental model of "Triton for everything GPU-side, CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 when Triton can't." CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample pushed us toward a more honest three-tier view: Triton for memory-bound and elementwise-heavy work, TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample for large fused kernels that need automatic pipelining and deep fusion, and CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample for hand-written atoms, GEMMs, and integrating third-party kernels like FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample. CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 C++ is still the escape hatch for things none of the above cover.

The short answer to "is CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample worth it?" is: for specific shapes and for consuming FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample, yes; as a general replacement for Triton or TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample in an established training codebase, no. The learning curve is real, the per-kernel engineering cost is higher than in Triton or TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample, and the ecosystem (especially around torch.compile integration) still has sharp edges that you will have to work around. We kept what earned its place and rewrote the rest back.

# FA4 via CuTe DSL, called from a FlexAttention backend path
import torch
from fa4_cute import fa4_forward

@torch.compiler.disable()
def attn(q, k, v, block_mask):
    # step 0 is excluded from throughput plots; warmup must run first
    return fa4_forward(q, k, v, block_mask=block_mask, causal=True)
FAQ

Frequently asked questions

Where did CuTe DSL clearly earn its keep for MegaCpp?+
Consuming FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. as a ready-made backend and cutting narrow WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.-heavy kernels where Triton or cuBLAS did not already own the shape.
What would materially improve CuTe DSL's case in a mainstream training stack?+
A cleaner native-operator route that lets CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface. kernels stay inside the compile graph without bespoke wrappers or graph breaks. Until that becomes routine, we still treat raw CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang. as a narrow tool for specific kernels and for consuming already-proven backends like FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell..
Why did the Mamba backward port stay out of CuTe DSL?+
Because once the fusion problem crossed into many GEMMs plus staged intermediates, the manual shared-memory and pipeline management cost grew faster than the likely speedup.
Why is step-0 warmup called out so aggressively?+
Because a CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang. kernel’s first JIT compilation can dominate the first measurement. Treating that as steady-state throughput produces fake wins or fake regressions.
What does "WGMMA-backed" mean in this post?+
It means the hot path is built around Hopper or Blackwell warpgroup matrix instructions rather than a simpler compiler-generated matmul. That is the narrow zone where CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang. can be worth the authoring cost, because you get direct control over layout, staging, and instruction shape.
How is CuTe DSL different from TileLang or cuTile Python?+
TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments. and cuTile Python are the compiler-managed side of the world: they can legalize or lower a higher-level tile description for you. CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang. is the explicit-control side where you own more of the staging, swizzles, TMA, and matrix-instruction choices directly. That extra control is why the learning curve and debugging cost are higher.
Why can GB10 run the BF16 receipts while FP16 and FP8 still hit toolchain walls?+
Because the public-safe BF16 lane is the simpler unquantized path. The blocked lanes we kept tripping over were the more specialized mixed-precision compiler and kernel-image paths, so a BF16 pass on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. never meant the stricter FP16 or FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. route was already healthy.
Does this post claim GB10 has clean tcgen05 or UMMA proof?+
No. The GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof. part of this article is the narrower BF16 warp-MMA + TMA compatibility lane. The stricter tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. / TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. / UMMAQuick term guideUMMAA Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths. story belongs to the datacenter-Blackwell evidence path and to the separate GB10 gate articles that stop at the public-safe proof boundary.
Where should I look for the exact tcgen05.alloc, tcgen05.ld, tcgen05.mma, and TMA multicast examples instead of the summary wording here?+
Use minimal TMEM allocation probe source for the alloc-only lane, full GB10 tensor-path probe source for the ld, mma, and multicast probes, and GB10 Blackwell tensor paths for the conservative claim boundary around them.
Which public examples should I open first if I want the concrete proof surfaces?+
Start with Dense FA4 execute proof sample for the narrow FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell. runtime lane, then MegaCpp model wiring examples for the local catalog, then TileLang TMA and H200 reality and TileLang TMA bulk copy 3D SMEM deep dive for the shortest checked-in explanation of where compiler-managed tile DSLs stop and CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang. starts.
Which public example should I open if I want the "TileLang vs CuTe DSL" bridge before this retrospective?+
Use TileLang TMA bulk copy 3D SMEM deep dive together with TileLang TMA bulk-copy SMEM near-copy. That pair is the shortest checked-in side-by-side for the compiler-managed route before reading the broader retrospective.
Which public example should I open if I want the compile-safe FlexAttention seam instead of the broader retrospective?+
Start with FlexAttention compile sample. It keeps CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.-only compilation of the outer FlexAttention surface separate from the optional FLASH or CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface. backend probe, then pair it with Dense FA4 execute proof sample and FA4 receipt summary sample so compile acceptance, execute proof, and backend truth stay separate.
What should I read if I want the higher-level bridge-tooling angle without overcommitting to CuTe DSL?+
Use TileLang TMA and H200 reality, TileLang TMA bulk copy 3D SMEM deep dive, and Upstream PRs we wrote for TileLang and Megatron-Core. That trio is the cleanest checked-in map of the compiler-managed side of the same problem: when automatic lowering helps, where it still fails, and why that still does not turn every fused kernel into a good CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang. candidate.
What changed upstream that made the bridge-tooling story less hand-wavy?+
Two things. CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop. now exposes clearer CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang. pipeline helpers for producer/consumer coordination, which is useful but still leaves pipeline ownership with the kernel author. Separately, TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.'s target="cutedsl" backend shows a compiler-managed route that lowers into CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface. DSL while keeping target selection, runtime packaging, and GEMM-path requirements explicit. That makes bridge tooling more believable for experiments, but it still does not erase the main cost called out in this article: somebody still owns schedule synthesis and shared-memory layout once the fusion gets deep.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

CuTe DSL

The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.

CuTe

CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.

CUTLASS

NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.

sm_121a

Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.

tcgen05.alloc

Documented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.

tcgen05.ld

Documented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.

tcgen05.mma

The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.

tcgen05

The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.

TMA multicast

The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.

WGMMA

Hopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.

TMEM

Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.

UMMA

A Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.

FA4

FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.

sm_100

Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.

sm_100a

Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.

Megatron Core

The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.

TileLang

A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.

H200

NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

GB10

Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Mamba3

A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…

MoE

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.