Our honest experience with CuTe DSL
What we tried to build with CuTe DSL, where it held up, where it lost to alternatives, and the chunks we rewrote back to Triton or kept in CUDA.

We spent a meaningful chunk of Q1 2026 learning CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and Blackwell-class hardware. Some of that work ended up in the tree, most of it did not, and a couple of experiments taught us enough to change how we think about kernel stacks. This post is the honest retrospective, because we kept getting asked "so is CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample worth it?" and the answer really depends on what you are trying to do.
Why this matters
Kernel stack choices are sticky. Once a critical kernel is in CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample, your container builds, your test machines, your CI lanes, and your debugging muscle memory all bend around it. CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample is also the path NVIDIA itself ships its newest reference kernels through (FlashAttention-4 most visibly), so any team running on Hopper or Blackwell ends up with at least a transitive dependency on it whether they meant to or not. The question is no longer whether to learn it; it is which kernels to actually own and which to consume. The shortest checked-in catalog for that decision is MegaCpp model wiring examples, and the shortest local "where does cuTile stop and CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample start?" path is this article together with TileLang TMA and H200 reality and TileLang TMA bulk copy 3D SMEM deep dive.
The other reason it matters: the per-kernel engineering cost is materially higher than the paths in The Triton Kernels We Actually Maintain In-Tree, and the torch.compile integration story still has sharp edges. Misjudging where on the cost curve a particular kernel lives is how you spend a quarter porting something that was already fast.
That is easiest to see when you line this post up with the more execution-focused follow-ups: Flash Attention 4 in practice covers the backend that actually survived into the training stack, Mamba3 kernel journey shows where the Mamba path stayed elsewhere, and GB10 Blackwell tensor paths: what we actually proved is the clearest record of where the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story capability matrix stopped being guesswork.
1. Context and scope
CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample, for our purposes, is the Python surface shipped as nvidia-cutlass-dsl (we used 4.4.1 and 4.4.2) that lets you write CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample-style kernels in Python, JIT-compile them, and call them without an ahead-of-time CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 build. The headline attractions were the FlashAttention-4 integration (flash-attn-4 4.0.0b4, a pure-Python CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel set) and the ability to hand-write WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron-backed GEMMs for shapes that cuBLAS and Triton do not corner well.
Our two concrete workstreams:
- Running FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and B200 via the CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel, both as a drop-in attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns backend and as a nested kernel inside
flex_attention(kernel_options={"BACKEND": "FLASH"}). - Exploring whether hand-written CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample GEMMs could replace or augment our TileLang fused scan for the Mamba MIMO backward-backward kernel on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 and GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story.
The rest of the post is what actually happened in those two tracks.
First-touch terms
- CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample is NVIDIA's kernel-building stack for tiled GEMMs and related tensor-core paths. CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingExample: MegaCpp model wiring example index Reference: TileLang and CuTe boundary is the layout and tiling language inside that world; CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample is the Python surface we actually used. The quickest checked-in decoder if you need to separate CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample/CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingExample: MegaCpp model wiring example index Reference: TileLang and CuTe boundary ownership from compiler-managed tile DSLs is this article together with TileLang TMA and H200 reality, TileLang TMA bulk copy 3D SMEM deep dive, and the MegaCpp model wiring examples.
- CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample means explicit ownership over layout, staging, and instruction choice. You can reach TMA, WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron, and similar hardware features directly, but you also inherit the pipeline and shared-memory bookkeeping cost yourself. The checked-in Kernel examples overview and MegaCpp model wiring examples are the local public-safe anchors for where we actually keep that kind of proof surface.
- cuTile Python is a different, higher-level tile DSL. It is useful to keep the names separate because cuTile is compiler-managed while CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample is the "manual control" side of the same vocabulary family. The shortest checked-in bridge between the two ownership models is TileLang TMA and H200 reality, TileLang TMA bulk copy 3D SMEM deep dive, and the MegaCpp model wiring examples.
sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro,sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough,sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough are target labels from NVIDIA's compiler vocabulary.sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro is the baseline Blackwell target name,sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough is the architecture-specific datacenter Blackwell target, andsm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is the architecture-specific GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story consumer target. That split matters here because a CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel that runs on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story is not automatically proving the stronger datacenter-only lane.- WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron means Hopper or Blackwell warpgroup matrix instructions. It is the main reason we tolerated the authoring cost in narrow GEMM-heavy kernels.
- TMA means descriptor-driven bulk async copies between global memory and shared memory. In this article it mostly matters as a staging primitive that CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample can own directly and that TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample can sometimes legalize for you.
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough / UMMAQuick term guideUMMAA Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.GroundingAbout: GB10 tensor-path proof summary Example: TileLang TMA bulk-copy sample belongs to the datacenter-Blackwell lane tied to TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity, not to the narrow GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story consumer receipts in this article. When we talk about GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story here, the clean public-safe result is the BF16 warp-MMA + TMA lane rather than a publication-gradetcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough proof. If you need the checked-intcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough /tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough /tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails and TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough surfaces instead of this summary, use GB10 Blackwell tensor paths: what we actually proved, minimal TMEM allocation probe source, and full GB10 tensor-path probe source.- FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample means FlashAttention-4, which in our stack mostly matters as a consumer of CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample rather than as a generic label for "fast attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns." The shortest checked-in runtime proof surfaces are Dense FA4 execute proof sample and FA4 receipt summary sample.
If you want the compiler-managed counterpart first, start with TileLang TMA and H200 reality, TileLang TMA bulk copy 3D SMEM deep dive, and the checked-in TileLang TMA bulk-copy SMEM near-copy. If you want the consumer-versus-datacenter Blackwell stop line, use GB10 Blackwell tensor paths: what we actually proved plus the checked-in GB10 gate matrix. That set is the shortest checked-in bridge for deciding when TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample legality should stay compiler-managed versus when CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample should own TMA, WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron, and staging directly.
For the target-name boundary, the exact NVIDIA labels matter in normal prose as much as they do in probes: sm_100Quick term guidesm_100Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 cubin patch repro is the baseline Blackwell name, sm_100aQuick term guidesm_100aDatacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Example: GB10 repro walkthrough is the datacenter Blackwell target, and sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough is the GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story target. The datacenter-oriented tensor path is the separate tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough lane, with tcgen05.allocQuick term guidetcgen05.allocDocumented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.GroundingAbout: GB10 cubin patch repro Example: GB10 repro walkthrough Example: GB10 gate walkthrough, tcgen05.ldQuick term guidetcgen05.ldDocumented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough, tcgen05.mmaQuick term guidetcgen05.mmaThe Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.GroundingAbout: GB10 tensor-path proof summary Example: sm_100a cubin patch repro Reference: GB10 claim-scope guardrails, TMA multicastQuick term guideTMA multicastThe cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Example: GB10 repro walkthrough, TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.GroundingAbout: GB10 tensor-path proof summary Reference: NVFP4 inference on GB10 Reference: GB10 stack parity, and UMMAQuick term guideUMMAA Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.GroundingAbout: GB10 tensor-path proof summary Example: TileLang TMA bulk-copy sample kept out of the broad GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story claim. The same conservative reading applies to FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper here: a BF16 CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample receipt on GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story does not imply the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper path is healthy.
2. What the learning curve actually costs
Everyone says CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample has a learning curve. It does, and that curve is steeper than people who have already absorbed CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample concepts tend to notice. For an engineer who is competent in Triton, comfortable in CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200, and has never written CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample kernels, the ramp looks roughly like this:
- Week 1: getting a trivial smem-tiled GEMM to compile and run, with correct results at one shape. Most of the time is spent reading CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample documentation to translate "tensor of shape" into a CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingExample: MegaCpp model wiring example index Reference: TileLang and CuTe boundary layout with the right strides and swizzles, and understanding which atoms your architecture exposes (for H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 sm_90a that means
warpgroup.MmaF16BF16Opfor WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron). - Week 2: getting the same kernel to run at a second and third shape without regressing, which forces you to understand stage/pipeline parameters and smem lifetime. We got
warpgroup.MmaF16BF16Opcorrectly emitting WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 with a single-GEMM result in the low single-digit microseconds before we were confident we understood what we had written. - Week 3: trying to fuse multiple GEMMs in one launch. The 3-GEMM fusion we built kept K in smem across GEMMs and accumulated into an LKQ tile, exact bf16 match against the reference. It ran at roughly half the wall time of three separate WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron launches: a clean ~1.78x fusion benefit that we can actually measure.
- Week 4+: trying to fuse ten or more GEMMs with elementwise ops and reductions in between. This is where the curve goes vertical. The manual smem-lifetime management grows nonlinearly, and everything we tried past the 3-GEMM fusion started eating working days without delivering measurable speedups.
Concretely: the 3-GEMM fused CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 matched cuBLAS on the same chain. It did not beat it. At small shapes, launch overhead is comparable to compute time, so the fusion eliminates launch overhead but has no headroom to do more. That was a useful result, because it told us exactly where the CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample payoff curve starts and stops.
3. What we kept
- The FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel lane itself, as a first-class backend in the training stack. The checked-in FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample examples and receipts carry the pinned dependency story, and the kernels JIT-compile on first call, which adds a one-time warmup we account for in our throughput measurements.
- The specific FlexAttention + FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample wiring, where the backend is selected via
flex_attention(kernel_options={"BACKEND": "FLASH"})and wrapped intorch.compiler.disable()at the call site. That single wrap is what made the preset lanes compile and run end-to-end without Inductor trying to lower a CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel into its outer compile graph. - The empirical GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story / sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough capability matrix. When we went in, the working assumption was that CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample was "fully blocked" on sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough. Our actual probes found that the bf16 path works; fp16 and fp8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper paths have sharper restrictions that we documented rather than guessed at. That pivot kept us from burning weeks on the wrong assumption.
- The 3-GEMM fused WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron proof on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200. We do not ship that particular fused GEMM in training, but the checked-in kernel examples stay around as a reference for anyone who needs to cut a specific shape that cuBLAS does not serve well. It is the cleanest example in our public tree of a hand-written WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron kernel that is correct at bf16 and inside the ballpark of cuBLAS on a real shape.
4. What we rewrote back
- The fused Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode MIMO bwd_bwd kernel stayed in TileLang. We explored porting it to CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample and hit three structural walls: smem-layout round-tripping (WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron reads swizzled smem;
StMatrixwrites plain layouts, so each inter-GEMM intermediate wants an extra smem-to-smem copy, roughly a few microseconds of overhead per hop), manual stage/pipeline management for ten-plus GEMMs, and the absence of the automatic multi-stage pipelining that TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample provides for free. - A custom CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample fused softcap+causal kernel for H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200. We built a prototype that worked at one shape and lost decisively to FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample with softcap on every other shape. We deleted it.
- A second-attempt CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample grouped GEMM for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch. cuBLAS grouped GEMMs caught up before we finished, and our Python-level dispatch logic was already eating most of the headroom.
5. Sharp edges we hit
- FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample standalone is not compatible with
torch.compiletoday. Without custom-op registration (upstream RFC filed, not yet landed at the time of the fix), the only supported compile path isflex_attention(BACKEND="FLASH")via PyTorch Inductor.dynamic=Trueis not supported with the FLASH backend. We document that loudly in our backend matrix. - CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample standalone with our pinned layer presets produced numerically divergent backward passes (gnorm in the millions after a handful of steps) on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200. The "passing" FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample lanes that we originally thought were CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample were in fact Triton FlexAttention, because
moba_flex_backendwas unset and defaulted to Triton. Once we forcedmoba_flex_backend="flash"on the same presets, the CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample FLASH kernel diverged. We filed upstream and verified the MHA (non-GQA) path separately to rule out a GQA-only issue. - Version pinning is sharper than for Triton or CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200.
nvidia-cutlass-dsl4.4.1 vs 4.4.2,flash-attn-44.0.0b4, and theapache-tvm-ffi/torch-c-dlpack-extpair have to line up exactly. Mixing wheels from two container builds gave us import-time failures that masqueraded as missing symbols. We therefore keep the exact version story in the checked-in examples and receipts instead of treating those packages as independently upgradeable. - JIT compile warmup is noisy. The first call into any CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel takes noticeably longer than a steady-state call. If you benchmark without a warmup, or if your throughput plot includes step 0, you will get misleading numbers.
- sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.GroundingAbout: GB10 tensor-path proof summary Example: GB10 cubin patch repro Example: GB10 repro walkthrough (GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story) has real, not imagined, restrictions. The bf16 path works; fp16 and fp8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper have sharp edges. We rebuilt the capability matrix from actual probe runs rather than trusting the "fully blocked" priors we inherited.
The compile seam was narrower than the ecosystem slogans made it sound. The checked-in FlexAttention compile sample keeps three facts separate on purpose: compile the outer flex_attention surface only on CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200, probe the optional FLASH or CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingExample: MegaCpp model wiring example index Reference: TileLang and CuTe boundary imports up front, and then prove actual execution with Dense FA4 execute proof sample or FA4 receipt summary sample. That split is what kept import success or a preset name from being mistaken for a CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingExample: MegaCpp model wiring example index Reference: TileLang and CuTe boundary kernel surviving inside the compiled graph.
The other sharp edge is that "backward diverged" is often too vague to be a useful diagnosis. Once a kernel depends on TMA staging, swizzled shared-memory layouts, and tightly bounded tile shapes, a shape that passes one benchmark does not automatically validate nearby shapes. That is one reason this post keeps coming back to narrow proof surfaces instead of broad ecosystem claims: the failure mode is often in the staging contract, not in the top-line algorithm name.
On Hopper, the fragile part was usually not the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns formula itself but the staging geometry around it. The nearby shapes that worried us were the ones that pushed against 128-byte TMA alignment, swizzled shared-memory layouts, or barrier timing between a producer refill and the next WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron consumer. That does not prove one universal root cause for every divergence report, but it does explain why one passing benchmark was never enough to bless a whole neighborhood of shapes. The closest checked-in contrast is TileLang TMA bulk copy 3D SMEM deep dive plus TileLang TMA bulk-copy SMEM near-copy: they show the kind of staging contract that compiler-managed lowering can keep consistent for you.
6. Comparisons we actually measured
For the one configuration where we have clean head-to-head numbers, a 3-GEMM chain on H200Quick term guideH200NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.GroundingAbout: training on 8x H200 Reference: H200 memory geometry Reference: training speed anatomy on H200 at small bf16 shapes:
| Variant | Wall time | Notes |
|---|---|---|
| CuTe DSL fused 3-GEMM, one launch | ~36 microseconds | Our hand-written kernel |
| CuTe DSL, three separate WGMMA launches | ~64 microseconds | Same body, no fusion |
torch.bmm / cuBLAS, three launches |
~36 microseconds | Stock baseline |
That is the reality check. Fusion buys ~1.78x by killing launch overhead. Matching cuBLAS at a single fused shape is not a win; it is a tie. The interesting fusion wins start appearing only when you cross ten-plus GEMMs with non-trivial intermediates, which is exactly where CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample's manual smem management collapses unless you have a large engineering budget.
One later ecosystem contrast made that trade easier to explain. The older Triton TMA experiments we reviewed paid a host-side descriptor tax before the useful math even started, because tensor-map setup was still living outside the launched kernel path. CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample and the compiler-managed lane in TileLang TMA and H200 reality avoid that specific setup pattern, which is one reason we stopped treating "has TMA" as proof that two kernel stacks pay the same staging cost.
Numbers we did not chase to a clean comparison: CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample versus TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample for the full Mamba MIMO backward-backward kernel on B200. We stopped early because the structural gap was obvious after the 3-GEMM fusion result and the missing multi-stage pipelining analysis. "n/a" is the honest answer; we did not do the 600-plus line port.
7. Where CuTe DSL earns its keep
After all that, we still think CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample is the right tool in two situations:
- When you need to write a new atom or a GEMM kernel for a shape or dtype combination that cuBLAS and Triton cannot corner well, and you need Hopper or Blackwell features (WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron, TMA, swizzled smem) that Triton does not expose cleanly. For these, Python-level CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample is meaningfully more productive than writing CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample in C++.
- When you are consuming a third-party CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel like FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample that was written by people who have already paid the learning curve cost. In that case you are essentially using it as a prebuilt kernel, and the job is wiring it into your compile graph without triggering inductor lowering of the kernel body itself.
There is one more narrow lesson from the later ecosystem work: higher-level bridge tooling only matters if it removes manual pipeline ownership without hiding the backend boundary. A compiler-managed route that emits CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingExample: MegaCpp model wiring example index Reference: TileLang and CuTe boundary-shaped code can be useful for experiments, but it does not change the main product rule in this post. MegaCpp still prefers TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample or Triton where they already own the fusion and compile story honestly, and reaches for raw CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample only when that extra control is the actual point.
That split is easier to state now because the upstream tools are more explicit about what they do and do not automate. CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingTileLang TMA bulk-copy sample documents cutlass.pipeline helpers such as PipelineProducer, PipelineConsumer, and TMA-oriented pipeline types, but those are still coordination primitives rather than automatic schedule synthesis. In parallel, TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample's target="cutedsl" backend made the opposite trade visible: keep TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample's higher-level lowering, layout inference, and runtime packaging, then emit CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample only after the compiler has already owned those choices. That is useful for shrinking boilerplate, but it still does not turn raw CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample into the default answer for deep fused kernels. The real ownership question stays the same: who decides stage scheduling, shared-memory layout, and fallback behavior once the kernel leaves its happy path?
The most meaningful future improvement is therefore not "more bespoke wrappers around CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample." It is a cleaner native-operator or dispatcher path that lets a CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.GroundingExample: MegaCpp model wiring example index Reference: TileLang and CuTe boundary-backed kernel stay legible to the compile stack without pretending the backend boundary disappeared. If a higher-level tool only saves boilerplate while keeping ownership honest, that is useful. If it hides which lane owns scheduling, staging, and fallback behavior, it makes the product story worse rather than better.
The bridge-tooling question is therefore practical, not aspirational. A higher-level compiler route is only interesting if it removes manual stage and layout ownership while still leaving a readable proof surface when the lowering goes wrong. The closest checked-in reading path for that trade is TileLang TMA and H200 reality, TileLang TMA bulk copy 3D SMEM deep dive, and Upstream PRs we wrote for TileLang and Megatron-Core: together they show why compiler-managed lowering is attractive and why it still has to earn trust on real kernels before it replaces a simpler lane.
Where it does not earn its keep, for us:
- Replacing a mature TileLang or Triton fused kernel that is already at or near speed-of-light for your shapes. The automatic fusion and pipelining you get for free in those compilers is worth more than the hand-written flexibility CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample gives you.
- Writing simple elementwise-plus-reduction kernels, where Triton or Liger is strictly easier to author, test, and debug.
What we kept and what we threw away
We kept the FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample backend, the FlexAttention+FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample wiring with torch.compiler.disable() at the call site, the empirical GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story capability matrix, the 3-GEMM WGMMAQuick term guideWGMMAHopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.GroundingAbout: GB10 tensor-path proof summary History: GB10 journey Reference: upstream PR: TileLang and Megatron proof as a reference, and the strict version pinning of the CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample stack. We kept the rule that benchmarks include a JIT warmup and that step 0 is excluded from any throughput plot involving a CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample kernel.
We threw away the hand-written CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample fused softcap+causal kernel, the second-attempt grouped GEMM for MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack dispatch, the assumption that CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample would beat TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample on the Mamba3Quick term guideMamba3A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…GroundingMamba 3 + Transformers: Why MegaCpp Uses a Hybrid Stack for C++ MegaCpp model glossary: patterns, blocks, and what names like NAM52 and NAM56R encode backward, and the inherited belief that GB10Quick term guideGB10Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.GroundingAbout: GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint About: GB10 tensor-path proof summary History: Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story was "fully blocked" for CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample. We also threw away the previous mental model of "Triton for everything GPU-side, CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 when Triton can't." CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample pushed us toward a more honest three-tier view: Triton for memory-bound and elementwise-heavy work, TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample for large fused kernels that need automatic pipelining and deep fusion, and CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample for hand-written atoms, GEMMs, and integrating third-party kernels like FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample. CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 C++ is still the escape hatch for things none of the above cover.
The short answer to "is CuTe DSLQuick term guideCuTe DSLThe CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.GroundingTileLang TMA bulk-copy sample tcgen05 gate matrix sample worth it?" is: for specific shapes and for consuming FA4Quick term guideFA4FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.GroundingAbout: FA4 catalog on Blackwell About: FlashAttention 4 in practice Example: Dense FA4 execute proof sample, yes; as a general replacement for Triton or TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample in an established training codebase, no. The learning curve is real, the per-kernel engineering cost is higher than in Triton or TileLangQuick term guideTileLangA CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.GroundingAbout: TileLang TMA and H200 reality History: upstream PR: TileLang and Megatron Example: TileLang TMA bulk-copy sample, and the ecosystem (especially around torch.compile integration) still has sharp edges that you will have to work around. We kept what earned its place and rewrote the rest back.
# FA4 via CuTe DSL, called from a FlexAttention backend path
import torch
from fa4_cute import fa4_forward
@torch.compiler.disable()
def attn(q, k, v, block_mask):
# step 0 is excluded from throughput plots; warmup must run first
return fa4_forward(q, k, v, block_mask=block_mask, causal=True)
Frequently asked questions
Where did CuTe DSL clearly earn its keep for MegaCpp?+
What would materially improve CuTe DSL's case in a mainstream training stack?+
Why did the Mamba backward port stay out of CuTe DSL?+
Why is step-0 warmup called out so aggressively?+
What does "WGMMA-backed" mean in this post?+
How is CuTe DSL different from TileLang or cuTile Python?+
Why can GB10 run the BF16 receipts while FP16 and FP8 still hit toolchain walls?+
sm_121aQuick term guidesm_121aConsumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro. never meant the stricter FP16 or FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes. route was already healthy.Does this post claim GB10 has clean tcgen05 or UMMA proof?+
tcgen05Quick term guidetcgen05The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path. / TMEMQuick term guideTMEMBlackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable. / UMMAQuick term guideUMMAA Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths. story belongs to the datacenter-Blackwell evidence path and to the separate GB10 gate articles that stop at the public-safe proof boundary.Where should I look for the exact tcgen05.alloc, tcgen05.ld, tcgen05.mma, and TMA multicast examples instead of the summary wording here?+
ld, mma, and multicast probes, and GB10 Blackwell tensor paths for the conservative claim boundary around them.Which public examples should I open first if I want the concrete proof surfaces?+
Which public example should I open if I want the "TileLang vs CuTe DSL" bridge before this retrospective?+
Which public example should I open if I want the compile-safe FlexAttention seam instead of the broader retrospective?+
What should I read if I want the higher-level bridge-tooling angle without overcommitting to CuTe DSL?+
What changed upstream that made the bridge-tooling story less hand-wavy?+
target="cutedsl" backend shows a compiler-managed route that lowers into CuTeQuick term guideCuTeCUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface. DSL while keeping target selection, runtime packaging, and GEMM-path requirements explicit. That makes bridge tooling more believable for experiments, but it still does not erase the main cost called out in this article: somebody still owns schedule synthesis and shared-memory layout once the fusion gets deep.Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The CUTLASS Python / CuTe DSL surface used for low-level tensor-program experiments and comparisons with TileLang.
CUTLASS's tensor-expression building block that underlies the more explicit CuTe DSL programming surface.
NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.
Consumer Blackwell cubin target used by GB10/DGX Spark and the patched destination in the public arch-field repro.
Documented tcgen05 allocation-side instruction family member used by the checked-in GB10 allocation probe.
Documented Tensor Memory load-side instruction in the tcgen05 family, kept separate from alloc and mma in the checked-in GB10 probes.
The Blackwell tcgen05 matrix-multiply-accumulate instruction family. On GB10, the public evidence still stops before a clean execution-grade proof.
The Blackwell tensor-generation instruction family that covers alloc, load, and mma paths beyond the older dense consumer path.
The cluster-scoped cp.async.bulk.tensor multicast form that attempts one tensor copy into shared memory of multiple CTAs in a cluster.
Hopper's warpgroup matrix-multiply path between the older mma.sync lane and Blackwell's tcgen05 family.
Blackwell tensor-memory scratch storage used by datacenter-oriented tensor paths; the public GB10 evidence treats it as unavailable.
A Blackwell unified-mma family referenced in the hardware/software notes alongside tcgen05-era tensor paths.
FlashAttention 4 family and dense-attention catalog used as an execution-validated comparison point on Blackwell.
Baseline Blackwell compiler target name in NVIDIA's architecture vocabulary, distinct from the architecture-specific and family-specific targets used elsewhere in the GB10 lane.
Datacenter Blackwell cubin target used by GB100/B200-class paths and by the source cubins in the public GB10 arch-patch repro.
The NVIDIA framework surface MegaCpp ports into through narrow adapters, layer specs, and runtime ownership bridges.
A CUDA kernel DSL/compiler surface used here for explicit tile layouts, shared-memory legality fixes, and TMA-oriented kernel experiments.
NVIDIA's Hopper H200 GPU platform, typically discussed here as an 8-GPU training node with large HBM capacity and NVLink-connected ranks.
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.
Consumer Grace Blackwell GB10 / DGX Spark bring-up lane used to separate driver-visible gates, patched cubin signals, and real execution proof.
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
A grounded look at why MegaCpp combines Mamba-style state-space blocks with a smaller number of attention blocks for long-context C++ work, and…
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.