Site search

Search the published MegaCpp archive

Use this page as both a search surface and a discovery map: topic hubs for reading order, term-driven entry points for hard concepts, and the full archive filter underneath.

178 published articles8 guided topic hubsSearch by topic, term, or proof surface

Looking for the full archive layout instead? Open the blog index.

Start with a reading path

Use the strongest cluster entry points before free-form search

The archive is broad enough that raw keyword search can hide the best entry point. These hubs are the shortest grounded routes into the busiest technical lanes.

Browse all topic hubs

Topic hub

GB10 and Blackwell Bring-Up

178 linked articles

A curated GB10 and Blackwell reading path: consumer-versus-datacenter tensor paths, driver-visible false positives, arch-patch repros, and the serving or precision choices that survived contact with the hardware.

9 curated in-hub reads

Recent entry point

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

April 20, 2026

Topic hub

Modal Training and Benchmark Operations

178 linked articles

A curated Modal reading path: when the rented-GPU surface was useful, what broke on multi-GPU launches, how receipts were recorded, and how we kept the lane debuggable.

7 curated in-hub reads

Recent entry point

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

April 20, 2026

Topic hub

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

178 linked articles

A curated Mamba3 reading path: why MegaCpp kept a hybrid stack, how the kernels evolved across CUDA, TileLang, and TPU, and where the runtime wins actually held.

11 curated in-hub reads

Recent entry point

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

April 20, 2026

Topic hub

MLA Integration, Dispatch, and Weight Absorption

178 linked articles

A curated MLA reading path: the weight-absorption contract, Megatron-safe integration boundaries, dispatch and FP8 edges, and the adapter surfaces that keep MLA connected to the rest of the stack.

9 curated in-hub reads

Recent entry point

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

April 20, 2026

Topic hub

Evaluation, Benchmarks, and Verifier Loops

178 linked articles

A curated evaluation reading path: verifier-first harnesses, ablation structure, benchmark receipts, and the evidence rules that keep comparisons from collapsing into anecdotes.

12 curated in-hub reads

Recent entry point

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

April 20, 2026

Topic hub

Megatron Parallelism and Layout Boundaries

178 linked articles

A curated Megatron reading path: the parallelism map, what actually splits, how NVIDIA and TPU wrappers differ, and the migration surfaces around NAM56R-style layouts.

9 curated in-hub reads

Recent entry point

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

April 20, 2026

Topic hub

TPU Sparse Attention and Pallas Kernels

178 linked articles

A curated TPU sparse-attention reading path: block-sparse contracts, Pallas kernel choices, SPMD sharding, and the runtime surfaces that keep long-context TPU work stable.

9 curated in-hub reads

Recent entry point

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

April 20, 2026

Topic hub

H200 Training and Kernel Bring-Up

178 linked articles

A curated path through the H200 lane: operator bring-up, step-time anatomy, memory pressure, and the NVIDIA kernel surfaces that actually moved the stack.

15 curated in-hub reads

Recent entry point

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

April 20, 2026

Grounded term entry points

Jump from a technical term to the right part of the corpus

These shortcuts are for concepts that show up across the blog but usually need one strong starting document or hub instead of a generic results list.

tcgen05

GB10 / Blackwell proof lane

Tensor-path proofs, `sm_100a` versus `sm_121a`, and the point where tcgen05 evidence stops.

Cross-link: Start with the proof summary

PJRT

TPU/XLA ownership boundaries

libtpu, PJRT, JAX, and Torch/XLA ownership splits with checked-in examples and receipts.

Cross-link: Broader TPU/XLA hub

FSDP2

Megatron and wrapper boundaries

Parallelism surfaces, wrapper seams, and migration paths once FSDP2 enters the stack.

Cross-link: Open the Megatron reading path

Mamba3

Mamba3 architecture and kernels

Hybrid-model rationale, kernel evolution, and cache scaffolds in one reading order.

Cross-link: Model contract first

MLA

MLA systems and dispatch

Weight absorption, adapter boundaries, sparse dispatch, and cache-side consequences in one path.

Cross-link: Architecture-first article

Verifier loop

Evaluation and benchmark evidence

Verifier-first evals, ablation structure, benchmark receipts, and profiler-backed comparisons.

Cross-link: Verifier-first starting point

TileLang

TileLang and TMA reality checks

Blackwell and H200 kernel experiments, TMA bulk-copy constraints, and what survived testing.

Cross-link: Related NVIDIA kernel lane

compile_commands.json

C++ data and semantic indexing

C++ corpus construction, semantic graphs, compile database inputs, and indexed dataset preparation.

Cross-link: Data-pipeline hub

tcgen05→Start with the proof summary PJRT→Broader TPU/XLA hub FSDP2→Open the Megatron reading path Mamba3→Model contract first MLA→Architecture-first article Verifier loop→Verifier-first starting point TileLang→Related NVIDIA kernel lane compile_commands.json→Data-pipeline hub

Filter the full archive

Browse the archive

Filter by topic and search the published notes

Narrow the blog by feature family, runtime surface, or keyword.

Need a guided path?

Use the topic hubs for short reading orders across the H200, TPU/XLA, data-pipeline, and MoE lanes.

Open topic hubs

Topic

Keyword

178 results

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

Featured

April 20, 202611 min read

GB10

Blackwell

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

Our GB10 tests show that some Blackwell datacenter-targeted SASS can be accepted and executed on consumer silicon, but they do not prove that the Blackwell Tensor Core Generation 5 matrix-instruction path (tcgen05.mma) physically executes on GB10. Older stronger claims overstate what the evidence supports.

Search the published MegaCpp archive

Use the strongest cluster entry points before free-form search

GB10 and Blackwell Bring-Up

Modal Training and Benchmark Operations

Mamba3 Architecture, Kernels, and Runtime Tradeoffs

MLA Integration, Dispatch, and Weight Absorption

Evaluation, Benchmarks, and Verifier Loops

Megatron Parallelism and Layout Boundaries

TPU Sparse Attention and Pallas Kernels

H200 Training and Kernel Bring-Up

Jump from a technical term to the right part of the corpus

GB10 / Blackwell proof lane

TPU/XLA ownership boundaries

Megatron and wrapper boundaries

Mamba3 architecture and kernels

MLA systems and dispatch

Evaluation and benchmark evidence

TileLang and TMA reality checks

C++ data and semantic indexing

Filter by topic and search the published notes

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

Why Driver-Visible Paths Can Look Like Hardware Support on GB10, Even When Silicon Proof Is Missing

Inside the GB10 Driver Patch Lane: libcuda Tables, Helper Cubins, Linux Hooks, and Why Deeper Patching Still Is Not tcgen05 Proof

Reproducing the sm_100a -> sm_121a Cubin Patch on GB10: CUDA/C++ Code, ELF Edits, and the Exact Point Where tcgen05 Stops

Author Mamba3 spec inside Megatron

Clustered sparse on TPU: the planner stages

Converting parquet token shards into Megatron indexed datasets

DSA and CUDA graph safety

DSA CUDA graph safety deep dive

DSA index-cache patch

DSA indexer memory fix

DSA indexer memory fix deep dive

Fail-closed hybrid pattern translation

GateSkip and FlexiDepth after the router

GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint

How to express a Nemotron-style recipe as pure Megatron CLI

libtpu, PJRT, JAX, and ownership boundaries

Liger FLCE reduction=none

Mamba linear CE parity deep dive

Mamba3 MIMO 3D-to-2D shared-memory deep dive

Mamba3 PsiV cache scaffold

Megatron bin/idx pipeline from parquet token shards

Megatron FLCE on Hopper

Migration policy: native Megatron vs narrow custom seams

NAM56R launch policy

NAM56R Megatron translation

Packed rows as the real training contract

Protobuf, the 2 GB Wall, and Why MegaCpp Prefers Shards Over Giant Messages

Public MLA integration patterns for Megatron

Regional compile without losing the plot

Restoration without git history

Restoring a Megatron training tree without git history

Shared MLA adapter boundaries

Sparse MLA dimension generalization

Sparse MLA FP8 dispatch

TileLang TMA and H200 reality

TileLang TMA bulk-copy 3D shared-memory deep dive

Torch 2.13 on GB10: the serving and training stack we actually chose

Torch 2.12 TPU/XLA breakage matrix: wheel pain, cache misses, and the workarounds that actually mattered

Torch/XLA 2.11 expectations vs TPU reality

vLLM on GB10: the overlay, the registration fixes, and the paths we kept off

What changed after the 10K-step gate: the ablations that stayed honest

Activation checkpointing deep dive: why per-block policies beat one global switch

Activation Checkpointing Policy: The Per-Block Pareto That Held Up

Activation Recompute Boundaries in Hybrid Stacks

Activations and how we split them

The adapter stack: how LoRA, QLoRA, and hot-swap compose MegaCpp specialists

Attention sinks and telemetry on TPU: measure without turning observability into the bug

Attention Validity and Structure-Aware Attention

Block-sparse attention on TPU v6e: block masks, MXU-friendly tiles, and stable contracts

Checkpoint Format and Resume: What We Save, and What We Test

The Clang semantic indexer: translation units, call graphs, and the perf wall

Code Deduplication at Scale: MinHash, LSH, and What a 142-Repo C++ Catalog Actually Looks Like

Communication cost and overlap: NCCL on H200, XLA collectives on TPU v6e

Compile Commands and Semantic Graphs: Why C++ Training Needs Real Build Context

The Compile-Time Tax We Accept for Runtime Speed

Context Parallel and Sequence Parallel: Similar Names, Different Jobs

Building a C/C++ corpus for training: what we keep, what we throw away, and why

Data Enhancements: Why structure IDs, AST features, and the Clang graph earn their cost

The C/C++ Data Preparation Pipeline, End to End