MegaCpp Blog

</>

Engineering Blog

Training systems, kernels, data pipelines, and production notes

Technical articles on training systems, kernels, TPU and GPU bring-up, data pipelines, and the work required to build AI for C++ engineers.

178

Published posts

H200

GPU bring-up lane

TPU

PJRT and XLA track

SLM

Specialist model family

Training, serving, and systems notes

Archive order follows published dates from the generated article metadata.

Technical deep dives on training, serving, data preparation, and model systems for C++ specialists.

Curated reading paths

Explore the archive by topic hub

Follow the H200, TPU/XLA, data-pipeline, and MoE lanes through short reading orders instead of relying on raw publish date alone.

Open topic hubs

Browse the archive

Filter by topic and search the published notes

Narrow the blog by feature family, runtime surface, or keyword.

Need a guided path?

Use the topic hubs for short reading orders across the H200, TPU/XLA, data-pipeline, and MoE lanes.

Open topic hubs

Topic

Keyword

178 results

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

Featured

April 20, 202611 min read

GB10

Blackwell

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

Our GB10 tests show that some Blackwell datacenter-targeted SASS can be accepted and executed on consumer silicon, but they do not prove that the Blackwell Tensor Core Generation 5 matrix-instruction path (tcgen05.mma) physically executes on GB10. Older stronger claims overstate what the evidence supports.

Engineering Blog

Training, serving, and systems notes

Explore the archive by topic hub

Filter by topic and search the published notes

What Our GB10 Experiments Actually Prove About Blackwell Consumer vs Datacenter Tensor Paths

Why Driver-Visible Paths Can Look Like Hardware Support on GB10, Even When Silicon Proof Is Missing

Inside the GB10 Driver Patch Lane: libcuda Tables, Helper Cubins, Linux Hooks, and Why Deeper Patching Still Is Not tcgen05 Proof

Reproducing the sm_100a -> sm_121a Cubin Patch on GB10: CUDA/C++ Code, ELF Edits, and the Exact Point Where tcgen05 Stops

Author Mamba3 spec inside Megatron

Clustered sparse on TPU: the planner stages

Converting parquet token shards into Megatron indexed datasets

DSA and CUDA graph safety

DSA CUDA graph safety deep dive

DSA index-cache patch

DSA indexer memory fix

DSA indexer memory fix deep dive

Fail-closed hybrid pattern translation

GateSkip and FlexiDepth after the router

GB10 Stack Parity for MegaCpp: Torch 2.13 cu132, GCC 15, CUDA 13.2, and the Nightly Constraint

How to express a Nemotron-style recipe as pure Megatron CLI

libtpu, PJRT, JAX, and ownership boundaries

Liger FLCE reduction=none

Mamba linear CE parity deep dive

Mamba3 MIMO 3D-to-2D shared-memory deep dive

Mamba3 PsiV cache scaffold

Megatron bin/idx pipeline from parquet token shards

Megatron FLCE on Hopper

Migration policy: native Megatron vs narrow custom seams

NAM56R launch policy

NAM56R Megatron translation

Packed rows as the real training contract

Protobuf, the 2 GB Wall, and Why MegaCpp Prefers Shards Over Giant Messages

Public MLA integration patterns for Megatron

Regional compile without losing the plot

Restoration without git history

Restoring a Megatron training tree without git history

Shared MLA adapter boundaries

Sparse MLA dimension generalization

Sparse MLA FP8 dispatch

TileLang TMA and H200 reality

TileLang TMA bulk-copy 3D shared-memory deep dive

Torch 2.13 on GB10: the serving and training stack we actually chose

Torch 2.12 TPU/XLA breakage matrix: wheel pain, cache misses, and the workarounds that actually mattered

Torch/XLA 2.11 expectations vs TPU reality

vLLM on GB10: the overlay, the registration fixes, and the paths we kept off

What changed after the 10K-step gate: the ablations that stayed honest

Activation checkpointing deep dive: why per-block policies beat one global switch

Activation Checkpointing Policy: The Per-Block Pareto That Held Up

Activation Recompute Boundaries in Hybrid Stacks

Activations and how we split them

The adapter stack: how LoRA, QLoRA, and hot-swap compose MegaCpp specialists

Attention sinks and telemetry on TPU: measure without turning observability into the bug

Attention Validity and Structure-Aware Attention

Block-sparse attention on TPU v6e: block masks, MXU-friendly tiles, and stable contracts

Checkpoint Format and Resume: What We Save, and What We Test

The Clang semantic indexer: translation units, call graphs, and the perf wall

Code Deduplication at Scale: MinHash, LSH, and What a 142-Repo C++ Catalog Actually Looks Like

Communication cost and overlap: NCCL on H200, XLA collectives on TPU v6e

Compile Commands and Semantic Graphs: Why C++ Training Needs Real Build Context

The Compile-Time Tax We Accept for Runtime Speed

Context Parallel and Sequence Parallel: Similar Names, Different Jobs

Building a C/C++ corpus for training: what we keep, what we throw away, and why

Data Enhancements: Why structure IDs, AST features, and the Clang graph earn their cost

The C/C++ Data Preparation Pipeline, End to End

C++ Data Versioning and Schema: How to Keep Training Rows Stable While the Corpus Evolves

The C++ Eval Suites, Verifiers, and the Compile-Then-Test Wall

Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs

CPU Offload and Startup Memory Calibration on H200 and GB10

Our honest experience with CuTe DSL

Building the C++ Training Data Pipeline: What Worked, What Broke

Data Poisoning Drills and Refusal Behavior for the MegaCpp Specialists

Data Shuffling and Seed Discipline

Dataloader throughput and stalls: making the input pipeline a first-class perf concern

Dataset Versions v2 to v6: The Long-Form Ablation History

v2 to v6: Four Generations of the C++ Dataset, and Why We Kept Them All

Determinism and bit-exact runs: what we guard and where we accept drift

Distillation, best-of-N, and verifier-grounded RL in the post-training loop

Distributed Optimizer Stress: Drift, All-Gather vs Reduce-Scatter, and Muon Gotchas

Document masking and the curriculum: what to feed each specialist first

DualPipe and 3D Parallelism on H200 and GB10