Building the MegaCpp Training Corpus: Data, Tokenization, and Document Masking
How MegaCpp curates C++ data for eight specialist models, tokenizes it for long-context training, and prevents cross-document leakage during sequence packing.

MegaCpp is not one model; it is a family of eight C++ specialists that share a tokenizer, a data pipeline, and a curriculum, but diverge in which slice of the corpus they consume most heavily. The quality ceiling of every specialist is set long before a single optimizer step runs — it is set in the data preparation stack. This post walks through that stack end to end: which repositories feed the corpus, how we turn raw commits into uint16/uint32 token streams, how we pack them into 4K, 16K, and 64K sequences without letting documents contaminate each other, and how the eight specialists are differentiated inside one shared dataset.
Why C++ Is Hard To Feed A Model
Unlike natural language, a C++ training example rarely stands alone. A single modified function depends on base classes three headers away, on template instantiations in a different translation unit, and on a call graph that is only knowable after the compiler has walked compile_commands.json. Naively shuffling source files into a tokenizer would teach the model to autocomplete syntax but not to reason about a repository. The MegaCpp pipeline therefore layers four dataset versions, from "pure diff" to "full Clang semantic graph", and trains the model on progressively longer contexts so local syntax is learned before global reasoning is demanded of it.
The Eight Specialists
The operational corpus — the one actually wired into the Megatron launchers — is eight public C/C++ repositories, cloned shallow and pinned to explicit tags. Each specialist is defined by which region of that corpus it over-samples during fine-tuning, while the base model sees all eight. The canonical list is documented in [data_preparation.md]:
llvm/llvm-projectatllvmorg-19.1.0— the compilers/toolchain specialist. Modern C++17/20 idioms, pass infrastructure, IR manipulation.boostorg/boostatboost-1.86.0with submodules — the template-heavy specialist. Expression templates, SFINAE, CRTP-grade metaprogramming.torvalds/linuxatv6.10— the systems/C specialist. Kernel patterns, macro-driven dispatch, RCU, lockless data structures.fmtlib/fmtat11.0.0— the "small, high-quality C++" specialist. Compact API design, constexpr, zero-cost abstraction patterns.google/googletestatv1.15.0— the testing specialist. Fixture/macro patterns, death tests, mocking idioms.abseil/abseil-cppat tip — the Google-commons specialist.absl::containers, synchronization primitives, Flags/Status.facebook/follyat tip — the high-performance C++ specialist. Futures, coroutines, lock-free queues.grpc/grpcatv1.67.0— the large-service specialist. Cross-language glue, async state machines, codegen consumers.
The rationale for exactly these eight is pragmatic: license-clean (Apache 2.0, BSL-1.0, GPL-2.0 headers, MIT), no credentials needed, combined ~15 GB after shallow clone, and collectively they cover the shapes of C++ that production teams actually ship — low-level kernel C, heavy-template generic C++, modern application C++, and service-framework C++. We deliberately keep the operational list small so the data pipeline is reproducible on a single workstation.
That operational list is a subset of a much larger catalog we track for future specialists and corpus expansion. The extended catalog in [cpp-training-corpus-repos.md] enumerates 142 repositories across 16 categories — OS kernels (Linux, FreeBSD, XNU, seL4, Zephyr), compilers and runtimes (GCC, CPython, Ruby, V8, LuaJIT, musl), databases (PostgreSQL, SQLite, RocksDB, DuckDB, ClickHouse), networking stacks (curl, nginx, HAProxy, gRPC, ZeroMQ), browsers, game engines (Unreal, CryEngine, PhysX), the GNOME and KDE ecosystems, ML/scientific libraries, crypto, and embedded RTOSes. Each entry is tagged by on-disk size bucket (S/M/L/H) so we can budget ingestion. The catalog also documents the awkward sources — SQLite's Fossil repo, Chromium/V8/Fuchsia on googlesource, VLC/x264 on VideoLAN GitLab, Unreal requiring an Epic-linked GitHub account ([cpp-training-corpus-repos.md]) — so a future corpus expansion does not re-discover the same infrastructure traps.
Tokenizer
All eight specialists share one tokenizer. It has a 131 072-token vocabulary — large enough that uint16 would overflow, which is why the Megatron .bin files are written in uint32 ([data_preparation.md]). The implementation is a hybrid: a fixed hand-curated C++ vocabulary (keywords, operators, common stdlib identifiers) merged with a learned BPE layer using BERT-style whitespace handling, implemented in nanochat/cpp_tokenizer.py on the HuggingFace tokenizers backend ([data_preparation.md]).
The fixed-vocab half matters for C++: tokens like std::, ->, ::, constexpr, template<, #include get stable single-token representations instead of being shattered across subword merges. That keeps attention maps interpretable and helps the model learn structural regularities early. The BPE half absorbs identifiers, literals, and comments. The artifact (tokenizer.json, ~2.2 MiB) is not vendored into the cppmega repo — it is owned by the nanochat checkout and copied into ${MEGACPP_DATA_ROOT}/tokenizer/tokenizer.json during stage 2 of the pipeline. Every checkpoint must therefore be paired with the nanochat commit hash that produced its tokenizer, otherwise decoding silently drifts.
A detail worth calling out: the tokenizer also prepends a BOS token to every document. We use that downstream for document-boundary inference in attention masking — see the doc-masking section.
The Five-Stage Pipeline
The MegaCpp data build is orchestrated by scripts/data/prepare_data.sh and has five stages ([data_preparation.md]):
Stage 1 — download. prepare_download_megacpp.sh shallow-clones the eight repos into ${MEGACPP_DATA_ROOT}/cpp_raw/. Pinned refs are baked into the script so re-running on a fresh machine produces the same tree up to upstream retagging. It is idempotent; existing directories are skipped.
Stage 2 — tokenize. prepare_tokenize_megacpp.py dispatches to nanochat/scripts/data/run_clang_pipeline.sh, which runs libclang-based semantic indexing over each project, emits enriched JSONL at one document per semantic chunk (≤4096 tokens), tokenizes with the hybrid BPE, streams into parquet shards of 50 000 docs each plus a val_shard.parquet, and drops a _COMPLETE sentinel ([data_preparation.md]). Output lands in ${MEGACPP_DATA_ROOT}/parquet/clang_semantic_4k_v10/. We deliberately delegate to nanochat rather than vendoring the clang indexer — it is several thousand lines of maintained upstream code pulling in libclang bindings, and duplicating it would be worse than a path dependency.
Stage 3 — format. prepare_format_megacpp.py converts parquet into Megatron's .bin/.idx format. The .bin is flat packed token IDs in uint32; the .idx is an MMIDIDX\x00\x00-magic header plus sizes, pointers, and a doc index matching Megatron-core's MMapIndexedDataset reader ([data_preparation.md]). Implementation prefers megatron.core.datasets.indexed_dataset.IndexedDatasetBuilder and falls back to a raw writer when megatron-core is absent.
Stage 4 — cache. prepare_cache_megacpp.py memmaps the .bin/.idx and reports docs, tokens, dtype. We intentionally do not pre-build Megatron's GPTDataset sample index here, because that index is a function of --seed, --seq-length, --global-batch-size, and --train-iters and Megatron rebuilds it in seconds at the first training launch ([data_preparation.md]). Pre-caching would be a fragility tax.
Stage 5 — verify. verify_dataset_megacpp.py confirms both files exist and are non-empty, parses the index, asserts max(token_id) < 131072, and prints the first 64 tokens of document 0. It exits non-zero on any failure — no silent fallbacks, because an undetected vocab mismatch poisons every downstream checkpoint.
Four Dataset Versions, One Corpus
Within that pipeline, the corpus exists at four increasing levels of structural enrichment, documented in [DATA_GENERATION_STATUS_EN.md]:
v2 and v3 — the base datasets. v2 extracts full files pre- and post-commit; v3 is a structured inline-diff view where removed lines are emitted as C++ comments (// Removed: ...) and added lines as live code, under a synthesized file header. Two header styles exist: v3_doxygen uses Javadoc-style /** @file ... @brief ... */ headers, v3_simple uses plain // File: ... comments. [training_data_examples.md] shows a concrete Abseil commit rendered in both. The raw v2/v3 archives live at /home/dave/commit_chains_new/*.jsonl.gz (~1.6 TB, 27.6 M documents) and the tokenized uint16 binaries at /home/dave/final_bin_data/. This layer teaches the model what a diff looks like and what "before/after" means in source form.
v4 — the Tree-sitter context graph. For each modified function we build a strict 64K window containing the target plus its direct callers and direct callees, extracted with a Tree-sitter AST walker in Rust. An early bug emitted empty files because of a JSON schema deserialization mismatch; that is fixed, the Rust binary is recompiled, and the pipeline now saturates 40 cores writing into /mnt/v4_data/v4_context_graph_output/v4_extracted/ ([DATA_GENERATION_STATUS_EN.md]). v4 is an approximate graph — cheap and fast, good enough for 16K/64K curriculum.
v5 — the Clang semantic graph. v5 reaches 100 %-accurate semantic relationships by driving Clang with each project's compile_commands.json and walking git history incrementally. It runs on a dedicated GKE cluster (v5-clang-cluster, 50 v5-clang-worker pods). The deployment unblocked after the node service account 8067557205-compute@developer.gserviceaccount.com was granted roles/artifactregistry.reader, which cleared an ImagePullBackOff on the heavy LLM-toolchain image ([DATA_GENERATION_STATUS_EN.md]). v5 is the ground-truth layer — slower, but it is what the long-context specialists learn repository reasoning from.
v6 — the enriched parquet. v6 does not replace the text; it augments each record with structural metadata emitted by the Rust cpp-chunker, so the model can learn code structure via backpropagation rather than by inference ([training_data_examples.md]). Added columns are structure_ids (per-character category, one of nine: other, preamble, func_sig, func_body, class_decl, class_member, comment, typedef, namespace), chunk_boundaries ({char_offset, kind, name, dep_level, is_leaf}), call_edges ({caller_idx, callee_idx}), and type_edges ({type_idx, user_idx}). The text column is unchanged, so the format is backwards compatible: a naive dataloader ignores the extra columns and sees flat text, while the structure-aware dataloader feeds structure_ids as input embeddings and call_edges/type_edges as learned relation bias in attention (Variant C of the design). Output will live at gs://nanochat-training-data-2026/data/cpp_enriched_16k/ and cpp_enriched_64k/.
Curriculum: 4K → 16K → 64K
Training proceeds in four phases of progressively longer context, mapped to those four dataset versions in [corpus_curriculum_mapping.md].
Phase 1 (4K context) is syntax mastery. The model is fed v2_simple, v3_simple, v2_doxygen, and v3_doxygen as pre-tokenized .bin files, packed into 4096-token sequences by dataloader.py via memmap over the uint16 binaries. No deep call graphs — just dense code-plus-diff-plus-comment at short range. The point is to learn C++ as a language before we ask it to understand a project.
Phase 2 (16K context) is file-level reasoning. We reuse v4_context_graph but truncate the loader to 16 384 tokens. Because the v4 packing algorithm places the most critical context (target → direct callers → direct callees) nearest the modification, the 16K window captures file-local and immediate cross-file dependencies without wasting tokens on far-away code.
Phase 3 (64K context) is full repository-graph reasoning. Here we use full v4_context_graph plus v5_clang_graph, with max_seq_len=65536. Up to 64 000 tokens of heavily interconnected C++ — callers-of-callers, base interfaces, template instantiation chains — are injected immediately before the target modification so the model can trace variables, inheritance, and side effects across many files.
Phase 4 is structure-aware training and can overlap Phases 2–3 because v6_enriched is backwards compatible. The same compilable C++ is served with structure_ids, chunk_boundaries, call_edges, and type_edges so the model receives structure and dependency-level embeddings at the input layer and learned relation bias in attention.
Eight specialists then differentiate off this shared foundation by weighted over-sampling: the LLVM-heavy specialist up-weights llvm-project shards, the Boost specialist up-weights Boost shards, and so on. The base model sees everything uniformly.
Document Masking: Why Packing Is Not Free
At 4K context the naive trick of concatenating short documents into one fixed-length sequence — "sequence packing" — is nearly free. At 16K and 64K it becomes catastrophic unless you mask document boundaries. Without masking, a token in Document B can causally attend to tokens in Document A just because they share a packed sequence. The model then hallucinates dependencies between unrelated files, and every gain from long-context training is eaten by cross-document contamination ([doc_masking_design_en.md]).
Our solution is a doc_ids tensor aligned with input_ids, where tokens inside the same document share an id and tokens in different documents do not. Rather than storing doc_ids in the parquet, we infer it on-the-fly from the BOS token that the tokenizer already prepends to every document:
doc_ids = torch.cumsum(input_ids == BOS_TOKEN_ID, dim=1) - 1
This requires zero changes to the data pipeline, runs in O(T), and can be computed in the dataloader or at the top of GPT.forward() ([doc_masking_design_en.md]).
From there, the mask is routed per attention backend. On CUDA with PyTorch ≥ 2.5 the primary path is FlexAttention: a custom mask_mod combines causal masking with document boundaries, and the resulting BlockMask is block-sparse so entire all-False blocks are skipped — minimal MFU overhead even at 64K. Softcapping composes naturally as a score_mod. The alternative CUDA path is Flash Attention 3 varlen: convert doc_ids to cu_seqlens, unpad (B, T, H, D) into a flat (total_tokens, H, D), call flash_attn_varlen_func, and re-pad the output. FA2 supports the same signature. For small contexts a 2D SDPA mask fallback exists, but it is memory-bound — a 64K 2D mask is 4 GB per sample, so we cap SDPA at T ≤ 8192. On TPU both Pallas FlashAttention (q_segment_ids/kv_segment_ids) and JAX Splash Attention (SegmentIds plus fused attn_logits_soft_cap) are supported; torch_xla hides the Pallas details.
Mamba blocks need extra care. The SSM hidden state would leak across document boundaries just as attention would, so we compute reset_mask = (doc_ids[:, 1:] != doc_ids[:, :-1]) and zero the SSM carry at each boundary. On XLA, where the scan is a compiled loop and we cannot stop-and-restart mid-sequence, we fold the reset into the scan body as a multiplicative mask on the carry state. The kernel_size=4 conv1d at the head of each Mamba block also leaks, because each output depends on the three preceding tokens; we mask the conv1d input buffer at boundary positions so those three tokens cannot belong to a prior document ([doc_masking_design_en.md]).
The success bar is quantitative: attention scores across different doc_ids must be exactly −∞ pre-softmax (verified by unit tests), MFU must not regress by more than 3–5 % versus naive packing, and at 16K+ packed sequences val_bpb must beat the unmasked baseline. On short 4K runs with few documents per sequence, the improvement is negligible and that is expected — the real signal shows up as context grows.
Where It Lands
The finished artifacts are flat: ${MEGACPP_DATA_ROOT}/megatron/clang_semantic_4k_v10_{train,valid}.{bin,idx}, with the tokenizer sibling at ${MEGACPP_DATA_ROOT}/tokenizer/tokenizer.json. Launchers like scripts/remote_smoke_h200_dsa_9_4_m.sh hard-code --data-path, --tokenizer-type HuggingFaceTokenizer, --tokenizer-model, and --split 98,1,1, so as long as ${REMOTE_ROOT}/data/ mirrors ${MEGACPP_DATA_ROOT} (bench3 uses /mnt/data/cppmega-root/data, europe uses /home/dave/cppmega-root/data) training picks it up with no edits ([data_preparation.md]).
Two honesty notes are worth repeating. The raw corpus (cpp_raw/, ~15 GB) lives outside git; if you need bitwise reproducibility after upstream retagging, mirror it to cold storage before discarding. And the clang indexer is order-sensitive on filesystem enumeration, so while the streaming parquet writer is seeded (--seed=42) the overall pipeline is not guaranteed bitwise-reproducible across kernels. Pin the sha256sum of the resulting .bin in the experiment log alongside the nanochat commit that produced the tokenizer.
That is the MegaCpp data stack: eight pinned repos, one 131 072-token hybrid tokenizer, four progressively enriched dataset versions (v2 → v6), a 4K→16K→64K curriculum, and a document-masking layer that makes long-context packing honest. The specialists are distinguishable not by separate pipelines but by how they weight this shared, structurally-aware corpus.