Tokenizer evolution for C++ code: from v2 proposal to v3 shipped
How the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency analysis told us, and what it did for downstream eval.

Tokenizer evolution for C++ code: from v2 proposal to v3 shipped
The tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs is the part of the model nobody looks at until it is wrong. For a C++ specialist family that trains at 4K, 16K, and 64K context, "wrong" is expensive: a bad tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs wastes context, degrades the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns map, and silently inflates loss on the exact patterns the model is supposed to be best at. This is the story of how the MegaCpp tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs went from v1 (32K, mostly OK) through a v2 proposal (48K, never fully shipped) to v3 (65K, what we run today), and what the corpus frequency analysis taught us in between.
For first touch, BPE means byte-pair encoding: start from tiny units and repeatedly merge frequent adjacent pairs into larger tokens; a fixed token is a hand-reserved vocabulary entry that exists whether or not merge learning would discover it; and a vocab size is the total number of token IDs the embedding and output head must support. This post sits directly underneath the data evolution story in v2 to v6: Four Generations of the C++ Dataset, and Why We Kept Them All; for the lower-level fixed-vocab and learned-merge mechanics, the companion piece is Inside the MegaCpp C++ tokenizer. For the storage and runtime handoff after tokenization, the adjacent contracts are Building the C++ Training Data Pipeline: What Worked, What Broke and Packed rows as the real training contract.
Where v1 hurt
The v1 tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs was a 32,768-token hybrid: 1,600 fixed (specials, keywords, operators, preprocessor, punctuation, common STL identifiers, numbers 0-999, diff markers, structural whitespace) and 31,168 learned BPE. It was good enough for 4K syntax pretraining but had real holes that we kept hitting:
- No number-pattern tokens. Hex (
0xFF), floats (3.14), scientific notation (1e10), and binary literals (0b1010) were all shattered into multi-token sequences. - Pure BPE on the learned half. Nothing exploited C++ morphology, so common
stems like
init,read,buffer,value,index, andnodehad to be re-discovered as merges. - Wasted reserved slots. Roughly 254 fixed slots in v1 were unused.
- No GPU or accelerator tokens. CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200, ROCm, and XLA names were often fragmented.
- Missing C++23/26 surface area such as
flat_map,mdspan, andsource_location.
The fix was not "add more BPE merges." Pure BPE, no matter how many merges,
will not reliably give you a single-token __device__ or cudaMalloc unless
you seed it. So the proposal was hybrid: more fixed, smarter learned.
v2 (48K) as a fallback, v3 (65K) as the primary target
The v2 proposal sized the vocabulary at 49,152 with 5,535 fixed and about 43,617 morpheme-aware BPE entries. The v3 proposal sized it at 65,536 with the same fixed structure and about 58,336 BPE entries. Vocabulary-scaling work at the time had been arguing that mainstream LLMs were often under-vocabularied for code; for a 270M-877M parameter C++ model, 48K-64K read as the sane range. We treated 65K as the primary target and 48K as a fallback if 65K turned out to be too memory-heavy at the embedding layer. It did not.
The public-safe recipe contract now exposes vocab_size: 65536 in NAM56R
recipe contract sample.
On the tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs-design side, the checked-in evidence is indirect by design:
Reference corpus and tokenizer pinning notes
defines the rule that tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs artifacts must be pinned as saved files plus a
revision record, and Data preparation notes
keeps special-token settings part of the public contract instead of a hidden
trainer detail.
The proposed fixed-vocab layout was deliberately re-zoned compared to v1:
0-63: special tokens64-319: C++ keywords including C++23/26 additions320-639: operators640-799: preprocessor and attributes800-1199: number-pattern tokens1200-1499: punctuation and indent levels1500-4499: STL and stdlib surface4500+: domain-specific bands plus morpheme-aware learned BPE
That is what the proposal looked like. What we shipped is not what the proposal looked like, because we ran corpus frequency analysis before locking the layout.
Vocabulary frequency analysis: the part that changed our minds
Before committing fixed slots to GPU, SQL, query/DB, C++23/26, and testing tokens, we ran the proposed fixed vocabulary against the actual corpus. The first run was on a 150K-document compilable C++ sample. The second run was on the full deduplicated v4 corpus: 1,056,993 documents, 642,028,236 identifiers, and about 22.3 GB of source.
The headline numbers were:
- 93% coverage of the 697 proposed fixed tokens.
- Morphemes massively outweighed domain tokens: 88.3M morpheme hits versus 6.94M domain-token hits, a 12.7x ratio.
- Comments were 24.2% of the corpus by bytes and 15.3% of all lines, so comment tokenization quality was not optional.
- Unicode was rare enough that Unicode normalization was not a priority for this C++-specialist lane.
- Four-space indent, two-space indent, and tabs all had enough presence to deserve first-class handling.
- Multi-space runs showed up often enough to justify a few dedicated whitespace-oriented tokens.
Then the uncomfortable findings: a lot of proposed domain tokens were actually generic C++ identifiers that BPE would learn perfectly well as merges.
The "generic word" problem
The proposed domain bands reserved around 1,700 fixed slots for GPU, SQL,
query/DB, C++23-26, and testing tokens. Many of those words did appear in the
corpus but still had no business taking fixed slots. Examples: query,
Status, map, enum, chunk, expected, stride, transfer,
receiver.
Only tokens with distinctive naming patterns such as
__double_underscore__, SCREAMING_CASE_MACROS, or prefix-style API names
like cublasSgemm and ncclAllReduce clearly benefited from fixed treatment.
That pushed the design toward compounds and special patterns, not generic
English words.
Categories we dropped entirely
The full-corpus analysis also identified categories where an entire band added little value:
- MongoDB
$-prefixed operators - Redis commands
- CMake keywords, which were mostly filtered out before tokenization
- narrow C++ ORM vocab
- protobuf keywords that were just generic English words
Within categories we kept, we pruned to compound and special-pattern tokens
only. That is why ClientContext or CompletionQueue survived where generic
terms like Status or Server did not.
Final budget
After all the cuts, the domain-token budget went from 697 proposed to about 415 final. The 282 freed slots went directly to BPE merges, taking the learned-vocabulary count from roughly 58,336 to roughly 58,618 in v3. Small in absolute terms, but every freed slot is a merge that can absorb something the model actually sees frequently, which is exactly what the morpheme analysis said to invest in.
Namespaces matter
One specific finding changed BPE seeding: namespace-qualified references were so
dominated by std:: that it deserved early learned treatment. After v3 shipped,
STL-heavy code became easier to read in attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns views because patterns like
std::vector<std::string> compressed more cleanly and let the model spend its
attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns budget on the type structure rather than on repeated namespace noise.
What the tokenizer did for downstream eval
The reason any of this matters is downstream loss and downstream behavior. We saw three concrete effects after v3 replaced v1:
- Lower bits-per-byte at every context length we measured, with the biggest gains on STL-heavy code and number-heavy code.
- Better effective context. The same packed-row window fit more useful code before cropping because namespace prefixes, literals, and common accelerator identifiers expanded less.
- Cleaner attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns on structural patterns. Tokens like
template<,constexpr,#include,::,->, andstd::became easier to inspect in attention visualizations.
There were also things v3 did not fix. The BPE half can still fragment long compound identifiers from rare or niche C++ ecosystems because the corpus did not contain enough hits to learn them as strong merges.
The fixed-token spec, briefly
A fixed token here means a vocabulary entry that is reserved up front rather than discovered by merge learning. That is the right tool for control markers, common structural symbols, and a small number of domain patterns where splitting would be predictably wasteful. Learned BPE still does most of the compression work; the fixed set exists to keep especially important patterns from being left to chance.
The trainer-side flow looks like this in public-safe sketch form:
# trainer side: load the v3 fixed-vocab spec and emit the runtime tokenizer
spec = load_fixed_token_spec()
tok = build_runtime_tokenizer(
spec,
target_vocab=65_536,
bpe_merges=58_618,
seed_morphemes=True,
seed_namespaces=("std::",),
)
save_runtime_tokenizer(tok)
The shipped fixed-token spec lives in a dedicated tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs artifact. We will
not inline it here because it is large and structural, but its top-level shape
mirrors the bands described above: special tokens, keywords, operators,
preprocessor, punctuation, STL/stdlib, number patterns, indent levels, the
surviving domain bands, and a reserved tail. Every entry has a stable ID and
every band is sized with empirical headroom. The quickest local proof bundle is
Reference corpus and tokenizer pinning notes
for the artifact-pinning rule, Data preparation notes
for why special-token settings are part of the pipeline contract, and NAM56R
recipe contract sample
for the downstream vocab_size: 65536 constraint the runtime already consumes.
The platform-specific consequences of freezing that vocabulary size show up most
clearly in Vocab and Tokenizer Plumbing on TPU: What XLA SPMD Makes You Decide
Up Front.
Lessons we want to keep
Three lessons from this iteration are worth keeping for future tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs work:
- Run the frequency analysis before locking the layout.
- Distrust generic English words in domain bands.
- Treat comments as real training content, not as noise.
Frequently asked questions
Why did v3 cut fixed domain tokens instead of adding more?+
What exactly counts as a fixed token here?+
What changed the most after the frequency sweep?+
Why does vocab size matter outside tokenizer training?+
Why is there no cross-tokenizer comparison table here?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.
A deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…
The explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.