MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 7 min readDavid Gornshtein
Tokenizer
BPE
C++
Vocab
Data

Tokenizer evolution for C++ code: from v2 proposal to v3 shipped

How the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency analysis told us, and what it did for downstream eval.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Tokenizer evolution for C++ code: from v2 proposal to v3 shipped
Published 7 min readDavid Gornshtein

Tokenizer evolution for C++ code: from v2 proposal to v3 shipped

The tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs is the part of the model nobody looks at until it is wrong. For a C++ specialist family that trains at 4K, 16K, and 64K context, "wrong" is expensive: a bad tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs wastes context, degrades the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns map, and silently inflates loss on the exact patterns the model is supposed to be best at. This is the story of how the MegaCpp tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs went from v1 (32K, mostly OK) through a v2 proposal (48K, never fully shipped) to v3 (65K, what we run today), and what the corpus frequency analysis taught us in between.

For first touch, BPE means byte-pair encoding: start from tiny units and repeatedly merge frequent adjacent pairs into larger tokens; a fixed token is a hand-reserved vocabulary entry that exists whether or not merge learning would discover it; and a vocab size is the total number of token IDs the embedding and output head must support. This post sits directly underneath the data evolution story in v2 to v6: Four Generations of the C++ Dataset, and Why We Kept Them All; for the lower-level fixed-vocab and learned-merge mechanics, the companion piece is Inside the MegaCpp C++ tokenizer. For the storage and runtime handoff after tokenization, the adjacent contracts are Building the C++ Training Data Pipeline: What Worked, What Broke and Packed rows as the real training contract.

Where v1 hurt

The v1 tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs was a 32,768-token hybrid: 1,600 fixed (specials, keywords, operators, preprocessor, punctuation, common STL identifiers, numbers 0-999, diff markers, structural whitespace) and 31,168 learned BPE. It was good enough for 4K syntax pretraining but had real holes that we kept hitting:

  1. No number-pattern tokens. Hex (0xFF), floats (3.14), scientific notation (1e10), and binary literals (0b1010) were all shattered into multi-token sequences.
  2. Pure BPE on the learned half. Nothing exploited C++ morphology, so common stems like init, read, buffer, value, index, and node had to be re-discovered as merges.
  3. Wasted reserved slots. Roughly 254 fixed slots in v1 were unused.
  4. No GPU or accelerator tokens. CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200, ROCm, and XLA names were often fragmented.
  5. Missing C++23/26 surface area such as flat_map, mdspan, and source_location.

The fix was not "add more BPE merges." Pure BPE, no matter how many merges, will not reliably give you a single-token __device__ or cudaMalloc unless you seed it. So the proposal was hybrid: more fixed, smarter learned.

v2 (48K) as a fallback, v3 (65K) as the primary target

The v2 proposal sized the vocabulary at 49,152 with 5,535 fixed and about 43,617 morpheme-aware BPE entries. The v3 proposal sized it at 65,536 with the same fixed structure and about 58,336 BPE entries. Vocabulary-scaling work at the time had been arguing that mainstream LLMs were often under-vocabularied for code; for a 270M-877M parameter C++ model, 48K-64K read as the sane range. We treated 65K as the primary target and 48K as a fallback if 65K turned out to be too memory-heavy at the embedding layer. It did not.

The public-safe recipe contract now exposes vocab_size: 65536 in NAM56R recipe contract sample. On the tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs-design side, the checked-in evidence is indirect by design: Reference corpus and tokenizer pinning notes defines the rule that tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs artifacts must be pinned as saved files plus a revision record, and Data preparation notes keeps special-token settings part of the public contract instead of a hidden trainer detail.

The proposed fixed-vocab layout was deliberately re-zoned compared to v1:

  • 0-63: special tokens
  • 64-319: C++ keywords including C++23/26 additions
  • 320-639: operators
  • 640-799: preprocessor and attributes
  • 800-1199: number-pattern tokens
  • 1200-1499: punctuation and indent levels
  • 1500-4499: STL and stdlib surface
  • 4500+: domain-specific bands plus morpheme-aware learned BPE

That is what the proposal looked like. What we shipped is not what the proposal looked like, because we ran corpus frequency analysis before locking the layout.

Vocabulary frequency analysis: the part that changed our minds

Before committing fixed slots to GPU, SQL, query/DB, C++23/26, and testing tokens, we ran the proposed fixed vocabulary against the actual corpus. The first run was on a 150K-document compilable C++ sample. The second run was on the full deduplicated v4 corpus: 1,056,993 documents, 642,028,236 identifiers, and about 22.3 GB of source.

The headline numbers were:

  1. 93% coverage of the 697 proposed fixed tokens.
  2. Morphemes massively outweighed domain tokens: 88.3M morpheme hits versus 6.94M domain-token hits, a 12.7x ratio.
  3. Comments were 24.2% of the corpus by bytes and 15.3% of all lines, so comment tokenization quality was not optional.
  4. Unicode was rare enough that Unicode normalization was not a priority for this C++-specialist lane.
  5. Four-space indent, two-space indent, and tabs all had enough presence to deserve first-class handling.
  6. Multi-space runs showed up often enough to justify a few dedicated whitespace-oriented tokens.

Then the uncomfortable findings: a lot of proposed domain tokens were actually generic C++ identifiers that BPE would learn perfectly well as merges.

The "generic word" problem

The proposed domain bands reserved around 1,700 fixed slots for GPU, SQL, query/DB, C++23-26, and testing tokens. Many of those words did appear in the corpus but still had no business taking fixed slots. Examples: query, Status, map, enum, chunk, expected, stride, transfer, receiver.

Only tokens with distinctive naming patterns such as __double_underscore__, SCREAMING_CASE_MACROS, or prefix-style API names like cublasSgemm and ncclAllReduce clearly benefited from fixed treatment. That pushed the design toward compounds and special patterns, not generic English words.

Categories we dropped entirely

The full-corpus analysis also identified categories where an entire band added little value:

  • MongoDB $-prefixed operators
  • Redis commands
  • CMake keywords, which were mostly filtered out before tokenization
  • narrow C++ ORM vocab
  • protobuf keywords that were just generic English words

Within categories we kept, we pruned to compound and special-pattern tokens only. That is why ClientContext or CompletionQueue survived where generic terms like Status or Server did not.

Final budget

After all the cuts, the domain-token budget went from 697 proposed to about 415 final. The 282 freed slots went directly to BPE merges, taking the learned-vocabulary count from roughly 58,336 to roughly 58,618 in v3. Small in absolute terms, but every freed slot is a merge that can absorb something the model actually sees frequently, which is exactly what the morpheme analysis said to invest in.

Namespaces matter

One specific finding changed BPE seeding: namespace-qualified references were so dominated by std:: that it deserved early learned treatment. After v3 shipped, STL-heavy code became easier to read in attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns views because patterns like std::vector<std::string> compressed more cleanly and let the model spend its attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns budget on the type structure rather than on repeated namespace noise.

What the tokenizer did for downstream eval

The reason any of this matters is downstream loss and downstream behavior. We saw three concrete effects after v3 replaced v1:

  1. Lower bits-per-byte at every context length we measured, with the biggest gains on STL-heavy code and number-heavy code.
  2. Better effective context. The same packed-row window fit more useful code before cropping because namespace prefixes, literals, and common accelerator identifiers expanded less.
  3. Cleaner attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns on structural patterns. Tokens like template<, constexpr, #include, ::, ->, and std:: became easier to inspect in attention visualizations.

There were also things v3 did not fix. The BPE half can still fragment long compound identifiers from rare or niche C++ ecosystems because the corpus did not contain enough hits to learn them as strong merges.

The fixed-token spec, briefly

A fixed token here means a vocabulary entry that is reserved up front rather than discovered by merge learning. That is the right tool for control markers, common structural symbols, and a small number of domain patterns where splitting would be predictably wasteful. Learned BPE still does most of the compression work; the fixed set exists to keep especially important patterns from being left to chance.

The trainer-side flow looks like this in public-safe sketch form:

# trainer side: load the v3 fixed-vocab spec and emit the runtime tokenizer
spec = load_fixed_token_spec()

tok = build_runtime_tokenizer(
    spec,
    target_vocab=65_536,
    bpe_merges=58_618,
    seed_morphemes=True,
    seed_namespaces=("std::",),
)
save_runtime_tokenizer(tok)

The shipped fixed-token spec lives in a dedicated tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs artifact. We will not inline it here because it is large and structural, but its top-level shape mirrors the bands described above: special tokens, keywords, operators, preprocessor, punctuation, STL/stdlib, number patterns, indent levels, the surviving domain bands, and a reserved tail. Every entry has a stable ID and every band is sized with empirical headroom. The quickest local proof bundle is Reference corpus and tokenizer pinning notes for the artifact-pinning rule, Data preparation notes for why special-token settings are part of the pipeline contract, and NAM56R recipe contract sample for the downstream vocab_size: 65536 constraint the runtime already consumes. The platform-specific consequences of freezing that vocabulary size show up most clearly in Vocab and Tokenizer Plumbing on TPU: What XLA SPMD Makes You Decide Up Front.

Lessons we want to keep

Three lessons from this iteration are worth keeping for future tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…GroundingInside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs work:

  1. Run the frequency analysis before locking the layout.
  2. Distrust generic English words in domain bands.
  3. Treat comments as real training content, not as noise.
FAQ

Frequently asked questions

Why did v3 cut fixed domain tokens instead of adding more?+
Because corpus analysis showed morphemes and structural patterns mattered more than reserving slots for generic English identifiers.
What exactly counts as a fixed token here?+
A reserved token that the tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and… keeps as a stable vocabulary entry whether or not merge learning would rediscover it. In practice that means control markers, punctuation-like structural patterns, number-oriented forms, and a small set of domain compounds with clear compression value.
What changed the most after the frequency sweep?+
We pruned whole token bands, reclaimed slots for BPE merges, and promoted namespace and number-pattern handling as higher-value investments.
Why does vocab size matter outside tokenizer training?+
Because embedding tables and output heads depend on the token-ID range. The tokenizerQuick term guideTokenizerA deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and… choice is not just compression policy; it also sets part of the runtime shape budget.
Why is there no cross-tokenizer comparison table here?+
Because this article is grounded in the v1-to-v3 migration and the checked-in MegaCpp contracts. Comparing expansion ratios against general code tokenizers is useful, but it needs the same C++ slice, normalization rules, and measurement script before the numbers belong in a public write-up.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

NAM56R

A concrete MegaCpp hybrid family name whose meaning lives in the launch pattern, feature placement, and runtime constraints rather than in one marketing label.

Tokenizer

A deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and…

XLA SPMD

The explicit TPU sharding mode where one compiled program carries placement rules instead of rank-local imperative code.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.