MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202610 min readDavid Gornshtein

Tokenizer

BPE

C++

Vocab

Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs

Q: Are per-specialist sub-vocabs separate tokenizer files?

No. The shipped design is one shared 64K tokenizer whose working set shifts by training mix and, optionally, runtime ID masking.

Q: Why is runtime ID masking off by default?

Because it is a routing tool, not a tokenizer requirement. A too-aggressive mask can turn a specialist-choice mistake into a hard decoding failure, while the shared vocabulary already keeps IDs stable across specialists. The safer default is to keep the shared softmax available and enable masking only for controlled ensemble-routing experiments.

Q: What should a masking experiment record?

Record both the tokenizer build revision and the exact allowed-ID set. Masking does not change tokenizer identity, so an experiment is only comparable when the shared vocabulary pin and the specialist mask are both part of the receipt.

A deep look at the tokenizer we ship: half hand-curated vocabulary, half learned BPE, what changed between v2 and v3, where the collisions live, and how per-specialist sub-vocabs fall out of the shared 64K layout.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs

Published April 18, 2026•10 min read•David Gornshtein

The tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped story is usually told at the summary level: we grew from 32K to 48K and then to a 64K-class v3 layout, seeded some morphemes, and locked the fixed-token bands. That summary is still almost useless for engineering. What actually mattered was the frequency analysis on the real corpus, the collisions between fixed and learned slots, and the per-specialist sub-vocab story — which is not a separate tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped but a discipline of BPE seeding and runtime ID masking. Here, runtime ID masking means applying a per-specialist allow-list at logits time while keeping one shared tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped and one stable ID layout.

Why MegaCpp cares about this

The closest companion pieces are tokenizer v2 to v3, which tells the versioning story, and the detailed corpus construction write-up, which explains where the token-frequency evidence came from in the first place.

A bad tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped is the cheapest way to degrade a code model. It wastes context, shatters high-frequency patterns into multi-token sequences, confuses the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns map, and silently inflates loss on the identifiers the model should be best at. For specialists training at 4K, 16K, and 64K, every percentage point of expansion ratio costs real compute at 64K and real answer quality at 4K. The other reason: our model family is a set of specialists sharing one 64K vocabulary. Their practical working sets differ enough that "per-specialist sub-vocab" is a useful abstraction even when no separate tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped file exists.

What we built in MegaCpp

The tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped is a hybrid. Half of the vocabulary is hand-curated fixed tokens; half is learned BPE. The tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped implementation wraps the HuggingFace tokenizers backend with BERT-style whitespace handling and a custom decoder that knows the difference between a standalone added token and a BPE suffix fragment.

The fixed half

The fixed half covers things BPE cannot be trusted to learn well:

Special tokens (IDs 0-63 in v3). <PAD>, <UNK>, <BOS>, <EOS>, fill-in-middle (FIM) markers, <CODE_START>/<CODE_END>, thinking tokens (<THINK_START>/<THINK_END> and sub-variants for error, fix, trace, verify, plan), tool-call tokens (<QUERY_TOOL>, <TOOL_RESULT>), compile/script markers, diff and comment delimiters, file separators, and a reserved tail. Control surface, not learnable.
C++ keywords, operators, preprocessor directives, attributes. Extended through C++23/26 (constinit, co_await, co_yield, co_return, contract_assert, _Atomic, and the attribute set).
Number patterns. A dedicated band for hex prefixes (0x, 0X), common byte values (0x00, 0xFF, 0x80), 32-bit magic constants (0xDEADBEEF, 0xCAFEBABE), float suffixes and common float literals, scientific notation, binary literal prefixes, and C++23 integer suffixes including z and uz for size_t. Before this band existed, 0xDEADBEEF was five or six BPE tokens; now it is one.
Punctuation and indent tokens. Explicit tokens for " ", " ", " ", and "\t".
STL and stdlib identifiers at high frequency.
Domain bands: GPU/accelerator tokens (CUDA runtime, cuBLAS, cuDNN, Thrust/CUB, CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample, NCCLQuick term guideNCCLNVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.GroundingAbout: NCCL and collective hangs Example: pipeline parallel sample Reference: training on 8x H200, atomics, graph API), ROCm/HIP mirrors, TPU/XLA op names (MHLO dialect, PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample/Mosaic surface), SQL keywords, query/DB tokens, C++23/26 library surface, and testing/build-framework tokens (GTest, Catch2, Boost.Test).

Each added token is registered through HuggingFace's added-tokens mechanism, which is what the production tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped path uses to distinguish "this token is a full word" from "this token is a BPE fragment." That distinction drives the decoder's space-reconstruction heuristics, because a C++ identifier like end_point may arrive as end + _ + po + int through the pre-tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped and has to decode without a stray space between po and int.

The learned half

The learned half is BPE, but it is BPE seeded aggressively on corpus-measured morphemes. The frequency analysis ran a code-aware Rust scanner with parallel workers across 1,435,084 C++ source files — 22.3 GB, 333 open-source projects — in 51.4 seconds. The headline number that shaped BPE seeding was morpheme dominance: 88.3M morpheme hits across 128 proposed morphemes, against 6.94M total hits for the 697 proposed fixed domain tokens. A 12.7x ratio, and it pointed directly at what to invest in.

The reason those numbers were trusted is that the scan is code-aware rather than regex-first. Identifier counts only help tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped design if comments, string payloads, and path-like noise do not dominate the top of the table.

Morpheme classes seeded into BPE: common components (value/index/offset/node/ptr/buffer/count/context, 59.9M hits over 52 items), C++ stems (init/read/write/create/start/format/lock/parse/find/alloc/insert, 22.9M over 30), prefixes (proto, sub, non, multi, meta, mono; 4.2M over 24), and suffixes (1.2M over 22). Total ~99% coverage of the proposed morpheme set. Aggressive early merges (init, read, write, buffer, value, index, node, ptr) mean initialize tokenizes as init + ialize, not fragment soup. After seeding, BPE ran a standard merge schedule until the remaining budget filled.

One specific seeding decision paid off disproportionately: std:: is 6.8M namespace-qualified references out of 28.2M total namespace references — an order of magnitude ahead of the next entries (llvm:: 552K, boost:: 517K, detail:: 495K, cutlass:: 443K, absl:: 322K). Making sure std:: merged early meant std::vector<std::string> becomes three tokens instead of five or six, and attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns visualizations over STL-heavy code got visibly cleaner after v3 shipped.

v2 to v3: what actually changed in token-frequency terms

The initial v2 and v3 sizing plans looked clean on paper — 1,700 fixed slots for GPU/SQL/query/DB/C++23-26/testing domains, organized into neat bands from 5300 to 6999. We ran the frequency analysis before committing those slots, which is where things got uncomfortable.

The first uncomfortable finding was the generic word problem. Of 473 domain tokens found in the smaller 150K-doc analysis, 209 were generic C++ identifiers appearing in 2% to 30% of documents. Words like query, Status, map, enum, chunk, expected, stride, transfer, receiver — technically valid "domain" tokens under the original taxonomy, but wastes of fixed slots because BPE learns them perfectly well as merges. We cut them. The initial budget went from 1,700 down to ~216 on the 150K analysis, an 87% reduction.

The second uncomfortable finding, from the full 1.06M-document corpus run, confirmed the direction but softened the magnitude. More documents surfaced more edge-case hits for borderline tokens (ODBC, ROCm, some C++23 types), so the final domain budget settled at ~415 tokens. The categories we cut entirely on the full-corpus pass:

MongoDB $-prefixed operators (39 tokens, 148 total hits across 1.06M docs). C++ corpora simply do not contain MongoDB query DSLs.
Redis commands (26 tokens, 430 hits). Only SREM had any real usage, and even that was incidental.
CMake (22 tokens, 1,685 hits). Almost entirely stray references in comments, because CMakeLists.txt is filtered at the file-type stage.
C++ ORM tokens (13 tokens, 401 hits). Negligible across the ORM band.
Protobuf keywords as a category (18 tokens, 1.39M hits — but every single token was a generic English word that BPE learns for free).

Within surviving categories we pruned to compound and special-pattern tokens only. gRPC kept ClientContext, ServerBuilder, CompletionQueue, dropped Status, Channel, Server. C++23 types kept source_location, flat_map, mdspan, dropped expected (108K generic hits). C++23 ranges kept cartesian_product and zip_transform, dropped stride and chunk. The surviving rule of thumb: fixed slots only for tokens with distinctive naming patterns — double-underscore, SCREAMING_CASE, or prefix-style API names like cublasSgemm, ncclAllReduce, __device__.

The 282 slots freed by those cuts went directly to BPE merges. In v3 the learned vocabulary rose from 58,336 to ~58,618, small in absolute terms but every freed slot is a merge that absorbs something the model actually sees.

Vocabulary collisions

Collisions between the hand-curated half and the learned half show up in two places. First, suffix-vs-standalone ambiguity: a token like s, is, or, if, in could be a BPE suffix (char + s = chars) or a standalone identifier. The decoder's _is_bpe_suffix logic is context-dependent: if the previous token is a fixed added token like char, a single-char s attaches as a suffix; if the previous token is a BPE fragment, a common short word stays standalone. We maintain an allow-list of common short words so the decoder does not collapse them into the previous identifier. Second, the underscore-identifier case: the pre-tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped splits on _, so end_point arrives as end + _ + po + int, and the decoder tracks an in_underscore_id state so subsequent fragments continue joining. Neither case is research, both are load-bearing — a tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped whose decode round-trip is off by a space every few hundred tokens silently corrupts every SFT example and verifier check.

Per-specialist sub-vocabs

We ship one shared 64K tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped, not per-specialist tokenizers. The strongest checked-in evidence is the fixed-token manifest for the v3 C++ tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped, which declares _total_vocab: 65536 and describes IDs 7200-65535 as the learned BPE band. Each specialist has a characteristic distribution over that shared vocabulary, which is useful to think about as a sub-vocab. A systems-C specialist spikes on __attribute__, __builtin_*, likely/unlikely, byte-value hex literals, and the preprocessor band, with almost no hits on CUTLASSQuick term guideCUTLASSNVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.GroundingAbout: CuTe DSL experiments Example: TileLang TMA bulk-copy sample/CUDA. A template-heavy generic C++ specialist saturates STL and Boost, touches the number-pattern band lightly, and uses attributes more than CUDA. A GPU specialist spikes on __global__, __device__, cudaMalloc, threadIdx, cublasSgemm, and atomics; a TPU/PallasQuick term guidePallasJAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.GroundingAbout: Pallas on TPU Example: Pallas kernel selection note Example: XLA Pallas bridge receipt sample specialist lights up mhlo.*, pallas.program_id, BlockSpec, and GridSpec.

Platform vocabularyQuick term guidePlatform vocabThe compact per-document platform-ID vocabulary that travels beside token IDs and is embedded separately from the BPE rows.GroundingAbout: XLA SPMD tokenizer and vocab on TPU Example: platform embedding sample Example: materialize tokenized enriched parquet is separate. A dedicated platform-metadata layer defines a 113-entry label-to-ID space consumed by an nn.EmbeddingBag(mode='sum') path, not by the text tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped. Six categories (OS, RTOS, GPU, architecture, compiler, C++ standard), up to 20 IDs per document. A prefix emitter can render the same info as a // platform: ... comment that does go through the tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped, but the ID-embedding path is primary.

The practical effect is that "per-specialist sub-vocab" is emergent from training mix and runtime ID masking, not from duplicating tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped packages. We considered cold-freezing unused domain bands at inference (zeroing softmax over the CUDA band for a systems-C specialist, for example) but have not shipped it. The BPE band is shared, added-token IDs are stable across specialists, and the merge schedule is fixed, which keeps weight sharing and ensemble-time routing simpler than per-specialist tokenizers.

How it lands in MegaCpp

That downstream handoff is easiest to understand next to the C++ data-preparation pipeline and SLM data, because tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped changes only stay honest if the dataset and checkpoint receipts stay pinned to the same revision.

In production the tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped is a build-time output, not a runtime library. The runtime tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped JSON (~2.2 MiB) is produced by the upstream build and copied into the tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped directory the training launchers point at. The data-prep pipeline consumes it by path, so bumping the tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped means bumping the upstream tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped revision and re-running the downstream prep stages. The final validation stage asserts max(token_id) < vocab_size, so mismatched pairs fail fast.

We lift the tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped into MegaCpp as-is. The only difference: the production launcher uses a HuggingFace tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped type with an explicit vocab-size flag and model directory, and never re-trains at launch.

Ablations and what we kept

The ablations worth preserving in the design history:

v1 at 32K was clearly under-vocabularied for code. BPE-only, no morpheme seeding, no number-pattern band, no thinking tokens. The fix was not "more BPE merges"; pure BPE on a bigger budget does not recover single-token __device__ or cudaMalloc. We committed to hybrid early.
The v2 proposal at 48K was a useful fallback but never shipped; the measured embedding cost of 65K was fine at our model sizes, so we went straight to v3.
The proposed 1,700 domain slots were almost entirely wrong. We kept 415 and freed the rest to BPE. The freed slots are not glamorous but they are real merges on high-frequency morphemes.
We considered a Unicode normalization pass (accents and math symbols). The corpus is 99.98% ASCII; the pass was not worth the cost of a non-invertible transform in an otherwise round-tripping pipeline.
We considered per-specialist vocabularies. We rejected them: shared weights and stable IDs matter more for ensemble routing than a small per-specialist efficiency win. Runtime ID masking is the lighter-weight alternative when we need it.

Production checklist

The tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped build output is tied to a specific upstream revision; that revision is recorded with every checkpoint the dataset feeds.
Added-token IDs are stable across versions for the lifetime of a specialist family; thinking, tool-call, and compile tokens have pinned IDs so SFT-formatted data survives tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped rebuilds.
Vocabulary size is asserted at every stage: data preparation, training launcher flags, and inference loader. Mismatches fail fast, not silently.
Decode round-trip is pinned by unit tests for the suffix-vs-standalone and underscore-identifier cases. Changes to _is_bpe_suffix require regression coverage.
Per-specialist runtime ID masking is a feature flag, off by default; it is a tool for ensemble routing, not a requirement.
Platform-info IDs are a separate metadata table from the token vocabulary and are consumed through an embedding-bag path, not through the text tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…GroundingTokenizer evolution for C++ code: from v2 proposal to v3 shipped.

Vocab snapshot

Layer	Slot count (approx)	Source	Notes
Fixed, hand-curated	thousands	keywords, punctuation, operators, morphemes	stable across versions
Learned BPE	tens of thousands	corpus frequencies	rebuilt v2 -> v3
Reserved and special	small	`<doc>`, `<mask>`, tool tokens	never reassigned
Per-specialist working set	subset of total	BPE seeding + runtime ID masking	no separate tokenizer file

# runtime ID masking for a specialist: disallow IDs outside the working set
import numpy as np
mask = np.full(vocab_size, -np.inf, dtype=np.float32)
mask[specialist_ids] = 0.0
logits = logits + mask  # applied before softmax

FAQ

Frequently asked questions

Are per-specialist sub-vocabs separate tokenizer files?+

No. The shipped design is one shared 64K tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency… whose working set shifts by training mix and, optionally, runtime ID masking.

Why keep platform labels out of the text tokenizer?+

Because they behave like document metadata, not like tokens in the source stream. The text-tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency… path is still about matching and decoding text, while the platform path is a separate bag of label IDs aggregated once per document and then broadcast as metadata. The quickest checked-in proof surface is Platform embedding sample, which keeps the aggregation mode explicit instead of pretending OS, compiler, or hardware tags are normal text tokens.

Why is runtime ID masking off by default?+

Because it is a routing tool, not a tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency… requirement. A too-aggressive mask can turn a specialist-choice mistake into a hard decoding failure, while the shared vocabulary already keeps IDs stable across specialists. The safer default is to keep the shared softmax available and enable masking only for controlled ensemble-routing experiments.

Is runtime ID masking a speed optimization?+

No. The mask is a correctness and routing guard, not a cheaper softmax. The vocabulary projection still computes the full row, and in mixed precision an over-tight allow-list can leave the softmax with unstable logits instead of a useful distribution. The speed win comes from better BPE seeding and fewer fixed-token mistakes, not from trying to make logits-time masking replace the shared tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency….

What should a masking experiment record?+

Record both the tokenizerQuick term guideTokenizerHow the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency… build revision and the exact allowed-ID set. Masking does not change tokenizer identity, so an experiment is only comparable when the shared vocabulary pin and the specialist mask are both part of the receipt.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

CUTLASS

NVIDIA CUTLASS kernel library and reference surface used for dense GEMM, FA4, and CuTe DSL interop.

Grounding

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

Grounding

Pallas

JAX's kernel language for writing explicit TPU kernels when stock XLA lowering is not enough for the required tile, memory-layout, or masking contract.

Grounding

Tokenizer

How the MegaCpp C++ tokenizer evolved from a 32K v1 through a 48K v2 proposal to the 65K v3 release: what we proposed, what corpus frequency…

Grounding

Tokenizer evolution for C++ code: from v2 proposal to v3 shipped

Platform vocab

The compact per-document platform-ID vocabulary that travels beside token IDs and is embedded separately from the BPE rows.

Grounding

NCCL

NVIDIA's collective-communication library for all-reduce, all-gather, reduce-scatter, and point-to-point transport on CUDA multi-GPU lanes.

Grounding

David Gornshtein • MegaCppMore posts →

Inside the MegaCpp C++ tokenizer: fixed vocab, BPE, and per-specialist sub-vocabs

Why MegaCpp cares about this

What we built in MegaCpp

The fixed half

The learned half

v2 to v3: what actually changed in token-frequency terms

Vocabulary collisions

Per-specialist sub-vocabs

How it lands in MegaCpp

Ablations and what we kept

Production checklist

Vocab snapshot

Read next

References

Frequently asked questions

Terms used in this article