Building the C++ Training Data Pipeline: What Worked, What Broke
An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and the quality gates that catch our own mistakes.
The broad retrospective: what worked, what broke, and how the pipeline settled.