MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 19, 20266 min readDavid Gornshtein

Protobuf

Serialization

Streaming

Checkpoints

Data

Training Infra

Protobuf, the 2 GB Wall, and Why MegaCpp Prefers Shards Over Giant Messages

Why large-message serialization becomes fragile near protobuf's practical limits, and how MegaCpp's checkpoint and data paths avoid single huge payloads by using sharded files, streaming conversion, and explicit completion markers.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

Protobuf, the 2 GB Wall, and Why MegaCpp Prefers Shards Over Giant Messages

Published April 19, 2026•6 min read•David Gornshtein

If you move enough model state or dataset payload through one serialized object, you eventually stop having a software architecture problem and start having a failure-domain problem. Protobuf's well-known size ceiling near 2 GB is one expression of that boundary. Even below the hard cap, giant messages are painful: allocation spikes, retry cost, partial-write ambiguity, and miserable debuggability.

The interesting question for MegaCpp was not whether protobuf is "good" or "bad." The question was narrower: what should the training and data pipelineQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most do so that single-message limits do not become the control plane for checkpointing or corpus transport? The answer visible in the repos is consistent: avoid monolithic payloads, write shardable artifacts, use atomic per-file promotion, and require explicit completion markers before consumers advance.

The 2 GB limit is real, but the practical pain starts earlier

Protocol Buffers officially document that any proto in serialized form must be smaller than 2 GiB across supported implementations, which is why the ecosystem treats payloads near that size as out of bounds. That is the headline limit. The operational limit is lower.

Long before a message reaches that ceiling, large-object serialization creates four problems:

one failure can invalidate the whole transfer rather than one shard
producer and consumer both need enough contiguous memory headroom for the whole object
retries become expensive because they repeat the whole blob
partial writes and partial reads are harder to distinguish from valid-but-incomplete state

That is exactly the class of pain that large checkpoints and streaming corpora trigger.

What the checkpoint code avoids

The checkpoint manager in the training repo does not rely on one giant serialized checkpoint message. It keeps two storage shapes instead.

First, a distributed checkpointQuick term guideDCPDistributed checkpoint planning and restore contract used by the Megatron-style save-and-resume lane.GroundingAbout: Checkpoint format and resume History: Restoration without git history Example: FSDP sharding sample path writes a directory of shards through torch.distributed.checkpoint, where each rank writes its own dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most directly and rank 0 separately writes small JSON metadata. The code comment is explicit about the reason: this avoids gathering the full model to rank 0 and is materially faster for sharded runs.

Second, the non-DCPQuick term guideDCPDistributed checkpoint planning and restore contract used by the Megatron-style save-and-resume lane.GroundingAbout: Checkpoint format and resume History: Restoration without git history Example: FSDP sharding sample path writes separate files for model weights, optimizer state, metadata, and optional extra state. Those files are promoted atomically via *.tmp plus os.replace(...), including metadata sidecars and emergency-save pointers. That is a very different failure surface from "serialize everything into one message and hope the write completes."

The DCPQuick term guideDCPDistributed checkpoint planning and restore contract used by the Megatron-style save-and-resume lane.GroundingAbout: Checkpoint format and resume History: Restoration without git history Example: FSDP sharding sample-vs-rank-0 split matters more as checkpoints grow. Once state is large enough that protobuf-style size ceilings are even relevant, rank-0 gather stops being just a speed problem and becomes the widest failure domain in the save path. A sharded checkpoint keeps persistence aligned with the distributed state that already exists during training.

The same grain also helps at load time. A distributed loader can use checkpoint metadata and the current model shard layout to read only the dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most each rank needs, so a restart or topology change does not have to reconstruct one giant in-memory object before redistribution.

The rotation logic also shows the same design instinct. Old local checkpoints are not deleted immediately if background remote upload threads are still alive or have marked gcs_upload_ok as failed. The code prefers disk pressure over permanent dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most loss. Again, the pattern is file-by-file durability, not trust in one monolithic transfer succeeding.

The resume boundary is the same story in smaller form. A manifest or success marker is only useful if it advances after every required shard has landed and been promoted, which is the durability contract described in Checkpoint format and resume.

What the data path avoids

The dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-preparation scripts show the same bias away from giant serialized payloads.

The streaming JSONL-to-parquet converter tails a JSONL file that may still be growing, accumulates rows only up to a configured rows_per_file threshold, writes a parquet shard, validates that shard by reopening it, and only then moves on. When input growth stops for long enough, it writes the final train shard, a validation shard, and a _COMPLETE sentinel.

The companion upload and download scripts are also shard-oriented. They copy parquet files individually and use _COMPLETE as the promotion marker rather than assuming that seeing some files means the dataset is ready. On the MegaCpp side, the same logic is carried into the checked-in streaming parquet helper: growing JSONL in, bounded parquet shards out, explicit sentinel at the end.

That choice matters because a corpus is not safer just because it is "serialized." A 100 GB corpus wrapped in a giant logical message is still a 100 GB problem. Columnar shards plus explicit end-of-stream signaling are much easier to retry, validate, and resume.

Shard size is part of the same reliability boundary. For parquet streams, the practical target is usually a row-group size that fits comfortably inside one dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most-loader worker's memory budget, often in the low hundreds of megabytes rather than a multi-gigabyte lump; for tar-style WebDataset streams, larger roughly gigabyte-scale shards are common because the reader is sequential. The exact row count should follow the schema and worker RAM, not a universal magic number.

The checkpoint path uses the same local-first separation. Fast local shard writes answer "is this save complete?" first, while remote object drain is a later state with its own retries and cleanup. That keeps training coupled to a short durability boundary instead of to one long WAN-bound upload.

Why this is better than one huge message

The repo evidence supports a narrow claim: MegaCpp repeatedly chose boundaries that keep failure local.

Problem	Giant-message design	Sharded-file design used here
checkpoint save	one large serialize/write step	DCP shard dirs or separate model/optimizer/meta files
interrupted write	ambiguous whole-object corruption	`*.tmp` then `os.replace(...)` per artifact
remote copy lag	easy to delete the only good blob	keep local copies while upload threads are pending or failed
dataset streaming	one huge export unit	parquet shards with row thresholds
end-of-stream detection	infer from transport success	explicit `_COMPLETE` sentinel

None of this means protobuf should never be used. It means protobuf-sized thinking is the wrong abstraction for heavyweight training artifacts. Once payloads are large enough to make 2 GB even relevant, the safer design is usually to stop pretending the object is singular.

What we changed or deliberately avoided

From the code and article history, the practical design choices are clear:

avoid rank-0 gather as the only checkpoint shape for heavily sharded runs
avoid single-file checkpoint authority when DCPQuick term guideDCPDistributed checkpoint planning and restore contract used by the Megatron-style save-and-resume lane.GroundingAbout: Checkpoint format and resume History: Restoration without git history Example: FSDP sharding sample shard directories are the natural write format
avoid treating partial presence as readiness; use sidecar metadata and _COMPLETE markers
avoid non-atomic final writes for large artifacts; write temp files and promote with rename
avoid dataset transport that depends on one oversized serialized payload staying valid end to end

This is not an anti-protobuf argument. It is an argument for choosing the right serialization grain. Small control messages are fine. Giant model and dataQuick term guideData pipelineAn honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…GroundingBuilding the C++ Training Data Pipeline: What Worked, What Broke SLM data: what the pipeline optimizes for and why the loader contract matters most artifacts want chunking, validation, and promotion semantics that survive interruption.

The broader lesson

The 2 GB protobuf limit is a useful warning sign because it forces an architectural question: if one message can brick the operation, why is the operation shaped like one message at all?

MegaCpp's repos answer that question with a boring but durable pattern: many files, explicit metadata, atomic promotion, resumable copies, and completion markers. That pattern is less elegant than "one object in, one object out," but it is much closer to how large training systems actually survive.

FAQ

Frequently asked questions

What still counts as the commit point when the payload is sharded?+

The commit point should stay small even when the payload is huge. For checkpoints that usually means a metadata index or iteration pointer that advances only after every required shard has landed; for streamed datasets it means a readiness marker like _COMPLETE that appears only after the last validated shard is in place. That separation is the real reliability win: the large artifacts stay file-shaped and resumable, while the thing that says "safe to consume" remains cheap to rewrite and easy to audit. The local companion surfaces for that boundary are Checkpoint format and resume and Converting parquet token shards into Megatron indexed datasets.

Why not push one checkpoint blob straight to object storage?+

Object stores have their own grain size. Amazon S3, for example, caps a single PUT at 5 GB and recommends multipart upload once objects reach 100 MB; multipart upload lets a client retry only the interrupted parts, but unfinished uploads must still be completed or aborted so uploaded parts stop being retained and billed. That fits the same MegaCpp pattern: commit local shards first, drain remote storage in retryable pieces, and let confirmed remote durability drive cleanup instead of making the training step depend on one giant WAN write.

What happens if remote checkpoint upload falls behind local saves?+

The local checkpoint is still the short commit boundary, but the system has to treat the remote drain as a capacity-managed queue. If upload retries lag long enough, cleanup should prefer keeping the last locally durable checkpoints over deleting the only good copy; once remote durability is confirmed, eviction can free local NVMe headroom for the next save. That is the same separation used by checkpoint resume markers and should be visible in SLO dashboards, not hidden inside a best-effort uploader.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

DCP

Distributed checkpoint planning and restore contract used by the Megatron-style save-and-resume lane.

Grounding

SLO

A single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.

Grounding

Data pipeline

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and…

Grounding

Topic hubs

Topic Hub

C++ Data Pipelines and Corpus Packaging

A curated archive for the C++ data path: corpus selection, semantic enrichment, packaging into training artifacts, and the file-level durability choices that keep the pipeline sane.

David Gornshtein • MegaCppMore posts →

Protobuf, the 2 GB Wall, and Why MegaCpp Prefers Shards Over Giant Messages

The 2 GB limit is real, but the practical pain starts earlier

What the checkpoint code avoids

What the data path avoids

Why this is better than one huge message

What we changed or deliberately avoided

The broader lesson

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

C++ Data Pipelines and Corpus Packaging