MegaCpp EngineeringApplied C++ model systems
</>
Article
Grounded engineering note from the MegaCpp stack
Published 8 min readDavid Gornshtein
Structure Aware
C++
Embeddings
Attention Bias
Training

Structure Embeddings and Relation Bias: Teaching the Model That Code Has Shape

How per-token structure IDs, chunk boundaries, and call/type edges become input embeddings and attention bias in the MegaCpp stack, what the ablations kept, and what ships in deployment.

MegaCpp
Focused on applied C++ model engineering
Article Preview
Structure Embeddings and Relation Bias: Teaching the Model That Code Has Shape
Published 8 min readDavid Gornshtein

C++ source is not a flat token stream. It has a preamble, functions, classes, call edges, type edges, and a dependency order imposed by headers. For two trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 years we watched the model rediscover that structure poorly from whitespace and identifiers alone. This post is about the two features we built to put the structure in the input instead of hoping the model infers it: learned structure embeddings added at the input, and a relation bias added to attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns logits. It also tells the honest story of what survived ablations and what we dropped on the way into deployment.

For first touch, structure_ids are the coarse token-level code-region labels, dep_levels are the lightweight dependency-depth buckets, structure embedding means adding those aligned feature planes at the input side, and relation bias means converting chunk-pair relations into additive attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns logits instead of new token embeddings.

Why the deployed MegaCpp stack cares about this

A C++ corpus that is enriched at pre-tokenization time carries a lot of cheap supervision: structure categories per character (preamble, function body, class member, comment, typedef, namespace), AST depth and sibling index from tree-sitter, the node type of the token's surrounding AST node, chunk boundaries with dependency levels computed from includes and type-uses, and cross-chunk edges (caller/callee, type dependency). All of that is computable once in a Rust chunker plus a tree-sitter pass and stored in the enriched parquet schema we ship to the dataloader.

The question is what to do with it at train time. Two shapes of feature dominate: per-token scalars (structure category, dep level, AST depth, sibling index, node type bucket) and per-chunk-pair relations (call, type, same-level, adjacent-level, preamble-to-code). The first naturally becomes additive input embeddings. The second naturally becomes an ALiBi-style additive attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns bias. We prototyped both and kept the one that paid rent.

What we built in the training prototypes

The input-side piece is a structure-embedding layer. Its category set covers the nine character-level structure labels emitted by our Rust chunking pipeline: preamble, function signature, function body, class declaration, class member, comment, typedef, namespace, and other. Its relation taxonomy is deliberately richer than a binary edge marker: caller-to-callee and callee-to-caller are separate planes, as are type-uses versus type-used-by, and the two dependency-level relations (same depth, adjacent depth) let the model see the structural spine of the translation unit.

The main input-level module takes up to five per-token ID streams — structure, dependency level, AST depth, sibling index, and AST node type — and produces a (B, T, n_embd) tensor that gets added to the standard token and position embeddings. The implementation went through several rewrites. The current version uses one concatenated embedding vocabulary rather than five separate lookups, a learned low-rank bottleneck, a linear up-projection to model dimension, and per-component learned scalar scales. All weights are zero-initialized, so attaching the module to an existing checkpoint is a step-zero no-op and the signal has to be earned by gradient descent. The configuration can enable every component, only the core always-available features, or an explicit subset; offsets and clamp bounds are precomputed so the forward path stays shape-static and XLA-friendly.

The per-token AST features come from a tree-sitter-cpp pass. We parse each C++ source file, walk the AST to paint per-character arrays for depth, sibling index, and a node-type bucket, then downsample to token level by sampling at each token's first character. The node-type bucketing mirrors the Rust chunking side: ten coarse ranges for declarations, statements, expressions, types, literals, operators, parameters, scope qualifiers, and miscellany. Keeping the Python and Rust mappings bit-identical is a recurring maintenance tax, but it is the only way the same trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 graph sees the same integers regardless of which enrichment path produced the dataset.

The enriched trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 schema carries the per-token structure IDs, dependency levels, AST depth, sibling index, and AST node-type buckets, plus chunk boundaries, chunk dependency levels, and edge lists for calls and type dependencies. The loader helpers are small but load-bearing: they coerce mixed array encodings into aligned sequences and keep trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 ingest from doing that work ad hoc. The same upstream signal quality depends on the build-aware extraction described in Compile commands and semantic graphs.

The attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-side piece is a relation-bias table with shape (num_relation_types, num_heads) — nine relations times the head count, so only a few hundred parameters. The forward path takes a chunk-level relation mask (B, R, C, C) built from the edge lists, plus a (B, T) token-to-chunk mapping. It combines relation planes into a per-head chunk-level bias with an einsum, then promotes that to (B, H, T, T) for attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns. Invalid tokens, meaning those outside any chunk, are masked out. The table starts at zero so existing checkpoints reload cleanly, and the bias is added to attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns logits before softmax, just like ALiBi. The chunk-level intermediate matters: at roughly C = 64 and T = 4k, the memory cost is tiny compared with a full token-token relation tensor.

The broader prototype surface also included a document-level platform-label input and a TreeFFN-style chunk-graph enrichment path. The checked-in platform embedding sample shows the broadcast document-label shape, and the structure graph enricher sample shows the pool -> update -> scatter contract. Both stayed experimental rather than becoming part of the default deployment path.

How it lands in production

The deployed port is deliberately narrower than the prototype surface. MegaCpp keeps the input-level structure feature as a regular embedding-side addition: the same stacked lookup plus low-rank bottleneck, the same zero-init behavior, and the same notion of a small default component set versus a fuller opt-in set. The checked-in structure embedding contract sample shows the validation seam that normalizes extra structure inputs before they mix into the main token embeddings.

The ingest seam stays simple: carry the five aligned structure planes from the enriched batch into the embedding path without renaming or recomputing them in the hot loop. The checked-in structure embedding components sample and data index show the exact token-level fields that survive this boundary.

Three things from the earlier implementation did not cross the boundary:

  • Chunk-pair relation bias as a default attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns feature. We removed the full relation-bias path after later ablations stopped paying for it and token-compacted attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns paths made the integration brittle.
  • Tree-style chunk-graph enrichment. It worked in ablations, but once enriched data plus structure embeddings plus ngram hash were enabled, the extra gain was smaller than the cost of carrying message passing through every step.
  • Document-level platform labels. They did not earn a permanent place on the production corpus.

What does carry forward is the input-level structure embedding with the stacked bottleneck, gated by a feature flag and defaulting to the "core" component set (structure + dep_level). MegaCpp injects that additive embedding alongside the token embedding path. It stays a regular model feature rather than becoming a custom kernel path because the cost is already low.

Ablations and what we kept

The two structure-aware features split cleanly on whether they survived ablation:

Feature Module Ports to MegaCpp Default in prod Why
Input-level structure embedding (core: structure + dep_level) Structure inputs, embedding seam yes on Largest single win in the enriched-data table
Stacked single-lookup bottleneck (dim=64) Structure inputs yes on Cuts param count and ~12 kernel launches/step
Tree-style chunk graph enricher Graph enricher sample experimental off Marginal once enriched data + ngram hash are on
Relation bias (chunk pair -> per-head logit add) Relation sample no off Marginal in ablations, brittle under token compaction
Document-level platform labels Platform embedding sample no n/a Production corpus did not justify the extra parameters

We ran the structure-aware features across three overlapping experiments: a no-enrichment baseline, a structure-core rung, and a full stack (structure + tree_ffn + relation_bias + ngram hash). The enriched data is consistently the largest single win in the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200-throughput and loss tables, and most of that win comes from the input-level embeddings, not the graph or the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns bias. Two concrete observations from the ablation record shaped the port:

  • The single-lookup stacked embedding with a 64-dim bottleneck cut structure-embedding parameter count several-fold and removed roughly a dozen kernel launches per step. Earlier versions used separate embeddings per component, a softmax over component weights, and a mask for absent components. That older weighting path also had an accidental fp32 allocation that hurt bf16 throughput. Both are gone.
  • The chunk-graph update path went through several rewrites to cut down quadratic work. The final version works, but it remained too expensive for the value it added in production settings.

The broader lesson was smaller than the prototype menu. Cheap aligned input planes survived, but graph-time enrichment and relation bias did not. Once token compaction or other sequence surgery rewrites physical token order, a fixed chunk-pair logit bias drifts unless logical anchors survive the rewrite; the input-side structure planes are much more robust. That is why this article stays paired with Compile commands and semantic graphs, Tokenized enriched packed rows on TPU, and Packed rows as the real training contract: more of the durable structure story lives in preprocessing, packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles, and retrieval than in a per-step runtime graph block.

Deployment checklist

The minimum public-safe config surface looks like this:

structure_features:
  enabled: true
  active_components: core   # structure + dep_level
  bottleneck_dim: 64
  relation_bias: false
  tree_graph: false
  • Build this through validated configuration entry points rather than ad hoc objects.
  • The dataloader should emit all five aligned token-level planes (structure_ids, dep_levels, ast_depth, sibling_index, ast_node_type) even when only the core components are active.
  • Node-type bucketing must stay consistent across preprocessing and trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200.
  • Keep the structure embedding zero-initialised when attaching it to new or converted checkpoints. Accidentally non-zero values at step zero shift the loss curve and make ablation results non-comparable.
  • Relation-bias and tree-graph paths should default to false in production configs.
  • The chunk-level bias path, if ever re-enabled, should not be combined with token compaction in the same layer without re-validation.
FAQ

Frequently asked questions

Why did structure embeddings survive while relation bias did not?+
Because the additive input signal kept paying for itself in ablations, while the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.-side relation bias remained marginal and became brittle once token-compacted paths entered the runtime.
Is relation bias just ALiBi for code chunks?+
Not exactly. ALiBi adds a fixed distance penalty to token-position attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. scores; our prototype used learned per-relation, per-head chunk-pair planes. The failure mode rhymes, though: once packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a… or token compaction rewrites physical order without stable logical anchors, an attention-logit bias can describe the old neighborhood instead of the one the layer is actually attending over. That is why the production lane keeps the durable signal in token-aligned input fields and treats chunk graph context as preprocessing, retrieval, or explicitly revalidated metadata.
Why materialize structure before token packing?+
Because AST and chunk labels start as source-span facts, while subword tokenizers can split one identifier across several tokens. Resolving that once in preprocessing gives the model token-aligned fields and keeps the embedding seam shape-only at runtime; the same boundary is why Tokenized enriched packed rows on TPU treats span-aware materialization as part of the data contract.
Why not feed the whole AST into the model?+
Because the durable contract is token-aligned, not tree-shaped. The full AST is still useful upstream for Compile commands and semantic graphs, retrieval chunking, and source-span labels, but carrying a raw tree or per-step graph through attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand. would reopen the cost and brittleness that made relation bias and graph enrichment stay off. That is also why the references below separate GraphCodeBERT-style data-flow structure and cAST-style structural chunking from the compact embedding contract sample.
Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

Attention

The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.

compile command database

The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.

Semantic indexing

The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.

Training

A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…

Packed rows

Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…