Structure Embeddings and Relation Bias: Teaching the Model That Code Has Shape
How per-token structure IDs, chunk boundaries, and call/type edges become input embeddings and attention bias in the MegaCpp stack, what the ablations kept, and what ships in deployment.

C++ source is not a flat token stream. It has a preamble, functions, classes, call edges, type edges, and a dependency order imposed by headers. For two trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 years we watched the model rediscover that structure poorly from whitespace and identifiers alone. This post is about the two features we built to put the structure in the input instead of hoping the model infers it: learned structure embeddings added at the input, and a relation bias added to attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns logits. It also tells the honest story of what survived ablations and what we dropped on the way into deployment.
For first touch, structure_ids are the coarse token-level code-region labels,
dep_levels are the lightweight dependency-depth buckets, structure embedding
means adding those aligned feature planes at the input side, and relation bias
means converting chunk-pair relations into additive attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns logits instead of
new token embeddings.
Why the deployed MegaCpp stack cares about this
A C++ corpus that is enriched at pre-tokenization time carries a lot of cheap supervision: structure categories per character (preamble, function body, class member, comment, typedef, namespace), AST depth and sibling index from tree-sitter, the node type of the token's surrounding AST node, chunk boundaries with dependency levels computed from includes and type-uses, and cross-chunk edges (caller/callee, type dependency). All of that is computable once in a Rust chunker plus a tree-sitter pass and stored in the enriched parquet schema we ship to the dataloader.
The question is what to do with it at train time. Two shapes of feature dominate: per-token scalars (structure category, dep level, AST depth, sibling index, node type bucket) and per-chunk-pair relations (call, type, same-level, adjacent-level, preamble-to-code). The first naturally becomes additive input embeddings. The second naturally becomes an ALiBi-style additive attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns bias. We prototyped both and kept the one that paid rent.
What we built in the training prototypes
The input-side piece is a structure-embedding layer. Its category set covers the nine character-level structure labels emitted by our Rust chunking pipeline: preamble, function signature, function body, class declaration, class member, comment, typedef, namespace, and other. Its relation taxonomy is deliberately richer than a binary edge marker: caller-to-callee and callee-to-caller are separate planes, as are type-uses versus type-used-by, and the two dependency-level relations (same depth, adjacent depth) let the model see the structural spine of the translation unit.
The main input-level module takes up to five per-token ID streams — structure, dependency level, AST depth, sibling index, and AST node type — and produces a (B, T, n_embd) tensor that gets added to the standard token and position embeddings. The implementation went through several rewrites. The current version uses one concatenated embedding vocabulary rather than five separate lookups, a learned low-rank bottleneck, a linear up-projection to model dimension, and per-component learned scalar scales. All weights are zero-initialized, so attaching the module to an existing checkpoint is a step-zero no-op and the signal has to be earned by gradient descent. The configuration can enable every component, only the core always-available features, or an explicit subset; offsets and clamp bounds are precomputed so the forward path stays shape-static and XLA-friendly.
The per-token AST features come from a tree-sitter-cpp pass. We parse each C++ source file, walk the AST to paint per-character arrays for depth, sibling index, and a node-type bucket, then downsample to token level by sampling at each token's first character. The node-type bucketing mirrors the Rust chunking side: ten coarse ranges for declarations, statements, expressions, types, literals, operators, parameters, scope qualifiers, and miscellany. Keeping the Python and Rust mappings bit-identical is a recurring maintenance tax, but it is the only way the same trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 graph sees the same integers regardless of which enrichment path produced the dataset.
The enriched trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 schema carries the per-token structure IDs, dependency levels, AST depth, sibling index, and AST node-type buckets, plus chunk boundaries, chunk dependency levels, and edge lists for calls and type dependencies. The loader helpers are small but load-bearing: they coerce mixed array encodings into aligned sequences and keep trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200 ingest from doing that work ad hoc. The same upstream signal quality depends on the build-aware extraction described in Compile commands and semantic graphs.
The attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns-side piece is a relation-bias table with shape (num_relation_types, num_heads) — nine relations times the head count, so only a few hundred parameters. The forward path takes a chunk-level relation mask (B, R, C, C) built from the edge lists, plus a (B, T) token-to-chunk mapping. It combines relation planes into a per-head chunk-level bias with an einsum, then promotes that to (B, H, T, T) for attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns. Invalid tokens, meaning those outside any chunk, are masked out. The table starts at zero so existing checkpoints reload cleanly, and the bias is added to attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns logits before softmax, just like ALiBi. The chunk-level intermediate matters: at roughly C = 64 and T = 4k, the memory cost is tiny compared with a full token-token relation tensor.
The broader prototype surface also included a document-level platform-label input and a TreeFFN-style chunk-graph enrichment path. The checked-in platform embedding sample shows the broadcast document-label shape, and the structure graph enricher sample shows the pool -> update -> scatter contract. Both stayed experimental rather than becoming part of the default deployment path.
How it lands in production
The deployed port is deliberately narrower than the prototype surface. MegaCpp keeps the input-level structure feature as a regular embedding-side addition: the same stacked lookup plus low-rank bottleneck, the same zero-init behavior, and the same notion of a small default component set versus a fuller opt-in set. The checked-in structure embedding contract sample shows the validation seam that normalizes extra structure inputs before they mix into the main token embeddings.
The ingest seam stays simple: carry the five aligned structure planes from the enriched batch into the embedding path without renaming or recomputing them in the hot loop. The checked-in structure embedding components sample and data index show the exact token-level fields that survive this boundary.
Three things from the earlier implementation did not cross the boundary:
- Chunk-pair relation bias as a default attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns feature. We removed the full relation-bias path after later ablations stopped paying for it and token-compacted attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns paths made the integration brittle.
- Tree-style chunk-graph enrichment. It worked in ablations, but once enriched data plus structure embeddings plus ngram hash were enabled, the extra gain was smaller than the cost of carrying message passing through every step.
- Document-level platform labels. They did not earn a permanent place on the production corpus.
What does carry forward is the input-level structure embedding with the stacked bottleneck, gated by a feature flag and defaulting to the "core" component set (structure + dep_level). MegaCpp injects that additive embedding alongside the token embedding path. It stays a regular model feature rather than becoming a custom kernel path because the cost is already low.
Ablations and what we kept
The two structure-aware features split cleanly on whether they survived ablation:
| Feature | Module | Ports to MegaCpp | Default in prod | Why |
|---|---|---|---|---|
| Input-level structure embedding (core: structure + dep_level) | Structure inputs, embedding seam | yes | on | Largest single win in the enriched-data table |
| Stacked single-lookup bottleneck (dim=64) | Structure inputs | yes | on | Cuts param count and ~12 kernel launches/step |
| Tree-style chunk graph enricher | Graph enricher sample | experimental | off | Marginal once enriched data + ngram hash are on |
| Relation bias (chunk pair -> per-head logit add) | Relation sample | no | off | Marginal in ablations, brittle under token compaction |
| Document-level platform labels | Platform embedding sample | no | n/a | Production corpus did not justify the extra parameters |
We ran the structure-aware features across three overlapping experiments: a no-enrichment baseline, a structure-core rung, and a full stack (structure + tree_ffn + relation_bias + ngram hash). The enriched data is consistently the largest single win in the trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200-throughput and loss tables, and most of that win comes from the input-level embeddings, not the graph or the attentionQuick term guideAttentionThe token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.GroundingAbout: fused MLA on NVIDIA Reference: shared MLA adapter boundaries Reference: public-safe MLA integration patterns bias. Two concrete observations from the ablation record shaped the port:
- The single-lookup stacked embedding with a 64-dim bottleneck cut structure-embedding parameter count several-fold and removed roughly a dozen kernel launches per step. Earlier versions used separate embeddings per component, a softmax over component weights, and a mask for absent components. That older weighting path also had an accidental fp32 allocation that hurt bf16 throughput. Both are gone.
- The chunk-graph update path went through several rewrites to cut down quadratic work. The final version works, but it remained too expensive for the value it added in production settings.
The broader lesson was smaller than the prototype menu. Cheap aligned input planes survived, but graph-time enrichment and relation bias did not. Once token compaction or other sequence surgery rewrites physical token order, a fixed chunk-pair logit bias drifts unless logical anchors survive the rewrite; the input-side structure planes are much more robust. That is why this article stays paired with Compile commands and semantic graphs, Tokenized enriched packed rows on TPU, and Packed rows as the real training contract: more of the durable structure story lives in preprocessing, packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles, and retrieval than in a per-step runtime graph block.
Deployment checklist
The minimum public-safe config surface looks like this:
structure_features:
enabled: true
active_components: core # structure + dep_level
bottleneck_dim: 64
relation_bias: false
tree_graph: false
- Build this through validated configuration entry points rather than ad hoc objects.
- The dataloader should emit all five aligned token-level planes (
structure_ids,dep_levels,ast_depth,sibling_index,ast_node_type) even when only the core components are active. - Node-type bucketing must stay consistent across preprocessing and trainingQuick term guideTrainingA grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…GroundingSLM training in MegaCpp: what the stack optimizes for and what stays explicit Training speed anatomy on H200.
- Keep the structure embedding zero-initialised when attaching it to new or converted checkpoints. Accidentally non-zero values at step zero shift the loss curve and make ablation results non-comparable.
- Relation-bias and tree-graph paths should default to false in production configs.
- The chunk-level bias path, if ever re-enabled, should not be combined with token compaction in the same layer without re-validation.
Frequently asked questions
Why did structure embeddings survive while relation bias did not?+
Is relation bias just ALiBi for code chunks?+
Why materialize structure before token packing?+
Why not feed the whole AST into the model?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
The token-mixing path that turns Q/K/V style projections into context-aware activations. On MLA pages here it refers to the concrete attention module boundary, not the A/M/E/R block-family shorthand.
The compilation-command manifest that grounds semantic indexing, compiler receipts, and structure-aware data enrichment.
The structure-aware indexing lane that turns compile commands and parsed symbols into reusable training metadata.
A grounded walkthrough of how the project approaches small-language-model training: explicit stack specs, memory-first patches, hybrid blocks, and…
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…