MegaCpp EngineeringApplied C++ model systems

</>

Article

Grounded engineering note from the MegaCpp stack

Published April 18, 202611 min readDavid Gornshtein

Lora

QLoRA

Adapters

PEFT

Inference

Specialists

The adapter stack: how LoRA, QLoRA, and hot-swap compose MegaCpp specialists

The LoRA, QLoRA, DoRA, VeRA, and DyLoRA family behind MegaCpp specialists, the registry and lifecycle that turn adapters into versioned releases, the hot-swap runtime, and the inference-facing API they power.

By David Gornshtein

MegaCpp

Focused on applied C++ model engineering

Article Preview

The adapter stack: how LoRA, QLoRA, and hot-swap compose MegaCpp specialists

Published April 18, 2026•11 min read•David Gornshtein

MegaCpp ships as one base model plus a fleet of small specialists - a project-specific adapter for the codebase you're editing, a language-specific one for the dialect of C++, an online adapter that learns from compiler feedback. The thing that makes that fleet tractable to build, version, ship, and serve is the adapter stack behind the product. This post is a tour of how that stack actually works: what compose differs from merge, what the registry guarantees, how hot-swap works at serving time, and what crosses into the serving API.

That adapter lifecycle sits directly under the specialist routing view of the product and the inference serving stack that activates one adapter or blend per request.

For first touch, LoRA means the low-rank delta pair added around an existing linear layer, QLoRA means training those deltas against a quantized base model, and adapter stack means the runtime and artifact contract around those deltas rather than only the math in one paper. compose means live arithmetic over adapter deltas without committing a new base checkpoint, merge means folding those deltas into one committed artifact, and hot-swap means activating one adapter or blend on a live model without rebuilding the base. The quickest checked-in anchor is the adapter runtime sample, with the distributed LoRA sync sample for the late-injection seam and the hybrid example catalog as the public starter map.

Why MegaCpp cares about this

Economics. A full fine-tune of the C++ specialist for one customer codebase costs as much as the original training run. A LoRA adapter at rank 4-8 over MLP and Mamba projections is a few tens of MB of trainable weights, fits on a single modern GPU, trains in coffee-break time, and at serving costs zero inference overhead because it merges into the base. QLoRA pushes the base into NF4 4-bit so the same GPU can adapt a much larger checkpoint. DyLoRA gives us one adapter that works across multiple ranks. VeRA shares random projection bases across layers and only learns diagonal scalings, crashing per-adapter parameter count by another order of magnitude. We need every variant because the answer to "which recipe is right" is a matrix of (specialist size, base size, target hardware), not a single column. The serving-side continuation is modal image and cold start when the main question becomes packaging and activation latency rather than adapter math.

How the stack is structured

The load-bearing adapter module defines the core LoRA variants and the injection walker. It defines LoRALinear, DoRALinear, VeRALinear, QLoRALinear, DyLoRALinear, and the Megatron-tensor-parallel-aware variant MegatronLoRALinear, plus the inject_lora() walker that swaps nn.Linear modules in place by name suffix. The default target set (DEFAULT_TARGETS = MLP_TARGETS | MAMBA_TARGETS) covers c_gate_fc, c_fc, c_proj, in_proj, out_proj - MLP and Mamba projections only. Q/K/V are deliberately excluded by default because adapting them invalidates the shared KV cache during online adaptation. ATTENTION_TARGETS = {"c_qkv"} exists for full-rank fine-tuning where cache-validity isn't a concern. inject_lora() detects whether a layer is already wrapped in a Megatron column- or row-parallel TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding wrapper and constructs the right adapter variant: column-parallel adapters shard B along the output dim and replicate A; row-parallel adapters shard A along the input dim and replicate B. There are two TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding backends: XLA/TPU via apply_lora_spmd_sharding(), and CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 DTensorQuick term guideDTensorPyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.GroundingAbout: EP / PP / TP / CP / SP / DP overview Example: 3D parallelism sample Reference: FSDP2 on XLA TPU via apply_lora_dtensor_tp(). merge_lora() folds adapter deltas into the base weights for zero-overhead inference; unmerge_lora() reverses the operation by snapshotting the delta at merge time so subsequent A/B mutations don't corrupt the base.

The 4-bit quantization layer is the canonical home of the key primitives: NF4_LEVELS and FP4_LEVELS codebooks, the NF4Tensor dataclass, quantize_to_nf4 / dequantize_nf4, double-quantization of the absmax scales themselves down to FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper E4M3, and QuantizedLinear for inference-only 4-bit layers. The QLoRALinear symbol is re-exported from the core adapter module for backward compat but the primitives live here. NF4 places its 16 levels at quantiles of N(0,1), which minimizes expected quantization error for normally-distributed neural-network weights; FP4 places them uniformly in [-1, 1]. We use NF4 by default because the empirical accuracy gap is real and the implementation cost is the same.

The compose and merge layers are the two halves of the arithmetic. Composition (compose_adapters, add_adapters, subtract_adapters, interpolate_adapters) is task-vector arithmetic in the spirit of [Editing Models with Task Arithmetic - Ilharco et al., ICLR 2023]. The non-obvious part is that for paired lora_A/lora_B keys the composition has to happen in delta space - (w1*B1 + w2*B2) @ (w1*A1 + w2*A2) != w1*(B1@A1) + w2*(B2@A2) - so we form the weighted sum of DeltaW = B @ A, then do a truncated SVD back to the target rank to recover new A, B factors. Standalone keys such as DoRA magnitude get a simple weighted average. There are explicit guards against mixing variants (use_dora, use_vera, use_qlora, use_dylora) because the forward passes are not interchangeable, and against mixing different alpha/rank scalings because the scaling is baked into the forward, not the stored weights. Merging is the higher-level operation that turns multiple adapters into one: merge_adapters_average() does exact delta-space averaging, merge_adapters_ties() is [TIES-Merging - Yadav et al., NeurIPS 2023], and merge_adapters_dare() is [DARE - Yu et al., ICML 2024].

The adapter lifecycle layer handles create_adapter, save_adapter, load_adapter, merge_adapter, plus the canonical artifact format. The artifact carries state_dict, lora_meta (rank, alpha, variant, targets, QLoRA quantization metadata, DyLoRA rank-growth metadata), and a schema version. A fail-closed gate rejects PEFT-style "LoRA over a separately quantized base" artifacts masquerading as native QLoRA: it checks use_qlora against the presence of qlora_quant_metadata and refuses any artifact that claims native QLoRA without carrying it. This prevents the most common cross-toolchain bug - a generic LoRA artifact that loads cleanly and produces silently wrong outputs.

The adapter registry is the global index. Each version gets an auto-incremented per-project tag, a parent ID for lineage, a status (draft -> active -> archived -> deprecated), a metrics blob, tags, training config, and a compatibility checker that verifies recorded targets, rank, and variant match the loaded model. The registry is a single JSON index alongside per-version adapter directories, with a report exporter and an A/B comparison helper. That versioned-surface mindset lines up with checkpoint format and resume: the artifact is only useful if the reader can tell exactly which contract produced it.

The runtime hot-swap surface has three layers: AdapterBank (low-level, holds named state dicts and copies weights into LoRA params), MultiAdapterManager (wraps a base model, handles a CPU cache, lazy-injects LoRA on first register, exposes activate, activate_merged([names], weights=[...]), deactivate), and AdapterRouter (per request: parse a routing prefix, activate, generate, cache last-used). The bank validates rank compatibility on every activation, validates compose support against runtime metadata for blended activations, and zeroes LoRA params on deactivate so the base model returns clean. BatchAdapterRouter handles the case where batch items want different adapters. In practice that routing layer has to coexist with the same per-request serving envelope described in inference serving stack and specialist routing, not with ad hoc prompt parsing.

The TIES and DARE merge paths are intentionally not delta-space exact - they sparsify on the raw A/B factors, which is what their original formulation does. We documented this distinction explicitly because merging (B1, A1) and (B2, A2) after sparsifying is mathematically a different operation from sparsifying (B@A) and re-factoring. For exact delta-space merging the user must pick merge_adapters_average().

The online adaptation layer is the inference-time adaptation loop. OnlineAdapter keeps a ReplayBuffer of (input_ids, reward) pairs from the live serving stream (reward = compiler exit code or unit-test pass rate) and runs periodic LoRA-only steps every N samples. The "MLP and Mamba only" target set is enforced because the KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack is shared across requests and adapting Q/K/V invalidates it. DCDAdapter is the no-compiler fallback: it distills the model's behavior on a long-context "teacher" view into a shorter "student" view, giving project-specific adaptation when the only signal is the project's own source tree. We run both for the C++ specialist because they answer different questions.

How it lands in MegaCpp

The MegaCpp serving stack consumes adapters through the canonical artifact contract from the adapter-management layer. save_adapter_artifact and load_adapter_artifact are the only entry points production touches. The format is stable across schema versions (currently ADAPTER_META_SCHEMA_VERSION = 2, ADAPTER_ARTIFACT_SCHEMA_VERSION = 1), and the production loader runs the same fail-closed validation. This lets us train an adapter in one environment and serve it in MegaCpp without a re-export step.

Production hot-swap is conceptually MultiAdapterManager minus the experiment-tracking: a per-replica CPU-pinned adapter cache with LRU eviction and one CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200-side activation slot that copies weights into LoRA params on swap. Per-request routing carries an explicit adapter id in the envelope (the gateway sets it; we don't parse from the prompt in production). Blended activations are supported but rare - the typical request hits one adapter, served at zero inference overhead because hot adapters are merged into the base and unmerged-and-restored on swap.

QLoRA in production is what serves the "large base on small hardware" lane: the base weights are NF4-packed in CPU memory, materialized to bf16 GPU on demand, with double-quantized FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper absmax scales. The quantization metadata travels with the adapter artifact so the loader can verify base-model compatibility before activation. The Megatron-TPQuick term guideTPTensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.GroundingAbout: parallelism map overview Example: TP partition-shape sample Reference: tensor parallel and sharding variants (MegatronLoRALinear, apply_lora_dtensor_tp) are what handle the sharded-base serving case where the model is split across multiple GPUs.

Deliberately not lifted: VeRA's shared-buffer dedup is a training-stack implementation detail; MegaCpp only sees the canonical artifact. DyLoRA rank-growing is training-time. The compose and merge utilities live in the training stack and produce artifacts; production only consumes them.

Ablations and what we kept

LoRA family ablations were less about loss and more about maintenance cost. The canonical recipe - rank 4-8 on MLP_TARGETS | MAMBA_TARGETS, alpha = 2 * rank, NF4 base for QLoRA, AdamW on LoRA params only - survived everything. DoRA's magnitude-vector parameterization gives small literature wins; we kept the implementation but it is not the default because the divergent forward pass and the unmerge logic (which had a P0 inversion bug we fixed) outweighed the consistent-but-small win at our shape.

VeRA is in the tree for the case where adapter footprint matters more than per-task accuracy - shared A_shared, B_shared mean every per-layer adapter is two diagonals of size rank rather than two rank x dim matrices. The shared-buffer plumbing settled after a sequence of P1 fixes (dedup on save, rebind on resume, EPQuick term guideEPExpert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.GroundingAbout: parallelism map overview Example: expert-parallel routing sample Reference: expert parallel and MoE sharding-aware atomic preflight); the engineering notes around _rebind_vera_shared_buffers() and the assign=True load path document the painful version.

DyLoRA serves multiple ranks at inference (rank-truncating A/B at request time). It composes badly with TIES/DARE, which is why those merge functions reject DyLoRA inputs. The unmerge bug was instructive: original unmerge() recomputed the delta from current A/B instead of snapshotting at merge time, so any A/B mutation between merge and unmerge corrupted the base weights. Snapshot in merge(), use the snapshot in unmerge() - now the canonical pattern across variants.

QLoRA is the most-used variant in serving by request count. The compression ratio (roughly 4x for base weights, including the FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper-quantized scales) is what makes single-GPU serving of the large base model possible. We did not see meaningful accuracy degradation at rank 8 or above; below rank 8 the quantization noise starts to bite. On the broader hardware-policy side this is the same trade studied in precision recipe: FP16, BF16, FP8, NVFP4.

Compose-vs-merge took longest to get right. Composition is the live in-RAM op: 0.8 * A + 0.3 * B - 0.2 * C to make a new specialist on the fly. Merging is offline batch: take a list of adapters and produce one committed artifact. Variant-compatibility check, alpha/rank scaling check, and delta-space SVD are the three things we got wrong before we got them right. Online adaptation went through one simplification: the original NTP/DCD/HYBRID split collapsed HYBRID to "run both" rather than alternating, because the alternation schedule was a hyperparameter we never picked correctly.

Production checklist

The default LoRA target set must remain MLP_TARGETS | MAMBA_TARGETS. Any expansion to ATTENTION_TARGETS invalidates the KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack and breaks online adaptation; require an explicit opt-in.
The adapter artifact contract is the single boundary between research and serving. Both sides must validate lora_meta against the runtime model contract before activation.
The native-QLoRA gate that rejects PEFT-style artifacts without qlora_quant_metadata is fail-closed and must stay that way; a silent acceptance produces wrong outputs that no eval will catch.
Compose validates (use_dora, use_vera, use_qlora, use_dylora) signatures and alpha/rank scaling. Both validations must run before any tensor arithmetic; do not relax them for a "convenience" path.
DyLoRA, VeRA, and the merged-with-base case must not be passed to TIES/DARE merging - those algorithms operate on raw factors and the compositions are mathematically incorrect.
VeRA shared buffers are stored at the model root (_vera_shared_A, _vera_shared_B) and re-bound on every full-checkpoint load with assign=True. The rebinding hook must remain in the load path.
Hot-swap activation must zero the LoRA params on deactivate so the base model is bit-exact recoverable. Do not skip this on the "deactivate to base" fast path.
Online adaptation only updates LoRA params on the MLP and Mamba targets. Adapting Q/K/V invalidates the shared KV cacheQuick term guideKV CacheThe stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack and silently corrupts cross-request inference.
Per-replica adapter cache is bounded; the eviction policy is LRU; the on-swap path snapshots the merged delta so unmerge is bit-exact.
The compose path uses delta-space SVD to recover A/B factors at the original rank. Do not "optimize" by averaging the factors directly.

Adapter family snapshot

Variant	Where it wins	Trade-off
LoRA	default adapter rank 8-32	cheapest, baseline quality
QLoRA	memory-bound fine-tune	extra dequant on forward
DoRA	small-rank tasks needing expressiveness	~10-15% slower
VeRA	many adapters per host	shared random basis, per-task vectors
DyLoRA	rank unknown at train time	samples rank per step

# hot-swap at serve time, adapter IDs resolved by registry
from adapters import registry, hot_swap
adapter = registry.resolve(project="llvm-project", dialect="cpp20")
hot_swap(model, adapter, merge=False)

FAQ

Frequently asked questions

When do we merge an adapter instead of hot-swapping it?+

Merge when one specialist will stay active for a long serving window. Hot-swap is the better fit for request-by-request routing in the inference serving stack and the product-facing specialist routing surface. The local runtime contract is kept small on purpose in adapter runtime sample, which is the fastest checked-in place to see which metadata hot-swap and composition depend on.

What does compose mean that merge does not?+

Compose is the live arithmetic path: keep adapters separate, validate their metadata, and build a temporary delta-space result for one request or one short experiment window. Merge is the offline commit path: produce one artifact that can be versioned and served like any other adapter release. The compact checked-in decoder is the composition_rule in adapter runtime sample, with the hybrid example catalog as the map for the rest of the runtime surface.

Which public example should a first-touch reader open first?+

Start with the hybrid example catalog, then open the adapter runtime sample for the artifact and runtime-metadata contract. If the distributed wrap order is the question, continue to the distributed LoRA sync sample. That order keeps the definitions small before the serving and distributed details arrive.

Glossary

Terms used in this article

Start here for quick definitions, then follow the linked posts for deeper context.

DTensor

PyTorch's mesh-backed distributed-tensor abstraction: one logical tensor with explicit shard or replica metadata across ranks.

Grounding

Tensor parallelism splits each linear's weights (QKV, O, MLP gate/up/down) across GPUs. On 8× H200 with TP=8 each GPU owns 1/8 of every matmul's columns or rows, so one big matmul becomes 8 smaller ones that all-reduce at the layer boundary. Cost: one all-reduce per attention and per MLP — heavy bandwidth, so TP is usually bound to a single NVLink/NVSwitch island (1 node of up to 8 GPUs). Embeddings, layernorms, and optimizer state stay replicated across the TP GPUs. Use TP when a single layer's weights don't fit on one GPU, not to scale past one node.

Grounding

Expert parallelism partitions MoE experts across GPUs — 64 experts on 8× H200 with EP=8 means each GPU owns the full weights of 8 experts. Each token routes to its chosen expert via all-to-all (to the GPU holding that expert), the FFN runs there, then all-to-all sends outputs back. Cost: two all-to-alls per MoE layer plus load imbalance when hot experts overload their owner. Attention, embeddings, and shared dense weights stay replicated across the EP dimension. Use EP when expert weights dominate total model size.

Grounding

KV Cache

The stored attention keys and values from earlier tokens so decode can reuse prior context instead of recomputing the full prefix every step.

Grounding

NVFP4

NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.

Grounding

FP8

Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.

Grounding

CUDA

NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.

Grounding

Topic hubs

Topic Hub

MLA Integration, Dispatch, and Weight Absorption

A curated MLA reading path: the weight-absorption contract, Megatron-safe integration boundaries, dispatch and FP8 edges, and the adapter surfaces that keep MLA connected to the rest of the stack.

David Gornshtein • MegaCppMore posts →

The adapter stack: how LoRA, QLoRA, and hot-swap compose MegaCpp specialists

Why MegaCpp cares about this

How the stack is structured

How it lands in MegaCpp

Ablations and what we kept

Production checklist

Adapter family snapshot

Read next

References

Frequently asked questions

Terms used in this article

Continue with a curated reading path

MLA Integration, Dispatch, and Weight Absorption