MLA weight absorption: what we kept and what we dropped for the C++ specialists
Multi-Head Latent Attention in production: why DeepSeek's absorbed decode path is the right choice for KV cache, why it is the wrong choice for training, and how the C++ specialist ensemble uses both.
The core architectural readback for what MLA changes in projection layout, KV handling, and the weight-absorption contract.