Engineering Blog

Technical deep dives from the Datasunrise OÜ team. No marketing fluff—just honest takes on training infrastructure, vector databases, code search, and what it actually takes to build AI for C++ engineers.

SLM Ensemble Architecture: 8 Specialists, One Coherent System
SLM Ensemble
Architecture

SLM Ensemble Architecture: 8 Specialists, One Coherent System

How our ensemble of 8 specialist C++ models (4B-8B params, 0.8B-1.6B active) is organized. Hybrid Mamba 3 + Transformer stack, expert routing, and why this beats a 70B generalist for real C++ work.

David Gornshtein14 min read
Read Article
Training the SLM Ensemble: Muon, FP16, and 100-200B Tokens per Specialist
Training
Muon

Training the SLM Ensemble: Muon, FP16, and 100-200B Tokens per Specialist

The training recipe grounded in our nanochat POC and cppmega stack: Muon optimizer, FP16 training, curriculum, and what survived contact with real GPUs.

David Gornshtein14 min read
Read Article
The C++ Corpus: Data Curation for 8 Specialists
Data
C++

The C++ Corpus: Data Curation for 8 Specialists

Corpus curriculum, tokenization, document masking. Why 'just train on GitHub' is not an answer for C++ and what the data pipeline actually looks like.

David Gornshtein15 min read
Read Article
Mamba 3 + Transformers: The Hybrid Stack We Actually Ship
Architecture
Mamba 3

Mamba 3 + Transformers: The Hybrid Stack We Actually Ship

PSIV cache, MIMO, register split, and the layer-interleaving choices behind the MegaCpp models. Why the hybrid beats pure attention on long C++ contexts.

David Gornshtein15 min read
Read Article
NVFP4 Inference on Blackwell and GB10
Inference
NVFP4

NVFP4 Inference on Blackwell and GB10

FP16/BF16 training maps to NVFP4 at inference. Kernel selection, CUTLASS block-scaled MMA, GB10 vs B200 capability split, and the pins that make it reproducible.

David Gornshtein15 min read
Read Article
Meet the 8 Specialists
SLM Ensemble
Specialists

Meet the 8 Specialists

Profiles of the 8 models in the MegaCpp ensemble: algorithms, templates, memory, concurrency, systems, build, debug, STL. What each is good at and what each is not for.

David Gornshtein9 min read
Read Article
How We Evaluate: Four-Axis Eval Beats Vibes
Evaluation
Benchmarks

How We Evaluate: Four-Axis Eval Beats Vibes

Compile, context adherence, hallucination, correctness. The harness, the benchmarks, and the cost-per-quality argument for the ensemble vs 70B generalists.

David Gornshtein14 min read
Read Article
GB10 War Stories: sm_121a, NaNs, and Honest Numbers
Infrastructure
GB10

GB10 War Stories: sm_121a, NaNs, and Honest Numbers

What it actually took to get clean training on GB10: ptxas, Liger graph breaks, NVFP4 crashes, a NaN bisect that exonerated the code, and the three pinning rules we would keep.

David Gornshtein15 min read
Read Article
The Latent Bridge: How Our 8 SLMs Talk Without Words
Architecture
Multi-Agent
Research

The Latent Bridge: How Our 8 SLMs Talk Without Words

Why we ditched token-based inter-model communication for direct semantic vector channels. Inspired by recent research on cross-model latent bridges, our 8 specialist SLMs now share meaning at latent speed.

David Gornshtein12 min read
Read Article
Nemotron Nano 3: The Holy Trinity of Efficiency (Mamba + MoE + GQA)
Architecture
Hybrid Models
NVIDIA

Nemotron Nano 3: The Holy Trinity of Efficiency (Mamba + MoE + GQA)

How NVIDIA combined Mamba-2 state spaces, sparse MoE, and GQA attention to create a 31.6B model that activates only 3.2B parameters. The architecture that inspired our SLM ensemble.

David Gornshtein15 min read
Read Article
Implementing Mamba 3 in Production: A Practitioner's Guide
Implementation
Mamba 3
Production

Implementing Mamba 3 in Production: A Practitioner's Guide

Everything we learned deploying Mamba 3 state-space models for C++ code generation: hardware requirements, Tensor Engine integration, memory efficiency, long-context handling, and real performance benchmarks.

David Gornshtein18 min read
Read Article
Mamba 3: The State Space Revolution (And Why We Use It)
State Space Models
Mamba 3
Architecture

Mamba 3: The State Space Revolution (And Why We Use It)

A first-principles journey through Mamba 3's innovations: trapezoidal discretization, complex dynamics via RoPE, and MIMO formulation. From mathematical intuition to real-world performance.

David Gornshtein16 min read
Read Article
The 4-Bit Miracle: How NVFP4 Squeezes 16-Bit Intelligence into 4-Bit Memory
Quantization
NVFP4
Blackwell

The 4-Bit Miracle: How NVFP4 Squeezes 16-Bit Intelligence into 4-Bit Memory

NVFP4 achieves the impossible: 3.5x memory reduction with less than 1% accuracy loss. Learn how dual-level scaling enables running 7 specialist SLMs on a single GPU.

David Gornshtein12 min read
Read Article
Training at 4-Bit: The Research That Broke the Rules
Training
NVFP4
Mixed Precision

Training at 4-Bit: The Research That Broke the Rules

NVIDIA demonstrated the impossible: training a 12B model on 10 trillion tokens at 4-bit precision. How mixed-precision strategies and the Muon optimizer enable FP4 training for our SLM ensemble.

David Gornshtein15 min read
Read Article
Building Production SLMs with NVFP4: A Practical Guide
Production
NVFP4
Deployment

Building Production SLMs with NVFP4: A Practical Guide

Hands-on guide to quantizing and deploying specialized language models with NVFP4. PTQ workflows, QAT strategies, and how we fit 7 models (50B params) in 28GB.

David Gornshtein18 min read
Read Article
The Magnificent Eight: Why We Built 8 Tiny Models Instead of 1 Big One
SLM Ensemble
Architecture
MoE

The Magnificent Eight: Why We Built 8 Tiny Models Instead of 1 Big One

Specialist vs generalist models: why 8 models with 4B-8B parameters each (0.8B-1.6B active) outperform single 70B+ models for C++ engineering. MoE architecture deep dive.

David Gornshtein12 min read
Read Article
Mamba Meets Transformers: Our Hybrid Architecture That Shouldn't Work But Does
Architecture
Mamba 3
Transformers

Mamba Meets Transformers: Our Hybrid Architecture That Shouldn't Work But Does

Combining Mamba 3 state-space models with Transformer Engine layers and regular attention. A technical deep dive into a hybrid architecture that defies conventional wisdom.

David Gornshtein14 min read
Read Article
The Honest Truth About Training AI on GB10: Our Grace Blackwell Journey
Training Infrastructure
GB10 / DGX Spark
Grace Blackwell

The Honest Truth About Training AI on GB10: Our Grace Blackwell Journey

Why we use NVIDIA GB10 (DGX Spark) clusters with 128GB unified LPDDR5X memory, 1 PFLOP FP4 performance, and the economics of desktop AI supercomputers vs. cloud.

David Gornshtein12 min read
Read Article
A Million Steps Without Falling: What We Learned About AI Agent Orchestration
AI Agents
Orchestration

A Million Steps Without Falling: What We Learned About AI Agent Orchestration

An agent that makes a 0.1% error rate sounds great until step 1000 when you realize you're debugging garbage.

David Gornshtein14 min read
Read Article
Why We Train Small Models That Actually Understand C++
SLM Ensemble
C++

Why We Train Small Models That Actually Understand C++

GPT-4 can write hello world in any language. Our models know why your template metaprogramming is broken. Learn how 4B-8B specialist models outperform 70B+ generalists.

David Gornshtein15 min read
Read Article
The Algorithm Whisperer: Our Newest SLM Specialist
SLM Ensemble
Algorithms
Pseudocode

The Algorithm Whisperer: Our Newest SLM Specialist

Meet Algorithm SLM (Algo-7B): our 8th specialist model trained on algorithm design, pseudocode interpretation, and complexity analysis. Why pseudocode understanding matters for C++ development.

David Gornshtein12 min read
Read Article
Attention Validity and Structure-Aware Attention
Attention Validity
Packed Rows

Attention Validity and Structure-Aware Attention

A packed-row validity regression, the clustered-sparse follow-up it forced, and the structure-aware attention plan we are integrating into the MegaCpp training stack.

David Gornshtein10 min read
Read Article
Building the C++ Training Data Pipeline: What Worked, What Broke
Data
Pipeline

Building the C++ Training Data Pipeline: What Worked, What Broke

An honest walkthrough of how the MegaCpp training data pipeline was built — source selection, filtering, dedup, tokenization, document masking, and the quality gates that catch our own mistakes.

David Gornshtein9 min read
Read Article
Distributed Optimizer Stress: Drift, All-Gather vs Reduce-Scatter, and Muon Gotchas
Optimizer
Muon

Distributed Optimizer Stress: Drift, All-Gather vs Reduce-Scatter, and Muon Gotchas

David Gornshtein9 min read
Read Article
Document Masking and the Curriculum: What to Feed Each Specialist First
Curriculum
Doc-Masking

Document Masking and the Curriculum: What to Feed Each Specialist First

Why MegaCpp masks documents inside packed sequences, how the four-phase curriculum is shaped from 4K syntax to 64K repository graphs, and what our ablations told us about the right starting diet for each specialist.

David Gornshtein10 min read
Read Article
Flash Attention 4 in Practice: What We Shipped, What We Cut
Flash Attention
Fa4

Flash Attention 4 in Practice: What We Shipped, What We Cut

Our hybrid stack's applicability matrix for Flash Attention 4, the Build4 validation profiles, the dense-full rollout gates, and the regressions that killed the first FA4 variants before they reached production.

David Gornshtein8 min read
Read Article
FP8 in the Training Stack: What Shipped, What We Rolled Back
Fp8
Training

FP8 in the Training Stack: What Shipped, What We Rolled Back

An engineer's account of rolling FP8 through the MegaCpp training stack: DeepGEMM block-scaled GEMMs, torchao Float8Linear, TransformerEngine's FP8-aware activation checkpointing, and the parts that looked good on paper and lost the benchmark.

David Gornshtein9 min read
Read Article
FSDP2 Pain and Payoff: What Actually Cut Memory on GPU and TPU
Fsdp2
Pytorch

FSDP2 Pain and Payoff: What Actually Cut Memory on GPU and TPU

David Gornshtein8 min read
Read Article
Graph Recompilation Hell on torch_xla: Shape Polymorphism, Dynamic Fallbacks, and Stabilizing the Cache
Torch_Xla
Tpu

Graph Recompilation Hell on torch_xla: Shape Polymorphism, Dynamic Fallbacks, and Stabilizing the Cache

David Gornshtein10 min read
Read Article
H200 Bring-up and Naming: How We Stopped Confusing Our Own Receipts
H200
H100

H200 Bring-up and Naming: How We Stopped Confusing Our Own Receipts

The MegaCpp H200 software stack, the bring-up of TP+SP+EP+FSDP2 on a fresh us-west1-c host, and the naming glossary that prevents two engineers from quoting different runs as the same number.

David Gornshtein9 min read
Read Article
Landing the Mamba 3 + Transformer Interleave Ratio: What the Ablations Told Us to Throw Away
Mamba 3
Transformer

Landing the Mamba 3 + Transformer Interleave Ratio: What the Ablations Told Us to Throw Away

How the hybrid layer pattern for our C++ specialist converged: AEMEAEDE versus dense versus GDN, what the NAM52 and NAM56R ablations settled, and the features we cut on the data.

David Gornshtein10 min read
Read Article
libtpu, JAX and torch_xla in One Container: Device Init Races and Env-Var Landmines
Libtpu
Jax

libtpu, JAX and torch_xla in One Container: Device Init Races and Env-Var Landmines

What it looks like when a single TPU host runs JAX and torch_xla side by side: libtpu initialization races, VFIO contention, PJRT platform resolution, and the env-var surface you only learn about when it breaks.

David Gornshtein9 min read
Read Article
Long Context and Attention Sinks: What Actually Held Up Past 16K
Long-Context
Yarn

Long Context and Attention Sinks: What Actually Held Up Past 16K

YaRN, RNoPE, packed-document masking, attention sinks, massive activations, and query-dependent output gating: a field report on which long-context techniques survived contact with the MegaCpp C++ corpus.

David Gornshtein10 min read
Read Article
The Mamba 3 Kernel Journey: CUDA, Pallas, TileLang, and a Honest Look at CuTe DSL
Mamba 3
Cuda

The Mamba 3 Kernel Journey: CUDA, Pallas, TileLang, and a Honest Look at CuTe DSL

How the Mamba 3 kernel stack actually shipped in our nanochat POC: TileLang on H200, Pallas on TPU v6e, a CuTe DSL port that never made it, and the verdicts that came out of each attempt.

David Gornshtein10 min read
Read Article
Mamba 3 Parallel Performance: What Beat Attention in the nanochat POC, and Where It Lost
Mamba 3
State Space Models

Mamba 3 Parallel Performance: What Beat Attention in the nanochat POC, and Where It Lost

MIMO scaling, block sizes, the PsiV cache trade-off, and an honest tally of where a Mamba 3 hybrid outran pure attention on H200 and where it did not.

David Gornshtein10 min read
Read Article
MLA Weight Absorption: What We Kept, What We Dropped for the C++ Specialists
Mla
Attention

MLA Weight Absorption: What We Kept, What We Dropped for the C++ Specialists

Multi-Head Latent Attention in production: why DeepSeek's absorbed decode path is the right choice for KV-cache, why it is the wrong choice for training, and how the C++ specialist ensemble uses both.

David Gornshtein9 min read
Read Article
The MoE Routing We Actually Shipped
Moe
Token Choice

The MoE Routing We Actually Shipped

Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM ensemble.

David Gornshtein10 min read
Read Article
Benchmarking the MegaCpp Stack on Modal: Multi-GPU Lessons From Rented Boxes
Modal
Benchmarks

Benchmarking the MegaCpp Stack on Modal: Multi-GPU Lessons From Rented Boxes

What we learned running our training stack on rented H100, H200, and B200 boxes through Modal — three benchmark lanes, an 8-GPU FSDP2 hang, and the bookkeeping that lets the numbers survive a week.

David Gornshtein8 min read
Read Article
OOM Hunting on TPU v6e: HBM Fragmentation, the XLA Allocator, and What Actually Moved Memory
Tpu
V6E

OOM Hunting on TPU v6e: HBM Fragmentation, the XLA Allocator, and What Actually Moved Memory

David Gornshtein9 min read
Read Article
SOTA Ablation and Comparison: How MegaCpp Decides What to Keep
Ablation
Sota

SOTA Ablation and Comparison: How MegaCpp Decides What to Keep

The ablation plan, the comparison methodology, and the honest numbers behind the MegaCpp SLM stack — what stacked, what didn't, and what we threw out even though the paper said it would help.

David Gornshtein10 min read
Read Article
Tensor, Sequence, Expert, FSDP: The Hybrid Sharding Stack and the v6e-8 Bugs
Tensor-Parallel
Sequence-Parallel

Tensor, Sequence, Expert, FSDP: The Hybrid Sharding Stack and the v6e-8 Bugs

David Gornshtein9 min read
Read Article
Tokenizer Evolution for C++ Code: From v2 Proposal to v3 Shipped
Tokenizer
Bpe

Tokenizer Evolution for C++ Code: From v2 Proposal to v3 Shipped

How the MegaCpp C++ tokenizer evolved from a 32K v1, through a 48K v2 proposal, to the 65K v3 shipped artifact — what we proposed, what corpus frequency analysis told us, and what the tokenizer ended up doing for downstream eval.

David Gornshtein9 min read
Read Article
Living on PyTorch 2.12 Nightly: The nanochat ABI and Wheel-Matrix Tax
Pytorch
Cuda

Living on PyTorch 2.12 Nightly: The nanochat ABI and Wheel-Matrix Tax

What it actually cost to sit on PyTorch 2.12 nightlies across CUDA, ROCm and custom TPU builds during the nanochat POC - ABI breaks, API churn, and the backward-compat patches that kept the fleet training.

David Gornshtein8 min read
Read Article
torch_xla and PJRT on TPU v6e: What Actually Worked
Torch_Xla
Pjrt

torch_xla and PJRT on TPU v6e: What Actually Worked

An honest engineering account of running torch_xla / PJRT on Google TPU v6e during the nanochat POC: version pinning, lazy-tensor tracing traps, persistent cache, SPMD, and the pybind we had to ship ourselves.

David Gornshtein9 min read
Read Article
TPU v6e Performance Deep Dive: Real MFU, Sharding Topology, and the Things That Pretended to Help
Tpu
V6E

TPU v6e Performance Deep Dive: Real MFU, Sharding Topology, and the Things That Pretended to Help

David Gornshtein10 min read
Read Article

Want to learn more about our work?

Check out our product documentation or get in touch to discuss how we can help with your C++ AI needs.