Training the MegaCpp SLM Ensemble on GB10: a Grace Blackwell war story
Field notes from bringing the MegaCpp SLM Ensemble up on NVIDIA GB10 and DGX Spark: silicon surprises, NaN bisects that ate days, regressions caused by our own patches, and the software-stack choices that held.
The broad GB10 war story: what was attempted, what held up, and what turned out to be wishful thinking.