Series · ongoing

Building a Training Stack from Scratch

From a single GPU to a sharded cluster: the kernels, parallelism, and reproductions behind a modern pretraining run — built up one load-bearing piece at a time.

3 of 5 parts published

01
Explainers · 2026-05-15
Sharding the Model: FSDP, ZeRO, and Tensor/Pipeline Parallelism
Past one GPU you stop training a model and start operating a distributed system. Here is what each parallelism axis actually shards, what it costs on the wire, and how practitioners stack them into 3D/4D layouts.
02
Recreations · 2026-05-28
Recreating FlashAttention: A Tiled, IO-Aware Attention Kernel from Scratch
FlashAttention is exact attention restructured for the memory hierarchy, not an approximation. We implement the tiled forward and recompute backward in Triton, validate exactness against a reference, and separate what a tutorial actually reproduces from what needs CUTLASS-grade engineering.
03
Reproductions · 2026-06-15
Reproducing the nanoGPT Speedrun: What Actually Moves the Loss Curve
The nanoGPT speedrun is a rare, fully open optimization target: hit 3.28 FineWeb validation loss on a GPT-2 (124M)-class model in minimum wall-clock on 8×H100. We reproduce the pipeline, isolate what the Muon optimizer and the architecture changes actually buy, and flag what will not transfer off the bench.
04
Planned · coming soon
05
Planned · coming soon

Sharding the Model: FSDP, ZeRO, and Tensor/Pipeline Parallelism

Recreating FlashAttention: A Tiled, IO-Aware Attention Kernel from Scratch

Reproducing the nanoGPT Speedrun: What Actually Moves the Loss Curve