Series · ongoing
Building a Training Stack from Scratch
From a single GPU to a sharded cluster: the kernels, parallelism, and reproductions behind a modern pretraining run — built up one load-bearing piece at a time.
- 01
Sharding the Model: FSDP, ZeRO, and Tensor/Pipeline Parallelism
Past one GPU you stop training a model and start operating a distributed system. Here is what each parallelism axis actually shards, what it costs on the wire, and how practitioners stack them into 3D/4D layouts.
- 02
Recreating FlashAttention: A Tiled, IO-Aware Attention Kernel from Scratch
FlashAttention is exact attention restructured for the memory hierarchy, not an approximation. We implement the tiled forward and recompute backward in Triton, validate exactness against a reference, and separate what a tutorial actually reproduces from what needs CUTLASS-grade engineering.
- 03
Reproducing the nanoGPT Speedrun: What Actually Moves the Loss Curve
The nanoGPT speedrun is a rare, fully open optimization target: hit 3.28 FineWeb validation loss on a GPT-2 (124M)-class model in minimum wall-clock on 8×H100. We reproduce the pipeline, isolate what the Muon optimizer and the architecture changes actually buy, and flag what will not transfer off the bench.
- 04
- 05