Series · ongoing

Building a Training Stack from Scratch

From a single GPU to a sharded cluster: the kernels, parallelism, and reproductions behind a modern pretraining run — built up one load-bearing piece at a time.

3 of 5 parts published

  1. 01

    Explainers · 2026-05-15

    Sharding the Model: FSDP, ZeRO, and Tensor/Pipeline Parallelism

    Past one GPU you stop training a model and start operating a distributed system. Here is what each parallelism axis actually shards, what it costs on the wire, and how practitioners stack them into 3D/4D layouts.

  2. 02

    Recreations · 2026-05-28

    Recreating FlashAttention: A Tiled, IO-Aware Attention Kernel from Scratch

    FlashAttention is exact attention restructured for the memory hierarchy, not an approximation. We implement the tiled forward and recompute backward in Triton, validate exactness against a reference, and separate what a tutorial actually reproduces from what needs CUTLASS-grade engineering.

  3. 03

    Reproductions · 2026-06-15

    Reproducing the nanoGPT Speedrun: What Actually Moves the Loss Curve

    The nanoGPT speedrun is a rare, fully open optimization target: hit 3.28 FineWeb validation loss on a GPT-2 (124M)-class model in minimum wall-clock on 8×H100. We reproduce the pipeline, isolate what the Muon optimizer and the architecture changes actually buy, and flag what will not transfer off the bench.

  4. 04

    Planned · coming soon

  5. 05

    Planned · coming soon