Topic

Distributed Training

Sharding models and data across many accelerators.

2 checkpoints

CHECKPOINT 00102026-06-15REPRODUCTIONSIn progress

Reproducing the nanoGPT Speedrun: What Actually Moves the Loss Curve

The nanoGPT speedrun is a rare, fully open optimization target: hit 3.28 FineWeb validation loss on a GPT-2 (124M)-class model in minimum wall-clock on 8×H100. We reproduce the pipeline, isolate what the Muon optimizer and the architecture changes actually buy, and flag what will not transfer off the bench.

CHECKPOINT 00022026-05-15EXPLAINERSadvanced

Sharding the Model: FSDP, ZeRO, and Tensor/Pipeline Parallelism

Past one GPU you stop training a model and start operating a distributed system. Here is what each parallelism axis actually shards, what it costs on the wire, and how practitioners stack them into 3D/4D layouts.

distributed-training fsdp tensor-parallelism gpu-memory