Reproducing the nanoGPT Speedrun: What Actually Moves the Loss Curve

Most papers hand you a result and ask for trust. The nanoGPT speedrun hands you a Git history. The task is brutally specific — train a GPT-2 (124M)-class model to 3.28 cross-entropy on a fixed FineWeb validation split, in the least wall-clock time on an 8×H100 node — and every record is a commit with a reproducible log. That specificity is what makes it both a good teacher and a dangerous one: the same optimization pressure that surfaces genuinely better training recipes also rewards overfitting to one model size, one dataset, and one loss threshold. Sorting those two apart is the whole exercise.

The target: a loss threshold, not a paper

The lineage is short and worth keeping straight. Andrej Karpathy’s nanoGPT is the readable reference trainer; his llm.c work reproduced GPT-2 (124M) on FineWeb to roughly 3.28 validation loss. Keller Jordan’s modded-nanogpt took that exact endpoint and turned it into a competitive benchmark: same parameter budget, same tokenizer (GPT-2 BPE), same FineWeb validation set, same 3.28 target. The deliverable for each record is a commit plus a training log, which is why this is a reproduction and not a vibe.

The metric is the interesting design choice. The speedrun ranks on wall-clock time to reach the threshold on a fixed 8×H100 node, not steps-to-threshold. That distinction matters more than it looks. Steps-to-threshold rewards sample efficiency in isolation and is blind to throughput, so it will happily crown a change that needs fewer optimizer steps while making each step slower — a wider attention window, a heavier optimizer inner loop, anything that does not torch.compile cleanly. Wall-clock folds algorithmic efficiency and systems efficiency into one number, which is the number you actually pay for. The cost is that the leaderboard is hardware-locked: a record is a statement about 8×H100 with a specific software stack, and it does not transpose to your A100s or your 4090 by simple scaling.

Because the validation set and token count are fixed and evaluation is deterministic, the only stochasticity in the final loss is training-side: initialization, data order, and the optimizer trajectory. That is what makes seed variance the central methodological problem later — the records are won by margins that sit close to that noise floor.

Muon: orthogonalizing the update

The headline algorithmic ingredient is the Muon optimizer.

For each 2D weight matrix it keeps an SGD-style momentum buffer, then replaces the raw momentum update with its closest orthogonal matrix before applying it. Concretely, if the momentum buffer has SVD

G = U \Sigma V^\top

, Muon steps along

U V^\top

— the same direction with every singular value flattened to one:

$\Delta W = -\eta \, U V^\top, \quad \text{where} \ G = U \Sigma V^\top$

You never compute the SVD. The orthogonal factor is approximated by a fixed quintic Newton–Schulz iteration run in bf16, which is just a handful of matmuls:

import torch

def zeropower_via_newtonschulz5(G, steps=5, eps=1e-7):
    # Approximate the orthogonal factor U V^T of G = U S V^T in bf16.
    assert G.ndim == 2
    a, b, c = 3.4445, -4.7750, 2.0315   # quintic coefficients
    X = G.bfloat16()
    X = X / (X.norm() + eps)            # bring spectral norm <= 1
    transposed = G.size(0) > G.size(1)
    if transposed:
        X = X.T
    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * (A @ A)
        X = a * X + B @ X               # quintic step, preserves singular vectors
    return X.T if transposed else X

The update step pairs that with momentum and a shape-aware scale so the effective step size is comparable across matrices of different aspect ratios:

buf.mul_(momentum).add_(grad)
update = grad.add(buf, alpha=momentum) if nesterov else buf   # Nesterov-style
O = zeropower_via_newtonschulz5(update, steps=5)
O *= max(1.0, W.size(0) / W.size(1)) ** 0.5                    # RMS match across shapes
W.add_(O, alpha=-lr)

Two practical points that get lost in the math. First, Muon only governs the 2D hidden matrices — the attention QKV and output projections, the MLP up/down weights. Embeddings, the unembedding/LM head, and every 1D parameter (norm gains, biases) stay on AdamW. A real Muon training loop maintains two parameter groups; treating the head or the embedding table as “just another matrix” is a known way to make it diverge. Second, the interpretation: orthogonalizing equalizes the update’s singular values so no single direction dominates the step, which is steepest descent under the spectral norm rather than the per-coordinate Euclidean norm AdamW approximates. That is why it tends to need fewer steps to a given loss on this task — it is a structurally different geometry, not a tuned variant of Adam.

In our small-config runs the direction matches the public logs: holding the architecture fixed, Muon reaches the loss target in meaningfully fewer optimizer steps than AdamW, and the per-step orthogonalization overhead is small enough that the step-count win survives into wall-clock. The exact magnitude is hardware- and shape-dependent and we have not reproduced it on the reference 8×H100 node, so treat the numbers in the table as direction-of-travel, not measured records.

The tweaks, one at a time

Muon is the loudest ingredient but not the only one. The speedrun’s architecture has drifted steadily away from 2019-GPT-2 toward a modern recipe, and the useful exercise is ablating each change individually rather than swallowing the stack whole. The recurring ones:

Rotary position embeddings replace learned absolute position embeddings (the wpe table is gone entirely). This is the relative-position story from RoFormer; it removes parameters and generally improves the loss at fixed steps.
QK normalization — an RMSNorm on the query and key projections (per head) before attention — which stabilizes attention logits and lets you push the learning rate harder without the logit blow-ups that otherwise cap your LR.
ReLU² (squared ReLU) in the MLP in place of GELU: F.relu(x).square(). Cheaper than GELU and, here, neutral-to-positive on loss.
A trapezoidal LR schedule — warmup, a stable plateau, then a linear cooldown to (near) zero over the final fraction of training. At a fixed short horizon the cooldown shape is one of the highest-leverage single knobs, because where and how fast you decay strongly determines the final loss you land on.
Later, heavier additions: U-net-style skip connections between blocks, value/embedding residual learning, logit soft-capping, an FP8 head, and FlexAttention with sliding-window plus document-boundary masking. These are increasingly co-tuned to this exact scale and node.

Below is the qualitative shape we see ablating individually on a smaller config. These are directional reads on under-the-target runs with several seeds, not 8×H100 records — read them as signs and rough magnitudes, not decimals.

Ingredient	What it changes	Steps-to-target	Wall-clock	Notes
Muon (vs AdamW)	optimizer geometry	large reduction	net win	needs AdamW for embed/head
RoPE (vs learned APE)	position encoding	modest reduction	neutral	also drops params
QK-norm	attention stability	small reduction	neutral	mainly enables higher LR
ReLU² (vs GELU)	MLP activation	negligible	slight win	cheaper kernel
LR cooldown shape	schedule tail	large at fixed horizon	free	most sensitive single knob

Measuring honestly

The records sit near the seed-noise floor, so single runs prove nothing. At a fixed step count, final FineWeb validation loss wobbles run-to-run by an amount that is comparable to the margin between adjacent leaderboard entries; declaring a winner from one seed each is how you fool yourself. We run each ablation across multiple seeds and report mean and spread, and we treat any “improvement” smaller than the seed spread as unproven rather than real.

Wall-clock accounting needs the same discipline. torch.compile warmup, the FlexAttention block-mask construction, and dataloader spin-up are real seconds that either belong in the number or do not — pick one convention and apply it everywhere. The public speedrun convention times the training loop on the target node; comparing your cold-start total against that is an apples-to-oranges error that will flatter or damn you by a minute. We also pin the obvious determinism levers (seed, data shard order, cuDNN settings) so that what is left is genuine optimizer stochasticity and not a leaky harness.

Our status is honest about its limits: the full reproduction on 8×H100 to 3.28 is in progress, and what we report so far is from smaller configurations chosen to fit available hardware. That is enough to confirm directions — Muon helps, the cooldown matters, QK-norm earns its keep through the LR ceiling — but not to certify an absolute record time, which is exactly why the record column below points at the repo rather than at a frozen figure.

What transfers vs what is overfit to 3.28

Some of this stack is real engineering that generalizes; some is a finely sharpened key for one lock.

Travels well. Muon is the standout — it has since been scaled well beyond toy GPT-2, with public reports of Muon-style optimizers used in large LLM pretraining, which is strong evidence it is not a speedrun artifact.

RoPE, QK-norm, and ReLU² are already standard issue in modern architectures for reasons that have nothing to do with this benchmark. If you take one thing back to your own pretraining, it is the optimizer and the QK-norm-enables-higher-LR interaction.

Overfit to the target. Anything tuned to the exact horizon, scale, and node is suspect off-bench. The trapezoidal cooldown is calibrated to land precisely at 3.28 in minimal steps — stretch the training run and that schedule is no longer optimal. The U-net skips, value-residual tricks, FP8 head, and sliding-window sizes are co-tuned to 124M parameters on H100 at this sequence length; there is no guarantee they survive a 10× scale-up or a different data mix. And the deepest trap is the metric itself: optimizing fastest to a fixed loss is not the same objective as best loss at a fixed compute budget, and a recipe that wins the former can lose the latter.

A harness you can actually rerun

The point of a reproduction is that you can poke it. We structure the runner so every headline ingredient is a single toggle and the optimizer split is explicit, so you can flip one variable, fix the rest, and read the delta on your own GPUs:

cfg = dict(
    optimizer="muon",      # {"muon", "adamw"} for the 2D matrices
    pos_emb="rope",        # {"rope", "learned"}
    qk_norm=True,
    mlp_act="relu2",       # {"relu2", "gelu"}
    lr_schedule="trapezoid",
    target_val_loss=3.28,
    seeds=[0, 1, 2, 3],    # report mean +/- spread, never a single run
)

# Optimizer split is the part people get wrong:
matrix_params = [p for n, p in model.named_parameters()
                 if p.ndim == 2 and "embed" not in n and "lm_head" not in n]
other_params  = [p for n, p in model.named_parameters() if p not in set(matrix_params)]
opt = [Muon(matrix_params, lr=...), torch.optim.AdamW(other_params, lr=...)]

On smaller hardware, shrink the model and the token budget together and pick a correspondingly easier loss target — the goal is to recover the ordering of ingredients, not the 3.28 record, which is reference-node-specific. Hold seeds, data order, and compile settings fixed across the arm you are testing, and only change the one toggle. If a change moves the loss by less than your seed spread, you have learned that it does not matter at your scale, which is itself a result.

The bottom line

The nanoGPT speedrun is the closest thing the field has to a controlled experiment on modern training recipes, and its lesson is unglamorous: one genuinely new optimizer (Muon), a handful of by-now-standard architecture choices, a schedule tail tuned to the horizon, and a lot of systems hygiene. Reproduced individually, Muon and the QK-norm/LR interaction are the parts worth carrying off the bench; the skip-connection-and-FP8 garnish is increasingly a key cut for this one lock. We are reporting the public records as logged and our own runs as in-progress and qualitative — the direction is solid, the absolute 8×H100 number is not ours to claim yet. If you only do one thing with this: clone the repo, wire up the toggle harness, and run the optimizer ablation against your own seed noise before you believe any single curve, including ours.

Metric	Reported	Reproduced
Target validation loss (GPT-2 124M, FineWeb)	3.28 val CE loss	in progress — full 8×H100 run pending val CE loss
Wall-clock to target on 8×H100	single-digit minutes (current record at time of writing; see repo) minutes	not yet run on reference node minutes
Optimizer steps to target, Muon vs AdamW (arch fixed)	Muon reaches target in markedly fewer steps (Jordan) —	direction reproduced on small config; magnitude hardware-dependent —
Baseline trainer (Karpathy nanoGPT/llm.c)	order of dozens of minutes on 8×H100 (repo baseline) minutes	context only minutes

TableReported vs. reproduced. We report the publicly logged records; our own runs are in progress and qualitative so far.

The target: a loss threshold, not a paper#

Muon: orthogonalizing the update#

The tweaks, one at a time#

Measuring honestly#

What transfers vs what is overfit to 3.28#

A harness you can actually rerun#

The bottom line#