Trust, but reproduce.

The verification layer for frontier machine learning.

A practitioner-only publication for working ML and agent engineers. We run the paper, score the reproduction, and ship the minimal code — so you know what holds up before you load it into your work.

Our standards See what reproduces

Essays

The Harness Is the Product: Why Agent Evals Are the Real Moat

Swapping the frontier model rarely moves your agent's success rate as much as fixing retries and context management — and the one thing competitors can't clone is your evaluation environment. A thesis on why agent evals, not weights, are where reproducible capability accrues.

CHECKPOINT 0013 · 2026-06-27

Signals — the pulse

All signals →

2026-06-26SIGNALS

vLLM and the new default shape of LLM serving

If you are still serving with naive static batching, the gap is not marginal — paged KV-cache and continuous batching change the throughput-per-GPU math, and most other stacks have copied the idea.

serving paged-attention kv-cache

2026-06-24SIGNALS

FlashAttention-3: async, low-precision, Hopper-native

The headline is hardware-specific: FA3 is a Hopper story (async copy/MMA overlap, FP8 paths). The portable lesson from the FA line is still the one that matters — attention is bandwidth-bound, and the win is in HBM traffic, not FLOPs.

flash-attention kernels gpu-memory

2026-06-20SIGNALS

Mamba and the selective-state-space line

Worth understanding even if you ship transformers: SSMs change the asymptotics (linear in sequence length, constant state at inference) and the failure modes. The interesting deployments are hybrids, not pure-SSM.

state-space-models attention long-context

2026-06-17SIGNALS

SGLang and RadixAttention for prefix reuse

If your workload has heavy shared prefixes — system prompts, few-shot exemplars, agent scaffolds — automatic prefix caching is close to free latency. This is where serving for agents diverges from serving for chat.

serving kv-cache agents

2026-06-13SIGNALS

DeepSeek-R1: RL-trained reasoning with open weights

The reproducible part is the method, not a leaderboard cell: group-relative RL on verifiable rewards, with open weights to probe. It is the cleanest public artifact for understanding the reasoning-model training loop.

reasoning grpo rlhf

2026-06-11SIGNALS

The modded-nanogpt speedrun and the Muon optimizer

A rare fully-public optimization target with a reproducible harness — exactly the kind of artifact we like. The Muon optimizer it popularized is the most interesting practical idea to come out of it.

pretraining optimization reproducibility

Reproduction Watch

The tracker →

CHECKPOINT 00102026-06-15In progress

Reproducing the nanoGPT Speedrun: What Actually Moves the Loss Curve

The nanoGPT speedrun is a rare, fully open optimization target: hit 3.28 FineWeb validation loss on a GPT-2 (124M)-class model in minimum wall-clock on 8×H100. We reproduce the pipeline, isolate what the Muon optimizer and the architecture changes actually buy, and flag what will not transfer off the bench.

reproducibility pretraining optimization distributed-training

Explainers — teardowns

All explainers →

CHECKPOINT 00112026-06-19intro

Reading a Model Release Like an Engineer: Weights, Licenses, System Cards, and Evals

The headline benchmark is the least durable thing in a model release. Here is how to read access, licenses, cards, eval protocols, and serving facts before you commit engineering to a number you cannot reproduce.

evaluation reproducibility llm industry

CHECKPOINT 00082026-06-08intermediate

Post-Training Quantization in Practice: GPTQ, AWQ, and FP8

Post-training quantization is the cheapest inference lever and the easiest to pull wrong. The right method is set by your serving regime — bandwidth-bound decode wants weight-only INT4, compute-bound prefill wants FP8 — and the win is real only if a fast kernel accelerates your exact config.

quantization serving kv-cache

CHECKPOINT 00072026-06-04advanced

GRPO, Demystified: Group-Relative Policy Optimization for Reasoning Models

GRPO swaps PPO's learned critic for a Monte-Carlo baseline — the mean reward over a group of sampled completions — trading rollout compute and per-token credit assignment for a simpler, more stable RL loop on verifiable-reward tasks.

grpo rlhf ppo reasoning

CHECKPOINT 00062026-06-01advanced

Routing Is the Hard Part: A Practitioner's Guide to Mixture-of-Experts

MoE decouples parameter count from per-token FLOPs, but every hard problem — instability, dropped tokens, load imbalance, all-to-all traffic, a footprint set by total not active params — lives in the router. A structural tour from Switch/GShard to fine-grained and aux-loss-free designs, and the systems bill you actually pay.

mixture-of-experts transformers tensor-parallelism

Recreations — from scratch

All recreations →

CHECKPOINT 00052026-05-28scope: minimal

Recreating FlashAttention: A Tiled, IO-Aware Attention Kernel from Scratch

FlashAttention is exact attention restructured for the memory hierarchy, not an approximation. We implement the tiled forward and recompute backward in Triton, validate exactness against a reference, and separate what a tutorial actually reproduces from what needs CUTLASS-grade engineering.

flash-attention kernels attention gpu-memory

Libraries — the toolshed

All libraries →

CHECKPOINT 00092026-06-11rl · stable

TRL in Anger: SFT, DPO, and GRPO Without Rewriting Your Training Loop

TRL turns SFT, DPO, and GRPO into Trainer subclasses that inherit the entire Hugging Face stack — accelerate, peft, DeepSpeed. The convenience is real; the cost is that you're debugging someone else's training loop the moment your problem stops looking like the quickstart.

dpo grpo peft fine-tuning

CHECKPOINT 00032026-05-20serving · production

vLLM, Explained: PagedAttention, Continuous Batching, and the Serving Stack

vLLM treats the KV cache like OS virtual memory — non-contiguous paged blocks — and schedules work at the token, not the request. You get high aggregate throughput; the cost is that per-request latency becomes something you tune rather than something you get for free.

paged-attention kv-cache serving llm

Essays — the horizon

All essays →

CHECKPOINT 00132026-06-27

The Harness Is the Product: Why Agent Evals Are the Real Moat

agents agent-harness tool-use evaluation

CHECKPOINT 00122026-06-23

The Economics of Thinking: Test-Time Compute as a Scaling Axis

Reasoning models turned inference into a per-request dial. This is an economic read on when spending FLOPs at test time actually buys accuracy, why it only pays where answers are cheap to verify, and what variable-cost inference does to latency budgets and capacity planning.

test-time-compute reasoning serving industry

CHECKPOINT 00012026-05-12

How We Separate Signal From Noise: Frontier Checkpoint's Verification Rubric

The standard behind everything we publish: the filters that decide what earns your attention, the reproduced-to-unverified ladder we grade claims on, how we handle benchmarks and weightless releases, and why every correction is dated and logged rather than silently edited.

methodology reproducibility evaluation industry

The verification layer for frontier machine learning.

The Harness Is the Product: Why Agent Evals Are the Real Moat

Signals — the pulse

Reproduction Watch

Explainers — teardowns

Recreations — from scratch

Libraries — the toolshed

Essays — the horizon

Get the week's signal, not the noise.