vLLM and the new default shape of LLM serving
If you are still serving with naive static batching, the gap is not marginal — paged KV-cache and continuous batching change the throughput-per-GPU math, and most other stacks have copied the idea.
Trust, but reproduce.
A practitioner-only publication for working ML and agent engineers. We run the paper, score the reproduction, and ship the minimal code — so you know what holds up before you load it into your work.
Essays
Swapping the frontier model rarely moves your agent's success rate as much as fixing retries and context management — and the one thing competitors can't clone is your evaluation environment. A thesis on why agent evals, not weights, are where reproducible capability accrues.
If you are still serving with naive static batching, the gap is not marginal — paged KV-cache and continuous batching change the throughput-per-GPU math, and most other stacks have copied the idea.
The headline is hardware-specific: FA3 is a Hopper story (async copy/MMA overlap, FP8 paths). The portable lesson from the FA line is still the one that matters — attention is bandwidth-bound, and the win is in HBM traffic, not FLOPs.
Worth understanding even if you ship transformers: SSMs change the asymptotics (linear in sequence length, constant state at inference) and the failure modes. The interesting deployments are hybrids, not pure-SSM.
If your workload has heavy shared prefixes — system prompts, few-shot exemplars, agent scaffolds — automatic prefix caching is close to free latency. This is where serving for agents diverges from serving for chat.
The reproducible part is the method, not a leaderboard cell: group-relative RL on verifiable rewards, with open weights to probe. It is the cleanest public artifact for understanding the reasoning-model training loop.
A rare fully-public optimization target with a reproducible harness — exactly the kind of artifact we like. The Muon optimizer it popularized is the most interesting practical idea to come out of it.
The nanoGPT speedrun is a rare, fully open optimization target: hit 3.28 FineWeb validation loss on a GPT-2 (124M)-class model in minimum wall-clock on 8×H100. We reproduce the pipeline, isolate what the Muon optimizer and the architecture changes actually buy, and flag what will not transfer off the bench.
The headline benchmark is the least durable thing in a model release. Here is how to read access, licenses, cards, eval protocols, and serving facts before you commit engineering to a number you cannot reproduce.
Post-training quantization is the cheapest inference lever and the easiest to pull wrong. The right method is set by your serving regime — bandwidth-bound decode wants weight-only INT4, compute-bound prefill wants FP8 — and the win is real only if a fast kernel accelerates your exact config.
GRPO swaps PPO's learned critic for a Monte-Carlo baseline — the mean reward over a group of sampled completions — trading rollout compute and per-token credit assignment for a simpler, more stable RL loop on verifiable-reward tasks.
MoE decouples parameter count from per-token FLOPs, but every hard problem — instability, dropped tokens, load imbalance, all-to-all traffic, a footprint set by total not active params — lives in the router. A structural tour from Switch/GShard to fine-grained and aux-loss-free designs, and the systems bill you actually pay.
FlashAttention is exact attention restructured for the memory hierarchy, not an approximation. We implement the tiled forward and recompute backward in Triton, validate exactness against a reference, and separate what a tutorial actually reproduces from what needs CUTLASS-grade engineering.
TRL turns SFT, DPO, and GRPO into Trainer subclasses that inherit the entire Hugging Face stack — accelerate, peft, DeepSpeed. The convenience is real; the cost is that you're debugging someone else's training loop the moment your problem stops looking like the quickstart.
vLLM treats the KV cache like OS virtual memory — non-contiguous paged blocks — and schedules work at the token, not the request. You get high aggregate throughput; the cost is that per-request latency becomes something you tune rather than something you get for free.
Swapping the frontier model rarely moves your agent's success rate as much as fixing retries and context management — and the one thing competitors can't clone is your evaluation environment. A thesis on why agent evals, not weights, are where reproducible capability accrues.
Reasoning models turned inference into a per-request dial. This is an economic read on when spending FLOPs at test time actually buys accuracy, why it only pays where answers are cheap to verify, and what variable-cost inference does to latency budgets and capacity planning.
The standard behind everything we publish: the filters that decide what earns your attention, the reproduced-to-unverified ladder we grade claims on, how we handle benchmarks and weightless releases, and why every correction is dated and logged rather than silently edited.
The Checkpoint · weekly
One issue a week: the papers, reproductions, and releases that held up — with the take on why they matter for the code you write.