Frontier Checkpoint

Frontier CheckpointA practitioner-only technical publication for working ML and agent engineers. We verify, reproduce, and recreate the work — exploring the open frontier of what actually holds up.https://frontiercheckpoint.com/en-us© 2026 Frontier CheckpointWhat a 4B Model Can Actually Do: Field Notes from 155 Experimentshttps://frontiercheckpoint.com/essays/what-a-4b-model-can-do/https://frontiercheckpoint.com/essays/what-a-4b-model-can-do/Across 155 small-model experiments centered on Qwen 3.5 4B, the same thing kept working: give the model something executable it can check against the evidence it has, and it punches far above its benchmark weight. Here is the field guide — the levers that worked, how I know they're real, and the frontier they opened up.Sun, 28 Jun 2026 00:00:00 GMTEssaysreproducibilityevaluationfine-tuningmethodologyllmagentseditors@frontiercheckpoint.comThe Harness Is the Product: Why Agent Evals Are the Real Moathttps://frontiercheckpoint.com/essays/agent-harness-evals-moat/https://frontiercheckpoint.com/essays/agent-harness-evals-moat/Swapping the frontier model rarely moves your agent's success rate as much as fixing retries and context management — and the one thing competitors can't clone is your evaluation environment. A thesis on why agent evals, not weights, are where reproducible capability accrues.Sat, 27 Jun 2026 00:00:00 GMTEssaysagentsagent-harnesstool-useevaluationeditors@frontiercheckpoint.comvLLM and the new default shape of LLM servinghttps://frontiercheckpoint.com/signals/vllm-paged-attention-serving-standard/https://frontiercheckpoint.com/signals/vllm-paged-attention-serving-standard/If you are still serving with naive static batching, the gap is not marginal — paged KV-cache and continuous batching change the throughput-per-GPU math, and most other stacks have copied the idea.Fri, 26 Jun 2026 00:00:00 GMTSignalsservingpaged-attentionkv-cacheeditors@frontiercheckpoint.comFlashAttention-3: async, low-precision, Hopper-nativehttps://frontiercheckpoint.com/signals/flashattention-3-hopper/https://frontiercheckpoint.com/signals/flashattention-3-hopper/The headline is hardware-specific: FA3 is a Hopper story (async copy/MMA overlap, FP8 paths). The portable lesson from the FA line is still the one that matters — attention is bandwidth-bound, and the win is in HBM traffic, not FLOPs.Wed, 24 Jun 2026 00:00:00 GMTSignalsflash-attentionkernelsgpu-memoryeditors@frontiercheckpoint.comThe Economics of Thinking: Test-Time Compute as a Scaling Axishttps://frontiercheckpoint.com/essays/economics-of-test-time-compute/https://frontiercheckpoint.com/essays/economics-of-test-time-compute/Reasoning models turned inference into a per-request dial. This is an economic read on when spending FLOPs at test time actually buys accuracy, why it only pays where answers are cheap to verify, and what variable-cost inference does to latency budgets and capacity planning.Tue, 23 Jun 2026 00:00:00 GMTEssaystest-time-computereasoningservingindustryeditors@frontiercheckpoint.comMamba and the selective-state-space linehttps://frontiercheckpoint.com/signals/mamba-ssm-attention-alternative/https://frontiercheckpoint.com/signals/mamba-ssm-attention-alternative/Worth understanding even if you ship transformers: SSMs change the asymptotics (linear in sequence length, constant state at inference) and the failure modes. The interesting deployments are hybrids, not pure-SSM.Sat, 20 Jun 2026 00:00:00 GMTSignalsstate-space-modelsattentionlong-contexteditors@frontiercheckpoint.comReading a Model Release Like an Engineer: Weights, Licenses, System Cards, and Evalshttps://frontiercheckpoint.com/explainers/reading-a-model-release/https://frontiercheckpoint.com/explainers/reading-a-model-release/The headline benchmark is the least durable thing in a model release. Here is how to read access, licenses, cards, eval protocols, and serving facts before you commit engineering to a number you cannot reproduce.Fri, 19 Jun 2026 00:00:00 GMTExplainersevaluationreproducibilityllmindustryeditors@frontiercheckpoint.comSGLang and RadixAttention for prefix reusehttps://frontiercheckpoint.com/signals/sglang-radixattention-prefix-cache/https://frontiercheckpoint.com/signals/sglang-radixattention-prefix-cache/If your workload has heavy shared prefixes — system prompts, few-shot exemplars, agent scaffolds — automatic prefix caching is close to free latency. This is where serving for agents diverges from serving for chat.Wed, 17 Jun 2026 00:00:00 GMTSignalsservingkv-cacheagentseditors@frontiercheckpoint.comReproducing the nanoGPT Speedrun: What Actually Moves the Loss Curvehttps://frontiercheckpoint.com/reproductions/reproducing-nanogpt-speedrun/https://frontiercheckpoint.com/reproductions/reproducing-nanogpt-speedrun/The nanoGPT speedrun is a rare, fully open optimization target: hit 3.28 FineWeb validation loss on a GPT-2 (124M)-class model in minimum wall-clock on 8×H100. We reproduce the pipeline, isolate what the Muon optimizer and the architecture changes actually buy, and flag what will not transfer off the bench.Mon, 15 Jun 2026 00:00:00 GMTReproductionsreproducibilitypretrainingoptimizationdistributed-trainingeditors@frontiercheckpoint.comDeepSeek-R1: RL-trained reasoning with open weightshttps://frontiercheckpoint.com/signals/deepseek-r1-rl-reasoning-open-weights/https://frontiercheckpoint.com/signals/deepseek-r1-rl-reasoning-open-weights/The reproducible part is the method, not a leaderboard cell: group-relative RL on verifiable rewards, with open weights to probe. It is the cleanest public artifact for understanding the reasoning-model training loop.Sat, 13 Jun 2026 00:00:00 GMTSignalsreasoninggrporlhfeditors@frontiercheckpoint.comThe modded-nanogpt speedrun and the Muon optimizerhttps://frontiercheckpoint.com/signals/modded-nanogpt-muon-speedrun/https://frontiercheckpoint.com/signals/modded-nanogpt-muon-speedrun/A rare fully-public optimization target with a reproducible harness — exactly the kind of artifact we like. The Muon optimizer it popularized is the most interesting practical idea to come out of it.Thu, 11 Jun 2026 00:00:00 GMTSignalspretrainingoptimizationreproducibilityeditors@frontiercheckpoint.comTRL in Anger: SFT, DPO, and GRPO Without Rewriting Your Training Loophttps://frontiercheckpoint.com/libraries/trl-sft-dpo-grpo-library/https://frontiercheckpoint.com/libraries/trl-sft-dpo-grpo-library/TRL turns SFT, DPO, and GRPO into Trainer subclasses that inherit the entire Hugging Face stack — accelerate, peft, DeepSpeed. The convenience is real; the cost is that you're debugging someone else's training loop the moment your problem stops looking like the quickstart.Thu, 11 Jun 2026 00:00:00 GMTLibrariesdpogrpopeftfine-tuningrlhfeditors@frontiercheckpoint.comPost-Training Quantization in Practice: GPTQ, AWQ, and FP8https://frontiercheckpoint.com/explainers/post-training-quantization-gptq-awq-fp8/https://frontiercheckpoint.com/explainers/post-training-quantization-gptq-awq-fp8/Post-training quantization is the cheapest inference lever and the easiest to pull wrong. The right method is set by your serving regime — bandwidth-bound decode wants weight-only INT4, compute-bound prefill wants FP8 — and the win is real only if a fast kernel accelerates your exact config.Mon, 08 Jun 2026 00:00:00 GMTExplainersquantizationservingkv-cacheeditors@frontiercheckpoint.comGRPO, Demystified: Group-Relative Policy Optimization for Reasoning Modelshttps://frontiercheckpoint.com/explainers/grpo-group-relative-policy-optimization/https://frontiercheckpoint.com/explainers/grpo-group-relative-policy-optimization/GRPO swaps PPO's learned critic for a Monte-Carlo baseline — the mean reward over a group of sampled completions — trading rollout compute and per-token credit assignment for a simpler, more stable RL loop on verifiable-reward tasks.Thu, 04 Jun 2026 00:00:00 GMTExplainersgrporlhfpporeasoningfine-tuningeditors@frontiercheckpoint.comRouting Is the Hard Part: A Practitioner's Guide to Mixture-of-Expertshttps://frontiercheckpoint.com/explainers/moe-routing-practitioners-guide/https://frontiercheckpoint.com/explainers/moe-routing-practitioners-guide/MoE decouples parameter count from per-token FLOPs, but every hard problem — instability, dropped tokens, load imbalance, all-to-all traffic, a footprint set by total not active params — lives in the router. A structural tour from Switch/GShard to fine-grained and aux-loss-free designs, and the systems bill you actually pay.Mon, 01 Jun 2026 00:00:00 GMTExplainersmixture-of-expertstransformerstensor-parallelismeditors@frontiercheckpoint.comRecreating FlashAttention: A Tiled, IO-Aware Attention Kernel from Scratchhttps://frontiercheckpoint.com/recreations/recreating-flashattention-tiled-kernel/https://frontiercheckpoint.com/recreations/recreating-flashattention-tiled-kernel/FlashAttention is exact attention restructured for the memory hierarchy, not an approximation. We implement the tiled forward and recompute backward in Triton, validate exactness against a reference, and separate what a tutorial actually reproduces from what needs CUTLASS-grade engineering.Thu, 28 May 2026 00:00:00 GMTRecreationsflash-attentionkernelsattentiongpu-memoryeditors@frontiercheckpoint.comRoPE and the Long-Context Stack: Rotation, Interpolation, and What Breaks at 128khttps://frontiercheckpoint.com/explainers/rope-long-context-stack/https://frontiercheckpoint.com/explainers/rope-long-context-stack/RoPE turns position into a per-dimension rotation — and that same rotation is why PI, NTK-aware scaling, and YaRN exist, and why a 128k window rarely means 128k of usable context. The math, the methods, and the serving bill.Sun, 24 May 2026 00:00:00 GMTExplainersropelong-contextattentiontransformerseditors@frontiercheckpoint.comvLLM, Explained: PagedAttention, Continuous Batching, and the Serving Stackhttps://frontiercheckpoint.com/libraries/vllm-paged-attention-serving/https://frontiercheckpoint.com/libraries/vllm-paged-attention-serving/vLLM treats the KV cache like OS virtual memory — non-contiguous paged blocks — and schedules work at the token, not the request. You get high aggregate throughput; the cost is that per-request latency becomes something you tune rather than something you get for free.Wed, 20 May 2026 00:00:00 GMTLibrariespaged-attentionkv-cacheservingllmeditors@frontiercheckpoint.comSharding the Model: FSDP, ZeRO, and Tensor/Pipeline Parallelismhttps://frontiercheckpoint.com/explainers/distributed-training-fsdp-zero-parallelism/https://frontiercheckpoint.com/explainers/distributed-training-fsdp-zero-parallelism/Past one GPU you stop training a model and start operating a distributed system. Here is what each parallelism axis actually shards, what it costs on the wire, and how practitioners stack them into 3D/4D layouts.Fri, 15 May 2026 00:00:00 GMTExplainersdistributed-trainingfsdptensor-parallelismgpu-memoryeditors@frontiercheckpoint.comHow We Separate Signal From Noise: Frontier Checkpoint's Verification Rubrichttps://frontiercheckpoint.com/essays/the-checkpoint-signal-vs-noise/https://frontiercheckpoint.com/essays/the-checkpoint-signal-vs-noise/The standard behind everything we publish: the filters that decide what earns your attention, the reproduced-to-unverified ladder we grade claims on, how we handle benchmarks and weightless releases, and why every correction is dated and logged rather than silently edited.Tue, 12 May 2026 00:00:00 GMTEssaysmethodologyreproducibilityevaluationindustryeditors@frontiercheckpoint.com