<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>Essays — Frontier Checkpoint</title><description>Forward-looking, argument-driven essays. Each leads with a clear thesis and a bet you could actually check — generous thinking-out-loud about where the frontier moves next, written to sharpen how you see your own road ahead, not to chase a trend.</description><link>https://frontiercheckpoint.com/</link><item><title>What a 4B Model Can Actually Do: Field Notes from 155 Experiments</title><link>https://frontiercheckpoint.com/essays/what-a-4b-model-can-do/</link><guid isPermaLink="true">https://frontiercheckpoint.com/essays/what-a-4b-model-can-do/</guid><description>Across 155 small-model experiments centered on Qwen 3.5 4B, the same thing kept working: give the model something executable it can check against the evidence it has, and it punches far above its benchmark weight. Here is the field guide — the levers that worked, how I know they&apos;re real, and the frontier they opened up.</description><pubDate>Sun, 28 Jun 2026 00:00:00 GMT</pubDate><category>Essays</category><category>reproducibility</category><category>evaluation</category><category>fine-tuning</category><category>methodology</category><category>llm</category><category>agents</category><author>editors@frontiercheckpoint.com</author></item><item><title>The Harness Is the Product: Why Agent Evals Are the Real Moat</title><link>https://frontiercheckpoint.com/essays/agent-harness-evals-moat/</link><guid isPermaLink="true">https://frontiercheckpoint.com/essays/agent-harness-evals-moat/</guid><description>Swapping the frontier model rarely moves your agent&apos;s success rate as much as fixing retries and context management — and the one thing competitors can&apos;t clone is your evaluation environment. A thesis on why agent evals, not weights, are where reproducible capability accrues.</description><pubDate>Sat, 27 Jun 2026 00:00:00 GMT</pubDate><category>Essays</category><category>agents</category><category>agent-harness</category><category>tool-use</category><category>evaluation</category><author>editors@frontiercheckpoint.com</author></item><item><title>The Economics of Thinking: Test-Time Compute as a Scaling Axis</title><link>https://frontiercheckpoint.com/essays/economics-of-test-time-compute/</link><guid isPermaLink="true">https://frontiercheckpoint.com/essays/economics-of-test-time-compute/</guid><description>Reasoning models turned inference into a per-request dial. This is an economic read on when spending FLOPs at test time actually buys accuracy, why it only pays where answers are cheap to verify, and what variable-cost inference does to latency budgets and capacity planning.</description><pubDate>Tue, 23 Jun 2026 00:00:00 GMT</pubDate><category>Essays</category><category>test-time-compute</category><category>reasoning</category><category>serving</category><category>industry</category><author>editors@frontiercheckpoint.com</author></item><item><title>How to Read an ML Result Without Fooling Yourself</title><link>https://frontiercheckpoint.com/essays/the-checkpoint-signal-vs-noise/</link><guid isPermaLink="true">https://frontiercheckpoint.com/essays/the-checkpoint-signal-vs-noise/</guid><description>A practical, friendly guide to reading ML results: why the experimental setup behind a benchmark number matters, how data contamination fools everyone, how to pin your eval harness so scores hold still, and why we build and run things ourselves — so you can learn from them and trust them.</description><pubDate>Tue, 12 May 2026 00:00:00 GMT</pubDate><category>Essays</category><category>methodology</category><category>reproducibility</category><category>evaluation</category><category>industry</category><author>editors@frontiercheckpoint.com</author></item></channel></rss>