How We Separate Signal From Noise: Frontier Checkpoint's Verification Rubric

AI media has a supply glut that masquerades as a shortage. Every frontier release ships with a launch thread, a benchmark table, and a dozen recaps before lunch; what’s actually scarce is the one fact a working engineer needs before spending a sprint on a technique — did it reproduce, and under exactly what conditions did it stop. Closing that gap is the whole reason this publication exists, and it is deliberately the slow, unglamorous job.

Summarization is free now — a model rewrites any paper in seconds — and access (the early-tester list, the embargoed briefing) buys a few hours of lead time and nothing durable. The defensible layer is the one nobody can automate cheaply: actually running the thing, on disclosed hardware, with a pinned config, and reporting where the claim held and where it cracked. Our house position, stated plainly: a result you cannot reproduce is a rumor with a PDF. The PDF can be beautifully typeset, peer-reviewed, and from a lab you respect. It is still a rumor until someone independent re-runs it.

What this buys you is specific: when we state a number, you can act on it without re-deriving it yourself; when we can’t state it, we tell you exactly what evidence is missing and what would change our mind.

The four filters

Before a release, paper, or benchmark earns space here, it runs a gauntlet of four questions. Failing one doesn’t kill coverage — it sets the ceiling on how strongly we get to speak, and it shapes the hedging.

Reproducibility

Is there enough to re-run it? Code, configs, seeds, data provenance, the actual hyperparameters — not “we used Adam.” A method described in prose with no artifact is an idea, and we cover it as an idea. The gold standard is a repo that builds and a command that recovers the headline figure. Most papers don’t clear it; that’s fine, but it caps the verdict.

Weight and code availability

Open weights, open code, API-only, or a screenshot in a keynote — each forecloses a different kind of check. With weights we can probe, quantize, and re-eval. Without them we are auditing a description, not a system.

Eval quality

Is the evaluation disclosed well enough to mean anything? Prompt format, few-shot count, decoding parameters, scoring code, contamination controls. “SOTA on MMLU” with none of that attached is not a result; it’s a vibe occupying a leaderboard cell.

Serving readiness

Does it run in a stack a practitioner actually uses? Tokenizer quirks, nominal versus effective context length, quantization support, whether vLLM or TGI or TensorRT-LLM picked it up on day one. A model nobody can serve at sane cost is a research curiosity — interesting, but covered as one, not as something you’ll ship.

The verdict ladder

Every reproduction we publish gets exactly one of five labels, and each label has an evidence bar. You should never have to guess how much weight a sentence here is carrying.

Verdict	What it means	Evidence required
Reproduced	We, or a credible third party using open artifacts, re-ran it and got the reported direction within a stated tolerance on disclosed hardware.	A runnable config, a pinned harness, and results matching the claim’s trend. Exact parity is not required.
Partial	Some sub-claims held; others didn’t, or the trend matched but the magnitude didn’t.	A scoped run showing which parts survived and which didn’t, with the boundary stated.
Failed	A good-faith effort with a disclosed config could not recover the result.	The full config and command, published so authors or readers can show us our error.
Contested	Credible parties disagree, conflicting reproductions exist, or a live methodological critique is unresolved.	Links to the conflicting evidence and a plain statement of what is in dispute.
Unverified	We have not independently checked it. The default for anything fresh or API-only.	None — but we name what evidence would change the label.

A few things the ladder encodes. “Unverified” is the default, not an accusation; most of the world is unverified at any given moment, including things that are almost certainly true. “Failed” ships with the full config precisely so you can tell us we held it wrong — a failed reproduction is a claim about our run, not a verdict on the authors’ competence. And “partial” is where most honest reproductions actually land: our from-scratch FlashAttention recreation recovers the kernel’s memory-scaling behavior without matching the reference’s absolute speedups, because that last mile is months of kernel engineering we didn’t do. Our nanoGPT-speedrun reproduction sits at the lower rungs too while the runs are still going — we report the public records and label our own numbers in-progress.

Benchmarks are guilty until reproduced

Benchmark numbers are the most over-trusted artifact in the field, so they get the most adversarial treatment. Two “same-benchmark” scores are routinely incomparable: change the prompt template, the few-shot count, the answer-extraction regex, or the decoding temperature, and a model’s MMLU number moves by more than the gap most launches are bragging about. We don’t reprint a benchmark table as fact. We reprint it as a claim, name the eval setup if it’s disclosed, and flag it unverified until we or a credible third party re-run it with a pinned harness.

Contamination is the failure mode we assume rather than hope against. Test sets leak into pretraining corpora, and a model that has seen the answers will quote them back. Scale’s GSM1k study — a held-out clone of GSM8K — is the canonical demonstration: several model families scored meaningfully lower on the fresh problems than on the original set, which is what overfitting to a public benchmark looks like from the outside.

When we reproduce, we publish the literal command, because the command is part of the result:

# A reproduction log here is pinned, seeded, and committed — the command IS the result.
 pip install "lm-eval~=0.4"          # pin the harness; scores drift across versions

lm_eval \
  --model vllm \
  --model_args "pretrained=org/model,dtype=bfloat16,seed=1234" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path "runs/gsm8k_$(git rev-parse --short HEAD).json"

If a number can’t survive being re-run with a pinned harness and a disclosed seed, it was never a number. It was marketing with a decimal point.

What we can say without weights

Most frontier releases are API-only at launch, and pretending otherwise would shrink coverage to nothing. So we have a clear contract for what is sayable without weights or a real system card. We can report the claim and attribute it. We can audit the system card for what it discloses and, more tellingly, what it omits — training-data description, eval methodology, known limitations, safety evaluations. We can note where the eval setup is too thin to compare against anyone else. We can contextualize the claim against open results we have actually verified.

What we cannot do is confirm it. We can’t probe the model, can’t check effective context against the nominal number, can’t test for contamination, can’t quantize it and measure the degradation. So API-only releases carry the “unverified” tag — and this is the part that matters — we state explicitly what evidence would move it. “Open the weights, or publish the eval harness, and we’ll re-run it.” That sentence is a standing offer, not a rhetorical flourish.

Corrections are the product

We will be wrong. Reproductions are run by humans on finite budgets, and the field moves faster than any review process. The question is not whether a publication errs but what it does next, and our answer is structural: every correction is dated, logged in a visible changelog, and appended — never a silent edit. The original claim stays legible with a strike and a note, so you can see what we believed, when, and why we changed our minds.

This is also why we date evergreen methodology and specific claims separately. The rubric you are reading is meant to last; a verdict about a particular model is stamped “as of” a date, because it has a shelf life and we would rather you knew it.

What we skip on purpose

A publication is defined as much by its omissions as its coverage, and ours are deliberate. We don’t cover funding rounds — a Series C tells you about a cap table, not a capability. We skip org-chart drama and founder feuds; who left for which lab is gossip, and gossip behind a paywall is still gossip. We don’t run transformer 101, because you have implemented attention and don’t need the millionth walkthrough of softmax — when we explain something, it’s the non-obvious part, the thing that bit us at 2 a.m. And we don’t rewrite press releases. If the only source is a marketing blog and an embargoed benchmark, the honest move is to wait for weights or an eval harness, tag it unverified, and say so.

The opportunity cost of chasing all of that is the time it takes to re-run one claim properly. We would rather spend it on the claim.

The bottom line

The bet underneath all of this is simple: as base models commoditize and summarization goes to zero, the scarce and defensible thing in AI media is the work of verification — the re-runs, the pinned configs, the honest “failed” with the command attached. We would rather publish one reproduction you can build on than ten recaps you have already seen. If you have reproduced something we got wrong, send the config; a correction with evidence is the highest-status contribution you can make here, and we will date it, log it, and thank you in the changelog.

The four filters#

Reproducibility#

Weight and code availability#

Eval quality#

Serving readiness#

The verdict ladder#

Benchmarks are guilty until reproduced#

What we can say without weights#

Corrections are the product#

What we skip on purpose#

The bottom line#

The four filters

Reproducibility

Weight and code availability

Eval quality

Serving readiness

The verdict ladder

Benchmarks are guilty until reproduced

What we can say without weights

Corrections are the product

What we skip on purpose

The bottom line