The Harness Is the Product: Why Agent Evals Are the Real Moat

Spend a week wiring up an agent and the uncomfortable lesson arrives fast: swapping the frontier model underneath rarely moves your task success rate as much as fixing the retry logic, tightening the tool schemas, or changing how you truncate context. The model is the engine; the harness is the car. As base capabilities converge — every lab now ships something that can plan, call tools, and write a patch — differentiation migrates to the scaffolding, and specifically to the one part of the scaffolding nobody can copy off your GitHub: the evaluation environments that tell you whether a change actually helped. That asset is slow to build, doesn’t transfer, and compounds. Which is exactly what a moat is.

What a harness actually is

A harness is the deterministic software that turns a next-token predictor into something that acts in a loop. Strip the marketing and it is four subsystems, each with its own failure surface.

Tool and function interfaces. The schemas, argument validation, and serialization that expose actions to the model. This is where function calling lives, where the Model Context Protocol (Anthropic, 2024) is trying to standardize the transport, and where computer-use gives the model a screen-and-keyboard surface instead of a typed API.
Planning and control loop. The logic that decides when to call a tool, when to reflect, and when to stop. ReAct (Yao et al., 2022) interleaves reasoning and acting; Reflexion (Shinn et al., 2023) bolts on self-critique after failures. Algorithmically these are thin — a while-loop with a prompt.
Memory and context management. What survives across steps: scratchpads, retrieval, summarization and compaction, and subagents that quarantine context so a 30-step task doesn’t drown in its own tool output. Voyager (Wang et al., 2023) made the skill library itself the memory.
Error recovery. Retries, timeouts, output validation, rollback, verification before a destructive commit. The unglamorous majority of production agent code.

def run_agent(task, tools, model, max_steps=50):
    messages = build_prompt(task, tools)
    for step in range(max_steps):
        action = model.decide(messages, tools)       # plan: tool call or final answer
        if action.is_final:
            return action.answer
        try:
            observation = tools[action.name](**action.args)
        except ToolError as err:
            observation = render_error(err)           # error recovery lives here
        messages += [action, observation]
        messages = compact(messages)                  # memory / context management
    return give_up()                                  # long-horizon failure is a state too

The model is invariant across all four boxes; the harness is where the engineering goes. And here is the irony that sets up the whole argument: the loop is not the hard part. Knowing which loop, with which tools, fails how on your tasks is the hard part — and that is an empirical question, which makes it an eval question.

Why eval is the bottleneck

Agentic tasks are long-horizon and stateful, which quietly breaks the assumptions single-number leaderboards are built on.

Start with credit assignment. A 30-step trajectory that fails gives you one bit — pass or fail — for an entire chain of decisions. You learn that something went wrong, not where, and the surface area for “something” grows linearly with horizon.

Then variance. Sampling temperature, network and tool nondeterminism, and mutable environment state mean the same agent on the same task can pass or fail run to run. A single rollout is a sample, not a measurement, yet most demos and a depressing number of blog posts report exactly one.

Then contamination. Web and GitHub data is in pretraining. When a benchmark is built from public issues that predate the model cutoff, the “right” fix may already be in the weights — you are measuring recall, not capability, and you usually can’t tell which.

And finally side effects. Real tasks mutate the world. You cannot cleanly re-run an agent that sent the email, charged the card, or merged the branch, which makes faithful, resettable environments a prerequisite for measuring anything at all.

The consequence: building the agent is not the bottleneck. Knowing whether a change helped is. Without a high-fidelity eval, every “improvement” is an anecdote with a green checkmark.

The benchmark landscape, and where it stops

Public suites are proxies, and useful ones. But it helps to see them by category rather than by leaderboard position — the positions are stale by the time you read them and rarely comparable across setups anyway.

Category	Representative suites	What it checks	Where it diverges from production
Software engineering	SWE-bench, SWE-bench Verified	A patch resolves a real GitHub issue, scored by hidden unit tests	Curated repos, single-shot, training-data contamination
Web & computer use	WebArena, VisualWebArena, OSWorld, WebVoyager, GAIA	Navigate self-hosted sites or a real OS to reach a goal state	Frozen snapshots, brittle selectors, no real-world consequences
Tool use & multi-turn	tau-bench, Berkeley Function-Calling Leaderboard, AgentBench	Call APIs and satisfy a simulated user under a stated policy	Simulated users are more patient and more legible than real ones

Each measures a genuine slice — SWE-bench (Jimenez et al., 2023) with execution-based grading is far better than LLM-as-judge handwaving — but each does so under frozen conditions, usually single-shot, and the headline number is the easy part to report and the hard part to trust. Two labs quoting “the same” SWE-bench result almost certainly used different retrieval, prompt scaffolds, and step budgets; the benchmark names the task, not the protocol, and the protocol is half the score.

So treat public benchmarks as smoke tests for a class of capability, not as a verdict on whether an agent will survive your production distribution. They tell you the model can write a patch. They do not tell you it will reconcile your ledger.

Environments as data

Here is why the environment, not the glue code, is the asset. A good environment is simultaneously three things: an eval, an RL substrate, and a data generator. The verifiable reward — the test passed, the user’s goal state was reached — is the same signal whether you are scoring a candidate model or training one. That is the entire premise behind agentic RL with verifiable rewards.

The flywheel is concrete:

Collect a real task distribution from your domain.
Build a faithful, resettable sandbox with programmatic success checks.
Use those checks to evaluate candidate models and harness changes.
Use the same checks as reward for RL or rejection-sampling fine-tuning.
Harvest the resulting failures to author harder tasks, and repeat.

Each turn deepens the environment, and the environment is the thing competitors must rebuild from scratch. Fidelity to a specific domain requires the domain: proprietary workflows, real user traces, and integration with the messy internal systems that never appear in a paper. It does not transfer — a world-class coding environment tells you nothing about your CRM-automation agent — and it is maintenance-heavy, so it is a moat rivals must continuously re-pay to cross, not a wall they vault once.

Reliability is a distribution, not a number

Production cares about the tail, and “it solved the task” is the wrong statistic. The right ones are distributional.

pass@k — the probability at least one of k rollouts succeeds. This is the optimistic ceiling, useful for search and best-of-N, and dangerously flattering as a headline.
pass^k — the probability that all k rollouts succeed. tau-bench (Yao et al., 2024) popularized reporting this, and it falls off a cliff as k grows. That cliff is the honest picture of an agent you would actually put in front of users.
Cost per success, not per attempt. An agent that retries until it stumbles into a passing trajectory can post a beautiful pass@k and ruinous economics.

from math import comb

# n independent rollouts per task, s of which succeeded.
def pass_at_k(n, s, k):     # P(at least one of k succeeds) -- the ceiling
    if n - s < k:
        return 1.0
    return 1.0 - comb(n - s, k) / comb(n, k)

def pass_pow_k(n, s, k):    # P(all k succeed) -- the reliability story (tau-bench)
    if s < k:
        return 0.0
    return comb(s, k) / comb(n, k)

The discipline is to report success rate with a confidence interval across many seeds, alongside cost and latency, as a single object.

An agent that is 70 percent reliable at ten cents a task and one that is 70 percent at four dollars a task are not the same product, and a leaderboard that collapses both to one column is hiding the only numbers a buyer cares about.

The counterarguments

The thesis, stated plainly: the durable moat in agents is not the base model but the harness — the scaffolding, tools, and task-specific evals — because that is where reproducible capability accrues. It deserves its strongest objections.

The model eats the harness. Reasoning models internalize planning; native tool use and longer context absorb memory management; labs increasingly RL the scaffolding directly into the weights. If the model learns to plan, reflect, and recover on its own, your elaborate control loop collapses back into a while-loop. This trend is real, and the right concession is blunt: the glue code commoditizes.

Interfaces standardize. MCP and converging tool-call schemas are turning tool plumbing into a protocol. Connectors become commodities you install, not assets you own. Concede this too — tool interfaces are not where the moat is.

Providers own the data. Frontier labs see agentic usage at enormous scale, can build broad evals from it, and can harvest trajectories the application layer never touches. Distribution favors incumbents.

Now the rebuttal, and notice its shape: every concession is about generic scaffolding, none about the task-specific environment. A model can internalize planning, but not your bank’s reconciliation rules or your domain’s definition of a correct outcome. MCP standardizes the wire format, not the question of whether the agent did the right thing on your tasks. Providers own the head of the usage distribution; the long tail of vertical, proprietary workflows is exactly what they do not see. The moat was never the loop or the schema — those were always going to commoditize. The moat is the environment that tells you whether the loop works on the problem you actually have.

The bottom line

If you are building agents and capabilities keep converging, the strategic move follows directly from the thesis. Invest in the eval before the agent: the harness you can rewrite in a week, but the environment is the compounding asset, so build it first and treat every production failure as a new test case. Measure distributions, cost, and latency together — never ship a demo as evidence. Design your environment as an RL substrate from day one, because the same programmatic check that scores a model can train one for free. And assume the glue commoditizes; do not try to build a moat out of tool plumbing that a protocol will eat next quarter.

The question that decides the next few years is not “which model.” It is “can you tell, reliably and cheaply, whether your agent is getting better at your task.” The teams that can answer that will out-iterate the teams that cannot, regardless of whose weights they rent. The harness is the product. The eval is the moat.

What a harness actually is#

Why eval is the bottleneck#

The benchmark landscape, and where it stops#

Environments as data#

Reliability is a distribution, not a number#

The counterarguments#

The bottom line#