Durable Agents Need an Adversary

Durable agents need an adversary.

Models produce. Something must judge. When the producer and the judge share one process, the judge calls it done.

Every durable harness component exists to prevent that collapse.

The Scaffolding Frame Misses the Point

The standard story treats harness engineering as temporary scaffolding — patches for a weak model that dissolve as the model gets stronger. LangChain frames it this way: harnesses exist because models can't plan, can't verify, can't stay coherent across long horizons. Make the model better, shrink the harness.

This reads the surface and misses the structure.

The components that carry the most weight are not compensating for model weakness. They encode something a single process cannot hold:

Evaluators that grade what the generator just produced
Tests and linters that reject output before it ships
Sprint contracts that pin down "done" before the work starts
Git history that compares what a session claimed to what exists on disk

Each is an adversary. Not a helper. An opposing force whose job is refusal. That opposition cannot be absorbed into the model because it cannot exist inside a single process. No process is an honest critic of its own output.

Generators Cannot Grade Themselves

The sharpest observation in this space comes from Anthropic's harness team: when agents evaluate their own work, they praise it confidently, even when a human can see the quality is mediocre. This is not a model bug. It is the shape of the incentive when generation and judgment share a substrate.

Anthropic's response was architectural. They split the work across three agents — planner, generator, evaluator — and tuned the evaluator to be skeptical as its primary trait, rather than bending the generator into self-criticism.

Tuning a skeptical evaluator is tractable. Making a generator self-critical is not
Weight evaluator criteria toward model weak spots — design judgment, completeness, edge cases
Force the evaluator to interact with the running system, not just read the code, or it rubber-stamps what looks done

The GAN analogy is exact. A generator and a discriminator locked in opposition produce work neither would produce alone. Collapse them and the output becomes confident slop.

Verification Is Institutional Skepticism

Practitioners call verification loops the highest-reliability pattern per hour of engineering invested, and they're right. The framing usually stays technical: run the tests, check the schema, gate the irreversible action.

The technical framing undersells it. A verification loop is codified refusal — your engineering org's skepticism, rendered executable:

HumanLayer uses hooks that stay silent on success and surface only errors on failure, with exit code 2 forcing the agent to fix what it broke before finishing
Anthropic and OpenAI's consolidated guidance treats Puppeteer and Playwright as back-pressure against agents that mark features done without opening the app
Type checkers, build steps, coverage thresholds — none can be flattered into passing

This is not plumbing. It is posture. The harness takes the position that work is not done until something independent says so.

Contracts Outlive Prompts

The most durable part of a harness is the contracts it encodes. Prompts rot as models change. Tools get rewritten. A clear definition of "done" survives every iteration.

The consolidated guidance from Anthropic and OpenAI is specific about why JSON beats Markdown for feature lists: JSON resists model-induced corruption better than freeform text. The rule becomes "never remove or reorder items, only flip status from incomplete to complete." That is a contract, binding the generator before it writes a line.

Contracts worth encoding:

Feature lists with immutable ordering and binary status flags
Sprint proposals where the evaluator signs off on acceptance criteria before the generator builds
Hard constraints like "it is unacceptable to remove or edit tests"
Repository as the single source of truth — if it is not in the repo, it does not exist for the agent

These survive model upgrades. A new model still has to meet the contract.

Context Resets Are Role Resets

Context management looks like a token-budget problem. Windows fill, models get confused, so you compact or reset. The Anthropic team found that for Claude Sonnet 4.5, compaction alone wasn't enough — they needed full context resets because the model developed "context anxiety," wrapping work up prematurely as its window filled.

The deeper read: a context reset is a role reset. When a fresh session reads only the durable artifacts, it enters the work without defensiveness about what the previous session built. It judges the code as a stranger would.

What survives the reset carries the real weight:

A progress file stating what was done and what failed
Git history showing what actually shipped versus what was claimed
The feature list with its immutable status flags
A short entry-point file pointing to deeper documentation

When Opus 4.6 reduced context anxiety on its own, Anthropic dropped the context reset construct. They kept the planner and the evaluator. The scaffolding shrank. The adversary stayed.

The Harness Encodes Your Engineering Org

What looks like AI infrastructure is software engineering institutional memory, re-pointed at a non-human worker. Every durable harness component maps to something engineering orgs already built for humans:

Evaluators map to code review
Test suites and CI map to quality gates
Sprint contracts map to product requirements
Progress files and git map to shift handoffs
Approval checkpoints map to change control

The Anthropic team was direct about the inspiration: it came from watching what effective software engineers do every day. The harness is not a discipline invented for agents. It is the operating model of a functioning engineering organization, compiled into runtime constraints.

This is why harness engineering is durable as a discipline. The underlying truth — that you need separation between makers and judges, between work and its verification — is older than LLMs. It predates every model it will outlive.

The Adversary Doesn't Dissolve

As models absorb more capability, the harness boundary shifts. Planning moves inward. Self-verification improves. Compaction gets smarter. LangChain's observation that the space of harness combinations does not shrink but moves is almost right.

What it misses: one piece refuses to move. The adversarial structure between maker and judge cannot be internalized into a single process without collapsing back into confident mediocrity. You cannot build an honest critic that is also the thing being criticized.

That is the durable core. Every other harness component is negotiable. Prompts, tools, memory formats, orchestration patterns — all of them will look different in two years. The split between generator and evaluator will not.

Build your harness around the split. Treat the rest as disposable. The model will keep changing. The adversary stays.