pillar

Agentic Loops and Harness Engineering: The 2026 Field Guide

Execution loops, externalized state, and verification gates now matter more than raw model IQ. Here's how the agents that actually ship are built.

PillarJune 11, 202616 min read
agent harnessagentic loopsReAct
Agentic Loops and Harness Engineering: The 2026 Field Guide

In May 2024, a Princeton and Stanford team made GPT-4 a 64% better software engineer without changing a single model weight. The SWE-agent paper got that relative gain purely by redesigning the interface the model worked through: a custom file viewer, throttled search, tighter feedback.

Two years later, OpenAI put it bluntly in its March 2026 harness engineering essay: "the agent harness is the most important thing about Codex." Not the model. The harness.

That's the thesis of this guide. An agent harness is the engineered layer between a model and its environment: the loop structure, the tool surface, the permission rules, the memory files, and the verification gates that decide whether an autonomous coding agent ships working software or burns tokens in circles.

TL;DR: Raw ReAct loops fail through compounding errors and overconfident termination, so production agents in 2026 wrap them in Plan-Execute-Verify structures, multi-agent orchestration, or stateless outer loops like the Ralph Wiggum pattern. Context rot makes long-lived in-context state a losing bet, which forces state out to files and git. The harness, configured through AGENTS.md files, tiered permissions, and executable test gates, is now where the engineering leverage lives.

Key takeaways:

  • ReAct is the runtime, not the design. Every serious agent wraps it in something else.
  • Context is a cache, not a memory. Externalize state toprd.json, progress logs, and git.
  • The Ralph Wiggum loop (fresh context every iteration, state on disk) is the floor for long-running autonomous coding agents.
  • Three executable gates (init.sh,test-all.sh,review.sh) are the de facto verification standard.
  • Don't trust public leaderboards. SWE-bench Verified was retired in February 2026 after 59.4% of audited hard problems turned out to have flawed tests.
  • Multi-agent orchestration fails 41 to 86.7% of the time across frameworks. It's a tool, not a default.

What is an agent harness, exactly?

An agent harness is everything around the model: the system prompt and memory files loaded at startup, the curated set of tools the model can call, the permission policy governing those calls, the scripts that verify its output, and the outer loop that decides when to continue, retry, or stop. The model supplies reasoning per step; the harness supplies structure, state, and safety across steps.

This used to be an afterthought. In 2026 it's the product. Anthropic's essay on long-running agent harnesses and OpenAI's harness engineering post both landed in March 2026, and they converge on nearly identical designs from independent directions.

The reason they converge is that the failure modes of naked agentic loops are now thoroughly documented. So start there.

The evolution of agentic loops: ReAct and its failure modes

The ReAct pattern, introduced by Yao et al. In October 2022, interleaves "Thought:" and "Action:" steps in a single trace. The model reasons, acts, observes the result, and reasons again.

It spread fast because it needed almost no infrastructure: a loop, a tool spec, a prompt. By 2023 it was the default in LangChain, AutoGPT, and most early coding agents.

It fails in four well-documented ways.

Compounding error cascades. Each step conditions on the prior step's text, so an early hallucination is preserved in the trace and poisons every subsequent action. There is no sanity-check mechanism inside a pure ReAct loop.

Repetition loops. When the agent loses track of its plan, it can repeat the same action indefinitely. Both the AutoGen paper and the Reflexion paper describe agents spinning on the same observation without progress.

Goal drift. Without a separate planner, the agent chases local optima that never converge on what the user asked for. Anthropic's Building Effective Agents essay notes that agents looping through many turns can lose coherence.

No verifiable termination. A ReAct loop ends when the model declares "Final Answer," and models are empirically overconfident about success. The PatchDiff study found 7.8% of "passing" SWE-bench patches failed the developer's full test suite.

The right read on all this: ReAct isn't dead, it's demoted. It remains the inner loop of nearly every production agent. The mistake is shipping it raw and calling it an agent.

Plan-Execute-Verify: the reasoning sandwich

Plan-Execute-Verify (PEV) is the dominant 2025-2026 pattern for non-trivial coding tasks. Reasoning brackets action in three explicit phases.

First, a planning step decomposes the task into verifiable steps, typically written to a structured file likeprd.jsonorfeature_list.json. Then a ReAct-style loop executes one step at a time. Finally, a separate check (a different model call, a deterministic script, or a test run) validates each step before the next one starts.

Anthropic's reference implementation makes the roles concrete: an orchestrator reads afeature_list.json, an Initializer expands it into PRD and architecture notes, a Coding agent implements one feature, and an Evaluator runs tests and either approves the work or returns a structured diff. The lifecycle is implement, test, fix, repeat.

OpenAI's Codex guidance lands in the same place: plan, then implement, then verify, and treat the harness itself as version-controlled production code.

PEV has real costs. It burns more tokens, takes more wall-clock time, and the planner itself can hallucinate. Anthropic concedes it isn't right for every problem. But for multi-step coding work, the verify phase is what separates agents that ship from agents that announce success and leave broken code behind.

What is the Ralph Wiggum loop?

The Ralph Wiggum pattern, named by security researcher Geoffrey Huntley in 2025, is the most consequential agent design of the past year. The name comes from the Simpsons character who doesn't understand the world while the world keeps moving around him. That's the agent: zero persistent memory, total system progress.

The canonical implementation is a shell wrapper, roughly:

bash
# ralph.sh — the outer loop owns all state
while ! prd_complete prd.json; do
  # 1. Fresh context: the model remembers nothing
  # 2. Agent reads prd.json + progress.txt, picks next undone item
  agent --clean-context --prompt "Read prd.json and progress.txt.
    Implement the next incomplete feature. One feature only."
  # 3. Gates: non-zero exit = feature not done, re-queue
  ./test-all.sh || continue
  ./review.sh   || continue
  # 4. Log and commit: the filesystem is the memory
  append_progress progress.txt
  git commit -am "feat: $(current_feature)"
done

Every iteration starts with a wiped context. The agent reads the requirements file and progress log from disk, does one unit of work, passes the gates, logs what it did, and commits. Then the loop resets it.

The architectural insight: context is a cache, not a memory. The model holds no state across iterations, but the system holds complete state in files and git history.

Hallucination loops become physically impossible to sustain, because the prior hallucination isn't in context anymore. Any iteration can be debugged or replayed by checking out the relevant commit.

Huntley frames it as "a CI/CD pipeline that happens to use a language model as its build step." OpenAI's harness essay endorses the pattern by name: if a task outlasts one context window, you need an outer loop, and the outer loop should own the progress file, the test results, and the git state.

If you're building a long-running autonomous coding agent in 2026, this pattern is the floor.

Why does context rot force these designs?

The stateless loop isn't a stylistic choice. It's a response to two empirical findings about how models handle long context.

The first is Lost in the Middle (Liu et al., July 2023), which showed frontier models attend to the beginning and end of a long context while effectively ignoring the middle, with drops of more than 20 percentage points on retrieval tasks when the relevant document sat mid-context.

The second is Chroma Research's 2025 "context rot" study, which tested 18 frontier models and found performance degrades monotonically as input grows, even on trivial tasks, and the degradation accelerates well before the advertised context limit.

Practitioners have converged on a working rule: there's a "dumb zone" somewhere past 20-40% of the advertised window where instruction-following falls apart. The exact boundary is task- and model-dependent, but the engineering response is consistent. Aider's repo map defaults to a 1K-token map of the repository rather than dumping files, because going bigger degrades instruction-following.

Cursor's engineering guidance cites a 15-40K token sweet spot for practical coding tasks.

This is why every serious agent separates an inner loop from an outer loop. The inner loop is the model's actual working context, typically 4K to 32K tokens, kept small and high-signal.

The outer loop is the harness: filesystem, git, test runner, review queue. Anthropic's context engineering essay states the goal plainly: not to fill the context window, but to keep the model focused on the next decision.

The externalized state kit has standardized fast. Afeature_list.jsonorprd.jsonholds verifiable acceptance criteria with pass flags. An append-onlyprogress.txtrecords what each iteration did. Git is the durable source of truth; if the progress log lies,git logdoesn't.

Harness anatomy: AGENTS.md, permissions, and gates

The configuration layer

AGENTS.md is the open standard here: a markdown file at the repo root telling any agent how the project builds, tests, and what's off limits. Incubated in mid-2024, it was adopted by Cursor, GitHub Copilot's coding agent, Aider, Devin, and OpenAI's Codex CLI, and in December 2025 the Linux Foundation's Agentic AI Foundation took stewardship.

By mid-2026, 23+ tools support it.

The vendor-specific variants follow the same shape. Anthropic's CLAUDE.md adds a dual-memory design: a human-edited project file plus an auto-written memory file, scoped at enterprise, project, and user levels. Cursor's .cursor/rules adds glob scoping, so a monorepo can carry different rules forservices/api/**andweb/**.

The convergence matters more than the differences. Every major coding agent now treats a project-level markdown file as the unit of harness configuration, and write-once-use-everywhere is becoming realistic.

The permission layer

The standard model is three tiers, and the major vendors have independently arrived at the same shape:

Tier Claude Code Codex CLI Devin CLI Typical contents
Always Allow read-only / workspace-write Auto / Allow File reads, in-project writes, search
Ask First Default mode workspace-write prompts Ask Shell commands, network, external systems
Never Deny (blocked) Deny Destructive ops, out-of-project edits, exfiltration paths

Sources: the Claude Code permissions docs and the Devin CLI permission reference.

There's an unofficial fourth tier: YOLO mode (--dangerously-skip-permissionsin Claude Code,--yoloin Codex, "danger-full-access"). Both vendors document it as a footgun. The consensus hardening for sandboxed autonomous runs is to combine it with aPreToolUsehook that checks every shell command against a denylist. Trust the model, verify every action.

The verification layer

Three executable gates have become the de facto standard, not by spec but by survival:

  • init.sh runs once: install dependencies, build, confirm the project is green before the agent touches anything.
  • test-all.sh runs after every iteration. Non-zero exit means the feature isn't done and gets re-queued. The agent never gets to declare success on its own.
  • review.sh runs after tests: linters, type checkers, and a security scanner (Semgrep, Bandit, or CodeQL) on every diff.

That last gate isn't optional decoration. Veracode's 2025 GenAI Code Security Report found 45% of AI-generated code contains OWASP Top 10 vulnerabilities, a rate unchanged over two years of model improvement. Benchmark scores went up 10x; security capability didn't move.

Self-improving harnesses

The 2026 frontier is letting the agent propose edits to its own harness: a reviewer agent reads recent traces, spots a recurring failure ("always forgets to run tests after editing"), and drafts a patch to the system prompt or a hook. Anthropic calls it self-improvement; OpenAI calls it harness engineering.

The guardrails both vendors insist on: a human reviews every patch, the harness lives in version control like any production artifact, and the agent never edits its own permission tier. The Berkeley multi-agent failure study warns why: self-improvement loops without strong external verification converge on satisfying the verifier, not the user.

How do the leading autonomous coding agents compare?

System Launch Harness differentiator Memory model License
Claude Code May 2025 Hooks + Skills + sub-agents + MCP Dual: CLAUDE.md + auto memory Proprietary
Cursor Composer 2.5 May 2026 Background agents, team rules, RL pipeline .cursor/rules + auto-rules Proprietary
OpenHands Mar 2024 Docker sandbox + event-sourced log + CodeAct Filesystem state MIT
SWE-agent / mini May 2024 ACI, now unwound to bash-only Linear history MIT
Codex CLI 2025 workspace-write permissions, AGENTS.md native AGENTS.md + config Proprietary

A few details deserve expansion.

Claude Code's most distinctive primitive is the sub-agent: an isolated context window with its own system prompt and optionally its own model. It's the cleanest way to decompose a long task without polluting the parent's context, a direct answer to context rot.

The commercial trajectory backs the design: $1B ARR within six months of GA, per Anthropic.

Cursor's Composer 2.5 is interesting for what it disclosed: the base model is Moonshot's open-source Kimi K2.5, with 85% of compute spent on post-training. It scored 79.8% on SWE-Bench Multilingual against Opus 4.7's 80.5%, at a claimed (vendor-stated, unverified) 10-60x cost advantage.

Cursor treats the model as the swappable component and the harness as the product.

OpenHands is the open-source reference architecture: a Docker sandboxed runtime, an event stream as the single source of truth for replay and debugging, and CodeAct, where the agent emits executable Python instead of discrete tool calls.

And SWE-agent tells the most instructive story in the field. Its custom Agent-Computer Interface delivered that 64% relative gain in 2024. By 2026, the same team recommends mini-SWE-agent, which uses nothing beyond bash and a linear history, because improved models made the bespoke interface roughly performance-neutral.

The lesson: harness advantages at the tool layer erode as models improve. Harness advantages at the loop, state, and verification layers don't.

The benchmark collapse: why leaderboard scores stopped meaning anything

If you needed proof that verification belongs in the harness rather than the leaderboard, 2026 delivered it.

SWE-bench Verified launched in August 2024 with 500 human-validated tasks. By late 2025, frontier models clustered above 70%. On February 23, 2026, OpenAI retired it: an audit of 138 failure cases found 59.4% had flawed test suites, and all three frontier model families showed training-data contamination, reproducing gold patches verbatim.

The rot was visible earlier. SWE-Bench+ (October 2024) showed 32.67% of "successful" patches had solution leakage, with answers embedded in the issue text itself.

The cleanest contamination fingerprint came from Scale AI's SWE-bench Pro, built partly on proprietary repositories that can't leak into training corpora. Top frontier score: 23.3%, against roughly 75% on Verified. The capability didn't halve. The measurement got honest.

Then METR added the human baseline. In a March 2026 study, active maintainers of scikit-learn, Sphinx, and pytest reviewed 296 AI pull requests that had passed the benchmark's tests. About half would not be merged. The automated grader overstated merge-worthiness by 24 percentage points.

The practitioner conclusion is blunt: treat public leaderboard scores above 50% as marketing. Your harness gates, run against your codebase, are the only benchmark that transfers.

Multi-agent orchestration: powerful, brittle, not free

The canonical multi-agent design is four roles in separate context windows: an Initializer that plans, a Coder that implements, an Evaluator that verifies, and a Supervisor that routes work. The lineage runs through ChatDev and MetaGPT to today's production harnesses.

The honest evidence cuts both ways. Anthropic's own multi-agent research system post reports the architecture works but consumes 90% more tokens than a single-agent chat, alongside "new challenges in agent coordination, evaluation, and reliability."

The sharpest critique is Cemri et al. from UC Berkeley, which audited 1,600+ traces across seven frameworks and found failure rates between 41% and 86.7%. The top failure modes are mundane: disobeying the task spec (15.2%), role confusion (11.5%), and premature termination (8.5%), where the system declares victory without running the tests.

My read: multi-agent orchestration is an amplifier for teams that already have strong verification gates, and a failure multiplier for teams that don't. Earn it. Start with a single Ralph-style loop, add an Evaluator role when the single loop demonstrably can't self-check, and add a Supervisor only when you're juggling parallel work streams.

What this means for you

If you're building or buying autonomous coding agents in 2026, the evidence supports a short list of defaults.

Pick the harness before the model. Models swap; loop structure, state files, and gates persist. SWE-agent's ACI unwinding shows tool-layer cleverness depreciates, while outer-loop architecture doesn't.

Default to a stateless outer loop for anything longer than one context window. Fresh context per iteration, state inprd.jsonandprogress.txt, truth in git. Keep the inner loop under half the advertised context window.

Make the gates executable and non-negotiable.test-all.shandreview.shafter every iteration, with a security scanner in the review gate. The agent never self-certifies.

Run Ask-First permissions plus a command denylist hook. Reserve YOLO mode for disposable sandboxes.

Discount the marketing. METR's July 2025 RCT found experienced developers were 19% slower with AI while believing they were 20% faster, a 39-point perception gap. DORA 2025 found high performers gain while struggling teams see delivery stability drop 7.2%. And Gartner projects over 40% of agentic AI projects canceled by end-2027. AI agents amplify whatever engineering discipline you already have. The harness is where that discipline gets encoded.

The loop is the product now. Build it like one.

Sources

Frequently asked questions

What is an agent harness?

An agent harness is the engineered layer between a language model and its environment: the system prompt, tool surface, permission rules, memory files, verification scripts, and outer control loop. It decides what the model sees, what it can do, and how its work gets checked. In 2026, both Anthropic and OpenAI have published essays arguing the harness, not the model, is the main competitive surface for coding agents.

What is the Ralph Wiggum loop?

It's a stateless agent pattern named by Geoffrey Huntley in 2025. A shell script wipes the model's context every iteration, has it read a requirements file and progress log from disk, implement one item, pass test and review gates, then commit and log. State lives in the filesystem and git, not the context window, which makes hallucination loops physically impossible to sustain.

Is ReAct still used in production agents?

Yes, but never alone. The 2022 ReAct pattern remains the inner loop of nearly every coding agent, but production systems wrap it in planners, verifiers, orchestrators, or stateless resets. Shipping raw ReAct fails predictably through compounding errors, repetition loops, and overconfident termination.

Why was SWE-bench Verified retired?

OpenAI retired it on February 23, 2026 after an audit found 59.4% of a sampled set of hard problems had flawed test cases, and all major frontier model families showed training-data contamination. On Scale AI's contamination-resistant SWE-bench Pro, top scores fell from roughly 75% to 23.3%.

Do multi-agent coding systems work?

Sometimes, at a cost. The four-role pattern (initializer, coder, evaluator, supervisor) is the canonical 2026 design, but a UC Berkeley study of 1,600+ traces found failure rates of 41 to 86.7% across seven frameworks. Use multi-agent orchestration only when single-loop designs demonstrably fall short, and verify everything externally.