The Wrong Reflex
The agent writes an email validator. It looks plausible — regex, edge case handling, error messages. It calls the task done and exits. The regex silently rejects valid addresses containing a plus sign.
Nothing in the loop ran the tests.
Not because the model couldn't write a test runner. Not because it lacked reasoning capacity. Because running the tests was a tool it could call — and on this pass, it didn't. The harness let it decide. It decided wrong.
The reflex when this happens is to reach for a better model. That reflex is almost always wrong.
The agent didn't fail because it was too dumb to verify. It failed because the architecture let it skip verification. A smarter model in the same architecture will skip verification slightly less often — and still skip it. The capability ceiling isn't the problem. The enforcement boundary is.
Most agent failures are context failures, enforcement failures, or blast-radius failures. Not one of them is fixed by a model upgrade.
This post is about what actually fixes them. I ran a controlled experiment across three projects — harness_ladder, multi_agent_coding, and deep_research_agent — holding the model constant and adding one piece of infrastructure at a time. What follows is what moved the needle, why, and what it cost.
The Experiment: One Variable Per Rung
harness_ladder is built around a specific discipline: each rung adds exactly one component. Fixed model (Qwen3:8b via Ollama), fixed task set, fixed evaluation criteria. The eval runner emits a structured (model, rung, task, pass/fail) table so the contribution of each rung is measurable in isolation.
The reason to build each rung from primitives rather than starting from a harness library is precisely this: a library bundles planning, memory, and sandboxing together. You can't measure the contribution of each if you can't isolate it. The discipline of building from scratch is the experiment.
| Rung | What's added | Core lesson |
|---|---|---|
| 0 | Bare ReAct loop — model + tools only | Baseline. The ceiling of raw capability. |
| 1 | Planning tools (create_plan, mark_step_done) |
Structural decomposition moves weaker models more than expected |
| 2 | Scratch memory (scratch_write / scratch_read) |
Many failures are context failures, not capability failures |
| 3 | Auto-verify node (harness runs tests, re-injects failures) | The harness enforces the feedback loop — the model doesn't choose it |
| 4 | Sandboxed write (only solution.py writable) |
Blocking bad moves is cheaper than asking the model to avoid them |
| 5 | Sub-agent delegation (delegate_subtask) |
Isolated context per sub-problem beats one long accumulating window |

Rungs 0–2 improve the model's working conditions without changing who controls the outcome. They're worth having — scratch memory at Rung 2 alone recovers failures that look like reasoning problems but are actually context problems. The model produced correct reasoning earlier in the session and lost track of it. That's a context failure, not a capability failure, and it's fixed by a scratchpad, not a smarter model.
Rung 3 is different in kind.
Rung 3: The Hinge
Before Rung 3, verification is a tool the model can call. The harness offers it. The model chooses when and whether to use it.
After Rung 3, verification is a LangGraph node the harness forces. Every time the model signals task completion, the node fires — not as a suggestion, not as a tool call in the prompt, but as a structural edge in the graph. The model cannot route around it. It cannot declare success without passing through the gate.
The verification doesn't live in the prompt. It lives in the graph. Most agent tutorials miss this entirely — they add a "please verify your work" instruction to the system prompt and call it a feedback loop. It isn't. An instruction is advisory. A graph edge is structural. The model that wrote # tests pass in a comment and exited cleanly would have done exactly the same thing with a smarter base model and a politely worded system prompt asking it to self-check.
The gate is what changes the outcome. Not the ask.
Rungs 4 and 5: Blast Radius and Isolation
Rung 4 applies Ashby's Law of Requisite Variety directly. If the agent can overwrite your entire codebase, keeping the codebase safe requires it to make the right decision on every file, every turn — a high-variety problem for the controller. Sandboxing solves it at a lower layer: restrict writes to solution.py only. The agent can't make a bad decision about your other files because the harness removed the option entirely.
Blocking bad moves is cheaper than asking the model to avoid them.
Ashby's Law is usually taught as a systems principle. Applied to coding agents, it's a concrete architecture decision: what is the minimum writable surface the agent needs to complete this task? Everything outside that surface should be structurally inaccessible, not just instructed-away.
Rung 5 introduces sub-agent delegation via delegate_subtask. The beat worth holding here: isolated context per sub-problem isn't just a memory optimization. Each sub-agent runs in its own context window and returns only its distilled finding to the orchestrator. The orchestrator never sees the raw reasoning from all sub-problems — it receives only what each sub-agent decided to surface.
This is the first appearance of a design instinct that recurs across every layer of this stack: compress by default, retrieve on demand. At the control-flow level, the harness compresses sub-problem context automatically — you get the signal, not the noise. This same instinct reappears at the token layer in Parts 2 and 3. Pay attention to where it shows up, because it's always solving the same problem at a different altitude.
Enforcement in a Real Pipeline
The ladder is a controlled experiment. multi_agent_coding is the same ideas applied in production-closer conditions: a four-agent sequence in LangGraph, running fully locally via Ollama.
The PLANNER decomposes the task into steps and edge cases — no code produced. The CODER writes solution.py against the plan with strict output rules: one code block, all imports present, no placeholders. The EVALUATOR runs pytest as a subprocess and returns structured pass/fail. The DEBUGGER reads the failing assertion lines, does root-cause analysis, patches solution.py, and hands back to the EVALUATOR. Max three iterations.
The load-bearing detail: the EVALUATOR has zero LLM involvement. It doesn't reason about whether the code is probably correct. It runs the tests and returns a result. This separation of reasoning and enforcement is deliberate. The LLM is allowed to reason. It is not allowed to judge its own output. Those are different jobs.
This is the second recurring design instinct in this stack: deterministic gate by default, LLM as opt-in escalation. The gate that controls pipeline routing runs no inference — it's a subprocess. The LLM's role in the feedback loop is to respond to what the gate finds, not to decide whether the gate needed to run.
Every run produces a proof.md — a Markdown document capturing what each agent reasoned, timing, and the final pass/fail per step. Small detail now. It becomes significant when the compliance thread resurfaces in Part 3.
What a Library Gives You — And What It Can't
deep_research_agent runs the same harness ideas through LangChain's deepagents library. create_deep_agent() supplies planning middleware (write_todos), a virtual filesystem (read_file / write_file / edit_file), and sub-agent delegation via the task tool. Two subagents, one domain tool (DuckDuckGo — no API key), a critic review pass, and a sourced report.md out the other end.
Here's what the library provides for free, and what it doesn't:
| Capability | Hand-built (harness_ladder / multi_agent_coding) |
deepagents |
|---|---|---|
| Planning / task tracking | create_plan / mark_step_done wired at Rung 1 |
write_todos middleware — free |
| Scratch + deliverable memory | scratch_write/read, manual JSON files |
virtual filesystem — free |
| Sub-agent delegation | delegate_subtask built at Rung 5 |
task tool + subagents=[...] — free |
| Isolated context per sub-problem | manual sub-graph invocation | automatic per task call |
| Deterministic enforcement gate | EVALUATOR = pure pytest subprocess, no LLM |
NOT provided — critic-subagent advises; does not enforce |
The last row is the lesson. deepagents collapses four rungs of infrastructure into a few lines of configuration. That's real leverage — for research pipelines that don't need a hard enforcement boundary, it's the right call. But it won't give you the no-LLM gate. The critic-subagent reads the output and offers an opinion. It doesn't run a deterministic check and either pass or fail the pipeline.
A library hands you the harness structure. The enforcement boundary is still a design decision you own.
This is also why the ladder experiment builds from primitives: if you'd started from deepagents, you couldn't have isolated the contribution of each rung. The library is the right production choice. The primitive is the right learning instrument. The error is using them interchangeably.
The Cost Nobody Tells You About
There's a finding from the deep_research_agent hands-on that the benchmark tables don't show.
Initial run: 5 sub-questions, qwen3:8b via Ollama on an Apple M3. Result: 619 seconds of run time, llama-server peaking at 7.77 GB RAM — versus the ~5 GB you'd expect from a single 8B model call. Sustained fan noise and heat throughout.
The spike isn't a configuration error. qwen3:8b is a reasoning model. When the orchestrator faces a complex delegation task, it activates extended chain-of-thought, which expands the KV cache significantly beyond what a plain instruct model uses. More inference passes × reasoning overhead = non-linear memory growth. The model name says "8B." The resource profile says something else.
Reducing to 3 sub-questions brought run time down to 223 seconds with stable RAM — still producing a complete sourced report with critique. For production-quality output, cloud inference (tested with claude-opus-4-8) removes the local compute constraint entirely and runs 3–5× faster — at the cost of per-token spend on every call.
That per-token spend is exactly where the next two parts live.
Here's the tension the harness creates: it makes agents reliable by forcing feedback loops. Forced loops are good for correctness. They're expensive for context. Every verify cycle injects pytest output back into the window. Every sub-agent delegation accumulates tool-output tokens across the session. At rung 5, a multi-agent session doesn't just reason more carefully — it accumulates tokens fast, across multiple sub-problems, across multiple verify retries.
The harness enforces correctness, and in exactly the same move, it makes the context window expensive. That's not a reason to skip the harness. It's the design problem the next two parts address.
The Program Is the Harness
The model is the reasoning engine. The harness is the program.
The model decides what to try. The harness decides what counts as done, what can be written, what context survives across steps, and whether the next turn is an answer or a retry. Upgrading the model improves the quality of what the agent tries. Upgrading the harness improves the reliability of what it produces.
The failure modes that show up in production — verification skipped, context lost, blast radius uncontrolled, feedback loops optional — are harness failures. They don't surface in standard benchmark evals because standard evals don't measure enforcement. They measure capability under ideal conditions. Agents in production don't operate under ideal conditions.
The six-rung ladder is one way to make enforcement failures visible. Each rung adds one constraint, and each constraint cuts a class of failures that a smarter model would still occasionally make. By rung 5, the agent isn't just more capable — it's operating in an environment where the most common failure modes are structurally impossible.
The code for all three experiments — harness_ladder, multi_agent_coding, and deep_research_agent — is on GitHub.
The harness enforces correctness. The LLM provides reasoning. Keep those two jobs separated and you'll spend a lot less time debugging agents that did exactly what you asked.
What's Next
The harness made the agent reliable. It did that by forcing feedback loops — and every forced loop writes tool output, test results, and git status back into the context window.
In a 30-minute coding session, a single git log --stat call consumes 8,388 tokens of raw output. The verify loop calls it multiple times. The context window fills. The inference cost compounds. At cloud rates, a well-harnessed agent that runs 20 verify cycles isn't just correct — it's expensive.
Part 2 cuts that cost by 97%. Without touching the harness logic. Without touching the model. Without touching the prompt.
The fix is at the harness boundary, invisible to everything above it.