97% Fewer Tokens, Zero Model Changes: RTK at the Harness Boundary

The Bill the Harness Runs Up

Part 1 closed on a tension worth revisiting concretely before moving on.

The multi_agent_coding pipeline from Part 1 has four nodes: PLANNER → CODER → EVALUATOR → DEBUGGER, with a retry loop back through EVALUATOR on failure. The EVALUATOR is a pure pytest subprocess — no LLM, deterministic, the enforcement gate that makes the harness reliable. It runs, produces output, and hands that output to the DEBUGGER.

On a real test suite, raw pytest output from a failed run is between 8,000 and 13,000 tokens of tracebacks, assertion diffs, file paths, and boilerplate headers. The DEBUGGER needs the assertion lines and the failed test names — maybe 100–200 tokens of actual signal. It receives 8,000–13,000.

Then the loop runs again. And again. The harness allows up to three DEBUGGER iterations. Each iteration reloads the full context: task, code, and all prior tool output. Without compression, the running context grows by 8,000+ tokens per retry. By iteration three, the context gap between "compressed" and "raw" is 24,000 tokens — from a single node, on a single pipeline, on a modest task.

The harness enforced correctness. In exactly the same move, it created an expensive context problem. That's where RTK enters.

What RTK Is

RTK is a transparent Rust shell proxy — a single static binary that sits between your agent and the shell. It intercepts commands, runs the real command, rewrites the output, and returns the compact version. The agent never sees the raw output. The agent never knows RTK exists.

The integration point is a PreToolUse hook in Claude Code, or a native plugin API in Hermes and OpenCode. Either way, when the agent issues a Bash tool call, the hook rewrites it to the rtk equivalent before execution. No prompt changes. No tool schema changes. No model changes.

Optimization at the harness boundary, invisible to the reasoning layer above.

That sentence is the Part 1 harness principle applied one layer lower: the harness shapes the environment the model operates in, and RTK shapes the I/O surface the harness produces. Same idea, different altitude.

The Mechanism and the Two Instincts

RTK applies four compression strategies per command: smart filtering (strip comments, whitespace, boilerplate), grouping (aggregate similar items — files by directory, errors by rule), truncation (keep the signal, cut redundancy), and deduplication (collapse repeated log lines with counts).

The more important design decision is what RTK doesn't do: it doesn't run an LLM to summarize. It doesn't use ML. Each command family gets a hand-written, per-type filter — git log --stat compression works differently from AWS CloudFormation stack event filtering, which works differently from TypeScript error grouping. That specificity is what buys <10ms overhead and fully deterministic output.

This is the second recurring design instinct in this series appearing at a new layer: deterministic gate by default, LLM as opt-in escalation. In Part 1, the EVALUATOR was the canonical example — pure subprocess, no inference, routes the pipeline. RTK is the same instinct applied to shell output. The hand-written filter is the gate. There's no ML escalation tier because the content type is known and structured; determinism is achievable, so it's the default.

The tee-on-failure behavior is the first instinct at the same layer: compress by default, retrieve on demand. RTK compresses aggressively and writes the full unfiltered output to a local tee log on failure — so the agent can retrieve complete detail only when it actually needs it. In Part 1 this was Rung 5: sub-agent delegation, where only the distilled finding returns to the orchestrator. In Part 3 it will be Headroom's CCR cache, where originals are stored locally and retrieved on demand. Same principle, three altitudes. RTK is the middle one.

What RTK Actually Covers

The post could stop at "git commands" and most readers would walk away satisfied but wrong. Git is where the demo benchmarks live — it's not where the token savings live in a real agentic session.

RTK's own 30-minute session breakdown tells the real story:

Operation	Frequency	Raw tokens	RTK tokens	Savings
`cat` / `read`	20×	40,000	12,000	−70%
`cargo test` / `npm test`	5×	25,000	2,500	−90%
`grep` / `rg`	8×	16,000	3,200	−80%
`git diff`	5×	10,000	2,500	−75%
`go test`	3×	6,000	600	−90%
`pytest`	4×	8,000	800	−90%
`ruff check`	3×	3,000	600	−80%
`git log`	5×	2,500	500	−80%
`git status`	10×	3,000	600	−80%
`ls` / `tree`	10×	2,000	400	−80%
`docker ps`	3×	900	180	−80%
`git add/commit/push`	8×	1,600	120	−92%
Session total		~118,000	~23,900	−80%

The largest single category is cat / read — file reads — at 40,000 raw tokens per session. An agent reading source files to understand architecture, do code review, or trace a bug across a codebase is the top use case by raw token volume, not git. Test runners (90% savings on pytest, cargo test, go test) are next. Git operations are significant but they're not the headline.

Beyond the dev loop, RTK's catalog extends into territory that matters for enterprise-grade multi-agent deployments: AWS CLI (ec2, lambda, logs, cloudformation, dynamodb, iam, s3), containers (docker, kubectl, OpenShift oc), build and lint tools (tsc, ESLint, ruff, golangci-lint, rubocop, cargo clippy), and GitHub CLI (pr list, issue list, run list). That's 100+ command families, each with a dedicated deterministic filter.

The Benchmark

These numbers come from a LangGraph agent session running Qwen2.5:7b via Ollama — raw mode and RTK mode against the same task, same model, same toolset:

Command	Raw tokens	RTK tokens	Savings
`git log --stat -2`	8,388	108	98.7%
`git branch -v`	88	6	93.2%
`git status`	66	22	66.7%
`ls -la`	376	113	69.9%
`git log --oneline -20`	138	101	26.8%
Session total	8,592	231	97.3%

RTK's published claim is 60–90% on common dev commands. These numbers land between 26.8% and 98.7% depending on the command, with the session average at 97.3%. The convergence between these measured results and their published range is independent confirmation from a different workload on different hardware — not a repeated marketing stat.

The variance by command is the honest technical read: git log --stat is highly compressible because its output has enormous structural repetition and a known schema. git log --oneline -20 at 26.8% is already dense — the signal-to-noise ratio in the raw output is high. RTK's compression ratio is a function of how much structural noise the original contains.

For the DEBUGGER node specifically: a mock suite produced 687 tokens raw → 88 tokens RTK (87.2%). A real suite with 8 test cases ran 8,000–13,000 tokens raw → ~231 tokens RTK (97–98%). And critically, across three DEBUGGER iterations, RTK context stays flat while raw context accumulates. By iteration three: a 24,000-token gap from a single node.

What RTK Actually Changes: Model Selection

Here's the claim the token table doesn't surface directly.

Before the RTK experiment, running the multi_agent_coding DEBUGGER on qwen2.5:7b with raw pytest output produced unreliable results. The model lost track of specific assertion lines buried in 8,000 tokens of traceback noise, hallucinated fixes addressing the wrong failure, and burned debug iterations on wrong patches. The apparent conclusion: the DEBUGGER needs a 14B+ local model or a cloud API call.

That conclusion is wrong. The DEBUGGER doesn't need a stronger model. It needs a smaller, cleaner input.

With RTK running, the DEBUGGER receives 88 tokens of clean assertion failures. qwen2.5:7b handles that without difficulty — reads the specific failure, identifies the fix, patches solution.py, returns to EVALUATOR. The entire pipeline runs locally at zero API cost.

DEBUGGER node — model requirement by configuration

Configuration	DEBUGGER input	Model needed	API cost/task
RTK off	8,000–13,000 tokens (raw)	14B+ local or cloud	~$0.11
RTK on	~231 tokens (compressed)	qwen2.5:7b local	$0.00

At 100 tasks/day: $11/day → $0.00/day on the DEBUGGER node alone. Harness structure and pass rate unchanged.

The broader claim: context quality determines model requirements as much as context size. qwen2.5:7b wasn't failing at the DEBUGGER because it lacked reasoning capacity. It was failing because it was reading 8,000 tokens of mixed signal and noise and losing the thread. RTK doesn't improve the model — it improves the input, and a dramatically cleaner input is often what separates viable from not viable at a given model size.

This is "harness engineering > model engineering" with a cost dimension attached. The harness makes the verify loop happen regardless of model quality. RTK makes the verify loop cheap enough that a small local model can handle it.

Industry Multi-Agent Use Cases

The dev-loop framing — git, pytest, one developer, one coding session — is the marketed use case. The more interesting question for enterprise architects is what RTK does inside multi-agent systems where agents are issuing infrastructure commands, not just version-control commands.

Four scenarios that map directly onto RTK's catalog:

SRE / Incident Response Agents

An incident response agent investigating a failing deployment issues kubectl logs <pod> repeatedly across multiple pods during a 30-minute incident window. Raw Kubernetes log output is verbose by design — timestamps, metadata headers, repeated heartbeat lines, duplicated stack frames across restarts. A single kubectl logs call on a busy pod can exceed 65,000 tokens.

RTK's deduplication strategy collapses repeated log lines into counts: Connection refused (×47) instead of 47 identical lines. docker logs <container> gets the same treatment. For an agent that's calling these commands 10–20× per incident session, the compression doesn't just reduce cost — it makes the context window tractable for a model that needs to correlate signals across multiple pod outputs simultaneously.

The same applies to aws logs get-log-events, which RTK reduces to timestamped messages only, stripping CloudWatch metadata headers. And aws cloudformation describe-stack-events is filtered failures-first — an agent triaging a failed stack deployment sees the error events immediately, not after wading through 200 CREATE_IN_PROGRESS entries.

Cloud Infrastructure Provisioning Agents

A cloud provisioning agent that audits or manages IAM has a specific problem: aws iam list-roles raw output includes the full policy document for every role. In an account with 200 roles, that's enormous — policy documents can be 2,000–5,000 tokens each. RTK's AWS IAM filter strips policy documents entirely, returning name/role/ARN only. aws lambda list-functions strips secrets and policy attachments. The agent gets a navigable summary; if it needs a specific policy, it retrieves that role individually.

This is the same compress-by-default, retrieve-on-demand instinct applied to cloud resource enumeration. An infra provisioning agent that scans an entire AWS account before deciding what to change — the kind of agent that's structurally expensive to run — becomes feasible at a small local model with RTK in the pipeline.

CI/CD Pipeline Agents

A CI/CD agent monitoring multiple pull requests issues gh run list to check workflow status, tsc to catch TypeScript errors across a large project, and golangci-lint run to enforce style. Without compression: raw tsc output lists every error with full file path, line number, column, code, and message — individually legible, but at 20,000+ tokens across a large codebase, the agent's context window is immediately dominated by compiler output.

RTK groups TypeScript errors by file. golangci-lint output is filtered to JSON and grouped by rule. ESLint groups by rule and file. The agent reads a structured summary of what's failing and where, not a raw stream of compiler messages. For an agent that's triaging failures across 10 open PRs simultaneously — a realistic multi-agent architecture where each PR gets a sub-agent — the difference between "agent can hold all 10 error summaries in context" and "agent sees only 2 PRs before the window fills" is a product-level difference, not just an efficiency one.

Data Platform Monitoring Agents

aws dynamodb scan output wraps every value in type annotations: {"pk": {"S": "user-123"}, "balance": {"N": "4500"}}. For a data agent scanning a table to check data quality or identify schema drift, those type annotations are noise — the agent already knows the schema. RTK unwraps them: {"pk": "user-123", "balance": "4500"}. At scale across a large table, this is 40–60% reduction on pure structural ceremony.

aws s3 ls with deep prefixes produces long recursive listings that agents read to understand bucket structure. RTK truncates with tee recovery — the full listing is saved locally if the agent needs to drill into a specific prefix. The data platform agent gets the structure summary; the full inventory is available on demand.

In each scenario, the underlying pattern is the same: RTK doesn't change what the agent can do. It changes how much of the context window gets consumed by structural noise before the agent gets to reason. In multi-agent systems where context pressure is the binding constraint on what fits in a session, that's a system design property, not a UX optimization.

Where It Matters Most: Unattended Agents

Interactive coding sessions have a human in the loop. You notice when the context window bloats, restart the session, intervene. Unattended agents don't have that oversight.

A Hermes instance running on a VPS, issuing git/docker/kubectl commands as part of background workflows, has no one watching the token accumulation. git log --stat -2 at 8,388 raw tokens per call, run hourly: 201,312 tokens/day from one command. With RTK: 2,592 tokens/day. At current Sonnet 4.6 input pricing, that delta compounds across weeks. The right time to install RTK in an ambient agent is before you deploy — not after the first bill arrives.

Hermes Integration

You ran this specifically with Hermes via rtk init --agent hermes. The integration differs from Claude Code's PreToolUse hook: Hermes exposes a native plugin API, so RTK's rtk-rewrite plugin intercepts terminal commands at the plugin layer rather than the hook layer.

Hermes plugin flow

Hermes agent

→

rtk-rewrite plugin

→

rtk rewrite

→

231 tokens

(was 8,592)

As Hermes evolves its plugin API, RTK's integration evolves with it. The Claude Code hook path is static — it rewrites a command string. For an ambient agent running long-term, the plugin coupling is the more durable integration point.

Where RTK Stops

RTK fires on Bash tool calls. Anything that goes through a shell subprocess — from git log to kubectl logs to aws dynamodb scan to ruff check — RTK covers. For file operations specifically, rtk read, rtk grep, and rtk find are explicitly callable as shell commands when you want RTK compression on file reads rather than using the agent's native Read/Grep/Glob tools (which bypass the hook).

The real boundary is shell vs. non-shell content. What RTK cannot touch:

RAG retrieval output returned directly as tool results
JSON API responses that don't go through a subprocess
Chat history and message accumulation across a long session
Prose documents the agent reads through native file tools

In the multi_agent_coding pipeline, the EVALUATOR routes through subprocess — RTK covers it. In the deep_research_agent, DuckDuckGo results are prose returned directly as a tool output — RTK never sees them. This is the seam between Part 2 and Part 3.

One property worth naming explicitly at this boundary: RTK's hand-written filters are deterministic. The same command, the same content, the same compressed result every time. Tee-logged originals on failure. Deterministic output is auditable output — this becomes the central tension in Part 3, where compression moves into ML inference on natural language and determinism breaks.

Install It, Forget It

RTK's correct relationship with your stack is infrastructure, not tooling.

brew install rtk
rtk init -g               # Claude Code / Copilot
rtk init -g --gemini      # Gemini CLI
rtk init --agent hermes   # Hermes plugin

Restart your agent. Every git status, pytest --tb=short, kubectl logs, aws lambda list-functions, and docker ps the agent issues is now compressed at the shell level. The _parse_failures() function in multi_agent_coding/system.py — the hand-written regex that surgically extracts assertion lines from raw pytest output — is doing what RTK does for one command type, written and maintained by hand. RTK replaces that bespoke filter with a general mechanism covering 100+ command families, transparently, without a line of application code.

That's the layer RTK operates at: below your application logic, below your prompt, below your model choice. Install it once, stop thinking about it.

The hands-on for this post — the multi_agent_coding pipeline, the RTK benchmark runs, and the Hermes integration — is on GitHub.

What's Next

RTK handles every shell command the agent issues — file reads, test runners, build tools, git ops, AWS CLI, kubectl. In a shell-heavy session, that covers a large fraction of the token budget. But in a RAG-heavy agent — one retrieving document chunks, stuffing JSON tool results into context, accumulating chat history across a long session — shell output is a minority of what fills the window.

What compresses the non-shell half? And what happens to the auditability property when compression moves from deterministic structural filters to ML inference on natural language?

Part 3 runs the experiment: a second compression layer, same pipeline, and the moment where "more compression" and "fully auditable" stop pointing in the same direction.