60–95% vs. 9.2%: What I Actually Measured With Headroom

The Layer RTK Can't Touch

Part 2 ended on a boundary: RTK handles every shell command the agent issues — git log, pytest, kubectl logs, aws lambda list-functions. In a shell-heavy coding session, that's 80% of the token budget, saved at 97%. But when the RAG retriever stuffed 3,496 tokens into context for a 120-token query, RTK couldn't touch a byte of it. That content never went through a shell. It arrived as a tool result — a JSON blob, returned directly.

That's the gap this post fills.

Most agentic workloads split into two token categories: the outputs of shell commands, and everything else. Everything else includes RAG retrieval chunks, structured JSON tool results, prose documents the agent reads, and the accumulating chat history across a long multi-agent session. RTK is surgical — deliberately scoped to the shell output layer. The second category needs a different layer entirely.

Headroom is that layer. The upstream project claims 60–95% token reduction on agent context. This post is about what that number actually means when you test it — what it requires from your message structure, which compressor produces which result, and why the version that compresses most aggressively is not always the one you should ship.

What Headroom Is

Headroom is a content compression pipeline that sits between your agent and the LLM. Before each prompt reaches the model, Headroom routes message content through type-specific compressors, sends the model a denser version, and keeps the originals in a local cache. The design principle is the same one you've seen at every layer of this series: compress by default, keep the original available if the model needs more detail.

The architecture has three moving parts.

ContentRouter — the classifier at the top of the pipeline. It inspects each message block, identifies its content type (JSON structure, source code, prose, image), and dispatches to the appropriate compressor. The routing decision happens at the block level, which matters more than it sounds: a tool_result block that contains list-format content routes differently from one with string content. Get the message format wrong and the router passes everything through silently — no error, no compression.

Three compressors — each one owns a content type and a risk profile. SmartCrusher handles structured JSON. CodeCompressor handles source code. Kompress handles prose and chat history via an ML model. These are not interchangeable — enabling one does nothing for the content types the others own.

CCR (Content Compression Repository) — the local cache that keeps originals with a configurable TTL. The design intent: the model reads the compressed version by default and calls headroom_retrieve to pull the original when it needs more detail. In the local_first_compression_layer repo, headroom_retrieve is listed under "Extending" — the cache is present; the retrieval wire-up is future work.

Position relative to Part 2: RTK owns the shell boundary. Headroom owns the content boundary — everything that arrives as message content rather than subprocess output. Headroom bundles RTK for shell delegation, so the full two-layer stack runs from a single entry point.

The two-layer compression stack

Agent

↓ all content

HEADROOM — content layer

SmartCrusher (JSON)

CodeCompressor (AST)

Kompress (prose)

CCR (local cache)

delegates shell output ↓

RTK — shell output layer

git · pytest · kubectl · docker · aws · 100+ command families

↓ compressed prompt

LLM Provider

Three Compressors, Three Risk Profiles

Before running the experiment, it's worth understanding the design of each compressor — because the configuration decision you make before running Headroom is exactly the architectural decision this post is really about.

Compressor	Content type	Mechanism	Deterministic	Semantic risk
SmartCrusher	JSON arrays, repeated-schema structures	Shared header + compact per-record diffs	✅ Yes	Near-zero
CodeCompressor	Source code (Python, JS, Go, Rust, Java, C++)	AST-aware — signatures preserved, bodies reducible	✅ Yes	Conditional — `protect_analysis_context` controls scope
Kompress	Prose, chat history, documentation	Extractive sentence scoring (HF model, ~180 MB, Apache 2.0)	⚠️ No guarantee	Real — qualifying clauses can be scored as low-signal and dropped

SmartCrusher and CodeCompressor are structural compressors. They operate on schema and AST, not on meaning. Given the same input, they produce the same output — the transformation is rule-governed, repeatable, and explainable. In the same way a compiler given the same source will always emit the same bytecode, these compressors given the same JSON will always produce the same compressed representation.

Kompress is different in kind. It applies extractive sentence scoring using a fine-tuned HuggingFace model. It reads prose and decides which sentences carry information and which don't — which means it is making semantic judgments. Those judgments are consistent on well-structured inputs. They are not architecturally guaranteed to be identical across runtimes, model updates, or edge-case content. The difference between "consistent in practice" and "guaranteed by design" is small in a demo environment and enormous in a regulated one.

One more piece of the architecture worth naming: the CacheAligner. Naive compression changes the prompt prefix. A changed prefix breaks the provider's KV cache — you save tokens and lose your cache discount simultaneously, which can leave you worse off on cloud deployments. Headroom's CacheAligner stabilizes compressed prefixes so prompt caching still hits after compression. This is a non-obvious correct insight. It's also out of scope for a local Ollama experiment — no KV-cache billing locally — but the right thing to know before you point this at a cloud endpoint.

The Experiment: From Theory to What Actually Ran

With the architecture understood, here's what I built to test it: a dual-graph LangGraph setup — baseline_graph and compressed_graph — running the same payload against the same model (Qwen2.5:7b via Ollama), in isolated subprocesses per configuration so each run sees a clean pipeline state.

The payload was built to exercise the content layer specifically: 10 structured JSON user records × 12 fields each (SmartCrusher target), a code search result in grep signature format (CodeCompressor target), and a prose document paragraph (Kompress target when enabled). The task: answer a question that required reasoning across all three content types.

The Part That Doesn't Make the README

Before a single token was compressed, three configurations produced nothing.

Run 1 — router:protected:user_message. Zero compression. The ContentRouter inspects at block level — specifically whether the tool_result block carries string content or list content. My initial messages used the default multi-block list format. Headroom marks list-content tool results as protected. The router fires, classifies correctly, protects the block, passes it through. No error. No warning. No compression.

Run 2 — router:noop. Switched to string content in tool_result blocks. Router fires and routes — but protect_recent defaults cover the most recent conversation window, which included the exact payload I wanted compressed. noop means the router considered it and declined. Zero compression.

Run 3 — router:tool_result:smart_crusher fires. String content, protect_recent=0, correct block structure. 343 tokens removed. 9.2% reduction. That's the rule-based baseline.

Run 4 — Kompress configuration. Two more gotchas at this layer.

The first is a singleton latching bug: Headroom caches a module-level pipeline that latches the first kompress_model value it sees in-process. Running rule-based (kompress_model="disabled") then Kompress in the same Python process silently keeps Kompress off — the pipeline is already initialized with the first config, the second is ignored, and both runs report identical numbers. The fix is a separate subprocess per configuration. Naive in-process before/after comparisons are quietly wrong, with no warning.

The second: target_ratio is required. Enabling kompress_model alone isn't enough. Without an explicit target_ratio, the savings gate rejects prose compression silently — same token count as rule-based, no indication the model ran and its output was rejected.

Both gotchas carry the same lesson as Runs 1–3: Headroom's surface API is clean. The routing logic, singleton initialization, and savings gate contract are not. Reading one layer below the API is what separates a working configuration from a quiet no-op. This is the same pattern as every piece of meaningful infrastructure: the interface hides complexity that, once understood, becomes the most useful part of working with the tool.

The hands-on for this post — the local_first_compression_layer — is on GitHub.

What the Numbers Show

transforms_applied in results.json tells the precise story of what fired. Rule-based run: ["router:tool_result:smart_crusher"]. CodeCompressor did not fire — the grep signature output has no reducible function bodies. All 343 token savings in the rule-based run are attributable to SmartCrusher alone, on the JSON user records. Kompress run: ["router:tool_result:smart_crusher", "mixed", "text"] — SmartCrusher on JSON, Kompress on prose.

Configuration	Tokens	Reduction	Tokens saved	Inference time
Baseline	3,496	—	—	—
Rule-based (SmartCrusher)	3,174	−9.2%	343	−4.4%
+ Kompress (ML)	2,600	−25.6%	1,029	−8.4%

A note on inference time: these deltas are directional, not precise. Completion-token counts varied across baseline runs (595, 697, 761 tokens across separate passes), so wall-clock deltas at this payload size reflect prompt-length differences more than Kompress overhead. The latency cost Headroom's own documentation projects — break-even at 5,000–15,000 tokens — applies to larger prose volumes that this experiment can't surface. Treat inference time here as signal.

Semantic fidelity. Both configurations correctly answered the test query. No semantic loss from either compression path.

Kompress nearly tripled the token savings: 343 → 1,029 tokens removed. The model performed equally well on both. By every metric this experiment could measure, the ML layer strictly outperformed the rule-based layer.

The Tension: Why Working Isn't Enough

Here's what Path A produced that a theoretical argument couldn't.

Three isolated subprocess runs with Kompress enabled. Byte-identical output — same SHA, 1,029 tokens each time. No semantic loss. 25.6% reduction vs. 9.2% on the rule-based config. No failure. No observed non-determinism.

So why does kompress_model="disabled" remain the right default for a regulated workload?

Because the audit standard for a regulated system isn't "observed reproducibility." It's provable reproducibility — the ability to reconstruct, on demand, exactly what the model read before producing a specific output on a specific production request.

Three clean test runs establishes that Kompress is consistent on this payload. It doesn't establish that Kompress is deterministic in the architectural sense — that the same input will always produce the same compressed output, guaranteed, across model updates, runtime versions, or edge-case content. The extractive sentence scorer doesn't carry that guarantee. The gap between "consistent when I tested it" and "guaranteed to produce the identical input that generated this production output" is invisible in a test harness and material in a compliance audit.

The compressed input Kompress produced in the test was reproducible. The compressed input it will produce in production on an edge-case prose document with a qualifying condition is a probability statement, not a contract.

The 16.4 percentage-point gap between 9.2% and 25.6% is real. The cost of that gap is the auditability property. For compliance-adjacent workloads, that's a trade worth naming explicitly before making it — not one to slide past because the test numbers looked clean.

This is the moment the second recurring design instinct in this series reaches its clearest form: deterministic gate by default, lossy ML as opt-in escalation. In Part 1, the EVALUATOR is a pure pytest subprocess — no LLM at the enforcement boundary. In Part 2, RTK's hand-written per-command filters are deterministic because structured shell output doesn't need inference. Here, SmartCrusher is the deterministic gate. Kompress is the opt-in escalation you enable when context pressure forces the trade and you've made a deliberate decision to accept it.

The default is the safe path. The escalation is the design decision that requires justification.

What's Already Built — And the One Gap

The audit gap is more specific than it might initially appear, and it's narrower.

The transforms_applied field exists and is populated in every result: ["router:tool_result:smart_crusher"] for the rule-based run; ["router:tool_result:smart_crusher", "mixed", "text"] for Kompress. Every compression decision is logged at the metrics level. The data is already there.

What's missing is the integration between that field and a downstream audit pipeline. transforms_applied lives in a local metrics dictionary. It isn't surfaced as a retrievable artifact tied to a specific request ID, timestamped, and queryable alongside the output it produced. The gap is one integration layer — not the absence of data, but the absence of a pipeline that elevates that data from debugging output to audit evidence.

The CCR cache closes the loop further in theory: originals are retained with configurable TTL, and headroom_retrieve is designed to pull the exact original that produced a given compressed input. In this repo that wire-up is future work. But the architecture to close it fully is already sketched — retained originals, logged transforms, a retrieval tool. What's missing is surfacing the compression log as a first-class artifact that audit tooling can consume alongside inference results.

That's the specific, one-layer design decision between "Headroom as a cost optimization" and "Headroom as a cost optimization plus a reproducibility mechanism." Worth naming precisely.

The Two Instincts, Named

Three posts. Three layers. The same two design ideas appeared at every level — introduced in Part 1 without being named as a pair, reinforced in Part 2 at the shell boundary, and reaching their most explicit form here at the content layer.

Compress by default, retrieve on demand. Sub-agent delegation in the harness returns only the distilled finding to the orchestrator — raw sub-problem reasoning stays isolated. RTK compresses shell output aggressively and tee-logs the full original on failure. Headroom compresses every message block and retains the original in CCR for on-demand retrieval. Same principle at control-flow, shell, and content altitude: the default path is compact; the full original is available when the system actually needs it.

Deterministic gate by default, lossy escalation as opt-in. The EVALUATOR in the harness is a pure pytest subprocess — no LLM at the enforcement boundary; the critic-subagent is the advisory escalation. RTK's hand-written per-command filters are deterministic; no ML tier is needed for structured output. SmartCrusher is the deterministic gate on content; Kompress is the opt-in escalation you enable only when context pressure forces a deliberate trade.

These aren't patterns specific to these three tools. They're patterns for systems where correctness and auditability are load-bearing. The model's reasoning layer should operate on the cleanest, most compact signal achievable — with the full original locally available when it needs more. The expensive paths (longer context, ML inference, cloud model) should require an explicit choice.

The default path should be deterministic, compact, and recoverable. Everything else is a deliberate escalation.

What This Series Leaves Unbuilt

The harness_ladder eval runner produces a (model, rung, task, pass/fail) table. Adding tokens_in / tokens_out / cost_usd and running each rung with and without the full RTK + Headroom stack turns the complementarity argument from asserted to measured.

Two open questions neither series post answers: does compression affect pass rate at any rung (expected: no, unverified), and what is the per-rung cost delta when both layers run in the pipeline (expected: large at rungs 3–5 where the verify loop issues repeated pytest and git commands, unverified)?

The code lives in harness_ladder/eval/runner.py. The three-axis sweep — model × rung × compression stack — is the natural next chapter.

All five projects — harness_ladder, multi_agent_coding, deep_research_agent, rust_token_killer, and local_first_compression_layer — are on GitHub.