The Number Part 4 Promised

530,702 tokens sitting in a vector index across two real PDFs. At top_k=1, after reranking, 560 tokens reached the model. All 12 queries answered correctly.

That's a 948× reduction from corpus to context — not compression after the fact, not a Headroom pass on what arrived, not RTK stripping shell output. This happened upstream, before the model paid attention to a single token. The wrong 530,142 tokens never arrived. There was nothing to compress.

Part 4 made the argument structurally: fixing retrieval quality upstream makes every downstream fix less load-bearing. This post makes it a number.


What Part 4 Left Open

Part 4 was the map, deliberately. Before running either reranking or PageIndex, it was worth knowing they answer the same question — why did the wrong tokens arrive? — from opposite architectural starting points. Reranking lets noise in and filters it out; PageIndex reasons over structure so noise is never retrieved.

Part 4 also left a specific promise: turn the Part 2 and Part 3 token discipline loose on whatever retrieval hands the model. First, you need retrieval handing the model the right tokens. That's what this post measures.

The experiment is local-first, local model, no cloud API calls at retrieval time. Two real PDFs, one BGE reranker, one k-sweep. The code and full results are on GitHub.


Why Reranking Works: The One Mechanism Worth Holding

Part 4 covered the bi-encoder vs cross-encoder distinction. One paragraph here, because the experiment numbers only make sense with the mechanism:

A bi-encoder encodes the query and each document chunk separately into fixed vectors, then compares them by cosine similarity. It's fast, scalable, and wrong in a specific way: the query and document never interact. "Sharpe ratio" in the query never attends to "risk-adjusted return" in the chunk. The vectors are pre-baked approximations.

A cross-encoder reads query and document together in a single forward pass, with full token-level attention between them. It's expensive — you can't pre-compute it, you run it per query-document pair — and accurate in exactly the places the bi-encoder wasn't. The joint attention is the whole mechanism, and the whole cost.

The pipeline: bi-encoder casts a wide net (top_n=30 candidates), cross-encoder scores each pair properly, the top top_k survivors reach the model. The first stage is cheap and approximate — recall over precision. The second stage is expensive and precise — precision over recall.

Two rerankers sit on this spectrum for the experiment:

  • FlashRank — ONNX-quantized cross-encoder. Same architecture as a full model, numerics shrunk for CPU inference. Sub-millisecond per candidate on commodity hardware. No GPU required.
  • BGE reranker (BAAI/bge-reranker-v2-m3) — full PyTorch cross-encoder. Slower than FlashRank, marginally higher accuracy ceiling. Multilingual. This is what produced the k-sweep results.

The Experiment: Index, Payload, and Design

Index: 1,038 chunks across two real PDFs — 530,702 tokens total. Not a toy corpus. These are documents with the kind of density and vocabulary variance that stress-tests retrieval.

Reranker config: BGE, top_n=30 candidates from the bi-encoder, then cross-encoder scoring on all 30 before top_k selection.

k-sweep design: Hold everything constant — same 12 queries, same index, same reranker — and sweep top_k from 20 down to 1. The question isn't whether reranking works. The question is: where does accuracy break as the model receives fewer and fewer chunks?


The K-Sweep: Where the Accuracy Floor Is

K-sweep results — BGE reranker, top_n=30, 12 queries
top_k tokens sent correct reduction
20
12/12 48×
10
12/12 96×
5
12/12 191×
3
12/12 318×
1
12/12 948×
Full corpus: 530,702 tokens. top_k=1 sends 560 tokens. Bar width proportional to tokens sent.

The story in this table is not the 948×. The story is the accuracy floor: 12/12 held all the way to top_k=1. The sweep never found a k where accuracy broke, because the reranker's top-1 candidate was correct on every query in the set.

That is a stronger result than "reranking improves retrieval." It says the reranker's confidence ordering was so reliable on this corpus and query set that a single retrieved chunk was sufficient — the second and third candidates were redundant before they reached the model.

Two things earn this result: the cross-encoder's joint attention producing a reliable relevance score, and a corpus with enough structural clarity that the reranker's training distribution applies cleanly. The first is the mechanism; the second is the condition. Both matter, and the second one is the limit — more on that below.


FlashRank vs BGE vs No Reranker: The A/B/C Comparison

The k-sweep used BGE throughout. A separate A/B/C run at fixed top_k=5 compared three configurations head-to-head: no reranker (raw bi-encoder top-5), FlashRank, and BGE.

The full results are documented in RESULTS.md in the repo. The abc block was not persisted to results_giant.json — to reproduce the comparison:

python run.py --mode abc --save

The finding documented in RESULTS.md: at top_k=5, both FlashRank and BGE improved over raw retrieval, with BGE at a marginal accuracy advantage and FlashRank significantly faster. The correct framing for the FlashRank vs BGE decision isn't accuracy versus accuracy — it's accuracy ceiling versus latency budget. FlashRank is the right default for latency-critical paths; BGE earns its overhead when the corpus is large and query diversity is high.


The Upstream Arithmetic

In Part 3, Headroom's SmartCrusher removed 343 tokens from a 3,496-token payload — 9.2% reduction, fully deterministic, zero semantic loss. That result was clean and correctly reported as such.

530,702 → 560 tokens is a different category of intervention entirely. This is 99.9% reduction before compression had anything to work on. The upstream fix operates at a different order of magnitude than the downstream fix.

But the framing that matters: these are not competing fixes. They don't overlap and they're not alternatives.

Where each layer intervenes
Reranking
Controls which tokens arrive — filters before the model sees anything
Headroom
Reduces the size of what arrived — compresses after retrieval
RTK
Reduces shell output tokens — compresses subprocess results

Fix retrieval quality and the compression layers become less load-bearing. When top_k=1 gives 560 tokens of the right content, Headroom's job on that payload is almost decorative — there's not much structural redundancy in a single, well-matched chunk. The upstream fix changes the nature of the downstream problem. That's the series lesson in miniature: fix it upstream and everything downstream gets easier.

The stack that runs all three together: reranking selects the right 560 tokens, RTK has already compressed the shell output the agent produced while building the query, Headroom compresses whatever JSON scaffolding wrapped the retrieved result. Different layers, different content types, additive savings.


What the Index Tells You About Your Corpus

530,702 tokens in the index. 560 tokens sufficient for all 12 correct answers. That's roughly 0.1% of the corpus by token count.

This is not unusual. It's the normal state of a large professional document corpus without a retrieval quality layer. The relevant content is concentrated — the answer to any given query lives in a small, specific section of a specific chunk. Everything else is coverage that the retriever pulls because embedding similarity is an approximation, not a precision instrument.

What reranking does is close the gap between "similar" and "relevant." The bi-encoder retrieves 30 chunks that share vocabulary and semantic neighborhood with the query. The cross-encoder identifies the 1 of those 30 where the query and the document actually answer each other. Without the cross-encoder pass, the model receives all 30 and has to find the relevant one itself — paying attention cost across the noise. With the cross-encoder, the noise is filtered before it arrives.


Where the 948× Doesn't Generalize

This result is reproducible on this corpus and these 12 queries. It doesn't automatically generalize, and three conditions constrain it.

Training distribution alignment. The BGE reranker was trained on MS MARCO and similar web-search datasets. On corpora where the semantic style differs significantly from web search — dense technical specifications, cross-domain scientific literature, multi-hop reasoning questions — cross-encoder reranking can produce marginal gains or even degrade retrieval. Published research on technical corpora has found BGE reranking adding latency with flat or slightly negative NDCG delta. The professional-document corpus here aligned well with the reranker's training distribution; don't assume that transfers to every corpus type.

Single-chunk answers. 12/12 at top_k=1 means every query had a single chunk that fully answered it. Queries requiring synthesis across multiple sections — "compare the disclosure requirements in Chapter 3 with the exceptions in Appendix B" — can't be answered by one chunk no matter how well-ranked. The k-sweep tests single-chunk answerability; it doesn't test multi-hop coverage.

Query set size. 12 queries is enough to demonstrate the mechanism on a real corpus. It's not enough to make statistical claims about the accuracy floor. The relevant question for any production deployment is: what's the minimum top_k at which accuracy stays above threshold on your query distribution, not on 12 curated examples?


The Design Instincts, One Layer Up

Two design ideas have appeared at every altitude in this series. They show up at the retrieval layer too.

Deterministic gate by default, probabilistic escalation as opt-in. The first-stage bi-encoder is the deterministic gate — same index, same query, same candidates, always. The cross-encoder scoring is the probabilistic layer: it introduces ML inference with all the consistency and distribution-dependence that implies. In Part 1 this was the EVALUATOR (deterministic pytest subprocess) vs the critic-subagent (advisory LLM). At the retrieval layer: vector search is the gate, cross-encoder is the escalation. Same shape.

Compress by default, retrieve on demand. At the retrieval layer, reranking is a form of compression at the source: instead of retrieving 30 chunks and compressing them downstream, retrieve 1 chunk and skip the downstream compression step entirely. The corpus stays in the index; only the relevant signal surfaces. Compress by default means only the signal arrives; the full corpus is always there on demand for a broader query.


What This Series Leaves Unaddressed

The full stack as it stands: harness (reliability), RTK (shell tokens), Headroom (content tokens), reranking (retrieval quality). Every layer operates within a session. When the context window closes, everything resets.

The agent that correctly retrieved a relevant compliance section in Session 1 starts blind in Session 12. It doesn't remember what it retrieved, what it decided, or what it verified. If the same query appears in a new session, it retrieves again from scratch — paying the full retrieval cost — and produces an output with no connection to the reasoning trace from Session 1.

That's the session boundary problem: not "how do I compress this session's tokens" but "what survives when the session ends and how do I connect what the agent learns across sessions to what it produces in future ones?"

Part 6 addresses what crosses the session boundary — and why a compliance agent's cross-session memory is the hardest unsolved problem in this stack.