The Question Compression Can't Answer

Part 2 handed RTK every token that came out of a shell. Part 3 handed Headroom almost everything else — JSON tool results, prose documents, the accumulating chat history. Between them, the stack compresses the two big token categories an agent produces. Both layers do the same fundamental thing: they take the tokens that already arrived and make them smaller.

Neither one asks the prior question.

In Part 3 the retriever stuffed 3,496 tokens into context to answer a 120-token query. Headroom compressed that blob — shaved it down, kept the original in cache, did its job well. But step back one move. Why were there 3,496 tokens to compress in the first place? For a question that small, most of what the retriever pulled was never going to be read, reasoned over, or cited. It was noise that arrived dressed as signal — and then we paid to compress the noise.

Compression is a treatment for the symptom. Retrieval quality is a treatment for the cause.

This is the retrieval-quality layer, and it sits upstream of everything Parts 2 and 3 do. Fix it, and there's simply less for the compression layers to compress. The wrong tokens never show up, so nobody has to shrink them.

There are two serious schools of thought on how to fix it. They answer the same question — why did the wrong tokens arrive? — and they disagree at the level of architecture, not tuning. One lets the noise in and filters it out afterward. The other reasons about structure so the noise is never retrieved at all.

Over the next couple of posts I'm going to build both, then turn the same token discipline from Parts 2 and 3 loose on whatever they retrieve. But before any of that is worth doing, you need the map. This post is the map.


Two Answers, One Question

Reranking
retrieve ~30 noisy candidates
cross-encoder scores each one properly
keep the best 5 → send a lean context
Let the noise in, then filter it out
PageIndex
reason over a table-of-contents tree
navigate directly to the right section
return only those pages → send a lean context
Never retrieve the noise in the first place

Both pipelines hand the model the same kind of thing at the end: a small, relevant context. They get there along opposite routes — and that route is the entire decision.


What Reranking Actually Is

Reranking is a second-pass quality filter. Your first-stage retriever casts a wide, cheap net and returns a list of plausibly relevant chunks. The reranker then re-examines each candidate against the query in far greater depth and reorders them, so the genuinely relevant passages rise to the top before they reach the LLM.

The distinction that matters is how deeply a model looks at the query–document pair.

Stage Model type How it scores Cost Role
First-stage retrieval Bi-encoder Encodes query and document separately into vectors, compares by cosine similarity Cheap, pre-computable, sub-ms Recall — cast a wide net
Reranking Cross-encoder Encodes query and document together in one forward pass; full token-level attention Expensive, per-pair, can't be cached Precision — sort the net's catch

A bi-encoder is fast precisely because it never lets the query and document interact — it just compares two pre-baked vectors. A cross-encoder is accurate precisely because it does the opposite: it reads the query and document jointly, so "Sharpe ratio" in the query can attend directly to "risk-adjusted return" in the passage. That joint attention is the whole point — and the whole cost.

When I build this, two rerankers will sit on exactly this spectrum:

  • FlashRank — an ONNX-quantized cross-encoder. Same architecture as a full model, but the numerics are shrunk to run locally on a CPU in milliseconds instead of needing a GPU.
  • BGE reranker — a full PyTorch cross-encoder. Slower, marginally higher accuracy ceiling.

Both do the same job: take a pile of noisy first-stage candidates, score them properly, keep the best few, and hand a lean context to the LLM. That's reranking in one sentence — and it's the general-purpose answer, because it works on any text whether or not the text has any structure to exploit.


What PageIndex Actually Does

PageIndex (VectifyAI/PageIndex) throws the whole pipeline above out and replaces it with something that looks more like how a human uses a table of contents.

Standard RAG (what reranking lives inside):

PDF → chunk into 512-token windows → embed → vector search → rerank → LLM

PageIndex — "vectorless, reasoning-based RAG":

PDF → build a semantic tree (titles + one-line summaries + page refs, hierarchical)
   → at query time, an LLM agent traverses the tree by reasoning
   → navigates directly to the right section, like reading a contents page
   → returns only those pages to the LLM

There are no vector embeddings, no similarity search, and no chunking. The tree itself is tiny — just node titles, one-line summaries, and page pointers — so the LLM can hold it in context, reason about which branch answers this query, and pull only the pages it decided are relevant.

The framing in their own words is sharp and worth keeping:

💡 The core claim
Similarity ≠ relevance. Vector RAG retrieves what is similar; what you actually want is what is relevant, and deciding relevance requires reasoning, not nearest-neighbour geometry.

Inspired by AlphaGo-style tree search, PageIndex does retrieval in two steps: generate a "table-of-contents" tree index of the document, then perform agentic, reasoning-based retrieval by searching that tree. It reports a state-of-the-art 98.7% on FinanceBench (via the Mafin 2.5 system built on it) — a financial-document QA benchmark where vector RAG historically struggles, which is exactly the structured, professional-document regime PageIndex is designed for. (MIT-licensed, ~33k GitHub stars.)

PageIndex is the specialist answer: it buys higher accuracy and genuine auditability on documents with structure worth reading — and it pays with an LLM call at retrieval time.


The Fork, Side by Side

The two architectures solve exactly the same problem — why did the wrong tokens arrive — from opposite ends. Reranking lets the noise in and then filters it. PageIndex reasons about structure so the noise is never retrieved.

Reranking PageIndex
Works on Any text, structured or not Well-structured documents (books, legal, financial)
Retrieval mechanism Embedding + cross-encoder inference An LLM call that traverses a tree
Retrieval cost Milliseconds, no LLM call One+ LLM call per query
Accuracy ceiling Capped by first-stage recall No first-stage recall cap; reasons over the whole structure
Auditability "chunk from page 145" "navigated Section 3.2 because the query asked about X"
Fails when Corpus has no structure Document has no clear hierarchy / query spans many sections

The accuracy-ceiling row is the crux. A reranker can only ever reorder what the first stage already retrieved — if the bi-encoder's wide net missed the one page that actually answers the query, no cross-encoder can recover it. PageIndex has no first-stage recall ceiling; it reasons over the whole document structure every time. The price is an LLM call per retrieval instead of a cached vector lookup.

There's a second row worth lingering on: auditability. A reranker can tell you what it sent — "chunk from page 145." PageIndex tells you what and why — "I navigated to Section 4.2 because the query asked about data-breach notification timelines, and the summary of Section 4.2 referenced GDPR Article 33." That's a reasoning trace attached to retrieval, and for a regulated workload it's exactly the artifact an audit wants.


Why This Belongs in the Stack

This series has been walking down a ladder, one altitude at a time.

Part 1Harnessmake the agent reliable — enforce the loop
Part 2RTKcompress the shell-output tokens
Part 3Headroomcompress everything else that arrived
Part 4Retrieval qualitystop the wrong tokens from arriving at all

Parts 2 and 3 are downstream fixes — they make the tokens that arrive cheaper. Part 4 is the upstream fix: it changes which tokens arrive. And this is the general lesson of the whole series in miniature — fixing a problem upstream makes every downstream fix less load-bearing. Better retrieval doesn't just save tokens. It relieves pressure on compression, on context limits, and on the model's ability to ignore distractors. If retrieval lands on the exact two pages that answer a query, Headroom's compression pass over them becomes almost decorative.

There's also a quieter echo of the recurring instinct this series keeps returning to — deterministic gate by default, expensive escalation as opt-in. Reranking is the cheap, cacheable, no-LLM gate: milliseconds, runs locally, no reasoning call. PageIndex is the reasoning escalation you reach for when the documents have structure worth paying an LLM call to navigate. Same shape, one layer up: the default path is cheap and mechanical; the powerful path is a deliberate choice you make when the workload justifies it.

So the honest positioning of the two answers:

  • Reach for reranking when the corpus is unstructured, mixed, or latency-critical — heterogeneous text like web pages, chat logs, support tickets, or code. No LLM call at retrieval time; runs in milliseconds locally.
  • Reach for PageIndex when the corpus is professional, structured documents — legal, financial, compliance, technical manuals. You buy higher accuracy and genuine auditability, and you pay with an LLM call per retrieval.

Two architectures, one question. The reranker filters noise after the fact; PageIndex refuses to retrieve it. Which one is "right" is entirely a function of whether your documents have a structure worth reading like a human would.


What's Next

This post was the map, on purpose. Before building anything, it's worth knowing that the retrieval-quality layer isn't one technique — it's a fork, and the branch you pick is dictated by the shape of your documents.

From here, the series goes hands-on with both branches: a local reranking pipeline (FlashRank and BGE on real PDFs) and a PageIndex tree built over the same documents — so the comparison stops being a table and starts being a measurement. Then comes the part that ties back to where we started: once retrieval is handing the model the right tokens, we turn the Part 2 and Part 3 token discipline loose on what's left. Fewer tokens, and the right ones.

Same question every layer of this series has asked. This time we answer it before the tokens ever arrive.


References