The Question Compression Can't Answer
Part 2 handed RTK every token that came out of a shell. Part 3 handed Headroom almost everything else — JSON tool results, prose documents, the accumulating chat history. Between them, the stack compresses the two big token categories an agent produces. Both layers do the same fundamental thing: they take the tokens that already arrived and make them smaller.
Neither one asks the prior question.
In Part 3 the retriever stuffed 3,496 tokens into context to answer a 120-token query. Headroom compressed that blob — shaved it down, kept the original in cache, did its job well. But step back one move. Why were there 3,496 tokens to compress in the first place? For a question that small, most of what the retriever pulled was never going to be read, reasoned over, or cited. It was noise that arrived dressed as signal — and then we paid to compress the noise.
Compression is a treatment for the symptom. Retrieval quality is a treatment for the cause.
This is the retrieval-quality layer, and it sits upstream of everything Parts 2 and 3 do. Fix it, and there's simply less for the compression layers to compress. The wrong tokens never show up, so nobody has to shrink them.
There are two serious schools of thought on how to fix it. They answer the same question — why did the wrong tokens arrive? — and they disagree at the level of architecture, not tuning. One lets the noise in and filters it out afterward. The other reasons about structure so the noise is never retrieved at all.
Over the next couple of posts I'm going to build both, then turn the same token discipline from Parts 2 and 3 loose on whatever they retrieve. But before any of that is worth doing, you need the map. This post is the map.
Two Answers, One Question
Both pipelines hand the model the same kind of thing at the end: a small, relevant context. They get there along opposite routes — and that route is the entire decision.
What Reranking Actually Is
Reranking is a second-pass quality filter. Your first-stage retriever casts a wide, cheap net and returns a list of plausibly relevant chunks. The reranker then re-examines each candidate against the query in far greater depth and reorders them, so the genuinely relevant passages rise to the top before they reach the LLM.
The distinction that matters is how deeply a model looks at the query–document pair.
| Stage | Model type | How it scores | Cost | Role |
|---|---|---|---|---|
| First-stage retrieval | Bi-encoder | Encodes query and document separately into vectors, compares by cosine similarity | Cheap, pre-computable, sub-ms | Recall — cast a wide net |
| Reranking | Cross-encoder | Encodes query and document together in one forward pass; full token-level attention | Expensive, per-pair, can't be cached | Precision — sort the net's catch |
A bi-encoder is fast precisely because it never lets the query and document interact — it just compares two pre-baked vectors. A cross-encoder is accurate precisely because it does the opposite: it reads the query and document jointly, so "Sharpe ratio" in the query can attend directly to "risk-adjusted return" in the passage. That joint attention is the whole point — and the whole cost.
When I build this, two rerankers will sit on exactly this spectrum:
- FlashRank — an ONNX-quantized cross-encoder. Same architecture as a full model, but the numerics are shrunk to run locally on a CPU in milliseconds instead of needing a GPU.
- BGE reranker — a full PyTorch cross-encoder. Slower, marginally higher accuracy ceiling.
Both do the same job: take a pile of noisy first-stage candidates, score them properly, keep the best few, and hand a lean context to the LLM. That's reranking in one sentence — and it's the general-purpose answer, because it works on any text whether or not the text has any structure to exploit.
What PageIndex Actually Does
PageIndex (VectifyAI/PageIndex) throws the whole pipeline above out and replaces it with something that looks more like how a human uses a table of contents.
Standard RAG (what reranking lives inside):
PDF → chunk into 512-token windows → embed → vector search → rerank → LLM
PageIndex — "vectorless, reasoning-based RAG":
PDF → build a semantic tree (titles + one-line summaries + page refs, hierarchical)
→ at query time, an LLM agent traverses the tree by reasoning
→ navigates directly to the right section, like reading a contents page
→ returns only those pages to the LLM
There are no vector embeddings, no similarity search, and no chunking. The tree itself is tiny — just node titles, one-line summaries, and page pointers — so the LLM can hold it in context, reason about which branch answers this query, and pull only the pages it decided are relevant.
The framing in their own words is sharp and worth keeping:
Similarity ≠ relevance. Vector RAG retrieves what is similar; what you actually want is what is relevant, and deciding relevance requires reasoning, not nearest-neighbour geometry.
Inspired by AlphaGo-style tree search, PageIndex does retrieval in two steps: generate a "table-of-contents" tree index of the document, then perform agentic, reasoning-based retrieval by searching that tree. It reports a state-of-the-art 98.7% on FinanceBench (via the Mafin 2.5 system built on it) — a financial-document QA benchmark where vector RAG historically struggles, which is exactly the structured, professional-document regime PageIndex is designed for. (MIT-licensed, ~33k GitHub stars.)
PageIndex is the specialist answer: it buys higher accuracy and genuine auditability on documents with structure worth reading — and it pays with an LLM call at retrieval time.
The Fork, Side by Side
The two architectures solve exactly the same problem — why did the wrong tokens arrive — from opposite ends. Reranking lets the noise in and then filters it. PageIndex reasons about structure so the noise is never retrieved.
| Reranking | PageIndex | |
|---|---|---|
| Works on | Any text, structured or not | Well-structured documents (books, legal, financial) |
| Retrieval mechanism | Embedding + cross-encoder inference | An LLM call that traverses a tree |
| Retrieval cost | Milliseconds, no LLM call | One+ LLM call per query |
| Accuracy ceiling | Capped by first-stage recall | No first-stage recall cap; reasons over the whole structure |
| Auditability | "chunk from page 145" | "navigated Section 3.2 because the query asked about X" |
| Fails when | Corpus has no structure | Document has no clear hierarchy / query spans many sections |
The accuracy-ceiling row is the crux. A reranker can only ever reorder what the first stage already retrieved — if the bi-encoder's wide net missed the one page that actually answers the query, no cross-encoder can recover it. PageIndex has no first-stage recall ceiling; it reasons over the whole document structure every time. The price is an LLM call per retrieval instead of a cached vector lookup.
There's a second row worth lingering on: auditability. A reranker can tell you what it sent — "chunk from page 145." PageIndex tells you what and why — "I navigated to Section 4.2 because the query asked about data-breach notification timelines, and the summary of Section 4.2 referenced GDPR Article 33." That's a reasoning trace attached to retrieval, and for a regulated workload it's exactly the artifact an audit wants.
Why This Belongs in the Stack
This series has been walking down a ladder, one altitude at a time.
Parts 2 and 3 are downstream fixes — they make the tokens that arrive cheaper. Part 4 is the upstream fix: it changes which tokens arrive. And this is the general lesson of the whole series in miniature — fixing a problem upstream makes every downstream fix less load-bearing. Better retrieval doesn't just save tokens. It relieves pressure on compression, on context limits, and on the model's ability to ignore distractors. If retrieval lands on the exact two pages that answer a query, Headroom's compression pass over them becomes almost decorative.
There's also a quieter echo of the recurring instinct this series keeps returning to — deterministic gate by default, expensive escalation as opt-in. Reranking is the cheap, cacheable, no-LLM gate: milliseconds, runs locally, no reasoning call. PageIndex is the reasoning escalation you reach for when the documents have structure worth paying an LLM call to navigate. Same shape, one layer up: the default path is cheap and mechanical; the powerful path is a deliberate choice you make when the workload justifies it.
So the honest positioning of the two answers:
- Reach for reranking when the corpus is unstructured, mixed, or latency-critical — heterogeneous text like web pages, chat logs, support tickets, or code. No LLM call at retrieval time; runs in milliseconds locally.
- Reach for PageIndex when the corpus is professional, structured documents — legal, financial, compliance, technical manuals. You buy higher accuracy and genuine auditability, and you pay with an LLM call per retrieval.
Two architectures, one question. The reranker filters noise after the fact; PageIndex refuses to retrieve it. Which one is "right" is entirely a function of whether your documents have a structure worth reading like a human would.
What's Next
This post was the map, on purpose. Before building anything, it's worth knowing that the retrieval-quality layer isn't one technique — it's a fork, and the branch you pick is dictated by the shape of your documents.
From here, the series goes hands-on with both branches: a local reranking pipeline (FlashRank and BGE on real PDFs) and a PageIndex tree built over the same documents — so the comparison stops being a table and starts being a measurement. Then comes the part that ties back to where we started: once retrieval is handing the model the right tokens, we turn the Part 2 and Part 3 token discipline loose on what's left. Fewer tokens, and the right ones.
Same question every layer of this series has asked. This time we answer it before the tokens ever arrive.
References
- PageIndex — Vectorless, Reasoning-based RAG — https://github.com/VectifyAI/PageIndex
- Mafin 2.5 on FinanceBench (98.7%) — https://github.com/VectifyAI/Mafin2.5-FinanceBench
- FinanceBench dataset — https://arxiv.org/abs/2311.11944
- FlashRank (ONNX cross-encoder reranking) — https://github.com/PrithivirajDamodaran/FlashRank
- BGE reranker — https://github.com/FlagOpen/FlagEmbedding