The Session That Never Happened
Session 1: the compliance agent retrieves the correct OSFI capital buffer guideline — the one updated in March 2026, the one that raised the buffer requirement for Category 2 institutions from 2.0% to 2.5%. The reranker surfaces it at top_k=1. The harness runs verification. The output is correct and traceable. Session ends.
Session 12: same agent, same institution, different query phrasing. The context window is fresh. The agent has no record of Session 1 — no knowledge that the guideline was already retrieved and verified, no memory of which version was applied, no connection to the reasoning trace that produced the Session 1 output.
The reranker runs again from scratch. Maybe it surfaces the right chunk. Maybe slightly different query phrasing pulls an adjacent chunk from an older guidance document that was never marked superseded in the index. The agent produces output. It might be consistent with Session 1. It might not be. Neither the agent nor the harness has any way to know.
A regulator asks: "Did your agent apply the March 2026 capital buffer update consistently across all client assessments?"
The answer is unknowable. Not because the agent made a mistake. Because the architecture has no memory of what it did.
Longer Context Is Not Memory
The instinct when you understand this problem is to give the agent its history. Feed it the last 50 session transcripts. Let it see everything that came before. This feels like a solution. It is a different version of the same problem.
Raw session history fed into context grows as sessions accumulate. By session 50, you have compressed nothing and retrieved nothing — you have concatenated everything. The token budget that Parts 2, 3, and 5 spent reducing is now consumed by transcripts the model scans on every turn. The bloat problem returns, across sessions instead of within one.
The more important error is conceptual. Context pressure and state continuity are different problems. RTK, Headroom, and reranking solve context pressure — they control what the current session's tokens cost. State continuity is about what the agent carries from one session to the next. Confusing them produces an architecture that tries to solve continuity by extending context, which doesn't work and creates a new cost problem in the same move.
Memory isn't longer context. Memory is the right facts, extracted from session history and retrievable on demand. That distinction is the whole design decision.
Two Kinds of Memory, Two Architectures
Agents need to remember two fundamentally different things across sessions. Conflating them produces the same category of error as using a cache where you need a database.
The architectural divide is load-bearing, not stylistic. Declarative facts are session-specific and query-dependent — you don't know which facts a Session 12 query will need until you see the query. Procedural knowledge is stable and universal — it's needed every session, doesn't change based on query content, and must be reliably present rather than probabilistically retrieved.
Treat them as the same type and you end up injecting everything into context (too much) or retrieving everything by similarity (unreliable). Treat them as distinct types with distinct access patterns and you get a system that is both efficient and consistent.
Declarative Memory: Compress the Session, Don't Extend It
At the end of a session, the agent has accumulated context — retrieved documents, reasoning traces, verified outputs. The wrong approach stores the full session. The right approach extracts structured facts.
Mem0 converts session content into discrete memory entries. A session that processed 30,000 tokens of compliance document retrieval, harness verification, and multi-agent reasoning compresses to 5–10 structured facts:
"OSFI Guideline B-20 (March 2026): capital buffer Category 2 = 2.5% — Session 1"
"Institution XYZ assessment: verified correct against B-20 v3.2 — Session 1"
"Client threshold exemption: applicable under clause 4.1(b) — Session 3"
These facts are indexed in pgvector. When Session 12 opens, the incoming query is embedded and used to retrieve semantically relevant facts from all previous sessions — without the model scanning full transcripts.
The compress-by-default instinct appears at the cross-session level: a session's reasoning compresses to a retrievable set of assertions. The full session context stays in Langfuse traces and CCR cache — available if needed, not injected by default.
One implementation note that matters for the audit argument in the next section: Mem0 facts carry an optional metadata dict. The write step at session close should populate it:
mem0.add(
messages=[{"role": "user", "content": fact_text}],
user_id=agent_id,
metadata={
"session_id": session_id,
"langfuse_trace_id": trace_id,
"ccr_content_id": content_id,
"source_document": "OSFI-B20-v3.2",
"retrieved_at": iso_timestamp,
}
)
This is optional in Mem0's default configuration. For a regulated workload, it is not optional. The reason becomes clear two sections ahead.
Procedural Memory: The Deterministic Path
Operational knowledge is different in kind from session facts. How to structure a capital adequacy report. How to handle an OSFI portal 403 error. Which section headers the compliance output format requires. These don't change based on query content and don't accumulate across sessions — they describe the environment and the task structure, not what happened in any specific session.
This knowledge belongs in SKILL.md files, loaded by the harness and injected verbatim into the system prompt on every session:
# Harness prompt construction — procedural memory injection
with open("skills/osfi_report_format.md", "r") as f:
procedural_context = f.read()
system_prompt = f"{procedural_context}\n\n{base_system_prompt}"
No vector search. No similarity scoring. No ML inference at injection time. The harness reads the file and prepends it to the context — deterministic, reproducible, identical on session 3 and session 300.
The Hermes memory layer operates on the same principle: operational skills are loaded at agent initialization, not retrieved per turn. The agent doesn't search for how to format a report — it always knows, because the harness always told it.
Procedural memory is also where session-boundary enforcement rules live. "Always check for a more recent guideline version before using a cached declarative fact." "Never produce a compliance output without a verified source reference." These must apply on every session and must be injected on every session. They are not candidates for declarative retrieval — if a compliance constraint might be missed by a similarity search, it belongs in the deterministic path.
The Determinism Divide at Cross-Session Altitude
Two design instincts have run through every layer of this series. At the cross-session layer, their purpose is clearest and their stakes highest.
Deterministic gate by default. Procedural memory is injected verbatim from a version-controlled file, every session. The same skill on Session 1 and Session 100. No model involved in the injection decision, no similarity threshold that might cause a miss, no distribution shift that might change what gets injected. The harness reads the file. The content goes in.
Probabilistic escalation as opt-in. Declarative memory uses vector similarity to decide which facts from previous sessions are relevant to the current query. The similarity score is an ML output — the right tool for session-specific knowledge that varies by query, but carrying the same non-determinism trade-off as every other ML layer in this stack.
Where this pattern has appeared across the series:
| Layer | Deterministic gate | Probabilistic escalation |
|---|---|---|
| Harness (Part 1) | EVALUATOR — pure pytest subprocess | critic-subagent — advisory LLM |
| RTK (Part 2) | Hand-written per-command filters | No escalation tier needed |
| Headroom (Part 3) | SmartCrusher — structural JSON compression | Kompress — ML sentence scoring (opt-in) |
| Reranking (Part 5) | Bi-encoder ANN search — deterministic given index | Cross-encoder scoring — ML per query-pair |
| Session memory (Part 6) | Procedural injection — file read, verbatim | Declarative retrieval — vector similarity |
Same shape at five altitudes. The compliance-critical path is always the deterministic gate. The query-dependent, context-sensitive path is always the probabilistic layer, enabled deliberately when the workload justifies it.
The Audit Problem: Five Links, One Missing
A compliance agent's cross-session provenance chain needs to be fully traversable. When a regulator asks "what was the basis for this output?", the chain must connect five links:
The chain is only as strong as its weakest link. By default, the Mem0 fact is a leaf node — it carries the extracted text and a timestamp, but no reference to the session that produced it, the Langfuse trace that recorded it, or the CCR content ID of the source it was derived from. The chain breaks at link two.
| Component | What it provides today | What's missing |
|---|---|---|
| CCR (Headroom) | Retains original source document with TTL, content_id |
No link to the Mem0 fact derived from it |
| Langfuse | Full Session 1 inference trace with trace_id |
No link to the Mem0 fact produced from that trace |
| Mem0 | Extracted fact, queryable by similarity | No session_id, langfuse_trace_id, or ccr_content_id on the fact |
| pgvector | Indexes facts for retrieval | Retrieves the fact — can't retrieve its provenance |
The fix is the metadata write shown in the Declarative Memory section above. When the fact is written to Mem0 at session close, the session_id, langfuse_trace_id, and ccr_content_id travel with it. When that fact is retrieved in Session 12 and injected as context, the metadata is available to the harness. The Session 12 Langfuse trace references the injected fact and its provenance metadata. The chain closes.
The architecture for full cross-session provenance already exists across these four tools. What's missing is the write-time decision to store the references — one metadata dict at session close is the difference between an audit trail and an audit approximation.
The CCR's reversibility property pays an unexpected dividend here. CCR retains the full source document — not the compressed version, the original — with its content ID. If a fact derived from a compressed chunk is ever questioned, the CCR can produce the exact bytes that were retrieved and summarized. The provenance chain traces not just to the session, but to the uncompressed source.
The Full Stack, Complete
Five layers. Six posts. The first time they appear together.
A well-architected agent stack running all five doesn't just produce correct answers in isolation. It produces correct, token-efficient, retrieval-accurate, cross-session-consistent answers — traceable to their source documents through a provenance chain that connects every session's output back to the knowledge that grounded it.
What Part 7 Builds
This post was the architectural map, deliberately. Before building the Mem0 + pgvector integration, the SKILL.md injection harness, and the provenance metadata chain, it's worth knowing that cross-session memory isn't one technique — it's a fork between declarative and procedural, and the branch you pick for each type of knowledge determines whether the audit chain closes.
Part 7 builds both branches and measures them against real sessions. The question it answers: does declarative retrieval in Session 12 actually surface the right fact from Session 1? Does the provenance chain close when the metadata is wired at write time? And what does the failure mode look like when similarity search surfaces the wrong fact from a semantically adjacent but factually different session?
The architecture described here works on paper. Part 7 finds out where it doesn't.