The Agent That Forgets Everything: Cross-Session Memory Without Breaking the Audit Trail

The Session That Never Happened

Session 1: the compliance agent retrieves the correct OSFI capital buffer guideline — the one updated in March 2026, the one that raised the buffer requirement for Category 2 institutions from 2.0% to 2.5%. The reranker surfaces it at top_k=1. The harness runs verification. The output is correct and traceable. Session ends.

Session 12: same agent, same institution, different query phrasing. The context window is fresh. The agent has no record of Session 1 — no knowledge that the guideline was already retrieved and verified, no memory of which version was applied, no connection to the reasoning trace that produced the Session 1 output.

The reranker runs again from scratch. Maybe it surfaces the right chunk. Maybe slightly different query phrasing pulls an adjacent chunk from an older guidance document that was never marked superseded in the index. The agent produces output. It might be consistent with Session 1. It might not be. Neither the agent nor the harness has any way to know.

A regulator asks: "Did your agent apply the March 2026 capital buffer update consistently across all client assessments?"

The answer is unknowable. Not because the agent made a mistake. Because the architecture has no memory of what it did.

Longer Context Is Not Memory

The instinct when you understand this problem is to give the agent its history. Feed it the last 50 session transcripts. Let it see everything that came before. This feels like a solution. It is a different version of the same problem.

Raw session history fed into context grows as sessions accumulate. By session 50, you have compressed nothing and retrieved nothing — you have concatenated everything. The token budget that Parts 2, 3, and 5 spent reducing is now consumed by transcripts the model scans on every turn. The bloat problem returns, across sessions instead of within one.

The more important error is conceptual. Context pressure and state continuity are different problems. RTK, Headroom, and reranking solve context pressure — they control what the current session's tokens cost. State continuity is about what the agent carries from one session to the next. Confusing them produces an architecture that tries to solve continuity by extending context, which doesn't work and creates a new cost problem in the same move.

Memory isn't longer context. Memory is the right facts, extracted from session history and retrievable on demand. That distinction is the whole design decision.

Two Kinds of Memory, Two Architectures

Agents need to remember two fundamentally different things across sessions. Conflating them produces the same category of error as using a cache where you need a database.

Two memory types — two architectures

Declarative Memory

What: what was retrieved, decided, verified

Tool: Mem0 + pgvector

Access: vector retrieval — relevant facts surface per query

ML: yes — similarity search

Changes: accumulates across sessions

Procedural Memory

What: how to do things in this environment

Tool: SKILL.md / Hermes memory

Access: file injection — verbatim into system prompt, every session

ML: no — deterministic file read

Changes: rarely — when the environment changes

The architectural divide is load-bearing, not stylistic. Declarative facts are session-specific and query-dependent — you don't know which facts a Session 12 query will need until you see the query. Procedural knowledge is stable and universal — it's needed every session, doesn't change based on query content, and must be reliably present rather than probabilistically retrieved.

Treat them as the same type and you end up injecting everything into context (too much) or retrieving everything by similarity (unreliable). Treat them as distinct types with distinct access patterns and you get a system that is both efficient and consistent.

Declarative Memory: Compress the Session, Don't Extend It

At the end of a session, the agent has accumulated context — retrieved documents, reasoning traces, verified outputs. The wrong approach stores the full session. The right approach extracts structured facts.

Mem0 converts session content into discrete memory entries. A session that processed 30,000 tokens of compliance document retrieval, harness verification, and multi-agent reasoning compresses to 5–10 structured facts:

"OSFI Guideline B-20 (March 2026): capital buffer Category 2 = 2.5% — Session 1"
"Institution XYZ assessment: verified correct against B-20 v3.2 — Session 1"
"Client threshold exemption: applicable under clause 4.1(b) — Session 3"

These facts are indexed in pgvector. When Session 12 opens, the incoming query is embedded and used to retrieve semantically relevant facts from all previous sessions — without the model scanning full transcripts.

The compress-by-default instinct appears at the cross-session level: a session's reasoning compresses to a retrievable set of assertions. The full session context stays in Langfuse traces and CCR cache — available if needed, not injected by default.

One implementation note that matters for the audit argument in the next section: Mem0 facts carry an optional metadata dict. The write step at session close should populate it:

mem0.add(
    messages=[{"role": "user", "content": fact_text}],
    user_id=agent_id,
    metadata={
        "session_id": session_id,
        "langfuse_trace_id": trace_id,
        "ccr_content_id": content_id,
        "source_document": "OSFI-B20-v3.2",
        "retrieved_at": iso_timestamp,
    }
)

This is optional in Mem0's default configuration. For a regulated workload, it is not optional. The reason becomes clear two sections ahead.

Procedural Memory: The Deterministic Path

Operational knowledge is different in kind from session facts. How to structure a capital adequacy report. How to handle an OSFI portal 403 error. Which section headers the compliance output format requires. These don't change based on query content and don't accumulate across sessions — they describe the environment and the task structure, not what happened in any specific session.

This knowledge belongs in SKILL.md files, loaded by the harness and injected verbatim into the system prompt on every session:

# Harness prompt construction — procedural memory injection
with open("skills/osfi_report_format.md", "r") as f:
    procedural_context = f.read()

system_prompt = f"{procedural_context}\n\n{base_system_prompt}"

No vector search. No similarity scoring. No ML inference at injection time. The harness reads the file and prepends it to the context — deterministic, reproducible, identical on session 3 and session 300.

The Hermes memory layer operates on the same principle: operational skills are loaded at agent initialization, not retrieved per turn. The agent doesn't search for how to format a report — it always knows, because the harness always told it.

Procedural memory is also where session-boundary enforcement rules live. "Always check for a more recent guideline version before using a cached declarative fact." "Never produce a compliance output without a verified source reference." These must apply on every session and must be injected on every session. They are not candidates for declarative retrieval — if a compliance constraint might be missed by a similarity search, it belongs in the deterministic path.

The Determinism Divide at Cross-Session Altitude

Two design instincts have run through every layer of this series. At the cross-session layer, their purpose is clearest and their stakes highest.

Deterministic gate by default. Procedural memory is injected verbatim from a version-controlled file, every session. The same skill on Session 1 and Session 100. No model involved in the injection decision, no similarity threshold that might cause a miss, no distribution shift that might change what gets injected. The harness reads the file. The content goes in.

Probabilistic escalation as opt-in. Declarative memory uses vector similarity to decide which facts from previous sessions are relevant to the current query. The similarity score is an ML output — the right tool for session-specific knowledge that varies by query, but carrying the same non-determinism trade-off as every other ML layer in this stack.

Where this pattern has appeared across the series:

Layer	Deterministic gate	Probabilistic escalation
Harness (Part 1)	EVALUATOR — pure pytest subprocess	critic-subagent — advisory LLM
RTK (Part 2)	Hand-written per-command filters	No escalation tier needed
Headroom (Part 3)	SmartCrusher — structural JSON compression	Kompress — ML sentence scoring (opt-in)
Reranking (Part 5)	Bi-encoder ANN search — deterministic given index	Cross-encoder scoring — ML per query-pair
Session memory (Part 6)	Procedural injection — file read, verbatim	Declarative retrieval — vector similarity

Same shape at five altitudes. The compliance-critical path is always the deterministic gate. The query-dependent, context-sensitive path is always the probabilistic layer, enabled deliberately when the workload justifies it.

The Audit Problem: Five Links, One Missing

A compliance agent's cross-session provenance chain needs to be fully traversable. When a regulator asks "what was the basis for this output?", the chain must connect five links:

Cross-session provenance chain

Session 12 output

↓injected as context from

Mem0 fact ← chain breaks here by default

↓produced by

Session 1 reasoning — Langfuse trace_id

↓grounded in

Session 1 retrieval — CCR content_id

↓retrieved from

Source document — OSFI B-20 v3.2, March 2026

The chain is only as strong as its weakest link. By default, the Mem0 fact is a leaf node — it carries the extracted text and a timestamp, but no reference to the session that produced it, the Langfuse trace that recorded it, or the CCR content ID of the source it was derived from. The chain breaks at link two.

Component	What it provides today	What's missing
CCR (Headroom)	Retains original source document with TTL, `content_id`	No link to the Mem0 fact derived from it
Langfuse	Full Session 1 inference trace with `trace_id`	No link to the Mem0 fact produced from that trace
Mem0	Extracted fact, queryable by similarity	No `session_id`, `langfuse_trace_id`, or `ccr_content_id` on the fact
pgvector	Indexes facts for retrieval	Retrieves the fact — can't retrieve its provenance

The fix is the metadata write shown in the Declarative Memory section above. When the fact is written to Mem0 at session close, the session_id, langfuse_trace_id, and ccr_content_id travel with it. When that fact is retrieved in Session 12 and injected as context, the metadata is available to the harness. The Session 12 Langfuse trace references the injected fact and its provenance metadata. The chain closes.

The architecture for full cross-session provenance already exists across these four tools. What's missing is the write-time decision to store the references — one metadata dict at session close is the difference between an audit trail and an audit approximation.

The CCR's reversibility property pays an unexpected dividend here. CCR retains the full source document — not the compressed version, the original — with its content ID. If a fact derived from a compressed chunk is ever questioned, the CCR can produce the exact bytes that were retrieved and summarized. The provenance chain traces not just to the session, but to the uncompressed source.

The Full Stack, Complete

Five layers. Six posts. The first time they appear together.

The agent stack — all five layers

Harness

Reliability within a session — enforces the verify loop, deterministic gate, blast radius control

RTK

Shell tokens within a session — compresses subprocess output at the Bash boundary

Headroom

Content tokens within a session — compresses RAG chunks, JSON results, prose

Reranking

Retrieval quality within a session — controls which tokens arrive before the model sees them

Session Memory

State continuity across sessions — what survives when the context window closes

The first four layers operate within a session. Only the fifth crosses the session boundary.

A well-architected agent stack running all five doesn't just produce correct answers in isolation. It produces correct, token-efficient, retrieval-accurate, cross-session-consistent answers — traceable to their source documents through a provenance chain that connects every session's output back to the knowledge that grounded it.

What Part 7 Builds

This post was the architectural map, deliberately. Before building the Mem0 + pgvector integration, the SKILL.md injection harness, and the provenance metadata chain, it's worth knowing that cross-session memory isn't one technique — it's a fork between declarative and procedural, and the branch you pick for each type of knowledge determines whether the audit chain closes.

Part 7 builds both branches and measures them against real sessions. The question it answers: does declarative retrieval in Session 12 actually surface the right fact from Session 1? Does the provenance chain close when the metadata is wired at write time? And what does the failure mode look like when similarity search surfaces the wrong fact from a semantically adjacent but factually different session?

The architecture described here works on paper. Part 7 finds out where it doesn't.