"You cannot improve what you cannot see. In AI systems, you cannot even debug what you cannot see."
Why Classic Observability Breaks for LLMs
Traditional software observability was built on a foundational contract: function in, deterministic output out. A request hits your API, you trace it through a known call graph, and if something breaks, the stack trace points you to the exact line. Every debugging session is reproducible by construction.
LLM-powered systems shatter this contract entirely.
A single user query today can spawn a non-deterministic, multi-model, multi-hop workflow spanning guardrails, classifiers, agents, tool calls, MCP servers, re-rankers, reasoning models, formatters, and output validators — often with different LLMs at each stage, and no two runs producing identical intermediate states. The system is no longer a function. It is a living pipeline.
In 2026, as enterprises across financial services, healthcare, and technology run AI assistants in production at scale, observability isn't a nice-to-have. It is the engineering discipline that separates teams shipping reliable AI from teams shipping regrets.
The first mistake engineers make when approaching this: treating LLM observability like microservice observability with a "model" label added. It isn't. The architecture is fundamentally different, the failure modes are fundamentally different, and the instrumentation requirements are fundamentally different.
This post is about what you actually need to see — and why most teams are blind to the most important signals until something breaks in production.
Part I — The Five Observability Lenses
LLM observability needs five distinct lenses, not three. The classic pillars — Logs, Metrics, and Traces — are necessary but not sufficient. Two new pillars emerge specifically from AI workloads, and the teams that skip them are the ones that get burned in production.
- OTel GenAI span attributes per model call
- W3C
traceparentacross every agent hop - Agent iteration spans, not log lines
- p50/p95/p99 latency histograms per model
- Token economics and USD cost counters
llm_finish_reason_totalas canary metric
- Sanitized prompts with PII stripped at ingest
- Agent reasoning journal with timestamps
- ClickHouse hot/warm/cold tiering required
gen_ai.content.promptexact payloadgen_ai.content.completionexact response- Causally attached to spans — no correlation guesswork
- Reasoning confidence histograms
- Semantic drift vs golden baseline
- Guardrail false-positive rate tracking
Lens 1 — Distributed Traces (The Spine)
Every user query is a trace. Every meaningful operation within that query is a span. This is the same conceptual foundation as microservice observability — but with three critical extensions that most teams miss when they first instrument LLM systems.
Extension A — LLM Call Spans with OTel GenAI Attributes
The OpenTelemetry GenAI Semantic Conventions (stable, 2025) define a standardized attribute namespace that works across providers. You attach attributes like gen_ai.system (anthropic, openai, google), gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and critically, gen_ai.response.finish_reason to every LLM call span. This is your contract with every model provider. Model-agnostic dashboards become possible. You stop writing one-off instrumentation per provider.
gen_ai.systemgen_ai.request.modelgen_ai.usage.input_tokensgen_ai.usage.output_tokensgen_ai.usage.reasoning_tokensgen_ai.response.finish_reasonExtension B — W3C Context Propagation Across Agent Hops
When Agent A calls Tool B, which calls MCP Server C, which calls Database D — the traceparent header must flow through every boundary. Without it, you get four orphaned traces instead of one connected trace. You lose causality entirely. You cannot answer "why did this response take 3.2 seconds?" because you can't see across the hop boundaries. For MCP specifically, the convention is to inject the trace context via the _meta field in the MCP protocol payload.
4bf92f3577b34da6a0b2d5e8c9f1a342traceparenttraceparent_metatraceparentExtension C — Agent Iteration Spans
Agents are loops. LangGraph, CrewAI, AutoGen — they all iterate, retry, and self-correct. Each iteration must be a child span, not a note in a log line. When an agent retries six times because a tool is failing silently, you need to see all six spans with their intermediate states. The final failure isn't the only data point that matters — the pattern of retries tells you why it failed.
trace: user_query · span: agent_run [tool silently failing]Lens 2 — Metrics (The Dashboard)
Metrics answer the question: "Is the system healthy right now?" In LLM systems, the metric families you need are significantly different from what you'd define for a REST API.
Latency metrics should be histograms, not averages — and they should be broken down by model and by operation. Request duration at p50/p95/p99 per model, agent orchestration duration per agent name, and tool call duration per MCP server. The p99 is where your regressions will hide.
Token economics are a new class of metric with no equivalent in traditional systems. You need total token counters broken out by model and by direction (input vs output). You need a cost counter in USD per model per environment. This is how you get PnL accountability at the infrastructure level — not "we spent $X on AI last month," but "this specific agent workflow costs $0.003 per user session and we have 50,000 daily actives."
Reliability metrics are where the most valuable signals live. Track guardrail violation counters by policy and action, agent retry counters by reason, and — this one is the canary in the coal mine — llm_finish_reason_total broken out by reason.
If more than 5% of your LLM calls are finishing with reason
max_tokens, your context window budget is wrong. Your prompts are growing unbounded, or your chain-of-thought instructions are too verbose. The model never tells you this with an error. It just stops mid-sentence. The metric is the only thing that surfaces it.
Lens 3 — Logs (The Evidence Trail)
Logs answer: "What exactly happened in this specific request?"
LLM logs are architecturally different from traditional logs in one critical way: they contain reasoning — the intermediate steps an agent took, why it chose a tool, what it considered and rejected. This is enormously valuable for debugging, but it creates two new problems teams aren't prepared for.
What to log per request: the sanitized prompt (PII stripped before storage — critical for regulatory compliance), tool call inputs and outputs at appropriate truncation, the agent's reasoning journal with timestamps, the guardrail verdict, and correlation IDs linking trace_id, session_id, and hashed customer identifier. These five things together let you reconstruct any request fully after the fact.
The volume problem: LLM log volumes are 10 to 100 times higher than traditional API logs. A single complex query with chain-of-thought reasoning can generate 50KB of structured log data. Storing this in Elasticsearch at full ingest is a cost trap — you will pay orders of magnitude more than the model inference itself. The right architecture is ClickHouse or OpenSearch with hot/warm/cold tiering, sized against your actual query patterns.
Lens 4 — OTel Events (The Timeline of What Was Actually Sent)
OTel Events are the pillar most teams discover only after a production incident. In the OTel GenAI specification, events are span-attached records that capture what was actually sent to the model and what it actually returned.
The key event types are gen_ai.content.prompt — the exact payload sent to the model, gen_ai.content.completion — the exact response, gen_ai.tool_call — each tool invocation with full input and output, and mcp.notification — server-sent events from MCP servers.
What makes events different from logs is that they are causal — they are attached to a specific span in a specific trace. When you see a bad LLM output in your quality dashboard, you click through to the Tempo trace, expand the span, and see the exact prompt that produced it. No hunting through Kibana with regex patterns. No trying to correlate a log line timestamp with a trace ID. The prompt and the span are the same object.
This is the foundational data model that Langfuse and Arize Phoenix are both built around. It's also the raw material for building prompt evaluation datasets, regression test suites, and model upgrade validation pipelines.
Lens 5 — Behavioral Signals (The AI-Specific Layer)
This lens has no equivalent in traditional observability. It covers quality and safety signals that emerge from the AI behavior itself — not from infrastructure metrics or log entries.
Confidence and reasoning quality signals: reasoning confidence histograms extracted from structured model output, intent classification scores per intent type, and the spread between your re-ranker's top score and bottom score. A collapsing spread is a signal that your retrieval quality is degrading.
Drift and degradation signals: semantic similarity scores comparing current responses against a golden baseline, prompt token growth rate over time (growing prompts are often a sign of context pollution accumulating across sessions), and hallucination detection rate evaluated by a judge model running in a separate evaluation pipeline.
Guardrail signals: PII entity counts by entity type (this tells you what your users are actually submitting — a compliance and product insight simultaneously), toxicity score distributions over time, and per-policy compliance check pass/fail rates.
These signals feed directly into your LLM evaluation pipeline — separate from your production metrics stack, but correlated to it. The behavioral signal layer is what transforms your monitoring from "is the system up?" to "is the system good?"
Curiosity question: How would you detect that your model started degrading in output quality three weeks into a production deployment — without user complaints? What signal would surface it first?
Part II — The Production Challenges Nobody Warns You About
Traditional observability assumes your system behaves the same way every time. LLMs assume it won't.
Challenge 1 — Non-Determinism Breaks the Concept of Replay
In traditional debugging, you replay a request with the same inputs to reproduce the bug. The test is deterministic. If it happened once, you can make it happen again.
With LLMs, temperature > 0 means the same input produces different outputs on every call. The exact bug you saw in production may not reproduce on demand. You cannot unit test your way to confidence without addressing this architectural reality.
The engineering response: You must log the full context window snapshot at the moment of inference — system prompt version, full conversation history, all tool results, and the user message — not just the user query string. This snapshot is what allows you to replay the exact model call in an offline evaluation harness, even if the live system would produce a different answer.
This approach is expensive. At 10,000 requests per day with an average context of 2,000 tokens, selective sampling is not optional — it is the architecture. Sample 100% of failure cases, 100% of guardrail triggers, and a statistically meaningful portion of normal traffic. Everything else gets dropped at ingest.
Challenge 2 — The Context Window Is a Black Box by Default
The model's working memory — the context window — is entirely opaque unless you instrument it explicitly. From the outside, you cannot tell whether you are at 30% or 90% capacity, what sections are consuming the most tokens, or whether the context is growing across conversation turns.
An agent running for 50 iterations on a long-lived task can exhaust its context budget by iteration 30. When it does, the model begins silently forgetting earlier context. There is no error in your logs. There is no exception thrown. You simply see degraded, inconsistent output with no obvious cause.
The engineering response: Build context window monitoring that wraps every LLM call and emits per-section token consumption before the request is made. Track fill percentage by section (system prompt, tool results, conversation history, new input), total token counts per request, and the growth rate of context across turns in a session. Alert when any session crosses 80% utilization — that is your early warning threshold.
Curiosity question: Do you know right now whether any of your running agents are at risk of context exhaustion mid-session? How would you find out?
Challenge 3 — Prompts as Unversioned Code
Your system prompts are code. They encode behavior, constraints, tone, tool usage rules, and output format. They can be changed by any engineer with repository access, deployed without a pipeline, and rolled back only if someone remembers what the previous version was.
This creates a class of production incidents that are invisible to your observability stack. A developer updates a system prompt on Friday afternoon. By Monday morning, your reasoning confidence metrics have dropped 12%, agent retries are up 40%, and two compliance checks are failing intermittently. You spend three hours looking at infrastructure metrics before someone checks Git and finds the Friday prompt change.
The engineering response: Hash every system prompt and include that hash as an attribute on every LLM span. This gives you the ability to query "what were my quality metrics at prompt version a3f891c vs b4d902e?" in any trace analytics system. Treat prompt deployments as first-class deployments — they belong in feature flags with canary rollout capabilities and documented rollback procedures. Langfuse's prompt management module is the dedicated tool for this specific problem. For teams in production, it is incident prevention infrastructure, not a developer experience improvement.
Challenge 4 — Multi-Model Pipelines Hide Their Own Cost
A production LLM pipeline typically looks like: fast classifier model (Haiku class) → orchestrator model (Sonnet class) → reasoning model with extended thinking → synthesis model → output formatter. Each stage has its own cost per call, its own latency profile, and its own quality contribution.
Without explicit cost attribution at the span level, you get one number at the end: total inference cost per request. That number is not actionable. You cannot tell whether you should optimize the classifier, reduce the reasoning model's context, or replace the formatter with a deterministic template.
The engineering response: Every LLM call span should carry an estimated cost attribute (calculated from token counts multiplied by the model's per-token rate) and a pipeline stage attribute identifying its role in the workflow. When you aggregate cost by stage across thousands of requests, you will often find one stage consuming 60% of budget while contributing 10% of output quality. That is the optimization target — and you cannot see it without per-span cost tagging.
Challenge 5 — Tool Calls Are Distributed Systems Problems in Disguise
MCP servers, external APIs, database tools — tool calls look like function calls from the agent's perspective, but they are synchronous blocking network calls inside an asynchronous agent loop. They inherit every distributed systems failure mode: timeouts, partial failures, retry storms, cascading degradation, and back-pressure.
The difference from a normal microservice dependency is the failure behavior. When a tool call fails inside an agent iteration, it does not return an HTTP 503 to a client. It returns an error payload to the language model, which makes a decision about what to do next. It might retry. It might hallucinate a plausible-looking response. It might enter a recovery loop that looks functionally correct from the outside until it isn't.
The engineering response: Instrument every tool call the same way you instrument a service dependency — with duration histograms, success rate counters, retry counters by reason, and error type categorization. And add circuit breaker logic at the tool call boundary. A tool with a 10% error rate should trigger degraded-mode behavior from the agent, not an unbounded retry loop that burns tokens while the tool remains unavailable.
Challenge 6 — Guardrails Are Themselves Observable Systems
Teams implement guardrails — PII detection, toxicity scoring, compliance checks — and then measure whether the AI system passes them. What they rarely measure is whether the guardrails themselves are performing correctly.
What is your PII detector's false positive rate on your specific user population? Is your toxicity classifier blocking valid technical responses because they contain sensitive-domain terminology? Is your compliance check accepting outputs it should be rejecting?
Guardrail failures in either direction are invisible without instrumentation on the guardrail layer itself. A PII detector blocking legitimate responses at a 2% rate is a product reliability problem masquerading as a safety feature. You only discover it if you are measuring it.
Curiosity question: If your input sanitizer started over-redacting at a 5% rate today, how long would it take your team to detect it?
Part III — How AI Coding Tools Observe Themselves
When the AI is operating on your codebase, the codebase is the context. And codebases are large, structured, versioned, and deeply relational — none of which traditional LLM observability was designed for.
The Core Architectural Difference
When you ask Claude Code "add pagination to the user list endpoint," you are not sending a text query into a chat interface. You are initiating a multi-phase agentic workflow that operates on a structured artifact — your codebase — with type dependencies, test suites, build system contracts, and version history. The observability requirements are architecturally distinct from conversational AI.
The Four-Phase Pipeline
Claude Code, GitHub Copilot Workspace, and Windsurf Cascade all share this common architecture with minor variations:
Phase 1 — Context Harvesting (typically 200-400ms)
Before the model receives any input, a pre-processing pipeline runs. It performs a file tree scan to detect structure, language, and framework. It runs an embedding similarity search against the user's query to select the 10-20 most relevant files. It extracts Abstract Syntax Trees to resolve symbol references — answering questions like "what other functions depend on getUserList?" It pulls recent commit history for files it identified as relevant.
This phase is heavily observable as infrastructure — cache hit rates on embeddings, retrieval latency, the ratio of files scanned to files selected, and the embedding similarity threshold that determines relevance cutoffs. These are the metrics that determine whether the model starts with the right context window contents.
Phase 2 — Reasoning and Planning
The model receives the structured context and generates a plan — a sequence of edits — before making any of them. This is chain-of-thought reasoning applied to code. Observable signals here include the number of file edits planned, the estimated scope of change, and the model's self-assessed confidence extracted from its structured output.
Phase 3 — Execution with Verification
Each planned edit is a tool call against the filesystem: replace a string in a file, create a new file, execute a bash command to run the test suite or type checker. The critical observable signal in this phase is the verification loop — how many edit attempts were required before all tests passed? A task completing on the first attempt is qualitatively different from a task that required seven iterations with test failures at each step, even if both end with green tests.
Phase 4 — Outcome Capture
Every accepted edit, every reverted edit, and every follow-up edit the developer makes immediately after the AI's edit is recorded as outcome telemetry. This is not observability for debugging — it is observability as training data. The acceptance rate on AI suggestions, broken down by file type, framework, and edit type, is the ground truth quality signal that drives model improvements.
Windsurf Cascade added real-time diff streaming in late 2025 — each edit streams token-by-token with immediate syntax validation, requiring streaming span events to track partial edit quality. GitHub Copilot Workspace added explicit "speculation spans" — when a model proposes a change below a confidence threshold, it emits a signal that triggers mandatory human review before application.
The Tiered Context Problem
The single hardest engineering problem in AI coding tools: a production codebase is 100,000+ lines and a context window is 200,000 tokens. They do not fit. The solution is tiered context retrieval, and it is directly observable through retrieval spans.
Tier 1 — open files and cursor position — always enters the context. Tier 2 — the top 20 files by embedding similarity to the user's query — fills the bulk of the remaining budget. Tier 3 — AST-resolved symbol dependencies for referenced but not selected files — provides structural scaffolding without full file content. Tier 4 — docstrings and function signatures only — covers the rest of the codebase at minimal token cost.
The metric context_window_tokens_by_tier is the single most diagnostic number for understanding why a particular AI coding session produced poor output. If tier 1 alone consumes 60% of the context budget — because you're working in a very large file — the model is working with half the repo coverage it would normally have. Retrieval quality degrades measurably.
Curiosity question: When an AI coding tool generates an edit that's wrong because it didn't see a relevant file, is that a model quality problem or a retrieval quality problem? How would you instrument for the distinction?
Part IV — Inside an Enterprise AI Pipeline at Scale
When you ask for "a Blade Runner-style image of a jazz musician in neon rain," you don't know you just triggered a 7-model, 4-service, 3-second pipeline. The engineering team sees every millisecond of it.
A Real Multi-Model Pipeline, Step by Step
Enterprise AI products are not single-model systems. They are orchestrated pipelines of specialized models, and every stage is independently observable.
Step 1 — Intent Classification (under 50ms) A fast, lightweight classifier model determines what kind of request this is — image generation, text response, code execution, data analysis — and routes to the appropriate downstream pipeline. Observable signals: detected intent, classification confidence, and routing target. A confidence score below 0.7 here is a signal that the query is ambiguous or falls outside the classifier's training distribution.
Step 2 — Prompt Enrichment (100-200ms) A text model silently rewrites the user's prompt into an optimized version for the downstream model. For image generation, this means expanding style references, adding technical quality directives, and checking the enriched result against content policy before it moves forward. The user never sees this transformation. Observable signals: input token count, enriched token count (a proxy for how much the model extended the prompt), enrichment model used, and the safety pre-check result on the enriched version.
Step 3 — Content Policy Evaluation (parallel, 70-90ms) A dedicated safety classifier evaluates the enriched prompt across multiple policy dimensions in parallel with Step 2. In mature implementations, this runs concurrently — you do not pay the latency twice. Observable signals: per-dimension toxicity scores, policy verdict, and evaluation latency. The policy verdict is one of the most important events in the entire pipeline because a false positive block means the user resubmits, increasing both user friction and downstream compute cost.
Step 4 — Primary Model Inference (1,200-2,800ms) The dominant latency component of the pipeline. Observable signals: model version, resolution requested, inference duration, GPU memory consumed, and the VRAM utilization at peak. GPU memory as a span attribute is underutilized in most observability stacks — but it is the signal that tells you when you are approaching model serving capacity limits before the latency spikes happen.
Step 5 — AI Quality Evaluation (300-500ms) A secondary model evaluates the primary model's output against the original prompt. An AI judging an AI. Observable signals: prompt adherence score, aesthetic quality score, artifact detection flag, and the resulting action — accept and deliver, or regenerate. The regenerate rate is a first-order signal for both model quality and prompt enrichment quality. A rising regenerate rate is a quality regression.
Step 6 — Post-Processing and Delivery Upscaling, C2PA watermark injection (mandatory under EU AI Act 2025 for AI-generated content identification), CDN upload, and signed URL generation. Observable signals: upscaling applied, watermark status, CDN upload latency, and total end-to-end pipeline duration.
Sampling at 100 Million Requests per Day
OpenAI in 2026 processes over 100 million requests per day. A complex multimodal pipeline with 12 spans per trace generates 1.2 billion spans daily. At 500 bytes per span, that is 600GB of trace data per day before any sampling is applied. Storing all of it is not an engineering decision — it is an accounting decision with the answer no.
The sampling strategies that work at this scale:
Head-based sampling makes the keep/drop decision at ingestion time. Use a deterministic hash of the user ID to consistently track a fixed percentage of users at full fidelity — so any individual user's session can be debugged end-to-end when needed. Always keep 100% of error traces.
Tail-based sampling evaluates the complete trace after it finishes and makes the keep/drop decision on richer signal. Keep every trace where a safety policy triggered a block. Keep every trace in the p99+ latency tail. Keep a baseline percentage of everything else. The OTel Tail Sampling Processor implements this with configurable policies.
Canary population tracing is the third tier — a small, consent-based cohort of users for whom every span, every event, and every token is captured. This is the population used for model A/B testing and quality regression measurement.
Time-To-Good-Output: The Metric Worth Tracking
Every AI product team tracks latency. Very few track the metric that actually correlates with product quality: Time-To-Good-Output (TTGO).
TTGO measures the elapsed time from when a user submits a request to when they see a result they don't immediately modify, retry, or discard. It is not pipeline latency. It is pipeline latency multiplied by the regeneration rate.
The compounding effect is significant. A model that is 30% faster but causes 40% more regenerations has a worse TTGO than the slower model. A guardrail that blocks 3% of valid requests forces 3% of users to resubmit, adding a full additional pipeline latency to those sessions. Prompt enrichment that produces high-quality results on the first pass reduces TTGO by eliminating the regeneration loop entirely.
Computing TTGO requires frontend and backend trace correlation — linking the client-side user action (accepted, regenerated, abandoned) back to the corresponding server-side trace. The mechanism is W3C trace context propagation into the frontend: the same traceparent header that flows through the backend pipeline is passed to the client, which emits the user's outcome as a span event back to the same trace ID. This is why AI companies instrument their web clients with the same distributed tracing primitives as their backend services.
Curiosity question: What would it take for your team to compute TTGO today? What data would be missing, and at which layer?
How Claude Observes Claude
Anthropic's pipeline for Claude.ai follows the same observability foundations but adds a layer unique to a safety-first architecture: Constitutional AI evaluation spans.
For every response Claude generates, a background evaluation pipeline runs three checks in parallel without blocking delivery: a reward model scores helpfulness, a Constitutional AI classifier evaluates harmlessness against Anthropic's published principles, and a factual consistency checker cross-references the response against any retrieved context that informed it.
These evaluations do not make the response wait. They complete asynchronously and write their verdicts to an evaluation store. Over millions of conversations, these evaluation results become the RLHF training signal that shapes the next model version.
The architectural insight here is profound: the evaluation pipeline is the observability system for model quality, and the observability system is simultaneously the training data pipeline for model improvement. The feedback loop is not an add-on to the architecture — it is the architecture.
The Three Things That Actually Matter
The teams building reliable AI in 2026 are not the ones with the most dashboards. They are the ones who internalized three things early:
You cannot set SLOs for systems you cannot measure. Instrument with specific intent — start with what you are defending before you decide what to instrument.
Production traces, behavioral signals, and evaluation events are not debugging artifacts. Teams that wire observability into their fine-tuning pipelines compound in quality faster than teams treating monitoring as a separate concern.
In LLM systems, the context window — its contents, its fill level, its growth across turns — is the unit of analysis. Observing it is the foundational act of LLM observability that everything else builds on.
Reference — The 2026 LLM Observability Stack
| Layer | Tool | Purpose |
|---|---|---|
| Trace backend | Grafana Tempo | Distributed traces, span storage, trace search |
| Metrics backend | Prometheus + Grafana | Time-series metrics, alerting, dashboards |
| Log backend | ClickHouse | High-cardinality LLM log storage, SQL querying |
| LLM evaluation | Langfuse | Prompt management, evaluation datasets, cost tracking |
| LLM quality | Arize Phoenix | Drift detection, hallucination scoring, embedding monitoring |
| OTel collector | OpenTelemetry Collector | Neutral aggregation layer, pipeline routing |
| Sampling | OTel Tail Sampling Processor | Intelligent trace sampling at scale |
| Context propagation | W3C Trace Context | Cross-service and cross-model trace continuity |
| Safety signals | Microsoft Presidio | PII detection and redaction |
| Semantic conventions | OTel GenAI spec (stable 2025) | Standardized gen_ai.* attribute namespace |