AI Agent Observability in 2026: Tracing Every Step with OpenTelemetry

When an agent fails in production, you need to replay its reasoning. Here is how OpenTelemetry tracing makes agents debuggable.

Sam CarterJun 28, 2026 10 min read

Cover image for AI Agent Observability in 2026: Tracing Every Step with OpenTelemetry — Photo: Ars Electronica / flickr (BY-NC-ND 2.0)

A single-shot chatbot is easy to debug: you have one prompt and one response. An agent is not. It plans, calls tools, reads results, revises, and loops, sometimes for dozens of steps across multiple model calls. When it produces a wrong answer or burns through your budget, "the model was bad" is useless. You need to see which step went sideways, and in 2026 the way you do that has standardized around OpenTelemetry.

Quick answer

Agent observability captures every model call, tool execution, and reasoning step as a structured span, then stitches them into one hierarchical trace you can replay after the fact. The 2026 standard is the OpenTelemetry GenAI semantic conventions, which Datadog, New Relic, and Dynatrace ingest natively and open-source tools like Phoenix and Langfuse build on. Instrument your model calls first, nest tool spans underneath, then layer evaluation on top so you know not just what happened but whether it was good.

Key takeaways

Agent observability captures every model call, tool execution, and reasoning step as structured spans you can replay.
A complete trace records the reasoning, the tools considered and invoked, the arguments, the responses, the tokens, and the latency of each hop.
The OpenTelemetry GenAI semantic conventions are the emerging standard, covering LLM spans, agent spans, events, and metrics.
Major vendors, Datadog, New Relic, Dynatrace, now support those conventions natively.
Pick an OpenTelemetry-first stack so you are not locked into one vendor's proprietary tracing.

Why agents need a different kind of monitoring

Traditional application monitoring assumes mostly deterministic code paths. Agents break that assumption. The same prompt can take different routes on different runs, call different tools, and reason differently. So you cannot debug an agent by reading logs of fixed function calls; you have to reconstruct a trajectory. Observability for agents means capturing that trajectory as a hierarchical trace, a tree of spans where each span is one meaningful step, so you can replay exactly what happened.

Concretely, an effective agent trace captures the reasoning trace, the tools considered, the tools actually invoked, the arguments passed, the responses returned, the tokens spent at each step, and the latency of each hop, all stitched into one trace you can step through after the fact. When the agent does something dumb, you scroll to the span where it went wrong and see the inputs that led there.

Here is how that differs from the monitoring you already run on conventional services:

Concern	Traditional app monitoring	Agent observability
Unit of work	Fixed function call	Variable trajectory (plan, call, revise, loop)
Failure question	Which line threw?	Which step reasoned wrong?
Cost driver	CPU and memory	Tokens per span and runaway loops
Replay	Re-run the same code path	Step through the exact reasoning tree
Key signal	Error rate and latency	Trajectory correctness and token spend

The practical upshot is that an agent can "succeed" (no exception, HTTP 200) and still be completely wrong, having called the right tool with the wrong arguments three steps back. Only a trace surfaces that.

A hierarchical trace tree showing an agent's reasoning, tool calls, and responses as nested spans — Photo: Bob Mical / flickr (BY-NC 2.0)

OpenTelemetry becomes the standard

The important consolidation of 2026 is that observability stopped being a pile of proprietary SDKs and converged on OpenTelemetry. As of early 2026, the OpenTelemetry GenAI semantic conventions cover four primary areas:

LLM client spans, individual model calls with their prompts, parameters, and token usage.
Agent spans, the higher-level reasoning and orchestration steps.
Events, for capturing prompt and completion content attached to spans.
Metrics, aggregate measures like token counts, latency, and error rates.

Because these are open conventions rather than a vendor format, Datadog, New Relic, and Dynatrace now ingest GenAI semantic-convention data natively, and open-source tools like Phoenix are OpenTelemetry-first by design. The payoff is portability: instrument once, and switch backends without rewriting your tracing.

Tip

Instrument to the OpenTelemetry GenAI conventions, not to a single vendor's SDK. It keeps your traces portable across Datadog, New Relic, Phoenix, and managed platforms, and it future-proofs you as the conventions evolve.

What a good observability stack provides

The guidance from teams running agents in production converges on a checklist. Effective tooling should give you:

Distributed tracing across model calls and tool executions, stitched into one trace.
Multi-turn session replay, so you can step through an entire conversation, not just one call.
Online evaluation, scoring outputs in production against quality criteria.
Alerting and anomaly detection for cost spikes, latency, and failure patterns.
Data curation, turning real traces into evaluation and fine-tuning datasets.
OpenTelemetry compatibility for portability.

That online evaluation piece connects directly to running LLM-as-a-judge evaluations in production: traces give you the raw record, and automated scoring turns that record into a quality signal you can alert on. Observability without evaluation tells you what happened; pairing them tells you whether what happened was good.

Choosing a backend

You do not have to write a tracer from scratch. The decision is mostly about whether you want a managed platform, an open-source tool you host, or to bolt agent traces onto an APM vendor you already pay for. Because everything speaks the GenAI conventions now, you can change your mind later without re-instrumenting.

Backend	Best for	Hosting	Notes
Phoenix (Arize)	Open-source, OTel-first traces and evals	Self-host or cloud	Free, strong eval tooling, popular default
Langfuse	Teams wanting traces plus prompt management	Self-host or cloud	Generous free tier, good session replay
Datadog / New Relic	Shops already on that APM	Managed	Native GenAI span ingestion, one pane of glass
LangSmith	LangChain and LangGraph stacks	Managed	Tight framework integration, paid
Braintrust	Eval-heavy workflows	Managed	Strong dataset curation from real traces

If you have no existing APM contract, start with Phoenix or Langfuse: both are free to begin with, both are OpenTelemetry-native, and both let you graduate to a paid tier without ripping out instrumentation.

Getting started

Instrument the model calls first. Wrap every LLM call to emit an OpenTelemetry span with prompt, parameters, tokens, and latency.
Add agent and tool spans. Nest tool executions and reasoning steps under a parent span so the trace forms a tree.
Attach content as events. Capture prompts and completions where policy allows, so traces are replayable, not just timing data.
Choose a backend. Use an OpenTelemetry-first open-source option (e.g., Phoenix) or a managed platform that ingests the GenAI conventions.
Layer evaluation on top. Score outputs online and alert on regressions, cost, and latency, not just errors.

Use framework adapters where they exist (they normalize spans into the common schema automatically), and fall back to the OpenTelemetry SDK for anything unsupported. This is part of the same operational maturity that makes AI agent memory and context engineering tractable, you cannot manage what you cannot see.

One detail that trips teams up: deciding how much prompt and completion content to capture. Full content makes traces perfectly replayable but raises privacy and storage costs, and may pull personal data into your observability backend. The OpenTelemetry conventions let you toggle content capture per span, so a common pattern is to record full content in staging and sample or redact it in production, keeping the structural span data (tokens, latency, tool names, arguments) everywhere.

What to do right now

If you are running an agent in production without tracing, work through this in order:

Pick a backend today: Phoenix or Langfuse if you want free and OpenTelemetry-native, your existing APM if you already pay for one.
Wrap every LLM call to emit a span with prompt, model, parameters, token usage, and latency.
Nest tool executions and reasoning steps under a parent agent span so each run forms a tree.
Decide your content-capture policy: full in staging, redacted or sampled in production.
Add alerts on three things: cost spikes (tokens per session), latency regressions, and error rate.
Once traces are flowing, layer online evaluation on top so you catch quality drift, not just crashes.

Frequently asked questions

What is the difference between logging and agent observability?

Logging records discrete events. Agent observability reconstructs the full trajectory, reasoning, tool calls, arguments, responses, tokens, and latency, as a hierarchical trace you can replay step by step. Logs tell you something happened; traces tell you the path that led there.

Why use OpenTelemetry instead of a vendor SDK?

Portability. The GenAI semantic conventions are an open standard now supported natively by Datadog, New Relic, Dynatrace, and open-source tools, so you can switch backends without re-instrumenting.

Do I need observability for a simple chatbot?

For a single-shot chatbot, basic logging may suffice. The moment you add tools, multi-step reasoning, or loops, you need tracing, otherwise failures become nearly impossible to diagnose.

Can observability help control costs?

Yes. Per-span token and latency data shows exactly where an agent spends its budget, so you can spot runaway loops and expensive tool calls and set alerts on cost spikes.

#ai#agents#observability#opentelemetry