Skip to content
WhySoGeek.
AI

Tau-Bench and Agent Reliability: pass^k in 2026

A single passing run is not reliability. Tau-bench measures policy adherence and consistency across trials, and the numbers are humbling.

Sam Carter 8 min read
Cover image for Tau-Bench and Agent Reliability: pass^k in 2026
Photo: MarkGri / flickr (CC0 1.0)

Your agent passed the test. Run it eight more times and see how many still pass. That gap, between "worked once" and "works every time," is the whole story of production agent reliability, and it is exactly what tau-bench was built to expose.

Quick answer

Tau-bench (and its 2026 successor tau2-bench) evaluates agents on tool-agent-user interaction in enterprise domains like retail and airline support, with a simulated user, real tool APIs, and policy documents the agent must obey. It grades policy adherence, not just task completion: an agent that books the right flight but violates the change-fee policy fails. Its pass^k metric measures how often an agent succeeds across k repeated trials, revealing that even strong models are inconsistent, with pass^8 under 25% in retail.

Key takeaways

  • Passing once is not reliability. Tau-bench's pass^k metric grades consistency across repeated runs.
  • Policy adherence is the bar, not task completion. Breaking a stated rule fails the task.
  • Even top models are inconsistent, with pass^8 falling below 25% in the retail domain.
  • A simulated user makes tasks multi-turn and realistic instead of one-shot.
  • The 2026 update, tau2-bench, adds voice and knowledge-retrieval domains.

Why single-run scores lie

Most benchmarks report whether an agent completed a task. Run it once, check the result, record pass or fail. That number flatters agents badly, because production does not run a task once. It runs it thousands of times a day, and a 70% success rate means three in ten customers get a wrong answer.

Reliability is about the distribution, not the best case. An agent that succeeds on average but fails unpredictably is worse for a business than a slightly less capable agent that fails consistently, because you can at least route around the predictable one.

What tau-bench measures differently

Tau-bench, from Sierra, sets up enterprise-style domains, originally retail and airline customer service, and evaluates the full interaction loop.

The simulated user

Instead of a static prompt, a language model plays the user. It has a goal and responds to the agent's questions over multiple turns. This forces the agent to gather information, ask clarifying questions, and handle a conversation, which is far closer to real support than a one-shot instruction.

Policy adherence, not just success

Each domain ships policy documents the agent must follow: refund limits, change-fee rules, eligibility criteria. The grading is strict. An agent that achieves the user's goal but breaks a policy fails the task. This maps directly onto what enterprises actually need, since a support agent that ignores the refund policy is a liability even when the customer is happy.

An enterprise support workflow where an agent must follow policy documents
Photo: IronRodArt - Royce Bair ('Star Shooter') / flickr (BY-NC-ND 2.0)

The pass^k metric

This is the headline contribution. Pass^k measures whether an agent succeeds on all k independent attempts at the same task. Pass^1 is the familiar single-run number. Pass^8 asks whether it succeeds eight times in a row. The results are sobering: even capable function-calling agents drop below 25% pass^8 in retail. High single-run scores hide deep inconsistency.

MetricQuestion it answersTypical result
Task success (pass^1)Did it work once?Often over 50%
pass^kDid it work k times running?Falls sharply as k grows
Policy adherenceDid it follow the rules?Lower than raw success

Tau2-bench and the 2026 leaderboard

The 2026 update, tau2-bench, expanded the domains to include voice and knowledge retrieval and grew to dozens of model entries. It sits among a handful of benchmarks that carry most of the signal for agents, alongside SWE-bench Verified for coding, OSWorld for computer use, and WebArena for browser tasks.

The broader trend for late-2026 leaderboards is a shift toward three metrics that separate production-ready from demo-ready: N-run consistency (pass^k style), policy adherence, and cost-adjusted accuracy. A cheap agent that is consistent and rule-abiding beats an expensive one that dazzles once.

BenchmarkFocus
Tau2-benchTool-agent-user, policy adherence
SWE-bench VerifiedReal software engineering fixes
OSWorldComputer use / desktop control
WebArenaBrowser navigation tasks
GAIAGeneral assistant tasks

What to do right now

  • Stop reporting single-run success internally. Run each eval task at least eight times and report pass^k.
  • Write your policies as gradable rules and fail any run that violates them, mirroring tau-bench's strictness.
  • Simulate the user in your evals so tasks are multi-turn, not one-shot instructions.
  • Track cost per successful task, not just accuracy. See AI coding agent costs.
  • Read the broader benchmark landscape in our AI agent benchmarks guide and coding model benchmarks.
  • Add observability so you can see which runs failed and why. Read agent observability with OpenTelemetry.

Frequently asked questions

What is the difference between pass^1 and pass^8?

Pass^1 is the classic single-run success rate. Pass^8 requires the agent to succeed on eight independent attempts at the same task. It falls off fast, which is exactly why it exposes reliability that pass^1 hides.

Why grade policy adherence separately from success?

Because in an enterprise a rule violation is a failure even when the customer's goal is met. An agent that refunds beyond the allowed limit "succeeded" for the user but created a business problem. Tau-bench fails that run.

Is a high tau-bench score enough to deploy?

No single benchmark is. Tau-bench measures policy-heavy support-style tasks. Pair it with domain-specific evals on your own tasks, and read why most agent pilots still fail.

How do I run something like tau-bench on my own agent?

Build a small graded task set with a simulated user, encode your policies as pass/fail checks, and run each task k times to compute pass^k. Start from your real support transcripts for realistic scenarios.

#ai-agents#evaluation#benchmarks

Sources & further reading

Keep reading