Demystifying Evals for AI Agents — Key Takeaways

Anthropic recently published a thorough guide on evaluating AI agents: Demystifying evals for AI agents. If you're building anything with AI agents, this is essential reading. Here's my summary of the key ideas.

Why evals matter

Without systematic testing, teams only discover problems after users hit them. The breaking point usually comes when users report the agent "feels worse" — but you have no data to confirm or deny it.

Good evals let you:

Distinguish real regressions from noise
Test changes against hundreds of scenarios before deploying
Baseline latency, token usage, and cost
Quickly assess new model upgrades
Build shared understanding of what "good" means across teams

The structure of an evaluation

The article defines the key building blocks:

Tasks — individual tests with defined inputs and success criteria
Trials — multiple attempts at each task (to account for model variability)
Graders — logic that scores agent performance
Transcripts — complete records of all interactions
Outcomes — the final state of the environment after the agent runs

Three types of graders

Code-based graders

String matching, binary tests, static analysis, outcome verification. Fast, cheap, reproducible — but brittle to valid output variations.

Model-based graders

Rubric scoring, natural language assertions, pairwise comparison. Flexible and nuanced — but non-deterministic and expensive.

Human graders

Expert review, crowdsourcing, spot-checking. Gold standard for quality — but slow and costly to scale.

The best approach is layering all three.

Agent-specific strategies

Different agent types need different eval approaches:

Coding agents — Unit tests, static analysis (ruff, mypy), code quality rubrics. Benchmarks like SWE-bench verify agents fix failing tests without breaking existing ones.

Conversational agents — State verification (was the issue resolved?), tool-call validation, tone/empathy rubrics, simulated users. Turn-count constraints prevent runaway conversations.

Research agents — Groundedness checks against sources, coverage of required facts, source quality validation. Subjective quality is the hardest to measure.

Computer use agents — URL/page state verification, backend database checks, file system inspection. Balance DOM extraction vs. screenshots for token efficiency.

Consistency metrics: pass@k vs. pass^k

Two metrics that capture very different things:

pass@k — probability of success in at least one of k attempts. Goes up as k increases.
pass^k — probability that all k trials succeed. Goes down as k increases.

Example: An agent with 75% single-trial success rate has pass@3 ≈ 98% but pass^3 ≈ 42%. Which metric matters depends on your product — a coding assistant where users can retry is different from a customer service bot that must get it right every time.

The practical roadmap

The article lays out a step-by-step path from zero to strong evals:

Start with 20-50 real failure cases — don't wait for a perfect suite. Convert bug reports and manual testing into test cases.
Ensure task quality — tasks should be passable by domain experts. If you get 0% pass rate across many trials, the task is probably broken, not the agent.
Balance positive and negative cases — test both when behavior should and shouldn't trigger.
Build stable environments — isolate trials with clean starting states. Shared state causes correlated failures.
Grade outcomes, not process — avoid over-rigid path checking. There are usually multiple valid approaches.
Read transcripts — manual review catches grader bugs and builds intuition.
Monitor saturation — as evals approach 100%, they transition from improvement signals to regression monitors.
Treat evals as living code — they need ownership and maintenance, not set-and-forget.

The Swiss Cheese Model

No single evaluation method catches everything. The article advocates layering multiple approaches — automated evals, production monitoring, A/B testing, user feedback, manual review, and systematic human studies. Like layers of Swiss cheese, each has holes, but stacked together they catch most issues.

My take

The most actionable insight: start now, start small. Twenty real failure cases turned into test tasks is infinitely more useful than planning a perfect 500-task eval suite you'll never finish. If you're building AI agents, this is the playbook.

Read the full article on Anthropic's blog →