What is evaluation?

An evaluation ("eval") is a test for an AI system: give the system an input, then apply grading logic to its output to measure success.

Evaluating traditional software is straightforward — given input X, assert output Y. Evaluating an agent is fundamentally different, because the interesting behavior is not just the final answer. It's the path the agent took to get there — the tools, skills, and memory it chose, the order it called them in, and the context it preserved across turns.

Chef's note: Single-turn evals — a prompt, a response, grading logic — were the main story for earlier LLMs. As capabilities advanced, multi-turn evals have become standard. And multi-turn agent evals are the hardest of all: agents use tools across many turns, modifying state in the environment and adapting as they go — which means mistakes can propagate and compound.

The harness, in layers

The harness isn't a single wrapper. It's multi-layered — concentric responsibilities sitting around the model, each catching a different class of failure. Eval coverage only becomes meaningful once you know which layer a test is poking at.

The harness is multi-layered, not a single wrapper.

The LLM

The model itself. Stateless, no memory, no tools — just tokens in, tokens out. Everything else is scaffold.

The engine room

Builds prompts, runs the agent loop, parses structured output, and recovers from malformed responses. Most “did the agent break?” failures live here.

What it can do

Tools the agent can call, memory it can persist, state it can mutate, context it threads across turns. Capability tests — “did it pick the right tool?” — live here.

The outer shell

Catches unsafe input, limits tool reach, verifies critical outputs, spawns sub-agents for sub-tasks, and preempts runaway loops before they bill you for them.

Transcript or outcome?

Every grader you write attaches to one of two things. The transcript is how the agent got there — tools called, in what order, with what arguments, against what state. The outcome is whether the dish landed: did the final answer satisfy the user; did the side-effect actually happen. Pick which one a metric is measuring before you write it, or you'll end up with numbers that average a path and an outcome into mush.

Different agent, different metric

You can't apply one metric to every agent. Coding agents, research agents, voice agents, and conversational agents all have different definitions of "good" — and the right evaluation has to mirror that.

Coding agents

Write, test, and debug code. Navigate codebases, run commands, behave like a human developer.

Research agents

Gather, synthesize, and analyze information; produce answers or reports. Quality is judged relative to the task, not against a unit test.

Voice agents

Realtime, streaming interactions. You grade two outputs at once: what the assistant does and how it sounds.

Conversational agents

Support, sales, coaching. Maintain state, use tools, take actions mid-conversation.

The two arenas: lab & wild

To build a reliable agent, you need to test it in a controlled lab and out in the real world. The two regimes have different inputs, different metrics, and different performance budgets.

Offline

The lab

Runs before traffic arrives. Design test cases, run the agent against them in a controlled in-memory environment, grade all metrics at once. If the agent fails, it fails safe — no real user is harmed.

Predefined datasets, known ground truth
Stable, in-memory sandbox
As many heavy metrics as you can afford

Control inputs so runs are reproducible
Compare actual trajectory to expected
Block CI on regressions

Online

The wild

Runs while real users interact in production. Metrics fire after each live turn and stream to an observability platform. There is no ground truth — there's no hardcoded right answer to a user's unique question.

Real, unpredictable user sessions
Callbacks after each completed turn
Fast LLM-as-a-judge — no fixtures

Score async; never block the user
Run fewer, lighter metrics per turn
Lean on outcome signals: thumbs, abandonment, side-effects

Offline tells you the agent is built correctly. Online tells you the agent is doing the right thing.

Types of graders for agents

Agent evaluations typically combine three kinds of graders. Each grades some portion of either the transcript or the outcome. Choosing the right grader for the right job is half the work of evaluation design.

Deterministic

Exact-match assertions, regex, string similarity (ROUGE), tool-trajectory comparisons. Cheap, fast, reproducible. Good for structural checks you can fully specify.

LLM-as-a-judge

A model scores another model's output against a rubric. Handles fuzzy judgments — comprehensiveness, tone, groundedness — at scale. Needs prompt-engineered rubrics and calibration.

Expert review

The ground truth for everything else. Expensive, slow, irreplaceable for ambiguous or high-stakes domains. Use sparingly to calibrate the other two.

Components of an offline eval

Offline evaluation has a fixed cast of components. In ADK (and most agent frameworks) the same nouns recur: an eval case, an eval set, a trajectory, the final response, evaluation criteria, and a runner that glues them together.

A single test representing one agent-model session. Contains one or more turns: a user query, the expected tool use, expected intermediate responses, and the expected final response.

A dataset file containing many eval cases. Where test files are unit tests, an eval set is integration: complex, multi-turn conversations that exercise whole workflows.

The sequence of steps the agent takes before returning to the user — tool choices, sub-agent invocations, reasoning strategy. You score performance by comparing the actual trajectory to the expected trajectory.

The graders. ADK ships tool_trajectory_avg_score (exact tool-call match), response_match_score (ROUGE-1), and judge-based criteria like final_response_match_v2 and rubric_based_tool_use_quality_v1.

The outcome metric. Trajectory tells you how the agent got there; final response tells you whether the answer is correct, relevant, and grounded.

For tests where fixed inputs aren't enough. The "user" prompts are generated by another model, letting you probe how the agent behaves when conversation goes off-script.

The harness for the harness. Runs end-to-end evals three ways: via the web UI (adk web), programmatically in CI/CD (pytest), or via CLI (adk eval).

Flow · Offline pipeline

Metrics from the offline pipeline

Subsequence match between the expected and actual tool calls. Rewards the agent for calling the right tools in the right order; tolerates extra calls in between.

Set-based score that penalises both missing tools and extra calls. Use alongside trajectory: trajectory says "right order," F1 says "right set."

A judge model scores how well the final response matches the user's intent and includes target keywords. Float 0.0–1.0.

Did the deck end up with the right slide count? Are required terms present? Are citation URLs preserved? Each constraint contributes a fraction.

Does the final response answer the question that was actually asked?

If the agent grounded in retrieved research, does that research actually relate to the prompt? Skipped when there's no research context in state.

How wasteful was the trajectory? Penalises redundant tool calls and meandering paths.

Did the agent stay on the assigned topic and within its product role?

Did the agent refuse out-of-scope requests politely instead of e.g. writing a poem about spring when asked?

The pipeline, end-to-end

The block below is a working offline evaluator: define a list of scenarios with expected tools and constraints, run each through the real PresentationExpertApp in an in-memory runner, score every metric, then push the run to Confident AI.

Chef's note: The metrics module (eval.metrics) is shared between offline and online evals. That sharing is the whole trick: the same _answer_relevancy function that grades a fixture in CI also grades a live user's turn in production. One source of truth, two delivery modes.

Components of an online eval

Online evaluation reuses most of the offline vocabulary, but every component has a different shape. The shift is from fixtures and assertions to traces and judgments.

Flow · Online callbacks

The callback wiring

Three ADK callbacks do the work. before_agent_cb initialises a per-turn scratchpad. after_model_cb accumulates tool calls and text as the model streams. after_agent_cb snapshots state and fires the judges in a background task.

Chef's note: If the eval can fail the user's request, it doesn't belong in the hot path. Snapshot the state, hand it to asyncio.create_task, and let the dashboard wait. The user shouldn't.

When to write evals

Writing evals is useful at every stage of the agent lifecycle:

Early. Evals force product teams to specify what success means. The exercise of writing the first test case is often more valuable than the test itself.
Middle. They become the regression net. As you ship new tools, new prompts, and new sub-agents, the offline suite catches the things you broke without noticing.
Late. They uphold a consistent quality bar across teams, model upgrades, and harness rewrites. Online evals tell you when the bar starts slipping in production.

Defense in depth

No single eval layer catches every issue. The reliable agents are the ones with a stack of imperfect filters — each catching what the others miss. Picture the Swiss cheese model: every slice has holes, but the holes line up only rarely. Together, they prevent most bad agent behavior from reaching the user.

Every layer has holes. The art is making sure the holes never line up.

Automated evals

Faster iteration, reproducible, runs pre-launch in CI on every commit.Catches · regressions · basic logic errors

Manual transcript review

Builds intuition, handles qualitative issues a metric won't catch.Catches · subtle inaccuracies · off-topic responses

Systematic human studies

Gold-standard quality for subjective tasks and complex domains.Catches · ambiguity · nuanced errors · bias

Production monitoring

Live user behavior, real-world failures, post-launch observability.Catches · drift · live system issues · unexpected edge cases

A/B testing

Measures user outcomes, validates significant changes, scales.Catches · performance variations · product regressions

User feedback

Explicit signals from real users — unanticipated problems, real examples.Catches · explicit bugs · dissatisfaction · feature requests

The harness is the model's clothes. You wouldn't ship a model without evaluating the model. Don't ship the harness without evaluating the harness.