# Demystifying Agent Harness Evaluation

An agent is a model plus a scaffold — tools, memory, skills, control flow. To evaluate one well, you have to evaluate the whole thing, not just the model's fina

Author: Shrestha Pawar — AI Architect
Published: May 17, 2026
Read time: 8 min read
Tags: AI Agents, Agent Evaluation, LLM Evaluation, AI Eval Harness, Multi Turn Evals, Agentic AI, AI Infrastructure, LLM Ops, AI Testing, AI Benchmarking, AI Safety, AI Tool Use, Evaluation Metrics, Offline Evaluation, Online Evaluation, LLM as a Judge, DeepEval, LangSmith, Ragas, Promptfoo, Research Agents, Voice Agents, Agent Trajectory, Tool Trajectory, AI Observability, AI Reliability, Production Monitoring, AI Workflows, Autonomous Agents, Generative AI
URL: https://syntropylabs.ai/blog/demystifying-agent-harness-evaluation

---

# What is evaluation?

An **evaluation** ("eval") is a test for an AI system: give the system an input, then apply grading logic to its output to measure success.

Evaluating traditional software is straightforward — given input `X`, assert output `Y`. Evaluating an agent is fundamentally different, because the interesting behavior is not just the final answer. It's the **path the agent took to get there** — the tools, skills, and memory it chose, the order it called them in, and the context it preserved across turns.

> **Chef's note:** Single-turn evals — a prompt, a response, grading logic — were the main story for earlier LLMs. As capabilities advanced, **multi-turn evals** have become standard. And multi-turn *agent* evals are the hardest of all: agents use tools across many turns, modifying state in the environment and adapting as they go — which means mistakes can propagate and compound.

# The harness, in layers

The harness isn't a single wrapper. It's **multi-layered** — concentric responsibilities sitting around the model, each catching a different class of failure. Eval coverage only becomes meaningful once you know *which layer* a test is poking at.

The harness is multi-layered, not a single wrapper.

## The *LLM*

The model itself. Stateless, no memory, no tools — just tokens in, tokens out. Everything else is scaffold.

## The *engine room*

Builds prompts, runs the agent loop, parses structured output, and recovers from malformed responses. Most “did the agent break?” failures live here.

## What it can *do*

Tools the agent can call, memory it can persist, state it can mutate, context it threads across turns. Capability tests — “did it pick the right tool?” — live here.

## The *outer shell*

Catches unsafe input, limits tool reach, verifies critical outputs, spawns sub-agents for sub-tasks, and preempts runaway loops before they bill you for them.

## Transcript or outcome?

Every grader you write attaches to one of two things. The **transcript** is *how* the agent got there — tools called, in what order, with what arguments, against what state. The **outcome** is whether the dish landed: did the final answer satisfy the user; did the side-effect actually happen. Pick which one a metric is measuring before you write it, or you'll end up with numbers that average a path and an outcome into mush.

# Different agent, different metric

You can't apply one metric to every agent. Coding agents, research agents, voice agents, and conversational agents all have different definitions of "good" — and the right evaluation has to mirror that.

## Coding *agents*

Write, test, and debug code. Navigate codebases, run commands, behave like a human developer.

## Research *agents*

Gather, synthesize, and analyze information; produce answers or reports. Quality is judged relative to the task, not against a unit test.

## Voice *agents*

Realtime, streaming interactions. You grade two outputs at once: *what* the assistant does and *how* it sounds.

## Conversational *agents*

Support, sales, coaching. Maintain state, use tools, take actions mid-conversation.

# The two arenas: lab & wild

To build a reliable agent, you need to test it in a controlled lab *and* out in the real world. The two regimes have different inputs, different metrics, and different performance budgets.

## Offline

The lab

Runs before traffic arrives. Design test cases, run the agent against them in a controlled in-memory environment, grade all metrics at once. If the agent fails, it fails safe — no real user is harmed.

- Predefined datasets, known ground truth
- Stable, in-memory sandbox
- As many heavy metrics as you can afford

- Control inputs so runs are reproducible
- Compare actual trajectory to expected
- Block CI on regressions

## Online

The wild

Runs while real users interact in production. Metrics fire after each live turn and stream to an observability platform. There is no ground truth — there's no hardcoded right answer to a user's unique question.

- Real, unpredictable user sessions
- Callbacks after each completed turn
- Fast LLM-as-a-judge — no fixtures

- Score async; never block the user
- Run fewer, lighter metrics per turn
- Lean on outcome signals: thumbs, abandonment, side-effects

Offline tells you the agent is built correctly. Online tells you the agent is doing the right thing.

# Types of graders for agents

Agent evaluations typically combine three kinds of graders. Each grades some portion of either the **transcript** or the **outcome**. Choosing the right grader for the right job is half the work of evaluation design.

## Deterministic

Exact-match assertions, regex, string similarity (ROUGE), tool-trajectory comparisons. Cheap, fast, reproducible. Good for structural checks you can fully specify.

## LLM-as-a-judge

A model scores another model's output against a rubric. Handles fuzzy judgments — comprehensiveness, tone, groundedness — at scale. Needs prompt-engineered rubrics and calibration.

## Expert review

The ground truth for everything else. Expensive, slow, irreplaceable for ambiguous or high-stakes domains. Use sparingly to calibrate the other two.

# Components of an offline eval

Offline evaluation has a fixed cast of components. In ADK (and most agent frameworks) the same nouns recur: an **eval case**, an **eval set**, a **trajectory**, the **final response**, evaluation **criteria**, and a **runner** that glues them together.

A single test representing one agent-model session. Contains one or more *turns*: a user query, the expected tool use, expected intermediate responses, and the expected final response.

A dataset file containing many eval cases. Where test files are unit tests, an eval set is integration: complex, multi-turn conversations that exercise whole workflows.

The sequence of steps the agent takes before returning to the user — tool choices, sub-agent invocations, reasoning strategy. You score performance by comparing the *actual trajectory* to the *expected trajectory*.

The graders. ADK ships `tool_trajectory_avg_score` (exact tool-call match), `response_match_score` (ROUGE-1), and judge-based criteria like `final_response_match_v2` and `rubric_based_tool_use_quality_v1`.

The outcome metric. Trajectory tells you *how* the agent got there; final response tells you whether the answer is correct, relevant, and grounded.

For tests where fixed inputs aren't enough. The "user" prompts are generated by another model, letting you probe how the agent behaves when conversation goes off-script.

The harness for the harness. Runs end-to-end evals three ways: via the web UI (`adk web`), programmatically in CI/CD (`pytest`), or via CLI (`adk eval`).

## Flow · Offline pipeline

## Metrics from the offline pipeline

Subsequence match between the expected and actual tool calls. Rewards the agent for calling the right tools in the right order; tolerates extra calls in between.

Set-based score that penalises both *missing* tools and *extra* calls. Use alongside trajectory: trajectory says "right order," F1 says "right set."

A judge model scores how well the final response matches the user's intent and includes target keywords. Float 0.0–1.0.

Did the deck end up with the right slide count? Are required terms present? Are citation URLs preserved? Each constraint contributes a fraction.

Does the final response answer the question that was actually asked?

If the agent grounded in retrieved research, does that research actually relate to the prompt? Skipped when there's no research context in state.

How wasteful was the trajectory? Penalises redundant tool calls and meandering paths.

Did the agent stay on the assigned topic and within its product role?

Did the agent refuse out-of-scope requests politely instead of e.g. writing a poem about spring when asked?

## The pipeline, end-to-end

The block below is a working offline evaluator: define a list of scenarios with expected tools and constraints, run each through the real `PresentationExpertApp` in an in-memory runner, score every metric, then push the run to Confident AI.

> **Chef's note:** The metrics module (`eval.metrics`) is shared between offline and online evals. That sharing is the whole trick: the same `_answer_relevancy` function that grades a fixture in CI also grades a live user's turn in production. One source of truth, two delivery modes.

# Components of an online eval

Online evaluation reuses most of the offline vocabulary, but every component has a different shape. The shift is from *fixtures and assertions* to *traces and judgments*.

## Flow · Online callbacks

## The callback wiring

Three ADK callbacks do the work. `before_agent_cb` initialises a per-turn scratchpad. `after_model_cb` accumulates tool calls and text as the model streams. `after_agent_cb` snapshots state and fires the judges in a background task.

> **Chef's note:** If the eval can fail the user's request, it doesn't belong in the hot path. Snapshot the state, hand it to `asyncio.create_task`, and let the dashboard wait. The user shouldn't.

# When to write evals

Writing evals is useful at *every* stage of the agent lifecycle:

- **Early.** Evals force product teams to specify what success means. The exercise of writing the first test case is often more valuable than the test itself.
- **Middle.** They become the regression net. As you ship new tools, new prompts, and new sub-agents, the offline suite catches the things you broke without noticing.
- **Late.** They uphold a consistent quality bar across teams, model upgrades, and harness rewrites. Online evals tell you when the bar starts slipping in production.

## Defense in depth

No single eval layer catches every issue. The reliable agents are the ones with a *stack* of imperfect filters — each catching what the others miss. Picture the Swiss cheese model: every slice has holes, but the holes line up only rarely. Together, they prevent most bad agent behavior from reaching the user.

Every layer has holes. The art is making sure the holes never line up.

## Automated evals

Faster iteration, reproducible, runs pre-launch in CI on every commit.Catches · regressions · basic logic errors

## Manual transcript review

Builds intuition, handles qualitative issues a metric won't catch.Catches · subtle inaccuracies · off-topic responses

## Systematic human studies

Gold-standard quality for subjective tasks and complex domains.Catches · ambiguity · nuanced errors · bias

## Production monitoring

Live user behavior, real-world failures, post-launch observability.Catches · drift · live system issues · unexpected edge cases

## A/B testing

Measures user outcomes, validates significant changes, scales.Catches · performance variations · product regressions

## User feedback

Explicit signals from real users — unanticipated problems, real examples.Catches · explicit bugs · dissatisfaction · feature requests

The harness is the model's clothes. You wouldn't ship a model without evaluating the model. Don't ship the harness without evaluating *the harness*.

# References

1. [Google ADK — Evaluation Guide](https://raw.githubusercontent.com/GoogleCloudPlatform/agent-starter-pack/refs/heads/main/agent_starter_pack/resources/docs/adk-eval-guide.md)
2. [DeepEval — Confident AI documentation](https://www.confident-ai.com/docs/llm-evaluation/introduction)
3. [LangSmith — Evaluation concepts](https://docs.langchain.com/langsmith/evaluation-concepts)
4. [Ragas — Framework concepts](https://docs.ragas.io/en/stable/concepts/)
5. [Promptfoo](https://github.com/promptfoo/promptfoo)
6. [Braintrust — Documentation](https://www.braintrust.dev/docs)
7. [OpenAI — Realtime agent evaluation cookbook](https://developers.openai.com/cookbook/examples/realtime_eval_gu)