← Back to Blog
    May 17, 20268 min readAI Agents
    Article

    Demystifying Agent Harness Evaluation

    An agent is a model plus a scaffold — tools, memory, skills, control flow. To evaluate one well, you have to evaluate the whole thing, not just the model's fina

    Shrestha Pawar
    Shrestha Pawar
    AI Architect
    AI AgentsAgent EvaluationLLM EvaluationAI Eval HarnessMulti Turn EvalsAgentic AIAI InfrastructureLLM OpsAI TestingAI BenchmarkingAI SafetyAI Tool UseEvaluation MetricsOffline EvaluationOnline EvaluationLLM as a JudgeDeepEvalLangSmithRagasPromptfooResearch AgentsVoice AgentsAgent TrajectoryTool TrajectoryAI ObservabilityAI ReliabilityProduction MonitoringAI WorkflowsAutonomous AgentsGenerative AI
    Demystifying Agent Harness Evaluation

    What is evaluation?

    An evaluation ("eval") is a test for an AI system: give the system an input, then apply grading logic to its output to measure success.

    Evaluating traditional software is straightforward — given input X, assert output Y. Evaluating an agent is fundamentally different, because the interesting behavior is not just the final answer. It's the path the agent took to get there — the tools, skills, and memory it chose, the order it called them in, and the context it preserved across turns.

    Chef's note: Single-turn evals — a prompt, a response, grading logic — were the main story for earlier LLMs. As capabilities advanced, multi-turn evals have become standard. And multi-turn agent evals are the hardest of all: agents use tools across many turns, modifying state in the environment and adapting as they go — which means mistakes can propagate and compound.

    The harness, in layers

    The harness isn't a single wrapper. It's multi-layered — concentric responsibilities sitting around the model, each catching a different class of failure. Eval coverage only becomes meaningful once you know which layer a test is poking at.

    image

    The harness is multi-layered, not a single wrapper.

    The LLM

    The model itself. Stateless, no memory, no tools — just tokens in, tokens out. Everything else is scaffold.

    The engine room

    Builds prompts, runs the agent loop, parses structured output, and recovers from malformed responses. Most “did the agent break?” failures live here.

    What it can do

    Tools the agent can call, memory it can persist, state it can mutate, context it threads across turns. Capability tests — “did it pick the right tool?” — live here.

    The outer shell

    Catches unsafe input, limits tool reach, verifies critical outputs, spawns sub-agents for sub-tasks, and preempts runaway loops before they bill you for them.

    Transcript or outcome?

    Every grader you write attaches to one of two things. The transcript is how the agent got there — tools called, in what order, with what arguments, against what state. The outcome is whether the dish landed: did the final answer satisfy the user; did the side-effect actually happen. Pick which one a metric is measuring before you write it, or you'll end up with numbers that average a path and an outcome into mush.

    Different agent, different metric

    You can't apply one metric to every agent. Coding agents, research agents, voice agents, and conversational agents all have different definitions of "good" — and the right evaluation has to mirror that.

    Coding agents

    Write, test, and debug code. Navigate codebases, run commands, behave like a human developer.

    Research agents

    Gather, synthesize, and analyze information; produce answers or reports. Quality is judged relative to the task, not against a unit test.

    Voice agents

    Realtime, streaming interactions. You grade two outputs at once: what the assistant does and how it sounds.

    Conversational agents

    Support, sales, coaching. Maintain state, use tools, take actions mid-conversation.

    The two arenas: lab & wild

    To build a reliable agent, you need to test it in a controlled lab and out in the real world. The two regimes have different inputs, different metrics, and different performance budgets.

    Offline

    The lab

    Runs before traffic arrives. Design test cases, run the agent against them in a controlled in-memory environment, grade all metrics at once. If the agent fails, it fails safe — no real user is harmed.

    • Predefined datasets, known ground truth
    • Stable, in-memory sandbox
    • As many heavy metrics as you can afford
    • Control inputs so runs are reproducible
    • Compare actual trajectory to expected
    • Block CI on regressions

    Online

    The wild

    Runs while real users interact in production. Metrics fire after each live turn and stream to an observability platform. There is no ground truth — there's no hardcoded right answer to a user's unique question.

    • Real, unpredictable user sessions
    • Callbacks after each completed turn
    • Fast LLM-as-a-judge — no fixtures
    • Score async; never block the user
    • Run fewer, lighter metrics per turn
    • Lean on outcome signals: thumbs, abandonment, side-effects

    Offline tells you the agent is built correctly. Online tells you the agent is doing the right thing.

    Types of graders for agents

    Agent evaluations typically combine three kinds of graders. Each grades some portion of either the transcript or the outcome. Choosing the right grader for the right job is half the work of evaluation design.

    Deterministic

    Exact-match assertions, regex, string similarity (ROUGE), tool-trajectory comparisons. Cheap, fast, reproducible. Good for structural checks you can fully specify.

    LLM-as-a-judge

    A model scores another model's output against a rubric. Handles fuzzy judgments — comprehensiveness, tone, groundedness — at scale. Needs prompt-engineered rubrics and calibration.

    Expert review

    The ground truth for everything else. Expensive, slow, irreplaceable for ambiguous or high-stakes domains. Use sparingly to calibrate the other two.

    Components of an offline eval

    Offline evaluation has a fixed cast of components. In ADK (and most agent frameworks) the same nouns recur: an eval case, an eval set, a trajectory, the final response, evaluation criteria, and a runner that glues them together.

    A single test representing one agent-model session. Contains one or more turns: a user query, the expected tool use, expected intermediate responses, and the expected final response.

    A dataset file containing many eval cases. Where test files are unit tests, an eval set is integration: complex, multi-turn conversations that exercise whole workflows.

    The sequence of steps the agent takes before returning to the user — tool choices, sub-agent invocations, reasoning strategy. You score performance by comparing the actual trajectory to the expected trajectory.

    The graders. ADK ships tool_trajectory_avg_score (exact tool-call match), response_match_score (ROUGE-1), and judge-based criteria like final_response_match_v2 and rubric_based_tool_use_quality_v1.

    The outcome metric. Trajectory tells you how the agent got there; final response tells you whether the answer is correct, relevant, and grounded.

    For tests where fixed inputs aren't enough. The "user" prompts are generated by another model, letting you probe how the agent behaves when conversation goes off-script.

    The harness for the harness. Runs end-to-end evals three ways: via the web UI (adk web), programmatically in CI/CD (pytest), or via CLI (adk eval).

    Flow · Offline pipeline

    image

    Metrics from the offline pipeline

    Subsequence match between the expected and actual tool calls. Rewards the agent for calling the right tools in the right order; tolerates extra calls in between.

    Set-based score that penalises both missing tools and extra calls. Use alongside trajectory: trajectory says "right order," F1 says "right set."

    A judge model scores how well the final response matches the user's intent and includes target keywords. Float 0.0–1.0.

    Did the deck end up with the right slide count? Are required terms present? Are citation URLs preserved? Each constraint contributes a fraction.

    Does the final response answer the question that was actually asked?

    If the agent grounded in retrieved research, does that research actually relate to the prompt? Skipped when there's no research context in state.

    How wasteful was the trajectory? Penalises redundant tool calls and meandering paths.

    Did the agent stay on the assigned topic and within its product role?

    Did the agent refuse out-of-scope requests politely instead of e.g. writing a poem about spring when asked?

    The pipeline, end-to-end

    The block below is a working offline evaluator: define a list of scenarios with expected tools and constraints, run each through the real PresentationExpertApp in an in-memory runner, score every metric, then push the run to Confident AI.

    Chef's note: The metrics module (eval.metrics) is shared between offline and online evals. That sharing is the whole trick: the same _answer_relevancy function that grades a fixture in CI also grades a live user's turn in production. One source of truth, two delivery modes.

    Components of an online eval

    Online evaluation reuses most of the offline vocabulary, but every component has a different shape. The shift is from fixtures and assertions to traces and judgments.

    Flow · Online callbacks

    image

    The callback wiring

    Three ADK callbacks do the work. before_agent_cb initialises a per-turn scratchpad. after_model_cb accumulates tool calls and text as the model streams. after_agent_cb snapshots state and fires the judges in a background task.

    Chef's note: If the eval can fail the user's request, it doesn't belong in the hot path. Snapshot the state, hand it to asyncio.create_task, and let the dashboard wait. The user shouldn't.

    When to write evals

    Writing evals is useful at every stage of the agent lifecycle:

    • Early. Evals force product teams to specify what success means. The exercise of writing the first test case is often more valuable than the test itself.
    • Middle. They become the regression net. As you ship new tools, new prompts, and new sub-agents, the offline suite catches the things you broke without noticing.
    • Late. They uphold a consistent quality bar across teams, model upgrades, and harness rewrites. Online evals tell you when the bar starts slipping in production.

    Defense in depth

    image

    No single eval layer catches every issue. The reliable agents are the ones with a stack of imperfect filters — each catching what the others miss. Picture the Swiss cheese model: every slice has holes, but the holes line up only rarely. Together, they prevent most bad agent behavior from reaching the user.

    Every layer has holes. The art is making sure the holes never line up.

    Automated evals

    Faster iteration, reproducible, runs pre-launch in CI on every commit.Catches · regressions · basic logic errors

    Manual transcript review

    Builds intuition, handles qualitative issues a metric won't catch.Catches · subtle inaccuracies · off-topic responses

    Systematic human studies

    Gold-standard quality for subjective tasks and complex domains.Catches · ambiguity · nuanced errors · bias

    Production monitoring

    Live user behavior, real-world failures, post-launch observability.Catches · drift · live system issues · unexpected edge cases

    A/B testing

    Measures user outcomes, validates significant changes, scales.Catches · performance variations · product regressions

    User feedback

    Explicit signals from real users — unanticipated problems, real examples.Catches · explicit bugs · dissatisfaction · feature requests

    The harness is the model's clothes. You wouldn't ship a model without evaluating the model. Don't ship the harness without evaluating the harness.

    References

    1. Google ADK — Evaluation Guide
    2. DeepEval — Confident AI documentation
    3. LangSmith — Evaluation concepts
    4. Ragas — Framework concepts
    5. Promptfoo
    6. Braintrust — Documentation
    7. OpenAI — Realtime agent evaluation cookbook
    ts
    const value = "code block";

    Related reading

    View all →

    More insights await

    Explore our latest articles on AI evaluation, LLM optimization, and engineering best practices.

    Read more articles →