AI Agent Evaluation Steers the Harness

Agent evaluation is being conflated with scoring agent performance. Such scoring is useful, but what one gets from such scoring are edits to the harness.

If an evaluation fails to affect the harness after the fact, the team was simply left with a dashboard written up in better language. LangChain puts the sharper version in its Deep Agents writeup, where every evaluation is a vector that shifts the behavior of the agentic system.

That sentence carries the argument.

An eval suite is part of the system. The eval suite is driving the agent to behave in one way and not another. A sloppy eval is cheap to run. A broad benchmark is a good thing to include in a review. But a broad benchmark does not push the system to perform well on the task distribution actual users encounter.

I see people treat evaluation of AI agents similarly to how they score the performance of agents. Yes, that matters. But what does the score from that evaluation buy? Edits to the harness. That one failing trace gave the team something valuable. Insight into the structure of the harness.

The harness is where the lesson lands

An agent harness is the stuff around the model: tools, tool descriptions, prompts, routing rules, memory, retrieval, runtime policy, state shape, and all the weird connective tissue that decides what the model can actually do. We have written about this in the context of developing AI agency, because model swaps get too much credit and harness changes get too little ownership.

Agent evaluation belongs there.

LangChain's Better-Harness writeup makes the loop explicit: evals create the learning signal for iteratively improving prompts, tools, tool descriptions, instructions, and runtime scaffolding. To recap the useful part: design evals, run evals, get learning signal from evals, then use that learning signal to improve the components around the model. The evals are training data for the harness.

This might happen because of a weakly worded instruction, or a tool set up with incorrect parameters, or a retrieval component that sends the agent on a wild goose chase through a swamp of useless information, or a routing rule that gives the cheap model a go first even though that model is likely to take patience to get to the right answer, or because the agent's state is disappearing at exactly the wrong moment as the agent is trying to recover from something.

The score does not fix these problems. The score only earns the right to edit the harness.

Circular flow showing production traces turning into eval datasets, scores, harness edits, holdout gates, and release. — The eval loop is only useful when it changes the harness and then protects the next release.

Production traces are the raw material

The best eval cases come from the system embarrassing itself.

Here is the refund agent case. The customer asks for a refund and the agent fails to check eligibility. For the research agent example, the agent reads the first file in a series of linked files but fails to open the rest. It then confidently and incorrectly summarizes the material for the user. For the coding agent, the agent changes the implementation but fails to update the tests. The agent then reports success because the patch compiled in the agent's head.

LangChain's readiness checklist says the first move is to manually review 20 to 50 real agent traces before building eval infrastructure. The fact that a team would be willing to read 50 real traces is what makes this refreshingly boring and, more important, it saves months while the team avoids building an eval suite from vibes, guesses, and the latest loud failure.

There is a distinction between capability evals and regression evals. The ability of the system to do new things will naturally have a low pass rate at first because the team is hill climbing. Once the system is able to do something, it should continue to do that thing in the future. Regression evals catch the system falling back to old behavior that the product relies on.

Evaluate the path when the path is the product

Another simplistic assumption for agent evaluations: just grade the final answer.

For a variety of reasons, grading the final answer simply does not work for task classes where the path is part of the product surface. The refund agent rejects a refund request because it skipped a check against company policy. The data agent in a meeting creates a chart by issuing a full table scan in production. The support agent solves the ticket by leaking internal notes to the customer in a conversation.

Google's Agent Development Kit docs make the split cleanly: agent evaluation should assess both final output quality and trajectory, meaning the sequence of steps, tools, and reasoning the agent used. Their ADK codelab turns that into a testing workflow with golden datasets that preserve user query, trajectory, and final response.

The release gate has to know what to test for, which behaviors rely on an acceptable path and which behaviors rely only on the output of the agent for a given input. Otherwise the team either blocks good changes with brittle tests or ships dangerous changes because the final answer looked right.

System, trace, and node evals answer different questions

The name Agentic CLEAR serves to identify the different levels at which a team can review and assess the ability of an agent to complete complex tasks. The paper describes an evaluation framework that produces insights at system, trace, and node levels of granularity. IBM's project page expands that into an open-source package that evaluates traces across system-wide issues, node or component analysis, and trace-level inspection.

Those levels map to different harness changes.

A system-level eval asks whether the workflow produced the outcome the product cares about. Did the claims agent resolve the case? Did the analyst agent produce a grounded answer? Did the coding agent land a patch that passed the intended checks? System-level failures point to architecture: routing, ownership, data access, memory, deployment boundary, or business process fit.

Trace-level evaluations assess whether an agent follows a task through to a coherent end. They look at whether an agent searches for the appropriate information prior to writing, whether an agent uses the correct tool to send a package, whether an agent sends off work that can be completed in parallel, and whether an agent follows through on a task that has no end point. Trace-level failures suggest problems with planning, tool use, interrupt handling, retry policy, and multi-agent orchestration. This is where multi-agent orchestration stops being a diagram and starts being an eval surface.

Node-level evaluations examine individual behaviors produced by an agent as it goes through a task. Did a retrieval node produce the correct documents? Does a summarizer preserve constraints? Did a tool call include the tenant ID? Did a model produce the correct function for the job? Node-level failures can be addressed by changing local parts of the harness, including a tool schema, prompt wording, retrieval filters, model choice, and guardrail placement.

One pass rate does not cut it for this type of evaluation. A single number will not highlight the repair surface to the agent developer.

Layered stack showing system, trace, and node levels of AI agent evaluation feeding harness changes. — System, trace, and node evals answer different questions, so they should change different parts of the harness.

Observability feeds evaluation, then evaluation changes behavior

Agent evaluation without traces becomes example-driven theater in which teams argue about a few examples, write up synthetic test cases, and then do a qualitative evaluation of behavior that nobody has actually seen triggered in real life.

Observability without evaluation is storage. I like traces. I like spans. As wonderful as these things are, the operational data the system is running on is also a receipt for how the system got to that point. That is what can become evaluation data.

Honeycomb's AI-era observability piece spells out how agentic workflows depend on operational data of high cardinality, queried quickly, because agents query production context iteratively on a case-by-case basis. In their words, agentic workflows depend on fast, queryable, high-cardinality operational data because agents ask iterative questions against raw production context, not just dashboards. The easiest way to compromise an evaluation dataset is for production traces to stop including tenant ID, tool arguments, retrieval sources, policy decisions, model versions, prompt versions, and release versions.

The eval dataset should be downstream of observability and upstream of the harness.

As such, Agent Monitoring Is an Infrastructure Workload. Monitoring, log collection, metrics collection, and tracing must be treated as workloads and run as services. Otherwise they are screenshots in a vendor console. The trace proves the agent failed. Then it dies in storage.

The release gate is the boring power move

A good release shape is a pull request that includes the modified harness, with all changes visible in the diff, plus the relevant evaluations that identified the change. The diff states that trace 481 failed and that the failed trace and its evaluation were used to modify the tool description. Or that a retrieval filter changed to avoid tenant leakage found by a node eval. Or that a route now uses a stronger model because a holdout set found cheap-model failures. The release is blocked by a regression test suite that found a path violation in the payment-approval case.

That is boring in the correct way.

LangChain's readiness checklist describes a CI/CD flow where code or prompt changes trigger offline evals, preview deployments, online evals, and promotion only after quality gates pass. Better-Harness then covers how optimization examples can guide improvement, while holdout evals and human review protect against overfitting the harness to visible cases.

Without that owner, AI agent evaluation becomes a pile of numbers. With that owner, evals become a steering wheel.

The first useful eval stack is small

The practical stack does not have to start fancy.

We should start by recording real usage traces, and then the failures in there as well, with corresponding success criteria that any reasonable human could check. Distinguish between the capability hills that the evaluation is trying to climb, and the regressions that it is trying to prevent. Tag evaluations by behavior. Holdouts should not be part of the agent's optimization loop. Log the part of the harness that changed for each failed eval. Run regression evals in CI before the agent gets another production release.

Note the granularity of agent evaluation is about a system workflow and thus System-level evaluations about the entire system workflow as a whole. Also note that trace-level evaluations verify the acceptability of the path an agent took to arrive at a conclusion. Finally, note that node-level evaluations verify the local step an agent took through a given node was correct.

Even good AI will sometimes fail to reach its desired outcome, and that is where the dataset comes from.

Every meaningful failure of the AI system should become evidence which a human can reuse. Thus a trace of AI system failure through human interactions becomes an eval of an AI program. An eval of an AI program becomes a harness edit. A harness edit goes through the holdout gate for that AI system. The next production run of the AI system produces more evidence.

In the end, AI agent evaluation is engineering.