If you are in a management or leadership role right now, AI is probably coming at you from every direction. LinkedIn thought leaders are telling you it will 10x your productivity. Trade magazines are running cover stories about companies that "went all-in on AI” and are now reaping the benefits. Your board wants a strategy. Your team wants direction. And somewhere underneath all of that noise is a very reasonable, very legitimate question that almost nobody is asking out loud:
“How do I actually know if this thing is working?”
This post is for the people asking that question. Not the engineers building the evaluation pipelines, not the compliance officers writing the audit trails, but the managers and decision-makers who need to understand what is happening inside their AI systems, why it costs what it costs, and whether they are spending their testing budget on the right things.
There is a real and measurable gap between how AI is marketed and how it behaves in production. The marketing talks about transformation; the reality is that AI agents are powerful but probabilistic tools that require ongoing oversight, measurement, and governance to deliver consistent value.
This is not a reason to avoid AI. It is a call to take AI evaluation seriously.
The organizations that will get the most durable value from AI are not necessarily the ones that move fastest — they are the ones that build the discipline to measure what their AI systems are actually doing and make decisions based on that data. Evaluation is not a tax you pay for deploying AI. It is the mechanism by which you stay in control of something that, left unmonitored, can quietly degrade, drift out of scope, or start behaving in ways that contradict your business intent.
Most AI initiatives start with a pilot. The pilot works — impressively, even. It gets signed off, it gets scaled, and then six months later someone notices that the agent is not performing as well as it used to, or that users have stopped trusting it, or that a subtle prompt change degraded accuracy in a way nobody caught.
The question that should be asked at the pilot stage — before budget is committed to scale is: "What does ongoing evaluation of this system look like, who owns it, and what does it cost?"
That question is almost never asked. Evaluation continues to be treated as a go-live checklist item rather than a permanent operational function. Getting that conversation into the room early is one of the highest-value things a QA-aware manager can do.
Here’s one practical management habit that pays dividends: before any AI agent goes to production, define in writing what "good enough" looks like. And not just at release, but over time. What accuracy threshold would trigger a review? What reliability drop would warrant pulling the system offline? What compliance failure rate is acceptable?
Do these sound familiar? Yes, they are acceptance criteria, something us in QA have always done and measured. Same rules apply, but this time due to the AI's non-deterministic nature, measuring them only at the end of UAT is not enough. These AI thresholds feel bureaucratic to write down in advance. But they feel like lifesavers the first time the numbers start moving in the wrong direction.
There is no shortage of things you can measure when evaluating an AI agent. Accuracy, BLEU score, ROUGE score, semantic similarity, latency, hallucination rate, refusal rate, user satisfaction, task completion rate, confidence calibration, token usage, cost per query… the list goes on, and there are entire evaluation frameworks such as RAGAS, LangSmith and Evals by OpenAI that will generate dashboards full of numbers for you.
The problem is not a lack of metrics. The problem is that most teams measure what is easy to quantify rather than what is important to the business. And then leadership gets a report full of numbers that nobody fully understands or knows how to act on.
A useful suggestion for cutting through this is to separate your metrics into two tiers: operational metrics and strategic metrics.
Operational metrics are the ones your developers and QA team needs to monitor day-to-day. They are granular, technical, change frequently, and they are the early warning system for problems. Examples include:
Your team(s) should own these. They should be visible in a live dashboard. Alerting thresholds should be set so that a significant drop in any of these triggers a human review before it becomes a user-facing problem.
Strategic metrics are what management should actually be looking at. These are slower-moving, business-outcome-oriented indicators.
Examples include:
In regards of metrics, often less is more. Before adding any metric to your management dashboard, run it through the "So What" test: “If this number goes up or goes down by 10%, what decision would I make differently?”. If you cannot answer that question concretely, the metric should not be in the management view. It belongs in the operational layer, or it should not be tracked at all. Get comfortable in pruning the metrics during the AI project.
Tokens are cheap. Testing without plan is not.
Running 10,000 automated evaluation checks overnight is operationally cheap. Tokens are relatively inexpensive, at least for now. Having someone review the results, interpret the failures, identify false positives, and decide what to fix is expensive. And if nobody is doing that work, the automated checks are largely theatrical. Passed tests that nobody investigates are not evidence of quality. They are a feel-good empty comfort blanket.
This looks back to a cost that often goes unacknowledged: the cost of insufficient evaluation. A single high-profile agent failure, such as a decision made on the basis of a hallucinated output, can cost orders of magnitude more than a year of thorough evaluation. Risk-adjusted thinking is essential here. Cheap, poorly designed evaluation that misses a critical failure mode is not actually cheap.
The same principle I have for test automation also applies to agent evaluation: the right number of tests to run is the number of tests whose results you will actually act on.
Start with risk analysis. Not every agent interaction deserves the same evaluation depth. Map your agent's tasks by business risk and weight your evaluation investment accordingly. A low-risk summarization task and a high-risk financial recommendation should not receive the same number of test runs or the same depth of human review.
Use tiered execution frequency. A common mistake is running the full evaluation suite constantly. A more cost-effective pattern is:
This should be familiar with all who have designed test automation strategy, as basically the same rules apply here as well.
Invest in your test library, not your test volume. A test suite of 200 carefully curated, high-signal test cases will tell you more, and cost less to run and review, than a suite of 2,000 cases with significant overlap and low discriminative power. Spend the time to audit and prune your test library regularly. Evaluate and remove tests that consistently pass and teach you nothing. Replace them with tests that probe real failure modes you have observed in production.
Make production data your evaluation engine. Some of the most valuable evaluation signals are generated in production, you just need to instrument them. Analyzing, sampling and labelling a percentage of real production interactions is often more cost-effective and more representative of AI agent's performance than expanding your synthetic test suite.
At some point, someone in finance or senior leadership will ask: *"What are we getting for our evaluation spend?"* This is a fair question and one we've been responding to also in regards of Test Automation. There also isn’t one simple metric that tells the whole story. Use a simple framing that resonates with non-technical stakeholders: evaluation spend is insurance and predictability. You can quantify it by estimating the probability and cost of the failure modes your evaluation catches and comparing that to the cost of running the evaluation program. If your evaluation suite has caught three compliance failures in the past quarter, and each of those failures would have cost X in fines or reputational damage, that is your ROI story.
If you take one thing from each section, let it be these:
AI agents are not magic. They are probabilistic systems that require the same disciplined measurement and feedback loops as any other business-critical processes.
If you want to discuss more on these, we can provide an AI Agent assessment.