Stop guessing, start testing: A testing framework for AI Agents

This is going to be a three-parter, or trilogy blog series, where I delve into AI from different perspectives from QA (Quality Assurance)’s point of view. This first blog is about how to test AI, and how and what to measure.

As someone who has spent years making sure software does exactly what it promises, stepping into the world of AI agent evaluation felt at the same time, both familiar and also somewhat uncomfortable. The basic principles for testers are the same: create regression test suites, have tests that expect output X, measure system changes over time. But what is different this time is that an AI agent doesn't fail the way a traditional test fails. With AI agents using LLM (Large Language Model) models, the answer is always a bit different. It drifts and hallucinates in unexpected ways, and makes it difficult to accurately measure why it answered like it did.

This brings the dilemma: How do I test something that is by design non-deterministic and therefore always behaves a bit differently? Previously, it was much easier to define if a test is a pass or fail, depending on whether the specifications were ready, but that's another discussion. Now testing AI is way more non-binary. Instead of yes or no, many times the result is somewhere between “maybe”, “somewhat” or “it depends”.

So how do you test something that is, by design, non-deterministic?

The core problem: You can't assert an exact output

In traditional QA, a passing test is binary. You call a function or test a feature, you get a value, you compare it to an expected value, done. But AI agents don't work like this. The same prompt or input, given twice to the same model, may produce two different responses — both arguably correct, or both subtly wrong in different ways.

This means that we need to change our mental image in QA from output matching to output scoring. You are no longer asking: “is this correct?” You are asking instead: “how correct is this, and why?”

I think this kind of mindset is crucial for the foreseeable future.

Building an evaluation setup for AI agents

A solid evaluation setup for an AI agent should cover at least the following three testing layers:

Accuracy is your first layer. Does the agent produce the right answer, or at least a semantically equivalent one? For factual tests, this can be measured by comparing the agent's response against a ground-truth dataset using embedding similarity scores or an LLM-as-judge pattern. A ground-truth dataset is a truth-verified dataset, which is read with a second AI model, which then uses it to rate the quality of the first model's response. The LLM-as-judge approach is surprisingly robust when you give the evaluator model a well-structured instruction rather than asking it to decide freely without guardrails.

Reliability is your second layer. Does the agent produce consistently good answers, not just occasionally great ones? If the same prompt is run 20 to 50 times, measure the resulting variance in quality scores. If the standard deviation is high, your AI agent is brittle and inconsistent. Reliability testing should also stress the edges: long inputs, ambiguous phrasing, adversarial prompts, content switches and deliberate typos. A reliable AI agent degrades gracefully, a brittle one fails unexpectedly.

Compliance is your third layer. This is about ensuring the agent stays within defined behavioral guardrails. Does it refuse to answer out-of-scope questions? Does it avoid generating harmful content? Does it correctly adhere to given limitations? Compliance testing requires a library of “should fail gracefully” prompts. Having inputs the agent should decline, redirect, or rescope. And asserting that it does.

Evaluating and regression testing AI, where the only constant is change

One of the most common mistakes I see in AI agent deployments is treating evaluation as a one-time gate or activity before launch. This won't work in the long run. The AI model changes. The prompts change. The data the agent retrieves changes. Every single one of those changes can silently degrade performance without triggering a single alert. It's the “boiling frog” issue, where the situation gradually changes for the worse.

Regression testing for AI agents should be automated, run on every deployment, and tracked over time as a metric, not as a pass/fail gate. You want to see a trend line over time, not just the snapshot for this release.

Equally important is to constantly track and verify your regression set. As said before, the models, prompts and data change constantly. This means that your regression set must also constantly change and adapt to the changing environment.

AI agent testing takeaways

Evaluating an AI agent demands a new mindset to embrace probabilistic testing. This doesn't mean you need to forget all the tried-and-true QA methods; they do still apply. But they alone are not enough anymore in LLM testing. You need to add on top of them. Build a scoring test harness. Automate your regression suite. Track deviations over time. Treat reliability variance as a top priority bug.

The models will keep improving. Your job in QA is to make sure you can measure that improvement honestly.

In the next blog, I'll cover the risks, compliance and GDPR questions on AI testing.