Blog

When the agent gets it wrong: AI evaluation as a governance imperative

Written by Mikko Ruuhamo | 2.6.2026

AI is now the biggest trend in town. There's a lot of hype, and lots of individuals use it for generating text, images and code. But companies have been more wary on its adoption, especially on the regulated industries side, such as finance, healthcare insurance and public sector. And understandably so. In these industries companies are more interested in; "what happens when AI agent gives wrong answer, and how many did it affect?", rather than "how good is this agent for our core business?".

This is a governance question. And QA needs to be a part of it.

The compliance failure

AI agents are being deployed today in customer-facing roles that carry real legal and regulatory weight. A chatbot that misquotes a loan term, a medical assistant that omits a drug interaction warning, a legal research tool that cites a case that does not exist, these are not just edge cases. They are foreseeable failure modes that your evaluation process needs to account for before the agent touches a real user.

Regulatory frameworks like the EU AI Act, GDPR, and sector-specific guidance from bodies like the European Medicines Agency(EMA) are increasingly explicit: if you deploy an AI system that makes decisions or recommendations affecting users, you are responsible for demonstrating that it behaves as intended. Simply saying: "We ran some tests before launch" is not a defensible audit trail.

A strong governance and compliance for AI is required.

What a compliance-oriented evaluation looks like

First, a risk classification for every task the agent performs is needed. Not all agent outputs carry equal weight or are high-risk from legals perspective. An agent summarizing internal meeting notes is low risk. An agent answering questions about medication dosages is high-risk. From QA's perspective, the evaluation coverage and regression frequency should be proportional to that risk level.

Second, you need adversarial red-teaming as a structured activity, not an afterthought. Red-teaming in this context means deliberately attempting to elicit non-compliant behavior, i.e. prompting the agent to bypass its own guidelines to provide advice it is not authorized to give, or to disclose information it should not give out. This should definitely be done after any significant model or agent prompt change, to evaluate whether current tests are sufficient or need modifications or additions.

Third, you need audit-ready logging and traceability. Every agent interaction that falls into a high-risk category should be logged in a way that allows you to reconstruct the context, the input, the output, and the evaluation result. If a regulator asks "did your agent ever give a user incorrect financial advice in the past six months?", you need to be able to answer that question and back it up with evidence.

Why we still need Humans in the AI loop

One pattern I advocate strongly for in high-risk cases is building structured human review into the evaluation workflow, not just as a fallback, but as an active data-generation step. When a human reviewer flags an agent response as non-compliant, that report should feed directly back into your test library and your model fine-tuning pipeline. Human review is expensive, but it is the only ground truth you have for edge cases that automated scoring systems are not suitable for or will miss. Otherwise, you can possibly run into a problem where the model slowly drifts away from compliance, and your automatic AI scoring doesn't notice the subtle boiling-frog effect.

And in any case, AI can't be held legally responsible for mistakes. At least not yet. So, it's up to you to take ownership that AI agent behaves according to given rules.

The takeaway

If your organization is deploying AI agents in any context where accuracy and compliance carry legal or reputational consequences, your evaluation program is not an optional hobby project or a nice-to-have metric. It is a governance obligation. Diligently test & document everything and treat red-teaming as a standard part of your release cycle, not a one-time exercise.

One more shout-out to excessive logging and documenting. Eventually mistakes do happen, and your Agent answers something it shouldn't. It’s inevitable. Having excessive logging allows you to:

1. Detect the issue early

2. Understand the issue fast

3. Start forming a fix promptly.

I've always been on the opinion that when the brown stuff hits the fan, how you respond and mitigate the issue is equal or more important than preventing it in the first place. Mistakes happen, and with AI coding, it's bound to happen more often. How you learn from it and stop it from happening again, that is most important metric.

In the next article, I'll cover test management and visibility for AI agents.