Evaluating Interactive
AI Systems

Proof, not vibes

June 9, 2026

Every AI tool looks impressive in a demo. The challenge isn't to impress someone once — it's whether it makes the right call the ten-thousandth time, when no one is watching.


At Talio we invested in this from day one. We've heard a lot of interest in it these past months, so we're sharing how we approach it.

The dangerous failure is the quiet one

When normal software breaks, it breaks loudly. It crashes, it throws an error, a health check goes red. You notice, and you fix it.


An AI that's confidently wrong does something worse. It answers fluently, plausibly, in a way that's indistinguishable from a correct answer. No crash. No alert. Just a quietly worse outcome that nobody catches.


In recruitment, that quiet wrong answer has a face. It's a candidate shown the wrong role who never sees the right one. It's an agency handed a slightly-off shortlist that never knows it could have been better. Across thousands of conversations a month, a small rate of silent misses isn't a rounding error — it's real impact in lost placements and lost careers, often invisible on every dashboard.


That's the failure we refuse to ship. And it's the one traditional testing can't catch.

Why normal tests don't reach it

We run the usual things. Unit tests check that individual components behave. Integration tests check that the pieces connect. Both run on every change, and both matter.


But neither answers the question that actually decides whether the product is good:

In this conversation, did the system make the judgment a good recruiter would have made?

That judgment depends on the whole situation at once: what the candidate said, what the system knew in that moment, what it decided to do about it. No unit test sees all of that together. And because the failures look like reasonable answers rather than errors, nothing in standard monitoring flags them either. A "200 OK" tells you the request succeeded. It tells you nothing about whether the answer was right.


So we needed a different kind of test, with a different shape.

What we built: evals — judgment, written down

In AI, this kind of test has a name: an eval. A unit test checks that a function returns the right value. An eval checks that the system makes the right judgment: given a real situation, does it do what a good recruiter would?


We capture each judgment our system has to get right as an eval scenario. Each one fixes a situation and the answer a good recruiter would give: this is what the candidate said, this is what should happen next.


Then we don't just check the answer on paper — we put the system through the actual conversation.


Every eval is run by one of our eval agents: AI agents that play the candidate and genuinely call and text our Talio agent, the way a real person would. They hold the conversation end to end, on the same voice and chat surfaces a candidate uses, and we check that the judgment at the end is the right one.


A transcript with the audio stripped out isn't the product — the live interaction is.


So that's what we test.


Where the right answer is unambiguous, the check is exact: it matches or it doesn't. Only for the genuinely qualitative calls — the ones a human would also have to weigh — do we let AI judge AI. We take the certainty of an exact check every time we can get it.

Compounding confidence

The core value here is not just the test suite. It's what it becomes over time: a written record. An eval for every judgment this product needs to get right.


Models change. Data changes. The kinds of conversations we handle keep growing. But the record should hold. Every eval we've ever decided the system must pass is re-run on every release. So when something starts to drift, we find out before a candidate or a client does.

Global and agency-specific evals

Every agency has its own way of working — rules, judgment calls, and standards specific to them and their sector. Much of it is hard-won IP.


That creates two needs at once. An agency wants confidence that its preferences are followed, not a generic playbook. And it needs to know that way of working stays its own, and never leaks to a competitor who also uses Talio.


So we split evals into two layers:

  • Global evals — the baseline every agency benefits from: the judgment quality and conversation standards we hold the whole product to.
  • Agency evals — the calls specific to one agency: their rules, their preferences, the way they want candidates handled. These encode an agency's own way of working.

Global evals are shared by design. Agency evals are walled off by design. An agency's scenarios, data, and preferences are its own — we never use one customer's way of working to benefit another.

The bar

It comes down to something simple: every judgment the system needs to get right is written down, and we know whether it still holds on every release.


That's the bar we think AI in recruitment should be held to, and the one we hold ourselves to.

Want to see how Talio holds the bar?

Reach out at jarne@usetalio.com