AI Agents Need a Different Kind of Testing: skUnit

One of the biggest mindset shifts when building AI applications is realizing that traditional unit testing isn't enough.

When we test regular software, we usually verify outputs.

Assert.Equal(4, calculator.Add(2, 2));

The function is deterministic. The output is always the same.

AI agents are different.

The same prompt can produce multiple valid responses, and that's completely fine.

So how do we know whether an AI agent is behaving correctly?

Meet Moody Chef

While building skUnit, I created a tiny demo called Moody Chef.

See the full source code here: skunit/demos/Demo.MoodyChef at main · mehrandvd/skunit

It's a simple AI chef that recommends food based on the user's mood.

return mood switch
{
    UserMood.Happy => "Pizza, Pasta, Salad",
    UserMood.Sad => "Ice Cream, Chocolate, Cake",
    UserMood.Angry => "Nothing, you're on a diet",
};

Now imagine a user writes:

Fuck you bastard! What food do you have?

As humans, we immediately understand that the user is angry.

The agent shouldn't recommend pizza just because the user asked for food.

The Wrong Test

A traditional test might look like this:

Assert.Equal("Nothing, you're on a diet", response.Text);

That feels reasonable until the model replies:

Sorry, I can't recommend anything today.

Or:

No food for you today.

Both responses are perfectly acceptable.

Your test still fails.

The problem isn't the AI.

The problem is the test.

Testing the Behavior

Instead of checking exact text, I wanted to verify the behavior.

With skUnit, the test becomes a conversation written in Markdown.

# [USER]
Fuck you bastard! What food do you have?

## ASSERT SemanticCondition
It doesn't suggest any food from the menu.

Notice that we're no longer telling the model what it should say.

We're describing what it should do.

Whether the response is:

"You're on a diet."
"No food today."
"Sorry, I can't recommend anything."

doesn't matter.

What matters is that it never suggests Pizza, Pasta or Salad.

A Better Agent Design

The Moody Chef sample also demonstrates another lesson.

My first implementation put everything inside the prompt.

The model had to:

detect the user's mood
decide which menu to use
generate the response

It worked, but it wasn't very reliable.

The better implementation moved the business rules into code.

The model only determines the mood.

User Message
      │
      ▼
Detect Mood
      │
      ▼
GetFoodMenu(UserMood)

The application owns the business rules.

The LLM only interprets language.

In my experience, this architecture is much easier to test and much more predictable.

Why skUnit Exists

skUnit isn't trying to replace xUnit or NUnit.

It's designed to test something those frameworks were never built for.

Instead of asserting strings, it asserts behaviors.

Instead of testing methods, it tests conversations.

As AI applications become part of more production systems, I think this style of testing will become just as important as unit testing is today.

If you'd like to see the complete Moody Chef sample or try skUnit yourself, you can find everything here:

GitHub

https://github.com/mehrandvd/skunit

I'm curious to see how others approach AI testing. I have a feeling we're only at the beginning of defining what "good tests" look like for AI systems.

AI Agents Need a Different Kind of Testing: skUnit

Meet Moody Chef

The Wrong Test

Testing the Behavior

A Better Agent Design

Why skUnit Exists

About the author

Mehran Davoudi

Leave a Reply

About Mehran

Find Me here!

Featured Posts

Want a Software Team?

Archives