Some AI Influencer Told Me I Didn't Need Evals

In any software developer's journey, they encounter more experienced colleagues whose wisdom isn't always immediately recognized by younger developers. But software development has been evolving for decades now, and we've learned a lot in the process.

It still gets debated regularly whether object-oriented programming was a good idea—you can put me in the category of saying it was. I think anything that makes a technical system friendlier, more human, and more tangible is undeniably beneficial.

But the real entry point into this conversation is software testing—the concept of unit tests and integration tests that every developer eventually has to confront.

The Testing Awakening

I've seen it countless times: young developers early in their career are already struggling to write enough code to keep up with business requirements. The idea of adding tests to validate that code feels impossible—where would they find the time?

Fast forward to year four or five of a developer's career. If they've been introduced to testing and embraced it, it becomes indispensable. You can't continue to evolve an application and expect to test it manually every time you release a new feature. You can't create new features without interacting with parts of the system in ways you didn't anticipate. You cannot refactor code and expect it to perform the same way your users expect without validation.

At a certain point in one's career (ideally as early as possible), embracing testing becomes absolutely necessary. You need to experience the confidence that comes from having a verified system—verified features deploying into a verified system.

In short, any software developer who has worked on real systems in production with real users arrives at the conclusion that pragmatic testing is absolutely necessary.

The AI Agent Parallel

So, how does that relate to AI agents?

We start with relatively simple interactions: questions, answers, tool use. Then we begin to push the boundaries of what these systems can do. The early days of agent development are fun, iterative, and somewhat magical because the capabilities of state-of-the-art large language models are pretty impressive at this point in history.

You get in a groove. You keep modifying your system, adding features, and at one point you remember: "Wait, what was the first task this was successful at? I should go back and test to see if that still works."

And here's the problem: with the evolution of your prompts, your tools, a fuller context window, and instructions that overlap with those earlier use cases, that original repeatability you observed is not quite as repeatable anymore. The system doesn't use the same language you expected based on your original implementation. The way your system was working before is not how it's working now.

The Inevitable Conclusion

Now we see the parallel with traditional unit testing: we're modifying systems with established behaviors, and we cannot guarantee that we can cognitively contain all edge cases and pathways the system might interpret as we continue our modifications.

In this sense, evaluations are absolutely necessary. Developing evaluations as you develop your application—to verify early use cases and continued performance—becomes critical as you modify portions that may not seem related to original systems but often interact in ways you didn't anticipate.

Quantifying behavior as it evolves over time and writing tests (evaluations, in the context of AI agents) as your application evolves is absolutely necessary.

The Bottom Line

Anyone who tells you that evaluations aren't essential has very little experience in software development, or more specifically AI agent development.

The same naivety that leads junior developers to skip writing tests is exactly what's happening in the AI agent space today. Some influencer told you that you didn't need evals? They're selling you the same false confidence that leads to production disasters.

The question isn't whether you need evaluations. The question is whether you want to learn this lesson the easy way or the hard way.

Travis Dent is CEO of Agent CI, bringing systematic software development practices to AI agent development.

Get early access to Agent CI

The Testing Awakening

The AI Agent Parallel

The Inevitable Conclusion

The Bottom Line