Everybody Tests

Everybody tests. Every developer. Every team. The debate isn't about whether you test. It's about whether you automate it.

I hear the pushback all the time. "We don't do TDD." "We don't have time for tests." "Tests slow us down." Fine. But watch what happens when that same developer finishes a feature. They open the browser. They click around. They fill out a form. They check the button. They verify the data.

That's a test.

When QA maintains a spreadsheet of scenarios to run before each release... that's a test suite. When you demo to stakeholders and walk through the happy path... that's acceptance testing. When you "just quickly check" that your refactor didn't break anything... that's regression testing.

The tests exist. They always have. They're just manual. Trapped in someone's head.

You already think test-first

Most developers already think test-first. They just don't write it down.

Before you write a function, what do you do? You think about what it should do. You imagine calling it with some input. You picture what comes out. You might even sketch a few edge cases in your head, on a post-it, on a whiteboard. What if the list is empty? What if the user isn't authenticated?

That's the test. You've written it. It exists. It's just in your brain instead of in code.

Then you implement. Then you verify it does what you imagined.

Congratulations, you already did the hardest part of test-first development.

The cost nobody talks about

Manual testing isn't free. It's a subscription.

Every time you click through your app to verify a change, you're paying. Every time someone else on your team does the same verification, that's another payment. Before every release. After every refactor. When something breaks in production. When a new hire needs to understand what the system does.

That 30-second click-through? You'll do it a hundred times. So will everyone else. Multiply by every feature, every edge case, every browser.

The math is brutal. An automated test takes minutes to write and milliseconds to run. A manual test takes seconds to run, but you'll run it thousands of times.

The teams that say they don't have time

I've seen teams who are adamant they "don't do TDD" spend entire days before a release running through test scripts. They hire QA teams whose job, their actual full-time job, is clicking the same buttons over and over.

And then they tell me they don't have time to write tests.

What they mean is this: they don't have time to write the test once. But they have infinite time to run it manually forever.

The same satisfying satisfying, all over again

We're about to make the exact same mistake with AI.

Watch what happens when someone builds an agent. They tweak a prompt. They run it. They look at the output. They squint. They decide if it's good enough. They tweak again. They run it again. They squint harder.

That's an eval. That's literally an eval. You're evaluating the output of your system against some criteria in your head.

Everybody evals. The question is whether you automate it.

Eval Driven Development

The parallel to TDD is exact. Before you write an agent, you think about what it should do. You imagine the inputs. You picture what a good output looks like. You might even think through some edge cases, what if the context is empty? What if the user asks something adversarial?

That's the eval. You've already written it. It's in your brain. Write it down.

Define what good looks like before you start prompting. Capture the examples. Capture the edge cases. Automate the judgment, whether that's exact match, semantic similarity, or LLM-as-judge.

Then iterate. Change the prompt, run the evals, see what breaks. Change the model, run the evals, see what improves. Swap in a new retriever, run the evals, know immediately if you're moving forward or backward.

This is Eval Driven Development. It's just TDD for the age of AI.

Vibes don't ship

With traditional software, at least the output is deterministic. You click the button, you get the same result.

With agents? The output is stochastic. You run the same prompt twice, you get different results. Manual review is useful for exploration. It’s useless as a measurement system. You can't trust your own squinting because you're sampling a distribution. That one good output you saw could happen 90% of the time. Or maybe 20% of the time. You have no idea.

This is why the teams building AI systems have eval suites, they have regression sets, they run benchmarks on every commit. They know their success rates because they measure them.

The teams that don't? They're shipping on vibes. And I've seen where that ends. It ends with an agent in production that works great in the demo and fails catastrophically on real traffic. It ends with prompt changes that "seemed fine" but tanked accuracy by 15%. It ends with executives asking why the AI feature is so unreliable, and engineers shrugging because they never actually knew how reliable it was in the first place.

The choice is the same as it always was

Twenty years ago, the industry had to learn that manual QA doesn't scale. That you can't hire enough people to click through every scenario before every release. That automated tests aren't a luxury.

Some teams learned that lesson. Some are still running through spreadsheets the night before launch.

Now we get to make the same choice again. You can build agents by squinting at outputs and hoping for the best. Or you can write down what good looks like, automate the evaluation, and actually know whether your system works.

Everybody tests. Everybody evals. The only question is whether you're going to keep doing it by hand, or whether you're finally going to let the computer do its job.

‍