Back

LangChain

“You Can’t Just Trust the Vibes”: A Deep Dive on AI Evaluations with Sarah Kainec

Focused Lead Software Engineer Sarah Kainec breaks down AI evaluations, eval-driven development, and how tools like LangSmith and LangGraph help teams test LLMs, reduce fragility, and build production-ready agentic systems.

Dec 5, 2025

By Michael Steichen & Sarah Kainec

Share:

At Focused, we’re big believers in building AI systems that actually work in production. That starts with understanding how to test them. In this conversation, Lead Software Engineer Sarah Kainec joins Mike Steichen to unpack what AI evaluations are, how tools like LangSmith and LangGraph support eval-driven development, and why you can’t just trust the vibes when you're shipping with LLMs

Michael Steichen: Hey everyone, I’m Mike Steichen, senior software engineer and engineering manager at Focused. Today I’m talking with my colleague Sarah Kainec, a lead software engineer here who’s been working deeply in AI and evaluations. She’s been at Focused for about three years and has led some of our most interesting agentic development projects. Sarah, thanks for joining me.

Sarah Kainec: Glad to be here!

What Are “Evals” Anyway?

Michael: So let’s dive right in. For people who are newer to this space, what’s a quick and dirty definition of AI “evaluations,” or “evals”?

Sarah: I like to compare them to test-driven development. In traditional software, you write tests to confirm your code works as expected. Evals serve a similar purpose, but with AI, it’s trickier because LLMs are non-deterministic. You might give the same input twice and get different, but still valid, responses.

Michael: Like a human.

Sarah: Exactly. If you say "hi" to ChatGPT, it might say "hi," "hello," or "how are you?" All valid greetings, just different. Evals help you define and check that variability. Sometimes you do want an exact match, like when categorizing topics or extracting data into a structured format. So you write evals to ensure that, despite non-determinism, you're still getting the outputs you need.

Michael: So we’re still writing tests, just with more flexibility?

Sarah: Yep. And sometimes you even use another LLM to evaluate the first model’s output. For example, you might ask, “Was this response a greeting?” That’s a super simple eval, but it’s surprisingly effective.

The Role of LangSmith (and the Kitchen Metaphor)

Michael: I know you’ve been using a tool called LangSmith to build and run these evals. What is LangSmith?

Sarah: LangSmith is an evaluation and observability platform built by the creators of LangChain. It lets you write and manage tests for AI models, and it gives you insight into what’s happening in your agentic pipelines. You can trace requests, inspect model behavior, debug errors, and monitor performance.

Michael: Okay, let’s get silly. If LangSmith were a kitchen tool, what would it be?

Sarah: I’d say it’s a combination of a thermometer and a taste tester. You want to know if the cake is cooked all the way through, and also whether it actually tastes good. With AI, that means verifying the structure and logic of the response and also checking the tone or user experience.

Michael: I love that. The output might be well-formed, but if it sounds like a robot, it’s still not right.

Sarah: Exactly.

LangSmith vs the Ecosystem

Michael: Is LangSmith the only way to do evals?

Sarah: No, not at all. LangSmith has a great UI and strong integration with LangChain and LangGraph, but there are other tools in the ecosystem too. Ragas and DeepEval are more code-based. LangFuse is gaining traction and feels a bit like LangSmith. There’s definitely a race underway to see which toolset becomes the standard for agentic AI development.

Michael: But day to day, you’re using LangSmith pretty heavily?

Sarah: Yes. The integration with LangGraph is a big win for us. LangGraph helps us build stateful agent workflows. LangSmith’s tracing and evaluation features make it easier to test, debug, and iterate.

Why Evals Actually Matter

Michael: So let’s get into the “why.” LLMs are powerful. You can ask them almost anything. Why can’t we just trust our instincts or run a few spot checks?

Sarah: You can go off vibes if it’s a personal side project. But if you’re integrating AI into a real product, especially as a step in a larger pipeline, you have to validate what it's doing. A single malformed output can break downstream logic or worse, create bad user experiences.

Michael: Give me a concrete example.

Sarah: Sure. We work on a project that extracts structured real estate data from government PDFs. If one field is wrong or the structure doesn’t match, it can throw off the whole system. And sometimes the model is fine 99 times out of 100, but that 1% failure is the one that breaks everything.

Michael: Or causes actual harm.

Sarah: Exactly. Take a therapy chatbot. If someone mentions self-harm, you have to be confident the response is safe and appropriate. That’s not something you want to “hope” works. You need evals that push the model through edge cases again and again.

Prompt Engineering and Fragile Logic

Michael: You mentioned earlier that prompt changes can cause regressions. How bad can it get?

Sarah: Pretty bad. We had a project identifying speakers in transcripts where people are labeled like Speaker 0, Speaker 1, etc. We were trying to match those labels to actual names. It took a month to get to ~90% accuracy. Then one small change, a single sentence in the prompt, broke everything. No one expected it to have such a dramatic effect.

Michael: I’ve definitely been there.

Sarah: Without evals, we wouldn’t have known anything was wrong until much later. And by then we might have already shipped bad results.

Michael: So evals let you iterate with confidence?

Sarah: Absolutely. If you're building with LLMs, evals are your safety net.

Swapping Models Isn’t Always Easy

Michael: Okay, but aren’t new models always better? Can’t I just wait for the next release?

Sarah: I wish it worked like that. But different models have different quirks. Claude, for example, prefers XML for structured data. GPT-4 models tend to be better with JSON. They also interpret instructions differently. Swapping models can totally throw off your carefully tuned prompt logic.

Michael: You had an issue with that recently, right?

Sarah: Yes. We were using Gemini 2.5 Pro while it was still in beta. It worked great until one day... it didn’t. Google pushed an update, and suddenly our results tanked. Our evals and traces picked up on it immediately. We downgraded to an earlier, stable model and recovered.

Michael: Without evals?

Sarah: We’d have shipped garbage. Or worse, we wouldn’t have realized it was garbage.

Evals Aren’t Just for Engineers

Michael: Is this just an engineering job? Or do you involve others?

Sarah: It’s crucial to involve stakeholders, especially when you're modeling human processes. In our real estate project, the editor reviewing AI outputs had specific formatting preferences. So we built those into our evals. Domain experts help define what “good” looks like.

Michael: Especially when the evaluation itself is subjective.

Sarah: Exactly. A developer can write a test, but only an expert can tell you whether the AI’s answer is helpful, appropriate, or complete.

Building an Evaluation Pipeline

Michael: Let’s say I’m new to this. What does a typical eval pipeline look like?

Sarah: Sometimes it’s simple. If you're classifying something like commercial, residential, or industrial, you can check for exact matches. But if you’re evaluating tone, sentiment, or conversational quality, things get fuzzier.

Michael: Like in the therapy chatbot example?

Sarah: Right. You can use another LLM to judge the quality or emotional tone of a response. Does it sound empathetic? Is the advice appropriate? It’s like having a second expert look over your work.

Michael: But doesn’t that feel like the model is judging itself?

Sarah: It can. That’s why you need to be careful and test your evals too. Ideally, you combine LLM-as-judge with human review, especially for subjective or high-risk tasks.

Final Thoughts: Do the Work, Reap the Reward

Michael: Okay, so I’m sold. But I’m also intimidated. Evals sound complicated. What do you say to someone on the fence?

Sarah: It’s definitely work. But it’s necessary. You wouldn’t ship code without tests, so why would you ship AI without evals? They give you reliability, reduce downstream costs, and help you iterate faster.

Michael: What about the “just cowboy it and patch it later” argument?

Sarah: That’s a fast path to pain. With AI, debugging after deployment is even worse. You don’t just have bugs, you have unpredictable behavior that’s hard to trace. Evals catch that early. They’re not optional if you care about quality.

Michael: Totally agree. Any last takeaways?

Sarah: Just that building with LLMs means embracing a new kind of software development. One where tests aren't just binary. They’re semantic, behavioral, and probabilistic. But if you want to build responsibly and confidently, evals are how you do it.

Michael: Love it. Sarah, thanks so much for your time and insights.

Sarah: Thanks for having me!

At Focused, we use eval-driven development to test complex, real-world behavior, not just static outputs. It’s how we catch fragile logic before it ships and ensure our AI systems actually do what teams expect. If you're building with LLMs, evals aren't optional. They are how you build systems that work.

See how we can help →

Back to Explore Focused Lab

/Contact us