Debugging LLM Pipelines with LangSmith: Why Prompting Alone Isn’t Enough

Learn how Focused uses LangSmith to debug, test, and scale LLM pipelines. Go beyond prompt engineering and design systems built for production.

Jul 25, 2025

By Matias Burgio

Share:

Shipping production-ready LLM systems isn’t just about prompt quality. It’s about cost, speed, and reliability. Behind every “successful” prompt, there’s an infinite layer of trade-offs, where every change you make becomes harder to debug and track over time. At Focused, we’ve learned that prompt engineering alone doesn’t scale. Here’s how I use LangSmith to debug and design LLM systems that hold up in production.

How to build a good prompt

A prompt always depends on the specific use case and the needs. When we talk about creating a good prompt, the focus is usually on providing a role, offering better context, setting constraints, using examples, or defining a template. All those techniques are valuable, but are only aligned with one side of prompting, the quality of the output.

Every modification we include or change in a prompt also affects other dimensions, such as the token count. The more complex the request is, the more tokens will be required to process it. And, as the token usage increases, latency and operational cost go up as well.

On one hand, users expect fast, high quality responses, which often pushes us to larger models, more context, or richer prompts. On the other hand, every additional token we send or receive not only increases latency but also has a direct impact on the operational cost.

These trade-offs are much harder to evaluate and optimize, so we shouldn’t only rely on a human in the loop. That’s why tools like LangSmith can be a game-changer for debugging and improving your system.

Graph Structure and Design

Before talking about testing and debugging, we should start thinking about the structure, because any good testing strategy starts with the architecture of your system.

LLMs are black boxes. As developers, we don’t have visibility into the inner workings of neural connections, so we can’t directly explain why a model gives different answers to slightly different prompts. However, what we are able to control is how we chain and modularize the steps of our pipeline.

Let’s use as an example a simple translator I made that can be easily extrapolated to more complex use cases. You send an input and a target language, and the system returns a translated message. In its most basic form, this system could be implemented as a single node: “translate.”

Let’s say you want to add another feature, like applying different tones or styles to the output. You could modify the prompt and keep it within the same node, hoping it still works as expected. And it might. But this is where traceability gets more complicated and the problems begin.

As we start combining multiple goals, translation, tone, and language detection into a single prompt, it becomes harder to understand which change affects which behavior. We lose the ability to measure the impact of prompt modifications in a meaningful way.

Like in traditional software development, we should aim to separate concerns. Instead of having a single overburdened node, imagine breaking the graph down into three distinct steps:

Language Detection
Translation
Tone Adjustment

Suddenly, your black box becomes three smaller boxes. This graph design lets you test your system more deeply, and ultimately makes it easier to iterate safely.

Now you can not only have datasets and regression for the system as a whole, but you can set up requirements for each action defined, evaluate them independently, and identify easily points of failure.

Tracing, Testing, and Tuning

To identify and fix these kinds of issues, we rely on monitoring and evaluation tools. Let’s go back to my translator example. One of the tone options that I made was poetic, and I received some reports mentioning that “Sometimes the poetic tone gets too large and loses consistency with the original meaning.”

A simple input like “I miss you. I hope we can meet again soon.” was producing:

Beneath moonlit skies, where shadows whisper secrets of yearning, my essence wanders through memories of you, a fleeting ghost in the tapestry of time. May the stars realign our destinies in the cosmic waltz of reunion.”

Where it’s not only excessively verbose, but it also changes the meaning of the original sentence.

How would you debug this system manually? Without any tool, you’d probably start rerunning user inputs or creating your own and using them in dev, trying to replicate the issue, tweaking prompts or the context provided, reading outputs, and checking for a better result without even knowing if it was a one-off bug or a real failure. This is time consuming and doesn’t scale.

That is why it is important to have support tools that speed up the work. To be able to easily look for all uses of the poetic tone, review the input and output of each of the steps in the chain, compare their results with other real inputs, and generate new datasets if necessary with those inputs to add coverage to this new scenario.

To fix the bug, I looked for similar use cases in the traces, chose some inputs that I thought could be beneficial for the fix, and ran them against different evals a couple of times to get an average result, focusing on the similarity score. Initially, the result was 0.42, but after refining the context and adjusting the prompt, I managed to improve it to 0.91. During the process, I also noticed that these extremely verbose responses also inflated token consumption. And by adjusting the output, not only was a more accurate translation achieved, but consumption was also reduced by 15% for the use of poetic tone. This optimization translated directly into lower latency and cost. As an extra bonus, we added the real user examples used to the regression dataset, improving both robustness and test coverage of our system.

Fixing the bug is only one part of the problem. The next challenge is making sure not to introduce new failures. How can you be confident that your fix doesn’t break something else in the system?

This is where LangSmith’s eval system and a good test harness come in. Let’s say you craft a new version of the tone prompt that solves poetic responses. To be sure this doesn’t affect our preexisting features, LangSmith provides a dashboard to review the output differences, compare the results with your defined metrics, and measure regression or improvement in a controlled way. Using it, I ensure that modifications keep the model within acceptable parameters for the system.

This platform is not only useful for debugging or troubleshooting. At Focused, LangSmith has become a core tool in our development workflow. We use it daily for feature development and debugging. We run our tests against each change, integrating it into the CI/CD pipelines to get regression coverage quickly, without relying on manual executions and detect failures before deployments.

Conclusion

Prompting isn’t just about getting the “best” answer. It’s about getting a good enough answer at the right cost, right speed, and with predictable behavior. As your system grows in complexity, it becomes harder to maintain that balance manually.

While using tools in the very first stages of your projects is not mandatory, as it starts to scale, it is helpful to have a centralized point of control to measure the performance of the system. That is why I consider LangSmith not only streamlines the process but also improves the results.

Back to Explore Focused Lab

/Contact us