What We Learned at O11yDay NYC
Our key takeaways from O11yDay NYC on observability, evals, and understanding your systems in production.
Mar 17, 2026
.png)
We spent last week at O11yDay NYC with the Honeycomb team, talking with engineers and platform leaders building integrated agents and distributed services in production.
The energy in the room reflected it. Teams aren't debating whether agents and AI systems need observability. They're deep in the weeds of how to actually do it.
Agents break the traditional debugging model
In a traditional distributed system, a request travels a known path. When something breaks, you follow the trace.
Agents don't work that way. An agent decides. It retrieves context, constructs prompts dynamically, calls tools, and hands off to other agents. A single request can move through multiple layers of reasoning before producing an output. When something goes wrong, it's rarely a thrown exception. It's a subtle misbehavior, an output that looks plausible but isn't right.
Austin Parker put it well in his keynote: "Observability monitors the gaps between what AI knows and the real system." Those gaps live in context: what was retrieved, how prompts were shaped, what assumptions were baked in. Without visibility there, you're left inferring behavior instead of understanding it.
Be the judgment layer
The idea that stuck most from Austin's keynote: as AI takes on more of the work, your job isn't to get out of the way. It's to be the judgment layer.
AI is good at working with context. People are good at knowing when something that looks right isn't. The goal is a loop where both keep making each other better, and that loop has a practical shape.
Prompt the agent. Question its assumptions. When it explains something, ask it to show you the evidence. Use what you find to add better context and go again. Evals are the automated version of that judgment, giving you a consistent way to measure whether changes to your models, prompts, or retrieval actually improved things.
One underrated point Austin made: this knowledge doesn't spread through documentation. It spreads through pairing. Watching someone prompt an agent, question it, and adjust its behavior in real time teaches both people something. That's how it actually moves across a team.
Tracing agents with Honeycomb
Honeycomb's model, structured events and high-cardinality queries, gives teams the ability to ask new questions of their systems without deciding in advance what to instrument. For agents, that's important because you often don't know what you'll need to ask until something goes wrong.
The traces that matter in agent systems are different from what most teams are used to capturing:
- Retrieval spans: what was fetched, from where, how it ranked
- Prompt construction spans: what context was injected, what was trimmed
- Decision spans: which tool was selected, which branch was taken
- Handoff spans: what state passed between agents, and whether it arrived intact
When these are first-class spans in a Honeycomb trace, teams can explore system behavior rather than just monitor it. Several teams we talked to had logging in place and still felt blind. The gap wasn't instrumentation. It was queryability. Being able to slice across thousands of traces to find where context degraded or where a model version started behaving differently is what turns data into understanding.
What we're taking away
The teams making real progress are taking what already works in distributed systems, trace propagation, structured events, span modeling, and applying it to a new class of decisions. The mental model transfers. The implementation patterns are still being defined, and that's where the real work is.
They're also running evals and observability together. Evals as the outer loop to measure whether changes made things better. Observability as the inner loop to understand what's actually happening. Neither one is enough on its own.
And the teams winning are the ones who've built Austin's loop: asking new questions, learning from what they find, feeding that back into better context and better prompts. Observability is what makes that loop possible over time.
Work with us
Focused is an official Honeycomb partner. We work directly with engineering teams on OTel instrumentation, agent observability, and eval pipelines, the hands-on work of making these systems understandable in production.
That's the work we do with Honeycomb. And based on last week, it's exactly where teams need the most help.
If any of this resonates, we'd love to connect.
