Agent Prompt Caches Are a Runtime Boundary

Prompt caching turns context order into runtime architecture.

Moving a timestamp, session id, memory write, or even tool schema to the front of an agent’s prompt can single-handedly destroy the cache economics of every long-running task. This still causes trouble even though the model does answer the prompt, the trace even looks normal, and it still works more or less as before. But the increased bill and first token latency can add up quickly, only becoming apparent later for expensive agents where the difference is barely noticeable at first.

The fresh work from LangChain named the long-buried architecture decision hidden in plain sight for everyone using Deep Agents. The numbers are blunt: properly implemented prompt caching, as opposed to a simple cache of last N tokens, can reduce inference token cost by 41 to 80%. The more interesting number is from Deep Agents eval trajectories in LangChain’s harness: when provider-aware caching is turned on, average token-cost reduction lands between 49 and 80%.

A prompt cache is a test for stable context.

The cache boundary lives inside the prompt

Provider prompt caches work off of prefixes. The reusable part of the prompt must occur in the same place, in the same order, with the exact same text, above the provider’s minimum token threshold. Cache hits for OpenAI occur on exact prefix matches, starting at 1,024 tokens, with stable prompt content first and the request-specific part last. OpenAI also exposes prompt_cache_key to help with routing locality for requests with reusable prefixes.

That turns prompt assembly into architecture.

Here the stable part of the prompt is the beginning of the prompt and consists of ‘boring’ parts that used to be referred to as prompt boilerplate, e.g. system prompt, developer prompt, tool prompt, output contract, examples, skills, static policy, project memory, and other material reused over multiple turns. The dynamic suffix consists of current input, plus relevant context, retrieved snippets, current memory writes, tool output, timestamps, request metadata, auth-scoped resource hints, and whatever was generated in the current turn.

While there is a graph, an eval set and a model router it does not automatically mean there is agentic AI architecture. It all leaks money as soon as stable parts of context are mixed with volatile parts, and those parts of context are repeated.

Small mistakes made to the prompt individually don’t seem to do much harm, but all together ruin cache hits. For example, a new middleware is introduced to prepend a request_id to the system prompt. A user profile summary is moved to the top of the prompt (what a “personalized” AI must look like). A retrieval component appends fresh citations to the top-level instruction. A tool registry is sorted by last-use time. A skill loader inlines the complete selected skill before the stable part of the prompt. Each modification is defensible by itself, but in total they turn the prompt cache into a rumor.

This is why context as product architecture is coming back time and time again. People’s understanding of context in terms of architecture is that agents don’t perceive the context only as text. They perceive the context as ordered state with associated costs and latencies. Agents also have understanding of context in terms of permission and failure.

Architecture diagram showing stable prompt prefix and dynamic suffix boundaries for agent prompt caching. — Cache hits survive when the runtime keeps stable context ahead of volatile work.

The harness owns the shape

The provider can offer the cache. The agent harness owns the runtime boundary because it decides what the provider sees.

In LangChain’s Deep Agents stack, the relevant line is create_deep_agent. The docs say caching is enabled by default for Anthropic models and Bedrock Claude or Nova models, and is applied to static parts of the system prompt that repeat on every turn.

The interesting part of the default stack is that provider-specific prompt caching happens after the patch/custom middleware and before the memory middleware. The Deep Agents customization docs note that memory is placed later so memory updates do not coincidentally invalidate the cache prefix. This is the crux of the argument against caching as a setting in the SDK buried near the model call. Instead, cache policy is a runtime boundary just like the rest of the system.

This is another point where lazy-loading tool schemas into context for running agent scenarios makes sense. Tool selection in an MCP-powered runtime is critical for performance and control, and dumping all tool schemas into a prompt pays nothing and churns the cache when tool lists change. Static tool definitions should be explicitly included in the prefix for cache hits, not accidentally created by adding the entire MCP registry to the prompt.

Skills are runtime-governed artifacts. They can be durable operating knowledge, cacheable prefix content, or raw unreviewed payloads. When skills as governed runtime artifacts affect cache hit rates or provider bills, they fall outside a narrow security view and become a runtime concern.

Provider differences are runtime inputs

The biggest problem with this interface is that the caching model exposed by each provider is different.

As mentioned earlier, Anthropic supports explicit cache breakpoints for tools, system and messages in that order. The cache for these objects has a default TTL of 5 minutes but can also be set to 1 hour for objects that live longer. Anthropic’s documentation for caching also details the pricing for cache writes and reads as well as the hard limit of 4 breakpoints per model. This becomes important when an agent has definitions for tools, the system, examples and then the conversation state.

OpenAI supports caching for a number of models, automatically for supported models and explicitly for others. The caching is rewarded by the agent runtime when stable parts of the agent input come first, followed by variable parts. This can be controlled by the layout of the input and by the routing of the cache key. For example, in cases where there are shared prefixes, the cache key should be sized to the point at which those prefixes fan out from having had the same prefix, to maximize hits. The OpenAI docs for caching also carry the important detail that cache-key traffic that is too broad can overflow locality and thus decrease hit rate.

Gemini splits the caching behavior between implicit caching for newer Gemini models such as Gemini 2.5 and larger, explicit cached-content objects for repeated large context. In these architectures the runtime determines that a large document, a long recorded audio or video transcript, or a similarly long collection of written instructions should be cached as a named artifact for repeated use by the agent, rather than the agent relying on implicit caching for prefix matches. Google's docs show cachedContent objects with TTL, similar to checkpointed caching model features.

Bedrock’s caching is also done via cache checkpoints, and there are model-specific minimums for number of tokens, supported fields, TTLs, and checkpoint limits. Amazon's docs are a reminder that enterprise deployment details eventually surface: model family, endpoint type, region, and checkpoint limits all affect caching behavior.

A generic model interface is enough to hide the invocation details of a model. However, it cannot hide the important cache details, such as the fact that a provider rewards exact prefix layout, as opposed to breakpoints that must be explicitly invoked for tools, system, or messages, cached content where the cache stores a large piece of context instead of a prefix match, or cache checkpoints with field support, TTL, and checkpoint-count rules. Those details become routing keys, object IDs, and provider policy in the runtime.

Model routing is an architecture boundary because routing a task from OpenAI to Anthropic or from Gemini to Bedrock changes the cache controls available to the agent runtime. Treat the providers as strings and the cache boundary has no owner.

Provider matrix comparing prompt-cache controls across OpenAI, Anthropic, Gemini, and Bedrock. — Provider APIs do not expose the same cache knobs, so a generic wrapper is not enough.

Cache hits belong in traces

Prompt caching has an observability shape.

The cache information is not complicated. Every serious runtime should record the cache policy that was applied to a model call. This includes the stable-prefix hash, the list of prompt sections included in the cacheable prefix, the number of cache breakpoints, the provider’s cache key or cached-content id, the write and read tokens for the cache, the TTL, and the reason why the cache was skipped for any given model call. This information should be observable in the same trace as model output, tool usage, cost, latency, retry attempts, and other evaluative data. In particular, this information should be recorded by the runtime, as opposed to being obscured inside a provider or runtime component. LangChain's prompt-caching model docs frame caching across implicit provider caching, provider-level controls, and middleware. Information about caching within a runtime should follow that model rather than being flattened into input tokens and output tokens.

Cost is similarly observable. Agent cost is a runtime signal when the runtime can connect a bill spike to a changed tool schema, memory injector, model route, or prompt template. This is finance cleanup, instead of a monthly bill where the team can only guess which agent got expensive.

Missing prompt-cache hits without reason is a runtime bug that behaves nicely and quietly continues to make the agent slower and more expensive while the dashboards stay green.

The runtime rule is simple

Place the stable context first, the volatile context last. Make the cache controls of a provider explicit in the corresponding middleware. Record cache behavior in the traces as well. Change cache policy the way runtime parameters change, because cache policy changes runtime parameters.

The hard part is ownership.

Everyone has a reasonable request: Product wants personalization on top of security’s policy coverage, platform abstracts away provider-specific details, finance toggles the cache on and off, and engineering wants the prompt to work even after memory, tools, skills, and retrieval have all changed. The runtime has to satisfy all of these requests and produce a context layout that is able to survive live-system changes.

That is agentic AI architecture now. The graph matters. The tools matter. The evals matter. The cache boundary also matters because it is where cost, latency, context, and provider behavior meet.

Those teams will make long-running agents cheaper without making them stupid. The others will learn the hard way that discounts don’t last when architecture is accidental and they treated prompt caching like a provider toggle.

The cache boundary lives inside the prompt

The harness owns the shape

Provider differences are runtime inputs

Cache hits belong in traces

The runtime rule is simple

Let's Build better Agents Together

Modernize your legacy with Focused