Stop Eager-Loading MCP Tools Into the Context Window

MCP servers should not eagerly load every tool schema into an agent's context window. Lazy-load tools by intent, then govern and audit execution.

Austin Vance, CEO of Focused

I think the problem with the current state of MCP is way deeper than just resizing the context window.

The protocol itself is decent, tool discovery and schema negotiation works well and the JSON-RPC architecture all feel very solid and well engineered. However, the default behavior of populating the agent's context at session start with every tool definition from every connected server makes running production agents virtually impossible.

One developer measured 67,300 tokens consumed before typing a single question. Seven MCP servers. Tool schemas alone ate up a third of the available context. Another measured 81,986 tokens.

The Eager-Loading Tax

When an agent starts a session with MCP servers connected, it downloads the full library of all tools, every session. And never filters out just the tools needed for the job at hand.

My browser automation server is loading 21 tool definitions. A GitHub server loads 27. My web search server bundles 8 providers behind 20 tools. I've not sent a single message yet and I'm already consuming significant context.

The numbers from a study of 856 tools across 103 MCP servers make this worse than it sounds. Fully augmented MCP tool descriptions add 67% more execution steps for a 5.85 percentage point accuracy gain. The tool definitions don't just eat context. They also slow agents down at actually learning to use the tools.

We wrote about evaluation pipelines for production agents. One of the failure modes of context pollution from tool definitions that I never see anyone mention is when the agent becomes less effective over time. It doesn't necessarily die or crash or throw an error. The amount of real conversation history that can be displayed in the working window gets pushed out by the tool schemas.

Even with child agents the context budget gets severely curtailed. Each child agent inherits the MCP configuration. That's new context I guess, but the immediate loss of tens of thousands of tokens to render tool schemas for subagents that may not even use them is completely antithetical to the point of using subagents in the first place: focused context. We covered the architecture patterns for multi-agent orchestration in LangGraph, but even great orchestration can't fix a context budget that's already half spent before the first tool call.

Cloudflare Just Admitted This Is Broken

Cloudflare launched Agents Week on April 12, and buried in their enterprise MCP reference architecture is an admission that the tool-definition model doesn't scale.

Their solution is called Code Mode. It condenses all of the individual MCP tools down into two meta-tools: portal_codemode_search and portal_codemode_execute. Rather than loading every tool definition into context, the agent writes JavaScript to search for and invoke tools on demand.

This means that 4 internal MCP servers exposing 52 tools would normally consume 9,400 tokens just for definitions. Code Mode drops that to 600 tokens. A 94% reduction. For Cloudflare's own API, which would consume over 2 million tokens as a traditional MCP server (twice the largest context window available right now), the reduction hits 99.9%.

That last number deserves to sit for a second. Cloudflare, one of the companies most aggressively adopting MCP across their entire enterprise, had to build a system that essentially replaces MCP's tool discovery mechanism because the original approach would literally overflow the context window. With one server.

The MCP spec team acknowledged context overload as the most frequent community concern in their tool filtering proposal. Quality decreases rapidly after around 10 tools, which far exceeds what most production setups connect.

Lazy-Loading Is the Fix

Not just a theoretical issue. I'm seeing lazy-loading work in multiple production environments, each implementing it slightly differently.

Cloudflare's Code Mode turns the agent into its own tool browser. Give it a search function, give it an execute function, and let it figure out which tools matter for the job at hand. The context cost for exploring MCP servers stays the same regardless of how many servers are connected.

There's also the Skills pattern. Instead of representing all of the tool schemas in detail upfront, agents encode the knowledge needed for a given task in lightweight skill files (typically 200 to 1,500 tokens each) that can be loaded as needed based on intent matching. A skill for browser automation might cost around 2,000 tokens to activate, as opposed to 13,600 tokens to load the full MCP server at startup. GitHub operations drop from 18,000 tokens to maybe 500 or so. Web search goes from 14,100 down to 550.

That's not marginal. That's an order of magnitude.

Arcade's MCP Gateway in LangSmith Fleet takes a third approach by centralizing 7,500+ tools and optimizing the tool descriptions for language models. These tools are not simply API wrappers. They are mapped to actions that agents can perform, with descriptions written specifically for how language models select and call upon them.

Harrison Chase wrote about this from the other side of the spectrum. His continual learning framework identifies three realms where agents improve: model weights, harness code, and context. The context layer is "the most common and most exciting area right now." However, optimizing for context only works if there is room in the context budget to do so. An agent can't learn from its interactions if the space for learning is already completely filled by tool schemas it loaded at boot time.

What This Looks Like in Practice

What I particularly like about the current LangChain infrastructure is that the eager version of these agents registers all tools when the agent is built:

from langchain.agents import create_agent
from langchain_mcp_adapters.client import MultiServerMCPClient

MCP_SERVERS = {
    "github": {"transport": "http", "url": "http://localhost:3001/mcp"},
    "browser": {"transport": "http", "url": "http://localhost:3002/mcp"},
    "search": {"transport": "http", "url": "http://localhost:3003/mcp"},
    "database": {"transport": "http", "url": "http://localhost:3004/mcp"},
}

async def build_eager_agent():
    client = MultiServerMCPClient(MCP_SERVERS)
    tools = await client.get_tools()  # all tools, all servers, every session
    return create_agent("claude-sonnet-4-6", tools=tools)

The lazy approach is not a magic discovery tool that mutates the running agent's tool set. The boring version is a router: decide which MCP servers matter for this task, load only those tools, then build the agent for that run.

from langchain.agents import create_agent
from langchain_mcp_adapters.client import MultiServerMCPClient

TOOL_REGISTRY = {
    "github": {
        "transport": "http",
        "url": "http://localhost:3001/mcp",
        "triggers": ["pr", "issue", "repo", "commit", "branch"],
    },
    "browser": {
        "transport": "http",
        "url": "http://localhost:3002/mcp",
        "triggers": ["browse", "click", "navigate", "screenshot", "page"],
    },
    "search": {
        "transport": "http",
        "url": "http://localhost:3003/mcp",
        "triggers": ["search", "find", "look up", "query"],
    },
    "database": {
        "transport": "http",
        "url": "http://localhost:3004/mcp",
        "triggers": ["sql", "query", "table", "database", "records"],
    },
}

def select_servers(task_description: str) -> dict[str, dict]:
    selected = {}
    task = task_description.lower()

    for name, config in TOOL_REGISTRY.items():
        if any(trigger in task for trigger in config["triggers"]):
            selected[name] = {
                "transport": config["transport"],
                "url": config["url"],
            }

    return selected

async def run_with_lazy_tools(task_description: str):
    selected_servers = select_servers(task_description)
    if not selected_servers:
        available = ", ".join(TOOL_REGISTRY)
        raise ValueError(f"No matching MCP servers. Available: {available}")

    client = MultiServerMCPClient(selected_servers)
    tools = await client.get_tools()  # only tools from the routed servers
    agent = create_agent("claude-sonnet-4-6", tools=tools)

    return await agent.ainvoke(
        {"messages": [{"role": "user", "content": task_description}]}
    )

The first version of the feature I had written had a terrible context profile because it stored definitions for every tool on every server. The next version routed first, then loaded only the relevant components as needed. The gain in a production system with 5 to 10 MCP servers is in the tens of thousands of fewer tokens processed every session.

Holding all of that tool schema in context is expensive. But more importantly, every token of tool schema that sits in context is a token that could be spent on reasoning, conversation history, or user-specific memory. We wrote about why [persistent agent memory](https://focused.io/lab/persistent-agent-memory-in-langgraph) is critical for production agents. Memory is useless if there isn't room for it.

Shadow MCP Is the Enterprise Problem Nobody Expected

Cloudflare's reference architecture introduces another concept worth paying attention to: Shadow MCP detection. They scan for unauthorized MCP server connections across the organization, monitoring hostnames, URI paths, and even DLP-based body inspection for JSON-RPC method calls like tools/call and initialize.

MCP has its own shadow IT problem. Developers will sometimes set up their own MCP server, integrate that into their existing agents, and security will never even be aware. This code can execute locally on developer machines, reach out to internal APIs, and bypass security controls. No audit trail, no credential governance, no DLP.

Cloudflare's answer is a monorepo governance model: centralized MCP team, AI governance approval, templates that inherit default-deny write controls and audit logging out of the box. New governed MCP servers deploy in minutes because the governance is baked into the platform, not bolted on after the fact.

I see this pattern constantly with clients. The MCP gold rush has teams spinning up servers faster than security can evaluate them. We wrote about why agent-operable interfaces are the product. The same principle applies to the tools agents use. If an employee can't access a system without approval, the agent shouldn't be able to either.

The Fix Is Architecture, Not Bigger Windows

"Context windows keep getting bigger." They do. And the waste doesn't get smaller.

A million-token window doesn't help if 67,000 tokens of tool schemas still get loaded that the agent won't ever use. The underlying issue is architectural: eager-loading is the wrong pattern for tool discovery in production agents.

Lazy-load tools based on task intent. Gate discovery behind a search mechanism. Keep tool definitions out of the context until the agent actually needs them.

Honeycomb published a set of principles for the AI era that apply here: cost is a system attribute, not an afterthought, and pre-production testing doesn't prepare for the load that comes from real systems in a real environment. Tool context overhead is exactly the kind of emergent cost that only shows up in production, when real agents connect to real MCP servers and the token bills start making people uncomfortable.

The protocol isn't the problem. The eager-loading default is the problem. Own the architecture decision. Lazy-load.