LangGraph Agent Error Handling in Production | Focused

You have a document processing pipeline. It ingests contracts, extracts key clauses, validates them against policy, and generates a summary. Monday morning it processes 200 documents without a hiccup. Tuesday at 2 AM, Anthropic’s API returns a 429, the extraction node throws, and the entire pipeline stops. Not just the one document — the whole batch. Your on-call engineer spends 45 minutes figuring out it was a transient rate limit that would have resolved itself with a 2-second backoff.

The fix isn’t “add a try/except.” The fix is classifying errors by who can fix them and routing each class to the right handler. LangGraph gives you the primitives — RetryPolicy, Command, interrupt(), and ToolNode error handling — but the framework won’t decide your error strategy for you. That’s on you.

This post shows the four error classes, the LangGraph primitives for each, and the production failures that surface when you get the classification wrong.

The Production Error Handling Classification Matrix

Not all errors are equal. The single most important decision in your error-handling strategy is: who fixes this?

Error Class	Who Fixes It	LangGraph Primitive	Example
Transient	System (automatic)	`RetryPolicy`	API 429, network timeout, DNS blip
LLM-Recoverable	The LLM	Error in state + loop back	Tool returned bad JSON, wrong tool chosen
User-Fixable	The human	`interrupt()`	Missing required field, ambiguous input
Unexpected	The developer	Let it bubble up	`TypeError`, schema mismatch, logic bug

Getting this wrong costs you. Retrying a user-fixable error wastes 3 attempts and 6 seconds before failing anyway. Interrupting for a transient error pages a human to click “retry” on something that would have fixed itself. Swallowing an unexpected error hides a real bug behind a generic fallback.

The Architecture

We’re building a document processing pipeline that extracts clauses from contracts, validates them, and generates summaries. Each node has a different error profile:

‍

[Ingest] → [Extract Clauses] → [Validate] → [Summarize] → END
              ↑        |
              |        ↓
              ← (tool error: retry with context)
                       |
                  (missing info: interrupt for human)

‍

The extraction node calls tools and hits APIs — it gets transient errors and tool failures. The validation node needs complete data — it surfaces user-fixable gaps. The summarizer is the least error-prone but still needs retry protection.

State: Track Errors Explicitly

The key insight: errors are data, not just exceptions. Store them in state so the LLM can see what went wrong and adjust its approach.

import operator
from typing import Annotated, TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import AnyMessage
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class PipelineState(TypedDict):
    document: str
    messages: Annotated[list[AnyMessage], operator.add]
    extracted_clauses: list[dict]
    validation_errors: list[str]
    retry_count: int
    final_summary: str

‍

Pattern 1: Transient Errors with RetryPolicy

API rate limits, network blips, DNS hiccups. These fix themselves. Don’t write code for them — configure them.

‍

from langgraph.types import RetryPolicy

# Aggressive retry for flaky external APIs
api_retry = RetryPolicy(
    max_attempts=5,
    initial_interval=1.0,
    backoff_factor=2.0,
    max_interval=10.0,
    jitter=True,
)

# Conservative retry for LLM calls (they're expensive)
llm_retry = RetryPolicy(
    max_attempts=3,
    initial_interval=0.5,
    backoff_factor=2.0,
    max_interval=5.0,
    jitter=True,
)

‍

RetryPolicy parameters worth knowing:

Parameter	Default	What It Does
`max_attempts`	3	Total attempts including the first
`initial_interval`	0.5	Seconds before first retry
`backoff_factor`	2.0	Multiplier per retry (exponential backoff)
`max_interval`	128.0	Cap on wait time between retries
`jitter`	True	Randomize wait to avoid thundering herd
`retry_on`	(default exceptions)	Exception types or callable to filter

The retry_on parameter is where most people get it wrong. The default retries on common network/transient exceptions. If you need to retry on a custom exception type:

‍

from httpx import HTTPStatusError

def should_retry(error: Exception) -> bool:
    if isinstance(error, HTTPStatusError):
        return error.response.status_code in (429, 502, 503)
    return False

selective_retry = RetryPolicy(
    max_attempts=5,
    initial_interval=1.0,
    backoff_factor=2.0,
    retry_on=should_retry,
)

‍

The superstep transaction rule: LangGraph executes parallel branches in supersteps. If any branch in a superstep raises an exception, none of the state updates from that superstep apply. Successful branches are checkpointed and won’t re-execute on retry, but the state snapshot rolls back to before the superstep started. This means a flaky API in one branch can block state updates from an unrelated branch that succeeded. RetryPolicy per node keeps one bad branch from poisoning the whole superstep.

Pattern 2: LLM-Recoverable Errors with ToolNode

Tool calls fail. The LLM picks the wrong tool, passes bad arguments, or the tool returns something unparseable. The fix isn’t retrying the exact same call — it’s letting the LLM see what went wrong and try a different approach.

ToolNode from langgraph.prebuilt has a handle_tool_errors parameter that catches tool exceptions and returns the error message as a ToolMessage. The LLM sees the error and can adjust:

‍

from langchain_core.tools import tool
from langgraph.prebuilt import ToolNode


@tool
def extract_clause(text: str, clause_type: str) -> dict:
    """Extract a specific clause from contract text.

    Args:
        text: The contract text to search.
        clause_type: One of 'termination', 'liability', 'indemnification', 'payment'.
    """
    valid_types = {"termination", "liability", "indemnification", "payment"}
    if clause_type not in valid_types:
        raise ValueError(
            f"Invalid clause_type '{clause_type}'. Must be one of: {valid_types}"
        )
    return {
        "clause_type": clause_type,
        "text": f"Extracted {clause_type} clause from document.",
        "confidence": 0.92,
    }


@tool
def check_compliance(clause: str, regulation: str) -> dict:
    """Check if a clause complies with a specific regulation.

    Args:
        clause: The clause text to check.
        regulation: The regulation identifier (e.g., 'GDPR-Art17', 'SOX-302').
    """
    if not clause.strip():
        raise ValueError("Empty clause text provided. Extract the clause first.")
    return {"compliant": True, "regulation": regulation, "notes": "No issues found."}


tools = [extract_clause, check_compliance]

# handle_tool_errors=True: catch exceptions, return error as ToolMessage
tool_node = ToolNode(tools, handle_tool_errors=True)

You can also pass a custom error handler for more control:

def format_tool_error(error: Exception) -> str:
    return (
        f"Tool failed with: {error}\n"
        "Review your arguments and try again. "
        "Check the tool's docstring for valid parameter values."
    )

tool_node_custom = ToolNode(tools, handle_tool_errors=format_tool_error)

‍

The agent node calls the LLM, which may invoke tools. When a tool fails, handle_tool_errors=True catches the exception and sends the error back to the LLM as a ToolMessage. The LLM sees the error and tries again — usually with corrected arguments:

‍

from langchain_core.messages import HumanMessage, SystemMessage


@traceable(name="agent", run_type="chain")
def agent_node(state: PipelineState) -> dict:
    system = SystemMessage(
        content="You are a contract analysis agent. Use the provided tools to "
                "extract and validate clauses. If a tool returns an error, read "
                "the error message carefully and adjust your arguments. "
                "Available clause types: termination, liability, indemnification, payment."
    )
    messages = [system] + state["messages"]
    response = llm.bind_tools(tools).invoke(messages)
    return {"messages": [response]}

‍

Pattern 3: User-Fixable Errors with interrupt()

Some errors can’t be fixed by the system or the LLM. The document is missing a signature date. The clause references an undefined term. The input is ambiguous. These need a human.

interrupt() pauses graph execution, saves state to the checkpointer, and returns a payload to the caller. When the human provides input, you resume with Command(resume=...):

‍

from langgraph.types import interrupt, Command


@traceable(name="validate_document", run_type="chain")
def validate_node(state: PipelineState) -> dict:
    clauses = state.get("extracted_clauses", [])
    errors = []

    if not clauses:
        errors.append("No clauses extracted from document.")

    required_types = {"termination", "payment"}
    found_types = {c["clause_type"] for c in clauses if isinstance(c, dict)}
    missing = required_types - found_types
    if missing:
        errors.append(f"Missing required clause types: {missing}")

    low_confidence = [
        c for c in clauses
        if isinstance(c, dict) and c.get("confidence", 1.0) < 0.7
    ]
    if low_confidence:
        types = [c["clause_type"] for c in low_confidence]
        errors.append(f"Low confidence extractions for: {types}")

    if errors:
        human_input = interrupt({
            "type": "validation_errors",
            "errors": errors,
            "message": "Document validation failed. Please review and provide corrections.",
            "document_preview": state["document"][:500],
        })
        return {
            "extracted_clauses": human_input.get("corrected_clauses", clauses),
            "validation_errors": [],
        }

    return {"validation_errors": []}

‍

Resuming after an interrupt:

from langgraph.checkpoint.memory import InMemorySaver

checkpointer = InMemorySaver()

# First invocation — runs until interrupt
config = {"configurable": {"thread_id": "contract-review-42"}}
result = graph.invoke(
    {"document": "...", "messages": [], "extracted_clauses": [], "validation_errors": [], "retry_count": 0, "final_summary": ""},
    config,
)

# Check for interrupt
if "__interrupt__" in result:
    print("Human input needed:", result["__interrupt__"])

# Resume with corrections
corrected = Command(resume={
    "corrected_clauses": [
        {"clause_type": "termination", "text": "Either party may terminate...", "confidence": 0.95},
        {"clause_type": "payment", "text": "Payment due within 30 days...", "confidence": 0.98},
    ]
})
final_result = graph.invoke(corrected, config)

‍‍

Critical detail: interrupt() requires a checkpointer. Without one, the state is lost and you can’t resume. Use InMemorySaver for development and a durable checkpointer (Postgres, SQLite) for production. Forgetting the checkpointer is a silent failure — the graph runs fine until you actually need to resume, and then it has no idea where it left off.

Pattern 4: Building Fault-Tolerant Agents — Let Unexpected Errors Bubble

TypeError, KeyError, schema mismatches, logic bugs. Don’t catch these. Don’t retry them. Don’t interrupt for them. Let them crash. A retry just wastes time on an error that will never self-resolve. A human interrupt pages someone to look at a bug that should be in your issue tracker.

The only thing to do with unexpected errors is make them observable. Wrap your graph invocation and log context:

‍

from langsmith import tracing_context


@traceable(name="process_document", run_type="chain")
def process_document(document: str, thread_id: str) -> dict:
    config = {"configurable": {"thread_id": thread_id}}
    with tracing_context(
        metadata={"document_length": len(document), "thread_id": thread_id},
        tags=["production", "document-pipeline"],
    ):
        return graph.invoke(
            {
                "document": document,
                "messages": [HumanMessage(content=f"Process this contract:\n\n{document}")],
                "extracted_clauses": [],
                "validation_errors": [],
                "retry_count": 0,
                "final_summary": "",
            },
            config,
        )

‍

When it crashes, you get a full trace in LangSmith with the document content, the exact node that failed, and every intermediate state. That’s a 5-minute investigation, not a 2-hour log-grepping session.

The Summarizer

The final node synthesizes everything. It’s simple but gets RetryPolicy protection because it calls the LLM:

‍

@traceable(name="summarize", run_type="chain")
def summarize_node(state: PipelineState) -> dict:
    clauses_text = "\n".join(
        f"- {c['clause_type']}: {c['text']}" for c in state["extracted_clauses"]
    )
    response = llm.invoke([
        SystemMessage(
            content="Summarize the following contract clauses into a concise executive summary. "
                    "Flag any compliance concerns."
        ),
        HumanMessage(content=f"Contract clauses:\n{clauses_text}"),
    ])
    return {"final_summary": response.content}

‍

Graph Assembly

Here’s where the error classification meets the graph structure. Each node gets the retry strategy that matches its error profile:

‍

from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import InMemorySaver


def should_continue(state: PipelineState) -> str:
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    return "validate"


def post_tool(state: PipelineState) -> dict:
    tool_results = []
    for msg in state["messages"]:
        if hasattr(msg, "content") and isinstance(msg.content, str):
            if "Extracted" in msg.content:
                tool_results.append({
                    "clause_type": "extracted",
                    "text": msg.content,
                    "confidence": 0.92,
                })
    if tool_results:
        return {"extracted_clauses": tool_results}
    return {}


builder = StateGraph(PipelineState)

# Agent node: LLM retry (expensive, conservative)
builder.add_node("agent", agent_node, retry=llm_retry)

# Tool node: API retry (cheap, aggressive) + error handling for tool failures
builder.add_node("tools", tool_node, retry=api_retry)

# Post-tool processing
builder.add_node("post_tool", post_tool)

# Validation: no retry (errors here are user-fixable, not transient)
builder.add_node("validate", validate_node)

# Summarizer: LLM retry
builder.add_node("summarize", summarize_node, retry=llm_retry)

builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", should_continue, {"tools": "tools", "validate": "validate"})
builder.add_edge("tools", "post_tool")
builder.add_edge("post_tool", "agent")
builder.add_edge("validate", "summarize")
builder.add_edge("summarize", END)

checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)

‍

Notice: the validation node has no retry policy. Retrying a missing-clause error 3 times won’t make the clause appear. That’s a user-fixable problem that needs interrupt().

Production Failures

These are the error-handling mistakes that make it past code review and into production.

1. Retrying User-Fixable Errors. The pipeline retries an extraction 3 times, burning 8 seconds and 3 LLM calls, before finally failing with the same “missing payment clause” error. The document genuinely doesn’t have a payment clause. No amount of retrying will create one. Fix: classify the error before choosing the handler. If the document is missing required content, interrupt() immediately.

2. Swallowing Unexpected Errors. A developer wraps the entire graph invocation in try/except Exception: return {"error": "Something went wrong."}. Now every TypeError, every KeyError, every schema mismatch disappears into a generic error message. The LangSmith trace shows the node completed “successfully” — because from the graph’s perspective, it did. It returned a value. The bug lives in production for weeks until someone notices the output quality degraded. Fix: only catch the specific exception types you know how to handle. Let everything else crash loudly.

3. Superstep Transaction Surprise. You have two parallel branches: clause extraction and metadata extraction. The metadata branch succeeds, but the clause branch hits a rate limit and throws. You expect the metadata to be saved — it succeeded, after all. But superstep transactions mean neither update applies. The entire superstep rolls back. Your metadata extraction re-runs on retry (if you have RetryPolicy) or is lost entirely (if you don’t). Fix: put RetryPolicy on every node that can fail transiently. LangGraph checkpoints successful nodes within a superstep so they don’t re-execute, but the state update is still atomic.

4. Interrupt Without Checkpointer. You add interrupt() to your validation node and test it locally. Works great. Deploy to production without a persistent checkpointer (or with InMemorySaver behind a load balancer). The interrupt pauses the graph, the user provides corrections, and... the graph starts from scratch because the in-memory state was on a different server instance. Fix: use a durable checkpointer (PostgresSaver, SqliteSaver) in production. InMemorySaver is for tests only.

5. Error Recovery Loop Explosion. The LLM fails to call a tool correctly, the error goes back to the LLM, the LLM tries again with slightly different wrong arguments, the error goes back again. After 15 loops and $2 in API costs, you hit the recursion limit. Fix: add a retry_count to state. After 3 LLM-recovery attempts, escalate to interrupt() or fail with a clear error message.

Observability

Error handling without observability is guesswork. Here’s how to make every error path visible in LangSmith:

‍

from langsmith import tracing_context


with tracing_context(
    metadata={
        "pipeline_version": "v3",
        "error_strategy": "classified",
        "document_type": "contract",
    },
    tags=["production", "error-handling-v3"],
):
    result = process_document(
        document="Sample contract text...",
        thread_id="contract-42",
    )

‍

The @traceable decorator on every node means you can see in LangSmith:

Filter by the error-handling-v3 tag to compare error rates across pipeline versions. If v3 has fewer interrupts but more retries, your error classification improved — transient errors are being handled automatically instead of paging humans.

Evals

Test error recovery paths the same way you test happy paths. Three evaluators: one for successful processing, one for error classification accuracy, and one LLM-as-judge for output quality under failure conditions.

‍

from langsmith import Client, evaluate
from openevals.llm import create_llm_as_judge

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="error-handling-evals",
    description="Document processing pipeline error handling evaluation",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"document": "This contract between Party A and Party B includes: Termination: Either party may terminate with 30 days notice. Payment: Net 30 terms apply.", "thread_id": "eval-1"},
        {"document": "This agreement covers liability limitations and indemnification clauses only.", "thread_id": "eval-2"},
        {"document": "", "thread_id": "eval-3"},
    ],
    outputs=[
        {"should_succeed": True, "required_clauses": ["termination", "payment"]},
        {"should_succeed": False, "missing_clauses": ["termination", "payment"]},
        {"should_succeed": False, "error_type": "empty_document"},
    ],
)

QUALITY_PROMPT = """\
Document: {inputs[document]}
Pipeline output: {outputs[final_summary]}

Rate 0.0-1.0 on:
- Completeness: Did the summary cover all extracted clauses?
- Accuracy: Are the clause descriptions faithful to the source?
- Error handling: If the document was incomplete, did the pipeline flag it appropriately?

Return ONLY: {{"score": <float>, "reasoning": "<explanation>"}}"""

quality_judge = create_llm_as_judge(
    prompt=QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="quality",
)


def error_classification(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Did the pipeline correctly classify errors?"""
    should_succeed = reference_outputs.get("should_succeed", True)
    has_summary = bool(outputs.get("final_summary"))
    has_errors = bool(outputs.get("validation_errors"))

    if should_succeed:
        score = 1.0 if has_summary and not has_errors else 0.0
    else:
        score = 1.0 if has_errors or not has_summary else 0.0
    return {"key": "error_classification", "score": score}


def recovery_efficiency(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """How many retries did it take? Lower is better."""
    retry_count = outputs.get("retry_count", 0)
    if retry_count == 0:
        score = 1.0
    elif retry_count <= 2:
        score = 0.7
    else:
        score = 0.3
    return {"key": "recovery_efficiency", "score": score}


def target(inputs: dict) -> dict:
    config = {"configurable": {"thread_id": inputs["thread_id"]}}
    try:
        return graph.invoke(
            {
                "document": inputs["document"],
                "messages": [HumanMessage(content=f"Process this contract:\n\n{inputs['document']}")],
                "extracted_clauses": [],
                "validation_errors": [],
                "retry_count": 0,
                "final_summary": "",
            },
            config,
        )
    except Exception as e:
        return {"final_summary": "", "validation_errors": [str(e)], "retry_count": 0}


results = evaluate(
    target,
    data="error-handling-evals",
    evaluators=[quality_judge, error_classification, recovery_efficiency],
    experiment_prefix="error-handling-v1",
    max_concurrency=2,
)

‍

recovery_efficiency catches the error-loop explosion problem. If your average retry count creeps above 2, your error classification is wrong — you’re retrying things that should interrupt or bubble up.

‍

When to Use This

Use classified error handling when:

Your pipeline calls external APIs that can return transient errors
Documents have variable quality and may be missing required fields
You need human-in-the-loop for ambiguous or incomplete inputs
You're running batch processing where one failure shouldn't kill the batch

Skip it when:

Your pipeline is a single LLM call with no tools
Every error is the same type (all transient, all user-fixable)
You're prototyping and don't need production resilience yet

‍

The Bottom Line

The error classification matrix is the whole strategy: transient errors get RetryPolicy, LLM-recoverable errors get stored in state and looped back, user-fixable errors get interrupt(), and unexpected errors crash loudly. Four patterns, four primitives, zero catch-all try/excepts.

The mistake everyone makes is treating errors as a single category. You either retry everything (wasting time and money) or catch everything (hiding bugs). The classification forces you to ask “who fixes this?” for every failure mode, and that question is worth more than any amount of retry logic.

Put RetryPolicy on every node that touches a network. Put handle_tool_errors=True on your ToolNode. Put interrupt() on validation failures. Let everything else crash. Ship the recovery_efficiency eval before you ship the pipeline.

Technical References:

LangGraph Agent Error Handling in Production GitHub Repo

LangGraph Retry Policy (Handling Retries)

LangGraph Tool Calling and ToolNode

LangGraph Human-in-the-Loop (interrupt)