AI Agents LangGraph

Retry Logic

Intermediate

This post covers Retry Logic in LangGraph , explaining how and why retries are used to handle failures in workflows. It includes automatic and conditional retries, retry loops, exponential backoff, and limits, along with state-based decisions, tool and LLM retry patterns, and safety mechanisms. The post also highlights common mistakes and best practices for building robust, failure-resilient LangGraph systems.

What Is Retry Logic?

Retry Logic is the mechanism to automatically or conditionally re-execute a node or subgraph when it fails, returns low-quality output, or encounters transient errors. In LangGraph, retries are essential because LLM calls, tool executions, and external API calls are inherently unreliable (rate limits, network issues, hallucinations, etc.). Retry logic helps make your agents more robust, resilient, and production-ready.

Why Retries Matter

LLMs can produce inconsistent or low-quality outputs
Tools and APIs can fail temporarily
Network issues are common
Complex agents benefit from self-correction
Improves overall success rate and user experience

Without proper retry logic, even well-designed agents can fail frequently.

Automatic Retries

LangGraph supports automatic retries at the node level using


     retry

parameter during compilation.

from langgraph.graph import StateGraph
from langgraph.prebuilt import ToolNode
from langgraph.types import RetryPolicy

graph = StateGraph(AgentState)

# Retry a specific node up to 3 times with exponential backoff
graph.add_node(
    "agent", 
    agent_node,
    retry=RetryPolicy(
        max_attempts=3,
        retry_on=[ConnectionError, TimeoutError]
    )
)

# Or apply retry policy globally during compilation
app = graph.compile(
    retry=RetryPolicy(max_attempts=4, initial_interval=1.0)
)

Conditional Retries

More powerful, decide whether to retry based on state.

def route_with_retry(state: AgentState):
    last_message = state["messages"][-1]
    attempts = state.get("retry_attempts", 0)
    
    if attempts >= 3:
        return "fallback"
    
    if "error" in last_message.content or confidence_low(last_message):
        return "agent"           # Retry the agent node
    else:
        return "END"


graph.add_conditional_edges("validator", route_with_retry)

Don't forget to increment the counter:

def agent_node(state: AgentState):
    return {
        "messages": [response],
        "retry_attempts": state.get("retry_attempts", 0) + 1
    }

Retry Loops in LangGraph

The most common pattern, using cycles for retries.

def retry_router(state: AgentState):
    attempts = state.get("attempts", 0)
    
    if attempts >= 5:
        return "fallback_node"
    if is_failure(state):
        return "agent"           # Retry
    return "END"


graph.add_node("agent", agent_node)
graph.add_node("validator", validator_node)
graph.add_node("fallback", fallback_node)

graph.add_conditional_edges("validator", retry_router)
graph.add_edge("agent", "validator")

Exponential Backoff

Add delays between retries to avoid overwhelming services.

import time

def agent_node_with_backoff(state: AgentState):
    attempts = state.get("attempts", 0)
    
    if attempts > 0:
        delay = (2 ** attempts) * 0.5   # Exponential backoff: 0.5s, 1s, 2s, 4s...
        print(f"Retrying in {delay} seconds...")
        time.sleep(delay)
    
    response = llm.invoke(state["messages"])
    return {
        "messages": [response],
        "attempts": attempts + 1
    }

Retry Limits

Always define hard limits to prevent infinite loops.

MAX_RETRIES = 5

def safe_retry_router(state: AgentState):
    if state.get("attempts", 0) >= MAX_RETRIES:
        print("Max retries reached. Moving to fallback.")
        return "fallback"
    
    if should_retry(state):
        return "agent"
    return "END"

State-Based Retry Decisions

Use rich state to make intelligent retry decisions.

class AgentState(MessagesState):
    attempts: int = 0
    last_error: str | None = None
    error_count: int = 0
    confidence: float = 0.0

def retry_decision(state: AgentState):
    if state["error_count"] >= 3:
        return "fallback"
    if state["confidence"] < 0.6:
        return "agent"           # Retry
    return "END"

Tool Retry Patterns

def tool_retry_router(state: AgentState):
    last = state["messages"][-1]
    
    if last.tool_calls and tool_failed(state):
        return "tools"           # Retry tool execution
    elif last.tool_calls:
        return "agent"           # Ask agent for new plan
    return "END"

LLM Retry Patterns

def llm_with_retry(state: AgentState):
    attempts = state.get("llm_attempts", 0)
    
    try:
        response = llm.invoke(state["messages"])
        return {
            "messages": [response],
            "llm_attempts": 0   # Reset on success
        }
    except Exception as e:
        if attempts < 3:
            return {
                "messages": [AIMessage(content=f"LLM failed: {str(e)}")],
                "llm_attempts": attempts + 1
            }
        raise

Retry Safety Mechanisms

def safe_agent_node(state: AgentState):
    attempts = state.get("attempts", 0)
    
    if attempts >= 8:
        return {
            "messages": [AIMessage(content="Max retries reached. Unable to complete task.")],
            "status": "failed"
        }
    
    try:
        return agent_node(state)
    except Exception as e:
        return {
            "messages": [AIMessage(content=f"Error: {str(e)}")],
            "attempts": attempts + 1,
            "last_error": str(e)
        }

Common Retry Mistakes

No retry limit → infinite loops
Retrying without increasing delay (thundering herd)
Not distinguishing between transient and permanent errors
Retrying without updating state (same failure repeats)
Missing fallback strategy

Best Practices for Retry Logic

Always set a maximum retry limit
Use exponential backoff for external calls
Differentiate error types (retry on network errors, not on bad input)
Track retry attempts in state
Provide clear fallback paths
Log retry events for observability
Combine with reflection for self-correction

Recommended Production Pattern:

class AgentState(MessagesState):
    attempts: int = 0
    last_error: str | None = None

def robust_agent_node(state: AgentState):
    if state["attempts"] >= 4:
        return {"messages": [AIMessage(content="Sorry, I couldn't complete this task after multiple attempts.")]}

    try:
        response = llm_with_tools.invoke(state["messages"])
        return {
            "messages": [response],
            "attempts": 0
        }
    except Exception as e:
        return {
            "messages": [AIMessage(content=f"Temporary error: {str(e)}")],
            "attempts": state["attempts"] + 1,
            "last_error": str(e)
        }

AI agent LangChain LangGraph Python

← All training