AI Agents LangGraph
Retry Logic
Intermediate
This post covers Retry Logic in LangGraph , explaining how and why retries are used to handle failures in workflows. It includes automatic and conditional retries, retry loops, exponential backoff, and limits, along with state-based decisions, tool and LLM retry patterns, and safety mechanisms. The post also highlights common mistakes and best practices for building robust, failure-resilient LangGraph systems.
What Is Retry Logic?
Retry Logic is the mechanism to automatically or conditionally re-execute a node or subgraph when it fails, returns low-quality output, or encounters transient errors.
In LangGraph, retries are essential because LLM calls, tool executions, and external API calls are inherently unreliable (rate limits, network issues, hallucinations, etc.).
Retry logic helps make your agents more robust, resilient, and production-ready.
Why Retries Matter
- LLMs can produce inconsistent or low-quality outputs
- Tools and APIs can fail temporarily
- Network issues are common
- Complex agents benefit from self-correction
- Improves overall success rate and user experience
Automatic Retries
LangGraph supports automatic retries at the node level using
retry
parameter during compilation.
from langgraph.graph import StateGraph
from langgraph.prebuilt import ToolNode
from langgraph.types import RetryPolicy
graph = StateGraph(AgentState)
# Retry a specific node up to 3 times with exponential backoff
graph.add_node(
"agent",
agent_node,
retry=RetryPolicy(
max_attempts=3,
retry_on=[ConnectionError, TimeoutError]
)
)
# Or apply retry policy globally during compilation
app = graph.compile(
retry=RetryPolicy(max_attempts=4, initial_interval=1.0)
)
Conditional Retries
More powerful, decide whether to retry based on state.
def route_with_retry(state: AgentState):
last_message = state["messages"][-1]
attempts = state.get("retry_attempts", 0)
if attempts >= 3:
return "fallback"
if "error" in last_message.content or confidence_low(last_message):
return "agent" # Retry the agent node
else:
return "END"
graph.add_conditional_edges("validator", route_with_retry)
Don't forget to increment the counter:
def agent_node(state: AgentState):
return {
"messages": [response],
"retry_attempts": state.get("retry_attempts", 0) + 1
}
Retry Loops in LangGraph
The most common pattern, using cycles for retries.
def retry_router(state: AgentState):
attempts = state.get("attempts", 0)
if attempts >= 5:
return "fallback_node"
if is_failure(state):
return "agent" # Retry
return "END"
graph.add_node("agent", agent_node)
graph.add_node("validator", validator_node)
graph.add_node("fallback", fallback_node)
graph.add_conditional_edges("validator", retry_router)
graph.add_edge("agent", "validator")
Exponential Backoff
Add delays between retries to avoid overwhelming services.
import time
def agent_node_with_backoff(state: AgentState):
attempts = state.get("attempts", 0)
if attempts > 0:
delay = (2 ** attempts) * 0.5 # Exponential backoff: 0.5s, 1s, 2s, 4s...
print(f"Retrying in {delay} seconds...")
time.sleep(delay)
response = llm.invoke(state["messages"])
return {
"messages": [response],
"attempts": attempts + 1
}
Retry Limits
Always define hard limits to prevent infinite loops.
MAX_RETRIES = 5
def safe_retry_router(state: AgentState):
if state.get("attempts", 0) >= MAX_RETRIES:
print("Max retries reached. Moving to fallback.")
return "fallback"
if should_retry(state):
return "agent"
return "END"
State-Based Retry Decisions
Use rich state to make intelligent retry decisions.
class AgentState(MessagesState):
attempts: int = 0
last_error: str | None = None
error_count: int = 0
confidence: float = 0.0
def retry_decision(state: AgentState):
if state["error_count"] >= 3:
return "fallback"
if state["confidence"] < 0.6:
return "agent" # Retry
return "END"
Tool Retry Patterns
def tool_retry_router(state: AgentState):
last = state["messages"][-1]
if last.tool_calls and tool_failed(state):
return "tools" # Retry tool execution
elif last.tool_calls:
return "agent" # Ask agent for new plan
return "END"
LLM Retry Patterns
def llm_with_retry(state: AgentState):
attempts = state.get("llm_attempts", 0)
try:
response = llm.invoke(state["messages"])
return {
"messages": [response],
"llm_attempts": 0 # Reset on success
}
except Exception as e:
if attempts < 3:
return {
"messages": [AIMessage(content=f"LLM failed: {str(e)}")],
"llm_attempts": attempts + 1
}
raise
Retry Safety Mechanisms
def safe_agent_node(state: AgentState):
attempts = state.get("attempts", 0)
if attempts >= 8:
return {
"messages": [AIMessage(content="Max retries reached. Unable to complete task.")],
"status": "failed"
}
try:
return agent_node(state)
except Exception as e:
return {
"messages": [AIMessage(content=f"Error: {str(e)}")],
"attempts": attempts + 1,
"last_error": str(e)
}
Common Retry Mistakes
- No retry limit → infinite loops
- Retrying without increasing delay (thundering herd)
- Not distinguishing between transient and permanent errors
- Retrying without updating state (same failure repeats)
- Missing fallback strategy
Best Practices for Retry Logic
- Always set a maximum retry limit
- Use exponential backoff for external calls
- Differentiate error types (retry on network errors, not on bad input)
- Track retry attempts in state
- Provide clear fallback paths
- Log retry events for observability
- Combine with reflection for self-correction
class AgentState(MessagesState):
attempts: int = 0
last_error: str | None = None
def robust_agent_node(state: AgentState):
if state["attempts"] >= 4:
return {"messages": [AIMessage(content="Sorry, I couldn't complete this task after multiple attempts.")]}
try:
response = llm_with_tools.invoke(state["messages"])
return {
"messages": [response],
"attempts": 0
}
except Exception as e:
return {
"messages": [AIMessage(content=f"Temporary error: {str(e)}")],
"attempts": state["attempts"] + 1,
"last_error": str(e)
}
AI agent LangChain LangGraph Python