LangGraph: Build Stateful AI Agents with Graph-Based Orchestration

Building production-grade AI agents requires more than chaining LLM calls together. Agents need persistent state, error recovery, human oversight, and the ability to run for extended periods without losing context. LangGraph is a stateful AI agent orchestration framework that addresses these challenges head-on by modeling agent workflows as directed graphs with built-in checkpointing, streaming, and human-in-the-loop capabilities. Created by LangChain, the same team behind the LangChain ecosystem, LangGraph provides the low-level infrastructure that makes durable, long-running agent workflows practical for real-world deployment.

Trusted by companies like Klarna, Replit, and Elastic, LangGraph takes a fundamentally different approach from simple chain-based orchestration. Instead of linear pipelines, it represents agent logic as stateful graphs where nodes are computation steps and edges define control flow. This graph-based model, inspired by Google’s Pregel and Apache Beam, enables complex multi-step reasoning, branching decision paths, and parallel execution patterns that are essential for sophisticated AI agents.

Architecture Overview

LangGraph Architecture

Understanding the LangGraph Architecture

The architecture diagram above illustrates the core components of LangGraph and how they interact to form a complete agent runtime. Let us break down each component in detail.

StateGraph Builder

The StateGraph class is the primary entry point for constructing agent workflows. It is a builder class that accepts a typed state schema – typically a TypedDict or Pydantic model – and allows developers to add nodes and edges incrementally. Each node in the graph represents a computation step that reads from and writes to the shared state. The state schema defines the shape of the data that flows through the graph, and each key can optionally be annotated with a reducer function that determines how multiple writes to the same key are merged. For example, a messages key annotated with Annotated[list, operator.add] will append new messages rather than overwriting the previous ones. This declarative approach to state management eliminates the need for manual state tracking and ensures type safety throughout the execution pipeline.

The compile() Process

Once the graph structure is defined with all nodes and edges in place, calling graph.compile() transforms the builder into an executable CompiledStateGraph, which is an instance of the Pregel runtime engine. This compilation step validates the graph topology, resolves conditional edges, sets up input and output channels, and prepares the channel infrastructure that will manage state during execution. The compiled graph implements LangChain’s Runnable interface, which means it supports invoke(), stream(), ainvoke(), and astream() methods out of the box. The compilation step also integrates optional components like checkpointers for persistence and memory stores for cross-session data.

CompiledStateGraph (Pregel Runtime)

The compiled graph is powered by the Pregel execution engine, which orchestrates the actual runtime behavior. Pregel manages the lifecycle of each execution step, coordinating which nodes run, in what order, and how their outputs are merged back into the shared state. The runtime handles concurrency control, error propagation, and interrupt management. When a checkpointer is configured, Pregel automatically saves state snapshots after each superstep, enabling the graph to resume from any point after a failure or interruption.

Nodes and Channels

Nodes are the computational units of the graph. Each node is a Python function that receives the current state and returns a partial state update. Channels are the communication medium between nodes. Each state key maps to a channel that manages how values are stored and updated. The channel system supports different update semantics: LastValue channels store only the most recent value, BinaryOperatorAggregate channels accumulate values using a binary operator like addition, and Topic channels implement publish-subscribe patterns for fan-out communication.

Checkpointer, Store, and Cache

The checkpointer provides durable state persistence, saving the full graph state after each superstep. This enables fault tolerance and the ability to replay or resume execution. The Store provides cross-session persistent key-value storage for long-term memory, and the Cache optimizes repeated computations by storing and reusing results.

Key Insight: LangGraph’s Pregel execution engine processes graph steps in bulk-synchronous-parallel phases, enabling deterministic state updates while supporting concurrent node execution within each superstep.

Pregel Execution Engine

LangGraph Pregel Execution

Understanding Pregel Execution

The Pregel execution engine is the heart of LangGraph’s runtime, implementing a bulk-synchronous-parallel (BSP) computation model inspired by Google’s Pregel paper. This model provides deterministic execution guarantees while allowing concurrent processing within each step. Let us examine how it works in detail.

The Plan-Execute-Update Cycle

Every Pregel superstep follows a three-phase cycle that ensures deterministic state management. In the Plan phase, the engine determines which actors (nodes) need to execute in the current step. For the first superstep, it selects nodes that subscribe to the special input channels. In subsequent supersteps, it selects nodes that subscribe to channels that were updated in the previous step. This reactive approach means that only nodes whose inputs have changed will execute, avoiding unnecessary computation.

In the Execute phase, all selected nodes run concurrently. During this phase, channel updates are invisible to other actors – each node sees a consistent snapshot of the state from the beginning of the step. This isolation is critical for deterministic behavior, because it prevents race conditions where one node’s partial output could affect another node’s execution within the same step. Nodes can execute in parallel because their inputs are frozen, and their outputs are buffered until the update phase.

In the Update phase, all buffered channel writes from the executing nodes are applied simultaneously. This is where reducer functions come into play: if multiple nodes write to the same channel, the reducer determines how those writes are merged. For a LastValue channel, the last write wins. For a BinaryOperatorAggregate channel with operator.add, all values are accumulated. The cycle then repeats, with the updated channels triggering new node subscriptions in the next superstep.

Checkpoint Persistence

After each superstep completes and channels are updated, the checkpointer (if configured) saves a complete snapshot of the graph state. This snapshot includes the channel values, the execution step number, and metadata about which nodes have completed. The checkpoint enables several critical capabilities: fault recovery (resuming from the last checkpoint after a crash), time-travel debugging (inspecting state at any previous step), and human-in-the-loop workflows (pausing execution and resuming later with new input).

Superstep Execution Guarantees

The BSP model provides important guarantees. First, execution is deterministic: given the same input and the same graph topology, the output will always be the same regardless of node execution order within a superstep. Second, the model naturally handles cycles: a node can write to a channel that triggers itself or another node in a subsequent superstep, enabling iterative refinement patterns. Third, the model supports graceful termination: execution stops when no more nodes are triggered for the next superstep, or when a maximum step count is reached.

Takeaway: With just StateGraph(State).compile(), you get a production-ready agent runtime with built-in checkpointing, streaming, and error recovery – no manual state management required.

State Channels

LangGraph State Channels

Understanding State Channels

Channels are the communication backbone of LangGraph. They define how data flows between nodes and how state updates are managed. Understanding channel types is essential for designing effective agent workflows, because the choice of channel type determines the semantics of state updates. Let us explore each channel type in detail.

LastValue Channel

The LastValue channel is the simplest and most commonly used channel type. It stores exactly one value at a time and can receive at most one update per superstep. If multiple nodes attempt to write to a LastValue channel in the same step, the system raises an error, because there is no defined way to resolve conflicting writes. This strict behavior makes LastValue ideal for state keys that represent a single current value, such as the latest user input, the current agent decision, or a configuration parameter. The LastValue channel ensures that state remains unambiguous and deterministic.

BinaryOperatorAggregate Channel

The BinaryOperatorAggregate channel solves the problem of accumulating multiple writes within a single superstep. It maintains a persistent value that is updated by applying a binary operator to the current value and each new write. The most common example is Annotated[list, operator.add], which appends each new list to the existing one. This channel type is essential for the messages key in a chatbot state, where multiple nodes might each contribute new messages in the same step. Other use cases include counters (Annotated[int, operator.add]), maximum trackers (Annotated[int, max]), and custom aggregation logic. The binary operator is applied in order, so the final result is deterministic.

Topic Channel

The Topic channel implements a publish-subscribe pattern that is useful for fan-out communication. It can receive multiple values per superstep and makes all values available to subscribing nodes. The Topic channel can be configured to deduplicate values and to accumulate values over multiple steps. This makes it ideal for scenarios where a node needs to send messages to multiple recipients, or where values from different supersteps need to be collected before processing. Internally, LangGraph uses a Topic channel for the Send primitive, which enables dynamic fan-out patterns where the number of parallel executions is determined at runtime.

EphemeralValue Channel

The EphemeralValue channel stores a value for only the duration of a single superstep. After the step completes, the value is cleared. This channel type is useful for passing data between nodes that should not persist in the long-term state, such as intermediate computation results, temporary flags, or one-time signals. Ephemeral values are particularly valuable in the context of input channels, where the initial user input should be available to the first set of nodes but should not clutter the state in subsequent steps.

NamedBarrierValue Channel

The NamedBarrierValue channel provides a synchronization mechanism. It waits until all named values have been received before making the aggregated result available to subscribing nodes. This is useful in patterns where multiple parallel computations must all complete before the next step can proceed, such as gathering results from multiple sub-agents before synthesizing a final answer. The barrier ensures that downstream nodes do not execute prematurely with incomplete data.

How Nodes Communicate Through Channels

The channel system creates a clean separation between computation and communication. Nodes never call other nodes directly. Instead, they read from channels and write to channels. The Pregel runtime manages the flow of data between channels and nodes, ensuring that each node receives the correct inputs and that writes are properly merged. This decoupled architecture makes it easy to add, remove, or reorder nodes without modifying existing node logic, because the channel definitions handle all the wiring.

Human-in-the-Loop

LangGraph Human in the Loop

Understanding Human-in-the-Loop Workflows

Human-in-the-loop (HITL) is one of LangGraph’s most powerful features, enabling AI agents to pause execution and request human input before continuing. This capability is essential for production deployments where human oversight is required for safety, compliance, or quality assurance. Let us examine the three core components that make HITL possible.

The interrupt() Function

The interrupt() function is the primary mechanism for pausing graph execution. When called from within a node, it raises a GraphInterrupt exception that halts the entire graph at that exact point. The value passed to interrupt() is surfaced to the client, providing context about why the graph was paused and what input is needed. For example, a node might call interrupt({"question": "Should I proceed with this action?", "options": ["yes", "no"]}) to present a decision point to a human reviewer.

A critical detail is that when a graph resumes after an interrupt, the node containing the interrupt() call is re-executed from the beginning. This means that any logic before the interrupt() call runs again, and the resume value is returned from the interrupt() function call. This design ensures that the node has access to both the original state and the human’s input when it resumes. If a node contains multiple interrupt() calls, LangGraph matches resume values to interrupts based on their order, so each interrupt receives the correct corresponding response.

GraphInterrupt and Checkpointing

The GraphInterrupt exception is not a failure condition – it is a controlled pause mechanism. When this exception is raised, the Pregel engine saves the current state to the checkpointer before surfacing the interrupt to the client. This checkpoint is what enables resumption: the graph can be re-invoked later, and it will restore from the saved checkpoint, re-enter the interrupted node, and continue execution with the human’s input. Without a checkpointer configured, the interrupt() function cannot work, because there is no way to persist the paused state.

Command(resume=) for Resuming

The Command class provides the mechanism for resuming a graph after an interrupt. When the client is ready to provide input, it invokes the graph again with a Command(resume=value) object, where value is the human’s response. The graph restores from the checkpoint, re-executes the interrupted node, and the interrupt() function returns the resume value instead of raising an exception. The Command class also supports other control flow operations: Command(update=...) applies a state update without executing a node, Command(goto=...) navigates to a specific node, and Command(graph=Command.PARENT) sends the command to the parent graph in a nested subgraph scenario.

The Re-execution Pattern

The re-execution pattern is fundamental to understanding how HITL works in LangGraph. When a node calls interrupt(), the graph saves state and stops. When the client calls graph.invoke(Command(resume=value)), the graph restores from the checkpoint and re-runs the same node. On re-execution, the interrupt() function returns the resume value instead of raising an exception, allowing the node to proceed with the human’s input. This pattern is clean and predictable: the node always has access to the full state, and there is no special “resume mode” logic needed. The node simply calls interrupt() and uses the return value.

Important: The interrupt() function enables true human-in-the-loop workflows by pausing graph execution mid-step, surfacing values to the client, and resuming exactly where it left off when the human provides input via Command(resume=value).

Installation

Getting started with LangGraph is straightforward. The core package provides everything needed to build and run stateful agent graphs:

pip install langgraph

For persistence and checkpointing, LangGraph offers dedicated packages for different storage backends. The SQLite checkpointer is ideal for development and testing:

pip install langgraph-checkpoint-sqlite

For production deployments requiring robust database support, the PostgreSQL checkpointer provides full ACID guarantees:

pip install langgraph-checkpoint-postgres

LangGraph requires Python 3.10 or later and depends on langchain-core for the Runnable interface, langgraph-checkpoint for state persistence, and langgraph-sdk for deployment integration. The package is MIT-licensed and can be used independently of the broader LangChain ecosystem.

Quick Start: Building Your First StateGraph

Let us build a minimal but functional LangGraph application to demonstrate the core concepts. This example creates a simple chatbot graph with typed state:

from langgraph.graph import StateGraph, START, END
from typing import Annotated
from typing_extensions import TypedDict
import operator

# Define the state schema with a reducer
class State(TypedDict):
    messages: Annotated[list, operator.add]

# Define a node function
def chatbot(state: State):
    return {"messages": ["Hello from LangGraph!"]}

# Build the graph
graph = StateGraph(State)
graph.add_node("chatbot", chatbot)
graph.add_edge(START, "chatbot")
graph.add_edge("chatbot", END)

# Compile and run
app = graph.compile()
result = app.invoke({"messages": []})
print(result)  # {'messages': ['Hello from LangGraph!']}

This example demonstrates several key concepts. The State TypedDict defines the shape of the graph’s shared state. The Annotated[list, operator.add] annotation specifies that the messages key uses the operator.add reducer, meaning new messages are appended to the existing list rather than overwriting it. The START and END constants are special nodes that mark the entry and exit points of the graph. The add_node() method registers a computation step, and add_edge() defines the control flow between nodes.

For a more practical chatbot that integrates with an LLM, you would add a checkpointer for persistence and use the interrupt() function for human approval:

from langgraph.checkpoint.memory import InMemorySaver
from langgraph.types import interrupt, Command

# Add checkpointing
memory = InMemorySaver()
app = graph.compile(checkpointer=memory)

# Node with human approval
def agent_with_approval(state: State):
    # Generate a proposed action
    proposed_action = "Delete 10 old files"
    # Pause for human review
    approval = interrupt({"action": proposed_action})
    if approval == "yes":
        return {"messages": [f"Executed: {proposed_action}"]}
    return {"messages": ["Action cancelled by user"]}

# Resume after human input
result = app.invoke(
    Command(resume="yes"),
    config={"configurable": {"thread_id": "1"}}
)

Key Features

LangGraph provides a comprehensive set of features for building production-grade AI agents. The following table summarizes the core capabilities:

Feature Description
StateGraph Build agents as directed graphs with typed state schemas and reducer functions for state updates
Checkpointing Persist execution state across steps for durability, fault recovery, and time-travel debugging
Human-in-the-Loop Pause execution for human review via interrupt(), resume with Command(resume=)
Streaming 7 stream modes including values, updates, messages, custom, checkpoints, tasks, and debug
Subgraphs Nest graphs within graphs for composition and modularity
Send Dynamic fan-out for map-reduce patterns where the number of parallel tasks is determined at runtime
Command Multi-purpose control flow: update state, resume from interrupt, navigate to specific nodes
Memory Store Cross-session persistent key-value storage for long-term agent memory

Amazing: LangGraph supports seven different streaming modes (values, updates, messages, custom, checkpoints, tasks, debug), giving developers granular control over how agent outputs reach end users in real time.

Each of these features is designed to work together seamlessly. For example, checkpointing enables human-in-the-loop by persisting state at interrupt points. Streaming works with checkpointing to emit state snapshots in real time. Subgraphs can have their own checkpointers and interrupt handlers, enabling complex multi-agent hierarchies. The Send primitive works with Topic channels to enable dynamic parallelism patterns that are difficult to express in traditional chain-based frameworks.

Streaming Modes

LangGraph’s streaming system is one of its most sophisticated features. The seven stream modes provide different views into the graph’s execution:

  • values: Emits the full state after each superstep, giving a complete snapshot of the graph’s progress
  • updates: Emits only the state changes from each node, reducing bandwidth while still showing progress
  • messages: Emits LLM message chunks in real time, ideal for chatbot interfaces that stream token-by-token
  • custom: Allows nodes to emit arbitrary data via a StreamWriter, enabling application-specific streaming patterns
  • checkpoints: Emits checkpoint data after each superstep, useful for monitoring persistence and debugging state
  • tasks: Emits task-level information about which nodes are executing, providing visibility into parallelism
  • debug: Emits detailed debug information including task payloads and results, essential for development and troubleshooting

Multiple stream modes can be combined by passing a list to the stream_mode parameter. When combined, the output includes a mode identifier alongside each chunk, making it easy to route different types of events to different handlers.

  • GitHub: https://github.com/langchain-ai/langgraph
  • Documentation: https://langchain-ai.github.io/langgraph/
  • PyPI: https://pypi.org/project/langgraph/

Conclusion

LangGraph represents a significant advancement in AI agent infrastructure. By modeling agent workflows as stateful directed graphs with a Pregel-based execution engine, it provides the durability, observability, and control that production deployments demand. The channel system offers flexible state management semantics, from simple last-value storage to complex aggregation and publish-subscribe patterns. The human-in-the-loop capability, built on checkpointing and the interrupt() function, enables the kind of human oversight that responsible AI deployment requires. And with seven streaming modes, developers have granular control over how agent outputs reach end users.

Whether you are building a simple chatbot or a complex multi-agent system, LangGraph’s graph-based approach provides a solid foundation. The framework integrates seamlessly with the LangChain ecosystem but can also be used standalone, making it accessible to any Python developer working with LLMs. With its combination of low-level control and high-level abstractions, LangGraph is well-positioned as the go-to framework for building stateful, production-ready AI agents.

Watch PyShine on YouTube

Contents