Picking an agent framework in 2026 is genuinely hard. LangGraph, CrewAI, AutoGen, and GraphBus all describe themselves as "multi-agent orchestration" tools — but they make fundamentally different architectural bets. This post breaks down exactly what those bets are and when each one makes sense.
Disclaimer: we built GraphBus. We'll be honest about its weaknesses. Skip to the TL;DR table if you just want the summary.
The fundamental question: when do LLMs run?
Every framework comparison between these four comes down to one axis: when does the LLM get called?
- LangGraph, CrewAI, AutoGen — LLMs run at runtime, on every user request (or task execution)
- GraphBus — LLMs run at build time, once, to improve the code. Runtime is pure Python
This single difference cascades into wildly different cost structures, latency profiles, and use case fits. Let's trace through each framework.
LangGraph
LangGraph (part of the LangChain ecosystem) models your agent system as a state machine — a graph of nodes, where each node is a function that can call an LLM. Edges are conditional: the system routes to different nodes based on state.
The mental model is: LLM as a router and reasoner, called at every decision point.
# LangGraph: LLMs called at runtime on every request
from langgraph.graph import StateGraph
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-3-haiku") # called per request
def classify_order(state):
# LLM call: ~200-500 tokens, every time
result = llm.invoke(f"Classify this order: {state['order']}")
return {"classification": result.content}
def process_payment(state):
# Another LLM call: ~300-800 tokens, every time
result = llm.invoke(f"Process payment for: {state['order']}")
return {"payment_status": result.content}
graph = StateGraph(dict)
graph.add_node("classify", classify_order)
graph.add_node("payment", process_payment)
graph.add_edge("classify", "payment")
app = graph.compile()
Strengths: Excellent for workflows requiring runtime adaptability — the LLM can make different routing decisions based on each specific request. Mature ecosystem, excellent streaming support, strong tooling for complex multi-step reasoning.
Weaknesses: Every request costs tokens. A simple order processing pipeline might use 2,000 tokens per request — $0.002 at current Claude pricing. At 1M orders/month that's $2,000/month in AI costs just for this one pipeline. Plus latency: each LLM call adds 200–2,000ms.
CrewAI
CrewAI organizes agents into "crews" — teams with roles, goals, and tool access. Agents collaborate using natural language. The framework is optimized for task-oriented workloads where you'd describe what you want in English and let the LLM figure out the execution plan.
# CrewAI: LLMs run for every task execution
from crewai import Agent, Task, Crew
analyst = Agent(
role="Order Analyst",
goal="Analyze e-commerce orders for fraud",
backstory="Expert in payment fraud detection",
llm="anthropic/claude-3-haiku" # called at runtime per task
)
task = Task(
description="Analyze order {order_id} for fraud signals",
agent=analyst,
expected_output="Fraud assessment with confidence score"
)
crew = Crew(agents=[analyst], tasks=[task])
result = crew.kickoff(inputs={"order_id": "ord_123"}) # LLM called here
Strengths: Extremely fast to get started. Role-based abstractions feel natural. Good for internal tools, prototyping, and workloads where the flexibility of natural-language task definition is worth the cost.
Weaknesses: Agents reason through natural language, which means prompt engineering is load-bearing. Output reliability is lower than structured approaches. LLM cost accrues per task, and tasks can be expensive (complex reasoning = many tokens).
AutoGen (Microsoft)
AutoGen's core abstraction is conversational agents that message each other to solve problems. You define agents with capabilities and let them negotiate solutions through dialogue — the LLM drives both the reasoning and the inter-agent communication.
# AutoGen: LLMs drive runtime conversation between agents
import autogen
config = {"model": "claude-3-haiku", "api_key": "sk-ant-..."}
validator = autogen.AssistantAgent(
name="OrderValidator",
system_message="You validate e-commerce orders.",
llm_config={"config_list": [config]}
)
processor = autogen.AssistantAgent(
name="OrderProcessor",
system_message="You process validated orders.",
llm_config={"config_list": [config]}
)
# At runtime: agents converse until they agree
user_proxy = autogen.UserProxyAgent(name="User", human_input_mode="NEVER")
user_proxy.initiate_chat(validator, message="Process order ord_123")
Strengths: The conversational model can produce surprisingly good results on complex, open-ended tasks. Self-correcting: agents can catch each other's mistakes through dialogue. Works well for code generation and research tasks.
Weaknesses: The conversational loop is expensive — a multi-agent conversation might use 5,000–20,000 tokens. Highly non-deterministic: the same input can produce different outputs (and different costs). Hard to deploy to production when you need SLAs.
GraphBus
GraphBus separates concerns differently. Agents run a build phase — they read their own source code, propose improvements, negotiate consensus, and commit changes. The improved code then runs via a typed message bus at runtime. You control when LLMs are invoked: at build time, during runtime agent logic, or both.
# GraphBus: LLMs run at BUILD TIME to improve this code.
# At runtime: agents communicate via typed pub/sub; add LLM calls to agent logic as needed.
from graphbus_core import GraphBusNode, schema_method, subscribe
class OrderProcessor(GraphBusNode):
SYSTEM_PROMPT = """
I process e-commerce orders. During build cycles, I negotiate with
FraudDetector and PaymentService to ensure our schemas are consistent
and my validation logic is robust.
"""
@schema_method(
input_schema={"order_id": str, "amount": float, "items": list},
output_schema={"status": str, "total": float}
)
def process_order(self, order_id: str, amount: float, items: list) -> dict:
# This code was IMPROVED by agents at build time.
# Now run it on the bus — call LLMs when your agents need them.
if amount <= 0:
raise ValueError("Amount must be positive")
total = sum(item["price"] * item["qty"] for item in items)
return {"status": "confirmed", "total": total}
@subscribe("/Fraud/Cleared")
def on_fraud_cleared(self, event):
self.log(f"Fraud check passed for {event.payload['order_id']}")
# Build once — agents negotiate improvements (~10K tokens total)
export ANTHROPIC_API_KEY=sk-ant-...
graphbus build agents/ --enable-agents
# [AGENT] OrderProcessor: "I propose adding amount validation (line 14)"
# [AGENT] FraudDetector: "Accepted — prevents downstream null errors"
# [ARBITER] Consensus. Committing to agents/order_processor.py
# Deploy — agents run on the bus, LLMs on your terms
graphbus run .graphbus/
# [RUNTIME] 3 agents active. Zero LLM calls will be made.
Strengths: Build-time LLM intelligence that improves your code structure and contracts. Structured runtime communication via a typed message bus. Full flexibility to call LLMs inside agent methods at runtime. Built-in K8s/Docker deploy tooling. Fully inspectable negotiation history. Works with existing Python codebases — just subclass GraphBusNode.
Weaknesses: Younger ecosystem than LangGraph/CrewAI. Negotiation-based builds add build-time cost. Best for structured, graph-shaped workloads — if your agents need open-ended conversational loops, LangGraph may be a better fit.
The cost comparison: running 1M orders/month
Let's make this concrete. An order processing pipeline that handles 1 million orders per month:
| Framework | Tokens / request | Cost at 1M orders/mo | Latency added | Schema-validated routing? |
|---|---|---|---|---|
| LangGraph | ~1,500 / req | ~$2,250 / mo | +500–2,000ms | No |
| CrewAI | ~2,000 / req | ~$3,000 / mo | +800–3,000ms | No |
| AutoGen | ~8,000 / req | ~$12,000 / mo | +2,000–8,000ms | No |
| GraphBus | 0 / req (for this pipeline†) | ~$0.03 / mo (build phase) | +0ms bus overhead | Yes |
* Token costs estimated at Claude Haiku pricing ($0.25/M input, $1.25/M output). Actual costs vary by model and usage pattern.
Of course, these numbers only matter if GraphBus can do what you need. If your pipeline genuinely requires per-request LLM reasoning, the alternatives aren't a choice — they're a requirement.
When to use each
Use LangGraph when:
- You need runtime adaptability — the LLM's response genuinely changes based on each request's unique context
- You're building research or analysis tools where open-ended reasoning is the product
- You're already in the LangChain ecosystem and want tight integration
- You need streaming — LangGraph's streaming support is excellent
- Your volume is low enough that per-request LLM cost is acceptable
Use CrewAI when:
- You're prototyping fast and care more about shipping than cost
- Your task can be described in natural language and the LLM can figure out the steps
- You're building internal tools where cost isn't a primary concern
- You want role-based agent abstractions without writing much infrastructure
Use AutoGen when:
- You're solving complex, open-ended problems where agent conversation produces emergent solutions
- You're doing code generation or research and need agents to check each other's work
- Cost and latency are secondary to capability — you'd rather have the best answer than the fastest one
- You're building developer tooling or agentic IDEs
Use GraphBus when:
- You have a pipeline with stable semantics — the logic doesn't change with each request, only the data does
- You want structured inter-agent communication — typed messages, pub/sub topics, and a graph that knows who depends on whom
- You're running at scale where per-request LLM cost is material (thousands+ requests/hour)
- You want LLM-improved code without LLM runtime dependency — deploy to air-gapped environments, edge, embedded
- You need full K8s/Docker deployment tooling built into the framework
- Your team wants to use LLMs to improve a Python codebase but keep runtime pure
The hybrid approach
These frameworks aren't mutually exclusive. A real production system might use:
- GraphBus for the high-volume data processing pipeline (orders, events, notifications)
- LangGraph for the customer support agent that needs per-message reasoning
- AutoGen for internal developer tooling where cost doesn't matter
The mistake is picking one framework and using it for everything. Think about which part of your system needs runtime intelligence versus which parts can have their intelligence baked in.
TL;DR
CrewAI — runtime LLM, role-based agents, fastest to prototype. Best for internal tools and low-volume tasks.
AutoGen — runtime LLM, conversational agents, most capable. Best for complex code-gen and research. Most expensive.
GraphBus — build-time LLM negotiation + graph-based runtime messaging bus. Call LLMs at runtime when your agents need them. Best for structured, graph-shaped agent systems that need intelligent code evolution and observable inter-agent communication.
If you're still unsure: run a quick test. Take your current pipeline. Count how many LLM calls it makes per request. Multiply by your expected volume. If the monthly cost is uncomfortable, GraphBus is worth evaluating.
GraphBus is MIT licensed and in alpha. The build pipeline, runtime engine, CLI, and negotiation protocol are working and tested. We're looking for teams with real production pipelines to work through the tradeoffs with us.
If you're running LangGraph or CrewAI in production and hitting the cost wall, reach out — we'd genuinely like to understand your use case.
Evaluate GraphBus for your pipeline
Join the alpha waitlist. We'll reach out to help you assess whether the build/runtime model fits your use case.
Join the waitlist Read the docs