Picking an agent framework in 2026 is genuinely hard. LangGraph, CrewAI, AutoGen, and GraphBus all describe themselves as "multi-agent orchestration" tools — but they make fundamentally different architectural bets. This post breaks down exactly what those bets are and when each one makes sense.

Disclaimer: we built GraphBus. We'll be honest about its weaknesses. Skip to the TL;DR table if you just want the summary.

The fundamental question: when do LLMs run?

Every framework comparison between these four comes down to one axis: when does the LLM get called?

This single difference cascades into wildly different cost structures, latency profiles, and use case fits. Let's trace through each framework.

LangGraph

LangGraph (part of the LangChain ecosystem) models your agent system as a state machine — a graph of nodes, where each node is a function that can call an LLM. Edges are conditional: the system routes to different nodes based on state.

The mental model is: LLM as a router and reasoner, called at every decision point.

# LangGraph: LLMs called at runtime on every request
from langgraph.graph import StateGraph
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-3-haiku")  # called per request

def classify_order(state):
    # LLM call: ~200-500 tokens, every time
    result = llm.invoke(f"Classify this order: {state['order']}")
    return {"classification": result.content}

def process_payment(state):
    # Another LLM call: ~300-800 tokens, every time
    result = llm.invoke(f"Process payment for: {state['order']}")
    return {"payment_status": result.content}

graph = StateGraph(dict)
graph.add_node("classify", classify_order)
graph.add_node("payment", process_payment)
graph.add_edge("classify", "payment")
app = graph.compile()

Strengths: Excellent for workflows requiring runtime adaptability — the LLM can make different routing decisions based on each specific request. Mature ecosystem, excellent streaming support, strong tooling for complex multi-step reasoning.

Weaknesses: Every request costs tokens. A simple order processing pipeline might use 2,000 tokens per request — $0.002 at current Claude pricing. At 1M orders/month that's $2,000/month in AI costs just for this one pipeline. Plus latency: each LLM call adds 200–2,000ms.

CrewAI

CrewAI organizes agents into "crews" — teams with roles, goals, and tool access. Agents collaborate using natural language. The framework is optimized for task-oriented workloads where you'd describe what you want in English and let the LLM figure out the execution plan.

# CrewAI: LLMs run for every task execution
from crewai import Agent, Task, Crew

analyst = Agent(
    role="Order Analyst",
    goal="Analyze e-commerce orders for fraud",
    backstory="Expert in payment fraud detection",
    llm="anthropic/claude-3-haiku"   # called at runtime per task
)

task = Task(
    description="Analyze order {order_id} for fraud signals",
    agent=analyst,
    expected_output="Fraud assessment with confidence score"
)

crew = Crew(agents=[analyst], tasks=[task])
result = crew.kickoff(inputs={"order_id": "ord_123"})  # LLM called here

Strengths: Extremely fast to get started. Role-based abstractions feel natural. Good for internal tools, prototyping, and workloads where the flexibility of natural-language task definition is worth the cost.

Weaknesses: Agents reason through natural language, which means prompt engineering is load-bearing. Output reliability is lower than structured approaches. LLM cost accrues per task, and tasks can be expensive (complex reasoning = many tokens).

AutoGen (Microsoft)

AutoGen's core abstraction is conversational agents that message each other to solve problems. You define agents with capabilities and let them negotiate solutions through dialogue — the LLM drives both the reasoning and the inter-agent communication.

# AutoGen: LLMs drive runtime conversation between agents
import autogen

config = {"model": "claude-3-haiku", "api_key": "sk-ant-..."}

validator = autogen.AssistantAgent(
    name="OrderValidator",
    system_message="You validate e-commerce orders.",
    llm_config={"config_list": [config]}
)

processor = autogen.AssistantAgent(
    name="OrderProcessor",
    system_message="You process validated orders.",
    llm_config={"config_list": [config]}
)

# At runtime: agents converse until they agree
user_proxy = autogen.UserProxyAgent(name="User", human_input_mode="NEVER")
user_proxy.initiate_chat(validator, message="Process order ord_123")

Strengths: The conversational model can produce surprisingly good results on complex, open-ended tasks. Self-correcting: agents can catch each other's mistakes through dialogue. Works well for code generation and research tasks.

Weaknesses: The conversational loop is expensive — a multi-agent conversation might use 5,000–20,000 tokens. Highly non-deterministic: the same input can produce different outputs (and different costs). Hard to deploy to production when you need SLAs.

GraphBus

GraphBus separates concerns differently. Agents run a build phase — they read their own source code, propose improvements, negotiate consensus, and commit changes. The improved code then runs via a typed message bus at runtime. You control when LLMs are invoked: at build time, during runtime agent logic, or both.

# GraphBus: LLMs run at BUILD TIME to improve this code.
# At runtime: agents communicate via typed pub/sub; add LLM calls to agent logic as needed.
from graphbus_core import GraphBusNode, schema_method, subscribe

class OrderProcessor(GraphBusNode):
    SYSTEM_PROMPT = """
    I process e-commerce orders. During build cycles, I negotiate with
    FraudDetector and PaymentService to ensure our schemas are consistent
    and my validation logic is robust.
    """

    @schema_method(
        input_schema={"order_id": str, "amount": float, "items": list},
        output_schema={"status": str, "total": float}
    )
    def process_order(self, order_id: str, amount: float, items: list) -> dict:
        # This code was IMPROVED by agents at build time.
        # Now run it on the bus — call LLMs when your agents need them.
        if amount <= 0:
            raise ValueError("Amount must be positive")
        total = sum(item["price"] * item["qty"] for item in items)
        return {"status": "confirmed", "total": total}

    @subscribe("/Fraud/Cleared")
    def on_fraud_cleared(self, event):
        self.log(f"Fraud check passed for {event.payload['order_id']}")
# Build once — agents negotiate improvements (~10K tokens total)
export ANTHROPIC_API_KEY=sk-ant-...
graphbus build agents/ --enable-agents

# [AGENT] OrderProcessor: "I propose adding amount validation (line 14)"
# [AGENT] FraudDetector: "Accepted — prevents downstream null errors"
# [ARBITER] Consensus. Committing to agents/order_processor.py

# Deploy — agents run on the bus, LLMs on your terms
graphbus run .graphbus/
# [RUNTIME] 3 agents active. Zero LLM calls will be made.

Strengths: Build-time LLM intelligence that improves your code structure and contracts. Structured runtime communication via a typed message bus. Full flexibility to call LLMs inside agent methods at runtime. Built-in K8s/Docker deploy tooling. Fully inspectable negotiation history. Works with existing Python codebases — just subclass GraphBusNode.

Weaknesses: Younger ecosystem than LangGraph/CrewAI. Negotiation-based builds add build-time cost. Best for structured, graph-shaped workloads — if your agents need open-ended conversational loops, LangGraph may be a better fit.

The cost comparison: running 1M orders/month

Let's make this concrete. An order processing pipeline that handles 1 million orders per month:

Framework Tokens / request Cost at 1M orders/mo Latency added Schema-validated routing?
LangGraph ~1,500 / req ~$2,250 / mo +500–2,000ms No
CrewAI ~2,000 / req ~$3,000 / mo +800–3,000ms No
AutoGen ~8,000 / req ~$12,000 / mo +2,000–8,000ms No
GraphBus 0 / req (for this pipeline†) ~$0.03 / mo (build phase) +0ms bus overhead Yes

* Token costs estimated at Claude Haiku pricing ($0.25/M input, $1.25/M output). Actual costs vary by model and usage pattern.

At 1M orders/month, the cost difference between GraphBus and AutoGen is over $140,000/year. That's not a marginal optimization — it's a different category of system.

Of course, these numbers only matter if GraphBus can do what you need. If your pipeline genuinely requires per-request LLM reasoning, the alternatives aren't a choice — they're a requirement.

When to use each

Use LangGraph when:

Use CrewAI when:

Use AutoGen when:

Use GraphBus when:

The hybrid approach

These frameworks aren't mutually exclusive. A real production system might use:

The mistake is picking one framework and using it for everything. Think about which part of your system needs runtime intelligence versus which parts can have their intelligence baked in.

TL;DR

LangGraph — runtime LLM, stateful workflows, great streaming. Best for adaptive pipelines and research tools.

CrewAI — runtime LLM, role-based agents, fastest to prototype. Best for internal tools and low-volume tasks.

AutoGen — runtime LLM, conversational agents, most capable. Best for complex code-gen and research. Most expensive.

GraphBus — build-time LLM negotiation + graph-based runtime messaging bus. Call LLMs at runtime when your agents need them. Best for structured, graph-shaped agent systems that need intelligent code evolution and observable inter-agent communication.

If you're still unsure: run a quick test. Take your current pipeline. Count how many LLM calls it makes per request. Multiply by your expected volume. If the monthly cost is uncomfortable, GraphBus is worth evaluating.


GraphBus is MIT licensed and in alpha. The build pipeline, runtime engine, CLI, and negotiation protocol are working and tested. We're looking for teams with real production pipelines to work through the tradeoffs with us.

If you're running LangGraph or CrewAI in production and hitting the cost wall, reach out — we'd genuinely like to understand your use case.

Evaluate GraphBus for your pipeline

Join the alpha waitlist. We'll reach out to help you assess whether the build/runtime model fits your use case.

Join the waitlist Read the docs