Microsoft AutoGen

Donavan Jones Published May 1, 2026 ai-engineering

Microsoft AutoGen

The previous two articles built up a picture of what agent orchestration and state management look like when implemented from scratch. That custom approach is what this platform uses in production — explicit task graphs, typed pipeline state, deliberate model selection per stage. But it is not the only approach, and understanding the alternatives clarifies why specific design decisions were made.

Microsoft AutoGen is one of the most widely used frameworks for multi-agent AI systems. It takes a different philosophy: instead of defining explicit pipelines, agents communicate by conversing. They send messages to each other, respond to requests, and the collective dialogue produces the result. Understanding AutoGen — what it is, what it is good at, and where its model breaks down — is useful context for anyone building AI systems, even those who end up building custom rather than using the framework.

What AutoGen Is

AutoGen is an open-source Python framework from Microsoft Research that models multi-agent workflows as conversations between agents. Each agent is an entity with a name, a system prompt, and optionally a set of callable functions (tools). Agents communicate by passing messages in a shared conversation thread.

The two primary agent types:

AssistantAgent — backed by a language model. It receives messages, reasons about them, and either responds with text or calls a registered function. This is the "thinking" component.

UserProxyAgent — acts as the human-in-the-loop or as an automated executor. It can execute code generated by an AssistantAgent, relay user input, or automatically respond according to a configured policy. This is the "acting" component.

The simplest AutoGen workflow pairs these two:

import autogen

config_list = [{"model": "claude-opus-4-8", "api_key": os.environ["ANTHROPIC_API_KEY"]}]

assistant = autogen.AssistantAgent(
    name="StudyAssistant",
    system_message="""You are a knowledgeable Bible study assistant.
    Help the user understand passages with clear explanations and relevant cross-references.
    When you need to retrieve passage text, use the get_verse function.""",
    llm_config={"config_list": config_list},
)

user_proxy = autogen.UserProxyAgent(
    name="User",
    human_input_mode="NEVER",   # automated, no human in the loop
    code_execution_config=False,
    function_map={"get_verse": bible_service.get_verse},
)

# Kick off the conversation
user_proxy.initiate_chat(
    assistant,
    message="Explain the meaning of Romans 8:28 in context.",
)

The agents exchange messages until the conversation reaches a termination condition. AutoGen handles the loop: the assistant responds, the proxy either executes a function call or replies, the assistant incorporates the result, and so on.

The Conversation Model

AutoGen's defining characteristic is that it models everything as conversation. Multi-agent coordination is agents talking to each other rather than an orchestrator routing structured data between stages.

A three-agent pipeline in AutoGen for study guide generation:

researcher = autogen.AssistantAgent(
    name="Researcher",
    system_message="You retrieve relevant Bible passages and commentary for a given topic.",
    llm_config={"config_list": config_list},
)

writer = autogen.AssistantAgent(
    name="Writer",
    system_message="You write clear, accessible study guide content from provided research.",
    llm_config={"config_list": config_list},
)

critic = autogen.AssistantAgent(
    name="Critic",
    system_message="You review study content for theological accuracy and flag any issues.",
    llm_config={"config_list": config_list},
)

group_chat = autogen.GroupChat(
    agents=[researcher, writer, critic],
    messages=[],
    max_round=12,
)

manager = autogen.GroupChatManager(groupchat=group_chat, llm_config={"config_list": config_list})

The GroupChatManager is itself backed by a model — it decides which agent should speak next based on the conversation so far. The agents converse, the manager facilitates, and the output emerges from the collective dialogue.

This is elegant and flexible. The researcher gathers information, passes it to the writer through the shared conversation, the writer produces a draft, the critic responds with issues, the writer revises. The pipeline structure is implicit in the conversation rather than explicit in code.

Where AutoGen Excels

Rapid prototyping. Spinning up a multi-agent workflow in AutoGen is genuinely fast. The framework handles the conversation loop, message routing, function execution, and termination conditions. A working two-agent prototype with tool use can be running in 30 lines of Python. For exploring whether a multi-agent approach will work for a problem, AutoGen is an excellent environment.

Conversational workflows with natural structure. Some tasks are genuinely conversational in nature — an analyst and a researcher debating an interpretation, a writer and editor going back and forth on revisions, a planner and executor working through an open-ended task. AutoGen's conversation model maps naturally onto these. The agents can ask each other clarifying questions, push back on conclusions, and change direction mid-task in ways that explicit pipelines cannot.

Code execution. AutoGen has first-class support for agents that generate and execute code. The UserProxyAgent can execute Python in a sandboxed environment and return results. For data analysis tasks, scripting tasks, or anything where the model needs to run code to solve a problem, AutoGen's code execution capabilities are well-developed and handle a lot of the surrounding complexity (retry on error, output capture, sandbox isolation).

Research and experimentation. AutoGen originated in a research context (Microsoft Research) and it shows — it has rich tooling for logging conversations, replaying them, introspecting agent behavior, and experimenting with different configurations. For studying multi-agent behavior or evaluating different orchestration strategies, AutoGen provides a good experimental platform.

Where the Conversation Model Creates Friction

For production AI features with predictable behavior requirements, the conversation model introduces friction that custom pipelines avoid.

Non-deterministic routing. The GroupChatManager decides which agent speaks next using a model call. The decision is probabilistic. Running the same workflow twice may produce different agent sequences, and debugging why the critic spoke before the writer — when the writer should have gone first — requires reading the manager's reasoning, which is embedded in opaque model calls.

In custom pipelines, the sequence is explicit in code. The order is deterministic. Debugging is straightforward.

Conversation overhead. Every inter-agent communication is a message in the shared conversation thread. By round 8 of a 12-round GroupChat, every agent's context window contains the full conversation history — including the turns where agents were talking to other agents about topics the current agent does not need to know. Context bloat compounds quickly in multi-agent conversations.

Custom pipelines pass only relevant data between stages. The writing agent receives the retrieved context and the outline — not the full conversation between the researcher and the manager about which passages to retrieve.

Termination conditions require care. AutoGen conversations terminate when an agent says TERMINATE (by convention) or when max_round is reached. Both are blunt instruments. A workflow that terminates early because an agent said something that pattern-matched the termination string — or that runs to max_round when it should have finished in round 4 — requires careful tuning of system prompts and termination logic.

Custom pipelines terminate explicitly: the stage completes, returns its output, and the orchestrator moves to the next stage.

Structured output is harder to guarantee. When the final output of a pipeline needs to be a specific JSON structure (a StudyGuide object with defined sections and citations), eliciting that structure from a conversational workflow requires careful system prompt engineering across all participating agents. Any agent in the conversation can produce output that breaks the final structure expectation.

Custom pipelines validate output at each stage boundary with typed schemas. A stage that produces malformed output fails immediately at that stage, not at the consumer.

AutoGen v0.4 and the AgentChat API

AutoGen's v0.4 release (late 2024) introduced significant architectural changes. The new autogen-agentchat package adopts an async-first, component-based design that addresses some of the v0.2 friction points:

from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_ext.models.anthropic import AnthropicChatCompletionClient

model_client = AnthropicChatCompletionClient(model="claude-sonnet-4-6")

researcher = AssistantAgent(
    name="Researcher",
    model_client=model_client,
    tools=[get_verse, search_passages],
    system_message="Retrieve relevant passages for the given study topic.",
)

writer = AssistantAgent(
    name="Writer",
    model_client=model_client,
    system_message="Write study content from the researcher's findings.",
)

team = RoundRobinGroupChat([researcher, writer], max_turns=6)

async def run():
    result = await team.run(task="Generate a study on Romans 8:28")
    print(result.messages[-1].content)

The v0.4 architecture is cleaner, more testable, and better suited to async production environments. RoundRobinGroupChat enforces a fixed agent order rather than relying on a manager model — removing one source of non-determinism. The component model makes individual agents more composable and independently testable.

The shift between v0.2 and v0.4 is informative: the framework moved toward more explicit control flow as it matured from research tool toward production framework. The patterns that work in production tend toward explicitness.

When to Reach for AutoGen

AutoGen is a good fit when:

Prototyping a multi-agent approach before committing to a custom implementation. AutoGen lets you validate that agents collaborating produces better results than a single agent, before investing in custom orchestration code.
The workflow is genuinely conversational and the agent sequence is not fully known in advance. Open-ended research tasks where agents need to ask each other questions fit AutoGen's model well.
Code generation and execution is central to the task. AutoGen's code execution infrastructure is mature and handles cases that are tedious to implement from scratch.
Team familiarity with Python — AutoGen is Python-only. For a TypeScript-first stack, the operational complexity of a Python sidecar may outweigh AutoGen's benefits.

AutoGen is a poor fit when:

Predictable, structured output is required. If the pipeline must produce a specific JSON shape every time, explicit pipelines with schema validation at each stage are more reliable.
Per-stage model selection matters for cost efficiency. AutoGen supports different model configurations per agent, but the framework overhead makes fine-grained cost control harder than in a custom pipeline.
Latency is tightly constrained. The conversation overhead — every inter-agent message extending all agents' context windows — accumulates. A custom pipeline with parallel stage execution and minimal context propagation is generally faster.
The system is already in TypeScript/Node.js. Introducing Python for AutoGen in a Node.js stack adds deployment complexity, a new language runtime, and a process-boundary for inter-service communication.

What AutoGen Taught Me About Custom Design

Even without using AutoGen in production, studying it clarified some principles that shaped the custom pipeline design:

The GroupChatManager's non-determinism — using a model to decide agent order — made explicit how important deterministic routing is in production. The custom orchestrator's task graph with defined dependencies is a direct response to wanting routing to be code, not model output.

AutoGen's conversation model — where all context accumulates in a shared thread — made explicit the value of targeted context propagation. Custom pipelines pass only the relevant subset of state to each stage. This came from seeing what happens when you pass everything.

AutoGen's termination condition complexity — agents needing to signal completion through message content — made explicit the value of explicit completion semantics. In custom pipelines, a stage is complete when its function returns. No ambiguity.

Frameworks are most valuable when their assumptions match your problem. AutoGen's assumptions match conversational, open-ended, research-oriented tasks. This platform's tasks — structured output, predictable pipelines, tight cost control, TypeScript stack — match a custom orchestration approach more closely. The framework still informed the design by illustrating which problems it was solving and which ones it was introducing.

Keep Reading

Case Study

In Progress

Bible Verse — Case Study

Production SaaS Platform · Full-Stack · Founder & Sole Engineer

A domain-driven SaaS platform with five independently scalable system boundaries: scripture content delivery, RAG-backed AI study, real-time community interaction, async media processing, and infrastructure services — built and operated end-to-end.

Our Results

37K+

Verses Indexed

AI Models

Bounded Domains

Job Queues

How We Built It

RAG pipeline grounding AI responses in actual scripture rather than model memory
Hybrid Llama / OpenAI routing — local inference for cost, API fallback for quality at the edge
Non-blocking media processing — FFmpeg jobs enqueued via BullMQ, API never waits on transcoding
Cross-instance real-time consistency via Redis pub/sub behind WebSocket and WebRTC layers

Lessons Learned

Domain boundaries enforced at the service layer prevent coupling long before scale demands microservices.
RAG retrieval quality matters more than model size — better embeddings outperform a larger model on poor context.
Async queue design should be first-class, not bolted on; BullMQ worker isolation saved the request path repeatedly.

Stack

Nuxt 3TypeScriptNitroPostgreSQLPrismaRedisBullMQWeaviateMinIOFFmpegWebRTCWebSocketsLlama 3.2OpenAI APIKubernetes

View Full Case Study

Written by

Donavan Jones Full-Stack Engineer & Systems Architect

5+ years building production systems · AI, Backend & Infrastructure · Founder of Bible Logic

Full-stack engineer with 5+ years of hands-on experience designing and shipping production systems — from Nuxt 3 frontends and Nitro APIs to self-hosted Kubernetes clusters, RAG pipelines, and real-time AI applications. Everything I write comes from systems I've designed, deployed, and operated in production.

5+ Years Experience AI Systems Specialist Kubernetes & Infrastructure

Nuxt 3TypeScriptPostgreSQLKubernetesRAG / LLMWebRTCAWS IVSRedis

Full Author Bio GitHub LinkedIn Resume Systems

Menu