Microsoft’s Agent Framework abstracts agent creation into declarative configurations and tool registrations. You define an agent, register tools, and the framework handles execution. The abstraction is convenient until something breaks or performance degrades.
Understanding the runtime architecture reveals where latency accumulates, how state management creates race conditions, and why certain tool invocation patterns cause memory leaks. The framework makes simple cases trivial. Complex cases require understanding what’s happening beneath the abstraction.
This is the internal execution model. Not the documentation version. The version you need when debugging why your agent is slow, why state isn’t persisting correctly, or why tool invocations are failing intermittently.
Agent Instantiation and Lifecycle
Agents are not long-lived objects. Each request instantiates a new agent instance. The framework maintains agent definitions as templates, not instances. When a request arrives, the framework:
- Deserializes the agent definition from configuration
- Resolves tool dependencies from the tool registry
- Instantiates a new agent instance with bound tools
- Loads conversation state from persistent storage
- Executes the turn
- Persists updated state
- Discards the agent instance
This instantiation-per-request model means agent constructors execute on every turn. Expensive initialization—loading models, establishing connections, building caches—happens repeatedly unless explicitly moved to tool initialization or state management.
The framework doesn’t pool agent instances. Each turn is a fresh instantiation. This ensures isolation between requests but prevents optimizations that require warm caches or persistent connections. If your agent initialization takes 200ms, every turn starts with 200ms latency regardless of the actual work being done.
The Turn Execution Pipeline
A turn executes through a pipeline with discrete phases:
Input Processing: The user message gets validated, normalized, and enriched with context from the conversation state. This phase applies input transformers, content filters, and context injection. Latency here comes from state retrieval and transformer execution.
LLM Invocation: The framework constructs a prompt from the conversation history, system message, tool definitions, and user input. This gets sent to the configured LLM endpoint. Latency is dominated by LLM inference time plus network round-trip. For GPT-4 class models, expect 2-8 seconds for typical completions.
Response Parsing: The LLM response gets parsed for tool calls, structured outputs, or direct responses. The framework expects specific JSON structures for tool invocations. Malformed responses trigger retry logic with clarification prompts. Parsing failures add an additional LLM round-trip.
Tool Execution: Identified tool calls execute sequentially unless explicitly parallelized. Each tool invocation is a synchronous function call with timeout protection. Long-running tools block the turn pipeline. Tools that fail trigger error recovery flows that may involve additional LLM calls.
Response Assembly: Tool outputs get formatted and appended to conversation history. The framework may make another LLM call to synthesize tool results into a natural language response. This synthesis call adds latency when tools return structured data that needs interpretation.
State Persistence: Updated conversation state gets written to persistent storage. This is synchronous—the turn doesn’t complete until state is persisted. Storage latency directly adds to user-perceived latency.
Each phase is sequential. Total turn latency is the sum of all phases. A turn with two tool invocations might involve: state load (100ms) + LLM call (3s) + tool 1 (500ms) + tool 2 (800ms) + synthesis LLM call (2s) + state save (100ms) = 6.5 seconds minimum.
State Management and Persistence
The conversation state lives in external storage, not agent memory. The framework loads state at turn start and persists it at turn end. Between these points, state exists only in agent instance memory.
State is keyed by conversation ID. Each conversation maintains a separate state. Multiple concurrent turns on the same conversation create race conditions. The framework doesn’t implement optimistic locking or conflict resolution. Last write wins.
If two turns execute concurrently on the same conversation:
- Both load state at T0
- Both modify state independently
- Turn A completes at T1, writes state
- Turn B completes at T2, writes state
- Turn A’s state changes are lost
This is not theoretical. It happens in production when users send messages rapidly or when background processes update conversation state. The framework provides no transaction isolation. Your application must implement this if concurrent state access matters.
State serialization format is JSON by default. Large conversation histories serialize to large JSON documents. State load and save time increases linearly with conversation length. After 50+ turns, state operations can exceed 500ms per turn. The framework doesn’t implement state pruning or compression. You must do this manually.
Tool Registration and Invocation
Tools register with the framework as callable functions with schemas. The schema defines parameters, types, and descriptions. The LLM receives these schemas in the prompt and indicates tool usage through structured responses.
Tool invocation flow:
- LLM response indicates tool call with function name and arguments
- Framework validates arguments against tool schema
- Framework resolves function name to registered callable
- Framework invokes function synchronously with timeout
- Function returns result or throws exception
- Framework captures result/error and continues execution
The synchronous invocation model means tool execution blocks the turn pipeline completely. A tool that takes 5 seconds to execute adds 5 seconds to turn latency. There is no built-in async execution model. Tools that need async behavior must implement their own async-to-sync bridging.
Tool timeout defaults to 30 seconds. When timeout triggers, the framework catches the timeout exception and presents it to the LLM as an error. The LLM may retry the tool call with different parameters or abandon the tool usage. Timeout handling adds another LLM round-trip.
Tool argument validation is strict. Type mismatches between the LLM’s provided arguments and the schema cause immediate failures. The framework doesn’t attempt type coercion. A schema expecting integer that receives string “42” fails validation. The error returns to the LLM for correction, adding another round-trip.
The Tool Context Problem
Tools receive no direct context about the conversation or agent state unless explicitly passed as parameters. A tool doesn’t know:
- Which conversation it’s executing in
- What previous tools were called this turn
- What the user’s original message was
- What the agent’s goal or persona is
This isolation is intentional for functional purity but creates practical problems. Tools that need context must receive it through parameters, which requires the LLM to extract and pass relevant context in every tool call.
If context isn’t in the tool schema, the LLM can’t provide it. If the schema includes extensive context parameters, the schema becomes verbose and the LLM is more likely to invoke tools incorrectly.
The workaround is dependency injection through tool initialization. Tools can receive conversation ID or agent configuration at registration time. This creates stateful tools that violate the functional interface but have access to needed context. The framework doesn’t prevent this. It just doesn’t facilitate it.
Where Latency Actually Comes From
Typical turn latency breakdown for an agent with tool usage:
- State load from storage: 50-200ms
- Prompt construction: 10-50ms
- First LLM call (planning): 2-5 seconds
- Tool argument validation: 1-5ms per tool
- Tool execution: 100ms-10s per tool depending on tool
- Tool result formatting: 5-20ms per tool
- Second LLM call (synthesis): 2-5 seconds
- Response formatting: 10-30ms
- State save to storage: 50-200ms
Single tool usage, two LLM calls: 4-10 seconds typical. Three tool calls requiring sequential execution: 6-15 seconds typical. Tool failure requiring retry: add 2-5 seconds per retry.
The LLM calls dominate latency in simple cases. Tool execution dominates when tools perform I/O—API calls, database queries, file operations. State operations become significant after conversation history exceeds 30-40 turns.
Network latency to the LLM endpoint is not configurable. Using Azure OpenAI in the same region as your agent reduces this by 50-100ms compared to OpenAI’s API. Still measured in seconds per call.
Streaming and Incremental Response
The framework supports streaming LLM responses. Tokens arrive incrementally and can be sent to the client as they’re generated. This improves perceived latency—the user sees response start immediately instead of waiting for complete generation.
Streaming is incompatible with tool usage. When the LLM decides to invoke a tool, it must complete the structured JSON indicating the tool call before tool execution begins. The framework buffers streaming responses until it can determine whether the response is a direct answer or a tool invocation.
For responses involving tool calls, streaming provides no latency benefit. The framework must wait for:
- Complete tool call JSON
- Tool execution completion
- Synthesis LLM call completion
The user sees nothing until the final synthesis is complete. Streaming only helps for turns where the LLM responds directly without tool usage.
Memory Management and Conversation History
Conversation history accumulates in state. Each turn appends user message, assistant response, and tool invocations to the history. The framework includes full history in every LLM prompt to maintain context.
As conversations lengthen, prompts grow. GPT-4 can handle 128k tokens but costs increase linearly with prompt length. A 50-turn conversation might generate 20k-30k token prompts. At $0.03/1k tokens, that’s $0.60-$0.90 per turn.
The framework doesn’t implement automatic history pruning, summarization, or windowing. After some number of turns—depending on verbosity—you’ll hit context length limits and turns will start failing with context overflow errors.
Manual history management strategies:
- Summarize old turns into compressed context
- Maintain sliding window of recent turns
- Prune tool invocations from history after results are processed
- Store full history externally and include only relevant portions in prompts
None of these are automatic. You implement them in application logic or custom middleware. The framework treats history as an append-only sequence that grows unbounded.
Error Handling and Recovery
Tool execution errors get caught by the framework and presented to the LLM as error messages. The LLM can attempt recovery—retry with different arguments, try alternate tools, or inform the user of the failure.
This recovery loop can create cascading latency. Consider:
- LLM calls tool A with bad arguments
- Tool A fails validation (add 1ms)
- LLM receives error, generates retry (add 3s)
- LLM calls tool A with corrected arguments
- Tool A executes but throws exception (add 500ms)
- LLM receives exception, tries tool B instead (add 3s)
- Tool B succeeds
Total: 6.5 seconds with two failures, three LLM calls.
The framework has no circuit breaker. If a tool fails repeatedly, the LLM will keep retrying until max turns or timeout is reached. A buggy tool can consume the entire conversation budget attempting recovery.
Timeout configuration exists at tool level and turn level:
- Tool timeout: how long a single tool execution can run
- Turn timeout: how long the entire turn can run including all LLM calls and tool executions
Turn timeout is the hard limit. When it triggers, the turn fails regardless of progress. The user gets an error. Conversation state may be in inconsistent state depending on when timeout occurred.
Concurrency and Thread Safety
The framework is not thread-safe within a single agent instance. Agent instances are single-threaded by design. Concurrent tool executions don’t happen unless explicitly implemented in application code.
However, multiple agent instances can run concurrently serving different conversations. This is safe because each conversation maintains an isolated state. The race condition on concurrent turns in the same conversation (discussed earlier) remains.
Tool implementations must be thread-safe if they’re shared across agent instances. A tool registered once gets invoked from multiple threads serving different conversations. Tool state must be either immutable or protected with appropriate synchronization.
The framework provides no synchronization primitives. Tools that need coordination across invocations must implement this themselves—external locks, atomic operations, or coordination services.
The Observability Problem
The framework generates logs but not structured traces. Understanding what happened during a turn requires correlating log entries by timestamp and conversation ID. There’s no trace ID that flows through the entire turn pipeline.
When debugging latency, you must:
- Parse logs for timestamps
- Identify phase boundaries manually
- Calculate durations between phases
- Attribute latency to specific components
The framework doesn’t expose metrics for:
- Per-phase latency distribution
- Tool execution time breakdown
- LLM call time vs network time
- State operation performance
- Failure rates by failure type
Custom instrumentation provides this. Wrap tools in timing decorators. Add middleware to instrument LLM calls. Log state operation durations. Aggregate metrics externally.
Production observability requires building instrumentation the framework doesn’t provide.
Scalability Characteristics
The framework scales horizontally—add more instances to handle more concurrent conversations. Each instance is stateless except for in-memory cache of agent definitions and tool registrations.
State storage is the bottleneck. All instances contend for state storage I/O. As load increases:
- State load latency increases due to storage contention
- State save latency increases due to write contention
- Concurrent updates to same conversation increase conflict likelihood
State storage must scale with conversation volume. Redis works well for low-latency requirements. Cosmos DB provides global distribution but higher latency. SQL databases work but require careful indexing and connection pooling.
Tool execution doesn’t scale automatically. Tools that call external APIs are limited by those APIs’ rate limits and latency. Tools that perform computation are limited by CPU. Tools that need GPU must run on GPU instances or call external GPU services.
The framework provides no built-in rate limiting, request queuing, or backpressure mechanisms. If tool executions overwhelm downstream services, tools fail and LLMs attempt recovery, creating retry storms.
Control Flow Edge Cases
The framework’s turn-based execution model creates edge cases:
Abandoned Turns: User closes connection before turn completes. The turn continues executing. Tool invocations complete. The state gets saved. Resources are consumed. There’s no cancellation signal.
Overlapping Turns: User sends a second message before the first turn completes. Both turns execute independently. Second turn loads state before first turn saves. The first turn’s state changes are lost to the second turn.
Tool Timeouts During Synthesis: Tool executes successfully but synthesis LLM call times out. Tool side effects happen but the user never sees results. Turn fails. Retry triggers duplicate tool execution with duplicate side effects.
Partial State Corruption: Turn fails during state save. State is partially written. Next turn loads the corrupted state. Framework doesn’t detect corruption. Execution continues in an inconsistent state until something breaks obviously.
These edge cases aren’t documented. You discover them in production. The framework provides no built-in mitigation. Your application must handle:
- Idempotent tool design
- State validation and recovery
- Turn cancellation on disconnect
- Concurrency control for overlapping turns
The Extension Points
The framework provides extension points for custom behavior:
Middleware: Hooks that execute before/after turn phases. Use for instrumentation, authentication, rate limiting, or custom error handling. Middleware runs in pipeline order and can short-circuit execution.
Custom State Storage: Implement the state storage interface to use alternative backends. The interface is simple—get, set, delete by conversation ID. Transactions and consistency are your problem.
Tool Adapters: Wrap existing functions in tool interfaces. The adapter handles argument mapping, error translation, and timeout enforcement. Adapters let you integrate existing code as tools without modification.
Prompt Modifiers: Inject content into prompts before LLM calls. Use for dynamic context, user preferences, or runtime configuration. Modifiers execute per-call and can see full prompt before submission.
Extensions execute within the turn pipeline. They block execution while running. Expensive extensions add latency directly to turn time.
When the Abstraction Breaks
The framework abstracts agent execution into declarative configuration and tool registration. This works for simple request-response patterns with minimal state and fast tools.
It breaks when:
- Conversation history grows beyond manageable size
- Tools have complex dependency relationships
- State updates need transactional semantics
- Latency requirements are sub-second
- Tool execution requires streaming or async patterns
- Multiple agents need coordination
- Conversation spans multiple channels or sessions
The framework doesn’t provide solutions for these cases. You build solutions on top using the extension points and accepting the performance characteristics of the execution model.
Alternatively, you don’t use the framework. The abstraction is helpful until it’s constraining. Complex agentic applications often outgrow frameworks quickly as requirements evolve beyond the framework’s execution model.
Performance Optimization Strategies
Given the execution model, optimization strategies:
State Management:
- Keep conversation history small through aggressive pruning
- Store large context externally, include only recent turns in state
- Use compressed serialization formats
- Implement lazy loading for conversation metadata
Tool Design:
- Make tools fast—every millisecond matters
- Batch tool operations when possible
- Cache tool results when appropriate
- Implement tool-level async execution for I/O-bound operations
LLM Optimization:
- Use smaller, faster models when task permits
- Implement prompt caching for repeated context
- Use streaming for direct responses
- Minimize tool calls through better prompting
Architecture:
- Deploy close to LLM endpoints to reduce network latency
- Use fast state storage with local replicas
- Implement turn-level caching for duplicate requests
- Add reverse proxy with response caching
These optimizations work within the framework’s constraints. They don’t change the fundamental execution model. If the model doesn’t fit your requirements, the framework isn’t the right tool.
The Documentation Gap
The official documentation explains agent creation, tool registration, and basic usage. It doesn’t explain:
- The turn execution pipeline in detail
- State management race conditions
- Tool invocation performance characteristics
- Memory management requirements
- Concurrency constraints
- Error recovery behavior
- Latency sources and optimization strategies
You learn these by reading framework source code, debugging production issues, or measuring carefully. The abstractions look simple in documentation. The implementation has complexity that emerges under load or in edge cases.
Understanding the runtime architecture matters when the framework doesn’t behave as expected. Which happens in every non-trivial application eventually.
The Framework Is Not Agentic
Despite the name, the framework implements request-response with tool calling, not autonomous agents. Agents don’t:
- Execute continuously toward goals
- Make independent decisions about when to act
- Maintain long-term memory across conversations
- Coordinate with other agents
- Learn or adapt behavior over time
The framework is a structured wrapper around LLM inference with tool integration. Useful for chatbots, assistants, and interactive applications. Not suitable for autonomous agents, multi-agent systems, or applications requiring complex control flow.
The execution model is turn-based request-response. One user input, one assistant output, repeat. This is a chatbot model, not an agent model. Framework capabilities are constrained by this model.
If you need actual agentic behavior—goal-driven execution, continuous operation, multi-agent coordination—you’re building on top of the framework or building something else entirely. The framework provides primitives, not a complete agent architecture.
Making It Work
The framework works well for:
- Conversational interfaces with bounded history
- Tool-augmented responses with latency tolerance
- Simple state management requirements
- Standard request-response patterns
It requires careful engineering for:
- Low-latency applications
- Complex state management
- Long-running conversations
- High-concurrency scenarios
- Tool execution coordination
Understanding the runtime architecture and execution model makes the difference between using the framework successfully and fighting against its constraints. The abstractions are helpful until you need to debug performance, handle edge cases, or optimize for production load.
Then you need to know what’s actually happening beneath the convenience layer. The agent isn’t magic. It’s an instantiation pipeline, a turn execution loop, tool invocations with timeouts, LLM calls with retry logic, and state management with race conditions.
The framework makes the simple case trivial. The complex case requires understanding where the abstraction ends and the implementation begins.
That boundary is closer than the documentation suggests.