Agent systems often present themselves as stateless request-response flows. An agent receives a prompt, executes a turn (observation, reasoning, action), and returns a result. The framing is linear and bounded.
In production, this breaks. An agent turn is not atomic. It is a sequence of operations—state reads, external API calls, model invocations, database writes—that can fail at any point. The agent’s perception of the world is stale by the time it acts. Its actions trigger side effects that propagate through other systems. Multiple agents may be executing concurrently against shared state.
An agent turn is a transaction. It has boundaries, it can fail, it can leave side effects. Treating it as anything less results in systems that are unpredictable, unrecoverable, and unsafe to scale.
The Problem: Agent Turns Are Not Atomic
An agent turn executes a sequence of steps:
- Observe the current state (read from databases, APIs, or memory)
- Reason about the observation (invoke the model, run planning logic)
- Select an action (decide what to do next based on reasoning)
- Execute the action (call an API, modify state, trigger side effects)
- Verify the outcome (confirm the action succeeded or failed)
Each step is a point of failure. The observation is stale by step 2. The model invocation can timeout. The API call can fail mid-request. The database write can be interrupted. The verification can reveal the action partially succeeded.
Consider an agent handling customer orders:
def execute_agent_turn(order_id: str, agent_state: AgentMemory):
# Step 1: Observe
order = fetch_order(order_id) # API call — can fail, can timeout
inventory = check_inventory(order.items) # Another API call
# Step 2: Reason
decision = model.think(f"Should we fulfill {order}?") # Model call — can fail, can hang
# Step 3: Action
if decision == "approve":
# Step 4: Execute side effects
new_shipment = create_shipment(order) # API call — can fail after creation
agent_state.update(f"Shipment {new_shipment.id} created")
send_notification(order.customer) # External system — may never receive it
# Step 5: Verify
manifest = check_manifest(order_id) # API call — may show partial state
What happens when create_shipment() succeeds but send_notification() fails? The shipment exists. The notification does not. The agent’s memory says both happened. The customer sees no confirmation. The inventory has been decremented. The order is now in an inconsistent state across systems.
Multiply this across hundreds of agents executing concurrently. The inconsistencies compound.
Transaction Boundaries Must Be Explicit
A transaction is a unit of work that either completes entirely or fails entirely. It has clear boundaries: a start point and a commit or rollback point. The transaction protects invariants—if the transaction succeeds, the system state is consistent; if it fails, state is unchanged.
Agent systems require explicit transaction boundaries because agent turns are inherently multi-step. The boundary answers the question: what constitutes success or failure of this turn?
Consider three possible boundaries:
Boundary 1: Observation-only. The transaction includes only the observation phase. The agent reads state and returns it to the caller. The caller decides what to do.
def observe_only_transaction(order_id: str) -> Order:
transaction_start()
try:
order = fetch_order(order_id)
inventory = check_inventory(order.items)
transaction_commit()
return (order, inventory)
except Exception as e:
transaction_rollback()
raise
This is safe but limits the agent to pure observation. The agent cannot execute actions.
Boundary 2: Full turn. The transaction includes observation, reasoning, and action. Either the turn completes entirely or it fails entirely.
def full_turn_transaction(order_id: str, agent_state: AgentMemory) -> TurnResult:
transaction_start()
try:
# Observe current state
order = fetch_order(order_id)
inventory = check_inventory(order.items)
# Reason
decision = model.think(f"Should we fulfill {order}?")
# Action — all sides effects must succeed or all must be rolled back
if decision == "approve":
shipment = create_shipment(order)
send_notification(order.customer)
update_inventory(order.items, decrement=True)
# Update agent state
agent_state.update(f"Turn complete: {decision}")
transaction_commit()
return TurnResult(success=True, decision=decision)
except Exception as e:
transaction_rollback()
agent_state.rollback() # Revert agent's memory to pre-turn state
return TurnResult(success=False, error=str(e))
This is safer but requires all side effects to be transactional or reversible. If send_notification() fails, the shipment must be reverted.
Boundary 3: Action-only. The transaction includes only the action phase. Observation and reasoning are outside the transaction; action execution is inside.
def action_only_transaction(order_id: str, decision: str, agent_state: AgentMemory):
# Observation and reasoning are pre-committed; we can't roll them back
observation_snapshot = agent_state.last_observation
# Only the action is transactional
transaction_start()
try:
if decision == "approve":
shipment = create_shipment(order_id)
send_notification(observation_snapshot.customer)
update_inventory(observation_snapshot.items, decrement=True)
transaction_commit()
return ActionResult(success=True, shipment_id=shipment.id)
except Exception as e:
# Observation and reasoning are unchanged; only action is rolled back
# But we have no way to undo the notification if it was sent
transaction_rollback()
return ActionResult(success=False, error=str(e))
This is pragmatic but creates asymmetry: the agent’s reasoning may be based on stale observation, and the action fails silently while the agent thinks it succeeded.
Each boundary has trade-offs. The choice depends on what can be rolled back, what consistency matters, and how the agent will respond to failure.
Rollback Semantics Are Not Obvious
Rollback is not free. It requires either:
- Reversible operations. Every action has an inverse. Create → Delete, Increment → Decrement, Insert → Remove.
- Compensating transactions. An operation cannot be reversed, but a compensating operation can restore the system to a consistent state. Book flight → Offer refund. This does not restore the original state; it just moves to a different consistent state.
- Distributed transactions. Coordinate all systems to commit or abort together. This requires all systems to support transactional semantics and 2-phase commit or similar. This is slow and fragile.
- Accept inconsistency. Roll back what can be rolled back; leave the rest. This requires explicit knowledge of which systems are transactional and which are not.
Most agent systems choose option 4 by default.
An agent calls an external payment API to charge a customer. The API responds with success, returns a transaction ID, and then crashes before the response arrives. The agent never receives the success confirmation. From the agent’s perspective, the operation failed. The agent rolls back the order. But the payment went through. The customer was charged and the order was cancelled.
This is not a rare edge case. Network failures, timeouts, and partial responses are common. Idempotency helps—if the payment API is idempotent, the agent can retry without creating duplicate charges. But not all systems are idempotent. Not all APIs document their idempotency guarantees.
Consider an agent that:
- Reads balance:
current_balance = 1000 - Decides to transfer
500to another account - Executes the transfer
- Transfer succeeds but external confirmation is lost
- Agent rolls back (assumes transfer failed)
- Agent tries again
- Transfer already executed; balance is now
500; second transfer would overdraft - Or the system allows the double transfer and the account goes to
0
Rollback semantics require idempotency, which requires explicit design. Most agent systems do not have it.
Side Effects Cascade Uncontrollably Without Containment
An agent’s action produces a side effect. The side effect triggers another system. That system produces another side effect. The cascade propagates through the infrastructure.
An agent approves a refund. The refund is processed. A notification is sent to the customer. The notification triggers a third-party CRM system to update the customer record. The CRM update triggers an automation that flags the account for review. The review automation sends a slack message to a human. The human misinterprets the message and escalates the issue.
The agent executed one action. Five downstream systems were affected. If any of those systems fails, the cascade breaks. If the agent rolls back, the cascade is partially undone but not fully (the Slack message was sent; the human already read it).
Without explicit side effect containment, side effects are unobservable and uncontrollable.
Containment 1: Declare side effects. The agent explicitly declares what side effects it will produce before executing them.
@dataclass
class AgentTurn:
observation: dict
decision: str
declared_side_effects: list[SideEffect]
class SideEffect:
system: str # which system this affects
operation: str # what operation
reversible: bool # can this be rolled back?
priority: int # if a choice must be made, execute high-priority first
def execute_turn_with_declared_effects(turn: AgentTurn):
# Declare first
for effect in turn.declared_side_effects:
log_side_effect(effect) # Record what will happen
validate_side_effect(effect) # Reject if unsafe
# Then execute
transaction_start()
try:
for effect in sort_by_priority(turn.declared_side_effects):
execute_side_effect(effect)
transaction_commit()
except Exception as e:
# Rollback only the reversible effects
reversible_effects = [e for e in turn.declared_side_effects if e.reversible]
for effect in reversed(reversible_effects):
rollback_side_effect(effect)
raise
Declaring side effects makes them observable and allows the system to make informed choices about order of execution and rollback strategy.
Containment 2: Rate-limit side effect propagation. Do not allow side effects to cascade infinitely. Require explicit acknowledgment before propagating to the next system.
def execute_side_effect_with_ack(effect: SideEffect) -> bool:
try:
result = invoke_system(effect.system, effect.operation)
# Wait for explicit ack before cascading
ack = wait_for_ack(timeout=30) # 30s to ack or nack
if not ack:
# No ack; do not propagate further regardless of result
log_warning(f"No ack for {effect}; stopping cascade")
return False
return True
except Exception:
# Explicit failure; stop cascade
return False
This adds latency but prevents cascading failures.
Containment 3: Isolate side effects by impact radius. Group side effects by how far they can propagate. Critical side effects (charge payment) stay local. Medium side effects (update inventory) can propagate to inventory systems but not beyond. Low side effects (send notification) are isolated.
class SideEffectImpactRadius:
CRITICAL = 1 # Stays in the agent's transaction
HIGH = 2 # Can propagate to directly affected systems
MEDIUM = 3 # Can propagate one level further
LOW = 4 # Can propagate freely
def execute_effect_by_radius(effect: SideEffect):
if effect.radius <= SideEffectImpactRadius.CRITICAL:
# Must succeed or entire turn fails
try:
return execute_side_effect(effect)
except:
raise
elif effect.radius <= SideEffectImpactRadius.HIGH:
# Must succeed but other turns can proceed if this fails
try:
return execute_side_effect(effect)
except:
log_critical(f"High-impact effect failed: {effect}")
# Alert human; do not retry automatically
raise
else:
# Can fail without affecting the turn
try:
return execute_side_effect(effect)
except:
log_warning(f"Low-impact effect failed: {effect}")
# Retry in background; do not block turn
queue_for_retry(effect)
return False
This stratified approach prevents low-impact failures from propagating to critical systems.
Concurrency Requires Transaction Isolation
If multiple agents execute concurrently against shared state, transaction boundaries become essential for correctness.
Two agents execute simultaneously:
Agent A Turn 1:
1. Read balance = 1000
2. Decide to transfer 500
3. Execute transfer, balance = 500
Agent B Turn 1 (overlapping):
1. Read balance = 1000 (read is stale; Agent A hasn't committed yet)
2. Decide to transfer 300
3. Execute transfer, balance = 700
Result: Both agents assume their transfer succeeded
Balance is now 700, but both transfers (500 + 300) were applied
Expected balance after both: 200
Actual balance: 700
This is the classic “Lost Update” problem. It occurs because:
- Both agents read the same stale balance
- Both execute their transfer
- One update overwrites the other
Transaction isolation prevents this. The system uses one of the classic isolation levels:
Read Uncommitted. Agents can read data written by other agents that haven’t committed yet. Highest concurrency. Lowest safety.
Read Committed. Agents can only read data that other agents have committed. Prevents dirty reads. Allows non-repeatable reads (Agent A reads balance, Agent B updates balance and commits, Agent A reads again and gets a different value).
Repeatable Read. Agents cannot see changes to data they already read, even if other agents commit updates. Allows phantom reads (Agent A queries orders WHERE status=‘pending’, Agent B creates a new pending order and commits, Agent A queries again and gets a different result).
Serializable. Agents are isolated as if they execute in sequence, not concurrently. Prevents dirty reads, non-repeatable reads, and phantom reads. Lowest concurrency. Highest safety.
Most agent systems default to Read Committed or lower because higher isolation is slow. But Read Committed allows non-repeatable reads. An agent reads the current state, reasons about it, and by the time it acts, the state has changed due to another agent. The action is based on stale reasoning.
# Isolation level: Read Committed
transaction_start(isolation=READ_COMMITTED)
try:
account_a = fetch_account(id=1) # balance = 100
account_b = fetch_account(id=2) # balance = 50
# At this point, another agent updates account_b from 50 to 60
# We don't see it yet (Read Committed)
# But if we read account_b again, we see the update
account_b = fetch_account(id=2) # balance = 60 (non-repeatable read)
transfer_amount = account_a.balance - account_b.balance # 100 - 60 = 40
transfer(account_a, account_b, amount=40)
finally:
transaction_commit()
Higher isolation levels prevent this but require explicit locking or multi-version concurrency control (MVCC). The cost is latency and reduced throughput.
State Consistency Requires Explicit Contracts
An agent’s internal state (memory, context, reasoning history) must be consistent with the world’s state (what the agent observes). Inconsistency means the agent’s actions are based on false assumptions.
An agent has memorized that inventory has 50 units of a product. It decides to approve two orders, each for 30 units. Both orders are approved. Now the system has 50 - 30 - 30 = -10 units. The agent has committed an impossible action based on stale memory.
Explicit state contracts prevent this:
Contract 1: State versioning. Every agent operation is tagged with the state version it observed.
@dataclass
class AgentObservation:
version: int # This is state version 42
data: dict # Current state according to version 42
def execute_decision_with_version_check(observation: AgentObservation, decision: str):
# Check if state is still at the observed version
current_version = get_current_state_version()
if current_version != observation.version:
# State changed; observation is stale
log_warning(f"Stale observation: expected v{observation.version}, got v{current_version}")
# Reject the decision or force re-observation
return False
# State is still consistent; execute the decision
return execute_decision(decision)
This approach is used by optimistic locking. If state changes between observation and action, the action is rejected. The agent must re-observe and re-decide.
Contract 2: Invariant checking. Before executing side effects, assert that key invariants still hold.
def execute_transfer(from_account: Account, to_account: Account, amount: int):
# Check invariant: from_account has sufficient balance
if from_account.balance < amount:
raise InsufficientFundsError()
# Check invariant: accounts are not the same
if from_account.id == to_account.id:
raise SelfTransferError()
# Check invariant: from_account is not frozen
if from_account.frozen:
raise FrozenAccountError()
# All invariants hold; execute
from_account.balance -= amount
to_account.balance += amount
Invariant checking stops impossible actions from executing. If the action violates an invariant, it fails before side effects propagate.
Contract 3: Causality ordering. Define dependencies between agent turns. Side effects must not execute until their dependencies are committed.
@dataclass
class AgentTurn:
id: str
decision: str
depends_on: list[str] # Turn IDs that must complete first
def execute_turn_with_dependencies(turn: AgentTurn):
# Wait for dependencies to complete
for dep_id in turn.depends_on:
dependency_turn = fetch_turn(dep_id)
wait_until_committed(dependency_turn)
# Dependencies committed; this turn can execute
transaction_start()
try:
execute_decision(turn.decision)
transaction_commit()
except:
transaction_rollback()
raise
This ensures that side effects propagate in a causal order, preventing decisions based on uncommitted state.
Detection and Recovery: Acknowledging Failures
Even with explicit transaction boundaries, something will fail. A network partition will isolate an agent. A database will crash mid-transaction. An external API will return an ambiguous response.
The agent must be able to detect these failures and recover without leaving the system in an inconsistent state.
Detection 1: Timeout. If an operation does not complete within a time window, assume it failed.
def execute_with_timeout(operation, timeout_ms=5000):
try:
result = asyncio.wait_for(operation(), timeout=timeout_ms/1000)
return (True, result)
except asyncio.TimeoutError:
# Operation did not complete in time; assume it failed
return (False, "Timeout")
But timeouts are ambiguous. Did the operation fail? Or is it taking longer than expected? Timeouts should trigger investigation, not immediate rollback.
def execute_with_timeout_and_investigation(operation, timeout_ms=5000):
try:
result = asyncio.wait_for(operation(), timeout=timeout_ms/1000)
return (True, result)
except asyncio.TimeoutError:
# Operation timed out; check if it actually executed
actual_result = verify_operation_executed()
if actual_result:
# Operation executed despite timeout; use the result
return (True, actual_result)
else:
# Operation did not execute; safe to retry or fail
return (False, "Timeout and verification failed")
Detection 2: Idempotency checking. Make all operations idempotent so retries are safe.
def execute_idempotent_operation(operation_id: str, operation: Callable):
# Check if this operation already executed
existing_result = lookup_operation_result(operation_id)
if existing_result:
# Already executed; return result without executing again
return existing_result
# Not executed yet; execute and cache result
result = operation()
cache_operation_result(operation_id, result)
return result
Idempotency is not free. It requires a place to store execution records (operation ID → result). But it is the most reliable way to make retries safe.
Detection 3: External verification. Periodically check the external world to verify that side effects actually occurred.
def execute_side_effect_with_verification(effect: SideEffect, verify_after_seconds=60):
# Execute the side effect
start_time = time.time()
success = execute_side_effect(effect)
# Schedule verification
def verify_later():
elapsed = time.time() - start_time
if elapsed >= verify_after_seconds:
verified = verify_side_effect_occurred(effect)
if not verified and success:
# Side effect appeared to succeed but verification failed
# This is a partial failure; escalate to human
alert_human(f"Unverified side effect: {effect}")
elif verified and not success:
# Side effect appeared to fail but verification succeeded
# This is a false negative; acknowledge success
patch_result(effect, success=True)
schedule_callback(verify_later, delay=verify_after_seconds)
return success
This catches false negatives (operation succeeded but appeared to fail) and false positives (operation failed but appeared to succeed).
Practical Implementation: Transaction-Aware Agent Loop
An agent loop that respects transaction boundaries looks different from a stateless request-response:
class TransactionalAgent:
def __init__(self, config: AgentConfig):
self.config = config
self.db = TransactionalDB(config.db_connection)
self.state_store = TransactionalStateStore(config.state_backend)
def run_turn(self, agent_id: str, prompt: str) -> TurnResult:
"""Execute a single agent turn as a transaction."""
turn_id = generate_turn_id()
# Start explicit transaction
with self.db.transaction(isolation_level=SERIALIZABLE):
try:
# Phase 1: Observe (read-only; can be rolled back)
observation = self._observe(agent_id)
observation.version = self.db.get_state_version()
# Phase 2: Reason (deterministic; depends on observation)
context = self._build_context(agent_id, observation)
reasoning = self._invoke_model(context, prompt)
decision, declared_effects = self._extract_decision(reasoning)
# Phase 3: Validate (check invariants)
self._validate_decision(decision, observation)
# Phase 4: Declare side effects (make observable to system)
for effect in declared_effects:
self._log_side_effect(turn_id, effect)
# Phase 5: Execute side effects (all or nothing)
for effect in declared_effects:
self._execute_side_effect(turn_id, effect)
# Phase 6: Update agent state
self.state_store.append_to_history(agent_id, {
'turn_id': turn_id,
'observation': observation,
'decision': decision,
'effects': declared_effects,
'timestamp': time.time(),
})
# Transaction commits here
return TurnResult(
success=True,
turn_id=turn_id,
decision=decision,
effects=declared_effects,
)
except ValidationError as e:
# Validation failed; roll back and return error
# (transaction automatically rolls back)
return TurnResult(
success=False,
turn_id=turn_id,
error=f"Validation failed: {e}",
)
except PartialExecutionError as e:
# Some side effects executed, some failed
# Try to rollback what can be rolled back
reversible = [ef for ef in declared_effects if ef.reversible]
for effect in reversed(reversible):
try:
self._rollback_side_effect(turn_id, effect)
except Exception as rb_error:
# Rollback itself failed; escalate
log_critical(f"Rollback failed for {effect}: {rb_error}")
# Transaction rolls back; agent state reverts
return TurnResult(
success=False,
turn_id=turn_id,
error=f"Partial execution: {e}",
partially_executed_effects=e.executed_effects,
)
except Exception as e:
# Unexpected error; transaction rolls back
log_error(f"Turn {turn_id} failed: {e}")
return TurnResult(
success=False,
turn_id=turn_id,
error=str(e),
)
def _observe(self, agent_id: str) -> Observation:
"""Fetch current state (read-only)."""
return Observation(
environment=self._fetch_environment_state(),
agent_memory=self.state_store.get_agent_context(agent_id),
timestamp=time.time(),
)
def _invoke_model(self, context: dict, prompt: str) -> str:
"""Call the model with timeout and retry."""
return execute_with_timeout_and_retries(
lambda: self.config.model.complete(
context=context,
prompt=prompt,
max_tokens=2000,
),
timeout_ms=30000,
max_retries=3,
)
def _execute_side_effect(self, turn_id: str, effect: SideEffect):
"""Execute a single side effect with idempotency."""
effect_id = f"{turn_id}:{effect.id}"
# Make the effect execution idempotent
return execute_idempotent_operation(
effect_id,
lambda: self.config.side_effect_executor.execute(effect),
)
This implementation:
- Wraps the entire turn in a transaction
- Makes observation explicit and read-only
- Validates the decision before executing side effects
- Declares side effects before executing them
- Makes side effect execution idempotent
- Handles partial failures and rollback
- Stores the full history for debugging
The Cost: Complexity for Correctness
Treating agent turns as transactions adds complexity. Transaction management, rollback logic, idempotency keys, isolation level tuning, and failure recovery are not free.
But the alternative is worse. An agent system without explicit transaction boundaries will eventually execute actions based on stale observations, leave side effects halfway executed, cascade failures through dependent systems, and lose consistency under concurrency. These failures are not edge cases in production at scale.
The choice is not “transactions or simplicity.” It is “explicit transaction semantics or implicit inconsistency.”
Correct agent systems require choosing:
- Boundary definition. What counts as success or failure of a turn?
- Rollback strategy. Which side effects are reversible? Which require compensation?
- Side effect containment. How do side effects propagate? What limits them?
- Isolation level. Can agents execute concurrently? At what consistency cost?
- Failure detection. How does the system know a turn failed? How does it recover?
These are not optional. They are prerequisites for building agent systems that are predictable, debuggable, and safe to operate.