Agent Turns as Transactions: Boundaries, Rollbacks, and Side Effects

Agent systems often present themselves as stateless request-response flows. An agent receives a prompt, executes a turn (observation, reasoning, action), and returns a result. The framing is linear and bounded.

In production, this breaks. An agent turn is not atomic. It is a sequence of operations—state reads, external API calls, model invocations, database writes—that can fail at any point. The agent’s perception of the world is stale by the time it acts. Its actions trigger side effects that propagate through other systems. Multiple agents may be executing concurrently against shared state.

An agent turn is a transaction. It has boundaries, it can fail, it can leave side effects. Treating it as anything less results in systems that are unpredictable, unrecoverable, and unsafe to scale.

The Problem: Agent Turns Are Not Atomic

An agent turn executes a sequence of steps:

Observe the current state (read from databases, APIs, or memory)
Reason about the observation (invoke the model, run planning logic)
Select an action (decide what to do next based on reasoning)
Execute the action (call an API, modify state, trigger side effects)
Verify the outcome (confirm the action succeeded or failed)

Each step is a point of failure. The observation is stale by step 2. The model invocation can timeout. The API call can fail mid-request. The database write can be interrupted. The verification can reveal the action partially succeeded.

Consider an agent handling customer orders:

def execute_agent_turn(order_id: str, agent_state: AgentMemory):
    # Step 1: Observe
    order = fetch_order(order_id)  # API call — can fail, can timeout
    inventory = check_inventory(order.items)  # Another API call
    
    # Step 2: Reason
    decision = model.think(f"Should we fulfill {order}?")  # Model call — can fail, can hang
    
    # Step 3: Action
    if decision == "approve":
        # Step 4: Execute side effects
        new_shipment = create_shipment(order)  # API call — can fail after creation
        agent_state.update(f"Shipment {new_shipment.id} created")
        send_notification(order.customer)  # External system — may never receive it
        
    # Step 5: Verify
    manifest = check_manifest(order_id)  # API call — may show partial state

What happens when create_shipment() succeeds but send_notification() fails? The shipment exists. The notification does not. The agent’s memory says both happened. The customer sees no confirmation. The inventory has been decremented. The order is now in an inconsistent state across systems.

Multiply this across hundreds of agents executing concurrently. The inconsistencies compound.

Transaction Boundaries Must Be Explicit

A transaction is a unit of work that either completes entirely or fails entirely. It has clear boundaries: a start point and a commit or rollback point. The transaction protects invariants—if the transaction succeeds, the system state is consistent; if it fails, state is unchanged.

Agent systems require explicit transaction boundaries because agent turns are inherently multi-step. The boundary answers the question: what constitutes success or failure of this turn?

Consider three possible boundaries:

Boundary 1: Observation-only. The transaction includes only the observation phase. The agent reads state and returns it to the caller. The caller decides what to do.

def observe_only_transaction(order_id: str) -> Order:
    transaction_start()
    try:
        order = fetch_order(order_id)
        inventory = check_inventory(order.items)
        transaction_commit()
        return (order, inventory)
    except Exception as e:
        transaction_rollback()
        raise

This is safe but limits the agent to pure observation. The agent cannot execute actions.

Boundary 2: Full turn. The transaction includes observation, reasoning, and action. Either the turn completes entirely or it fails entirely.

def full_turn_transaction(order_id: str, agent_state: AgentMemory) -> TurnResult:
    transaction_start()
    try:
        # Observe current state
        order = fetch_order(order_id)
        inventory = check_inventory(order.items)
        
        # Reason
        decision = model.think(f"Should we fulfill {order}?")
        
        # Action — all sides effects must succeed or all must be rolled back
        if decision == "approve":
            shipment = create_shipment(order)
            send_notification(order.customer)
            update_inventory(order.items, decrement=True)
        
        # Update agent state
        agent_state.update(f"Turn complete: {decision}")
        
        transaction_commit()
        return TurnResult(success=True, decision=decision)
    except Exception as e:
        transaction_rollback()
        agent_state.rollback()  # Revert agent's memory to pre-turn state
        return TurnResult(success=False, error=str(e))

This is safer but requires all side effects to be transactional or reversible. If send_notification() fails, the shipment must be reverted.

Boundary 3: Action-only. The transaction includes only the action phase. Observation and reasoning are outside the transaction; action execution is inside.

def action_only_transaction(order_id: str, decision: str, agent_state: AgentMemory):
    # Observation and reasoning are pre-committed; we can't roll them back
    observation_snapshot = agent_state.last_observation
    
    # Only the action is transactional
    transaction_start()
    try:
        if decision == "approve":
            shipment = create_shipment(order_id)
            send_notification(observation_snapshot.customer)
            update_inventory(observation_snapshot.items, decrement=True)
        
        transaction_commit()
        return ActionResult(success=True, shipment_id=shipment.id)
    except Exception as e:
        # Observation and reasoning are unchanged; only action is rolled back
        # But we have no way to undo the notification if it was sent
        transaction_rollback()
        return ActionResult(success=False, error=str(e))

This is pragmatic but creates asymmetry: the agent’s reasoning may be based on stale observation, and the action fails silently while the agent thinks it succeeded.

Each boundary has trade-offs. The choice depends on what can be rolled back, what consistency matters, and how the agent will respond to failure.

Rollback Semantics Are Not Obvious

Rollback is not free. It requires either:

Reversible operations. Every action has an inverse. Create → Delete, Increment → Decrement, Insert → Remove.
Compensating transactions. An operation cannot be reversed, but a compensating operation can restore the system to a consistent state. Book flight → Offer refund. This does not restore the original state; it just moves to a different consistent state.
Distributed transactions. Coordinate all systems to commit or abort together. This requires all systems to support transactional semantics and 2-phase commit or similar. This is slow and fragile.
Accept inconsistency. Roll back what can be rolled back; leave the rest. This requires explicit knowledge of which systems are transactional and which are not.

Most agent systems choose option 4 by default.

An agent calls an external payment API to charge a customer. The API responds with success, returns a transaction ID, and then crashes before the response arrives. The agent never receives the success confirmation. From the agent’s perspective, the operation failed. The agent rolls back the order. But the payment went through. The customer was charged and the order was cancelled.

This is not a rare edge case. Network failures, timeouts, and partial responses are common. Idempotency helps—if the payment API is idempotent, the agent can retry without creating duplicate charges. But not all systems are idempotent. Not all APIs document their idempotency guarantees.

Consider an agent that:

Reads balance: current_balance = 1000
Decides to transfer 500 to another account
Executes the transfer
Transfer succeeds but external confirmation is lost
Agent rolls back (assumes transfer failed)
Agent tries again
Transfer already executed; balance is now 500; second transfer would overdraft
Or the system allows the double transfer and the account goes to 0

Rollback semantics require idempotency, which requires explicit design. Most agent systems do not have it.

Side Effects Cascade Uncontrollably Without Containment

An agent’s action produces a side effect. The side effect triggers another system. That system produces another side effect. The cascade propagates through the infrastructure.

An agent approves a refund. The refund is processed. A notification is sent to the customer. The notification triggers a third-party CRM system to update the customer record. The CRM update triggers an automation that flags the account for review. The review automation sends a slack message to a human. The human misinterprets the message and escalates the issue.

The agent executed one action. Five downstream systems were affected. If any of those systems fails, the cascade breaks. If the agent rolls back, the cascade is partially undone but not fully (the Slack message was sent; the human already read it).

Without explicit side effect containment, side effects are unobservable and uncontrollable.

Containment 1: Declare side effects. The agent explicitly declares what side effects it will produce before executing them.

@dataclass
class AgentTurn:
    observation: dict
    decision: str
    declared_side_effects: list[SideEffect]
    
class SideEffect:
    system: str  # which system this affects
    operation: str  # what operation
    reversible: bool  # can this be rolled back?
    priority: int  # if a choice must be made, execute high-priority first

def execute_turn_with_declared_effects(turn: AgentTurn):
    # Declare first
    for effect in turn.declared_side_effects:
        log_side_effect(effect)  # Record what will happen
        validate_side_effect(effect)  # Reject if unsafe
    
    # Then execute
    transaction_start()
    try:
        for effect in sort_by_priority(turn.declared_side_effects):
            execute_side_effect(effect)
        transaction_commit()
    except Exception as e:
        # Rollback only the reversible effects
        reversible_effects = [e for e in turn.declared_side_effects if e.reversible]
        for effect in reversed(reversible_effects):
            rollback_side_effect(effect)
        raise

Declaring side effects makes them observable and allows the system to make informed choices about order of execution and rollback strategy.

Containment 2: Rate-limit side effect propagation. Do not allow side effects to cascade infinitely. Require explicit acknowledgment before propagating to the next system.

def execute_side_effect_with_ack(effect: SideEffect) -> bool:
    try:
        result = invoke_system(effect.system, effect.operation)
        # Wait for explicit ack before cascading
        ack = wait_for_ack(timeout=30)  # 30s to ack or nack
        if not ack:
            # No ack; do not propagate further regardless of result
            log_warning(f"No ack for {effect}; stopping cascade")
            return False
        return True
    except Exception:
        # Explicit failure; stop cascade
        return False

This adds latency but prevents cascading failures.

Containment 3: Isolate side effects by impact radius. Group side effects by how far they can propagate. Critical side effects (charge payment) stay local. Medium side effects (update inventory) can propagate to inventory systems but not beyond. Low side effects (send notification) are isolated.

class SideEffectImpactRadius:
    CRITICAL = 1  # Stays in the agent's transaction
    HIGH = 2      # Can propagate to directly affected systems
    MEDIUM = 3    # Can propagate one level further
    LOW = 4       # Can propagate freely
    
def execute_effect_by_radius(effect: SideEffect):
    if effect.radius <= SideEffectImpactRadius.CRITICAL:
        # Must succeed or entire turn fails
        try:
            return execute_side_effect(effect)
        except:
            raise
    elif effect.radius <= SideEffectImpactRadius.HIGH:
        # Must succeed but other turns can proceed if this fails
        try:
            return execute_side_effect(effect)
        except:
            log_critical(f"High-impact effect failed: {effect}")
            # Alert human; do not retry automatically
            raise
    else:
        # Can fail without affecting the turn
        try:
            return execute_side_effect(effect)
        except:
            log_warning(f"Low-impact effect failed: {effect}")
            # Retry in background; do not block turn
            queue_for_retry(effect)
            return False

This stratified approach prevents low-impact failures from propagating to critical systems.

Concurrency Requires Transaction Isolation

If multiple agents execute concurrently against shared state, transaction boundaries become essential for correctness.

Two agents execute simultaneously:

Agent A Turn 1:
  1. Read balance = 1000
  2. Decide to transfer 500
  3. Execute transfer, balance = 500

Agent B Turn 1 (overlapping):
  1. Read balance = 1000 (read is stale; Agent A hasn't committed yet)
  2. Decide to transfer 300
  3. Execute transfer, balance = 700
  
Result: Both agents assume their transfer succeeded
Balance is now 700, but both transfers (500 + 300) were applied
Expected balance after both: 200
Actual balance: 700

This is the classic “Lost Update” problem. It occurs because:

Both agents read the same stale balance
Both execute their transfer
One update overwrites the other

Transaction isolation prevents this. The system uses one of the classic isolation levels:

Read Uncommitted. Agents can read data written by other agents that haven’t committed yet. Highest concurrency. Lowest safety.

Read Committed. Agents can only read data that other agents have committed. Prevents dirty reads. Allows non-repeatable reads (Agent A reads balance, Agent B updates balance and commits, Agent A reads again and gets a different value).

Repeatable Read. Agents cannot see changes to data they already read, even if other agents commit updates. Allows phantom reads (Agent A queries orders WHERE status=‘pending’, Agent B creates a new pending order and commits, Agent A queries again and gets a different result).

Serializable. Agents are isolated as if they execute in sequence, not concurrently. Prevents dirty reads, non-repeatable reads, and phantom reads. Lowest concurrency. Highest safety.

Most agent systems default to Read Committed or lower because higher isolation is slow. But Read Committed allows non-repeatable reads. An agent reads the current state, reasons about it, and by the time it acts, the state has changed due to another agent. The action is based on stale reasoning.

# Isolation level: Read Committed
transaction_start(isolation=READ_COMMITTED)
try:
    account_a = fetch_account(id=1)  # balance = 100
    account_b = fetch_account(id=2)  # balance = 50
    
    # At this point, another agent updates account_b from 50 to 60
    # We don't see it yet (Read Committed)
    
    # But if we read account_b again, we see the update
    account_b = fetch_account(id=2)  # balance = 60 (non-repeatable read)
    
    transfer_amount = account_a.balance - account_b.balance  # 100 - 60 = 40
    transfer(account_a, account_b, amount=40)
finally:
    transaction_commit()

Higher isolation levels prevent this but require explicit locking or multi-version concurrency control (MVCC). The cost is latency and reduced throughput.

State Consistency Requires Explicit Contracts

An agent’s internal state (memory, context, reasoning history) must be consistent with the world’s state (what the agent observes). Inconsistency means the agent’s actions are based on false assumptions.

An agent has memorized that inventory has 50 units of a product. It decides to approve two orders, each for 30 units. Both orders are approved. Now the system has 50 - 30 - 30 = -10 units. The agent has committed an impossible action based on stale memory.

Explicit state contracts prevent this:

Contract 1: State versioning. Every agent operation is tagged with the state version it observed.

@dataclass
class AgentObservation:
    version: int  # This is state version 42
    data: dict    # Current state according to version 42

def execute_decision_with_version_check(observation: AgentObservation, decision: str):
    # Check if state is still at the observed version
    current_version = get_current_state_version()
    if current_version != observation.version:
        # State changed; observation is stale
        log_warning(f"Stale observation: expected v{observation.version}, got v{current_version}")
        # Reject the decision or force re-observation
        return False
    
    # State is still consistent; execute the decision
    return execute_decision(decision)

This approach is used by optimistic locking. If state changes between observation and action, the action is rejected. The agent must re-observe and re-decide.

Contract 2: Invariant checking. Before executing side effects, assert that key invariants still hold.

def execute_transfer(from_account: Account, to_account: Account, amount: int):
    # Check invariant: from_account has sufficient balance
    if from_account.balance < amount:
        raise InsufficientFundsError()
    
    # Check invariant: accounts are not the same
    if from_account.id == to_account.id:
        raise SelfTransferError()
    
    # Check invariant: from_account is not frozen
    if from_account.frozen:
        raise FrozenAccountError()
    
    # All invariants hold; execute
    from_account.balance -= amount
    to_account.balance += amount

Invariant checking stops impossible actions from executing. If the action violates an invariant, it fails before side effects propagate.

Contract 3: Causality ordering. Define dependencies between agent turns. Side effects must not execute until their dependencies are committed.

@dataclass
class AgentTurn:
    id: str
    decision: str
    depends_on: list[str]  # Turn IDs that must complete first

def execute_turn_with_dependencies(turn: AgentTurn):
    # Wait for dependencies to complete
    for dep_id in turn.depends_on:
        dependency_turn = fetch_turn(dep_id)
        wait_until_committed(dependency_turn)
    
    # Dependencies committed; this turn can execute
    transaction_start()
    try:
        execute_decision(turn.decision)
        transaction_commit()
    except:
        transaction_rollback()
        raise

This ensures that side effects propagate in a causal order, preventing decisions based on uncommitted state.

Detection and Recovery: Acknowledging Failures

Even with explicit transaction boundaries, something will fail. A network partition will isolate an agent. A database will crash mid-transaction. An external API will return an ambiguous response.

The agent must be able to detect these failures and recover without leaving the system in an inconsistent state.

Detection 1: Timeout. If an operation does not complete within a time window, assume it failed.

def execute_with_timeout(operation, timeout_ms=5000):
    try:
        result = asyncio.wait_for(operation(), timeout=timeout_ms/1000)
        return (True, result)
    except asyncio.TimeoutError:
        # Operation did not complete in time; assume it failed
        return (False, "Timeout")

But timeouts are ambiguous. Did the operation fail? Or is it taking longer than expected? Timeouts should trigger investigation, not immediate rollback.

def execute_with_timeout_and_investigation(operation, timeout_ms=5000):
    try:
        result = asyncio.wait_for(operation(), timeout=timeout_ms/1000)
        return (True, result)
    except asyncio.TimeoutError:
        # Operation timed out; check if it actually executed
        actual_result = verify_operation_executed()
        if actual_result:
            # Operation executed despite timeout; use the result
            return (True, actual_result)
        else:
            # Operation did not execute; safe to retry or fail
            return (False, "Timeout and verification failed")

Detection 2: Idempotency checking. Make all operations idempotent so retries are safe.

def execute_idempotent_operation(operation_id: str, operation: Callable):
    # Check if this operation already executed
    existing_result = lookup_operation_result(operation_id)
    if existing_result:
        # Already executed; return result without executing again
        return existing_result
    
    # Not executed yet; execute and cache result
    result = operation()
    cache_operation_result(operation_id, result)
    return result

Idempotency is not free. It requires a place to store execution records (operation ID → result). But it is the most reliable way to make retries safe.

Detection 3: External verification. Periodically check the external world to verify that side effects actually occurred.

def execute_side_effect_with_verification(effect: SideEffect, verify_after_seconds=60):
    # Execute the side effect
    start_time = time.time()
    success = execute_side_effect(effect)
    
    # Schedule verification
    def verify_later():
        elapsed = time.time() - start_time
        if elapsed >= verify_after_seconds:
            verified = verify_side_effect_occurred(effect)
            if not verified and success:
                # Side effect appeared to succeed but verification failed
                # This is a partial failure; escalate to human
                alert_human(f"Unverified side effect: {effect}")
            elif verified and not success:
                # Side effect appeared to fail but verification succeeded
                # This is a false negative; acknowledge success
                patch_result(effect, success=True)
    
    schedule_callback(verify_later, delay=verify_after_seconds)
    return success

This catches false negatives (operation succeeded but appeared to fail) and false positives (operation failed but appeared to succeed).

Practical Implementation: Transaction-Aware Agent Loop

An agent loop that respects transaction boundaries looks different from a stateless request-response:

class TransactionalAgent:
    def __init__(self, config: AgentConfig):
        self.config = config
        self.db = TransactionalDB(config.db_connection)
        self.state_store = TransactionalStateStore(config.state_backend)
    
    def run_turn(self, agent_id: str, prompt: str) -> TurnResult:
        """Execute a single agent turn as a transaction."""
        turn_id = generate_turn_id()
        
        # Start explicit transaction
        with self.db.transaction(isolation_level=SERIALIZABLE):
            try:
                # Phase 1: Observe (read-only; can be rolled back)
                observation = self._observe(agent_id)
                observation.version = self.db.get_state_version()
                
                # Phase 2: Reason (deterministic; depends on observation)
                context = self._build_context(agent_id, observation)
                reasoning = self._invoke_model(context, prompt)
                decision, declared_effects = self._extract_decision(reasoning)
                
                # Phase 3: Validate (check invariants)
                self._validate_decision(decision, observation)
                
                # Phase 4: Declare side effects (make observable to system)
                for effect in declared_effects:
                    self._log_side_effect(turn_id, effect)
                
                # Phase 5: Execute side effects (all or nothing)
                for effect in declared_effects:
                    self._execute_side_effect(turn_id, effect)
                
                # Phase 6: Update agent state
                self.state_store.append_to_history(agent_id, {
                    'turn_id': turn_id,
                    'observation': observation,
                    'decision': decision,
                    'effects': declared_effects,
                    'timestamp': time.time(),
                })
                
                # Transaction commits here
                return TurnResult(
                    success=True,
                    turn_id=turn_id,
                    decision=decision,
                    effects=declared_effects,
                )
            
            except ValidationError as e:
                # Validation failed; roll back and return error
                # (transaction automatically rolls back)
                return TurnResult(
                    success=False,
                    turn_id=turn_id,
                    error=f"Validation failed: {e}",
                )
            
            except PartialExecutionError as e:
                # Some side effects executed, some failed
                # Try to rollback what can be rolled back
                reversible = [ef for ef in declared_effects if ef.reversible]
                for effect in reversed(reversible):
                    try:
                        self._rollback_side_effect(turn_id, effect)
                    except Exception as rb_error:
                        # Rollback itself failed; escalate
                        log_critical(f"Rollback failed for {effect}: {rb_error}")
                
                # Transaction rolls back; agent state reverts
                return TurnResult(
                    success=False,
                    turn_id=turn_id,
                    error=f"Partial execution: {e}",
                    partially_executed_effects=e.executed_effects,
                )
            
            except Exception as e:
                # Unexpected error; transaction rolls back
                log_error(f"Turn {turn_id} failed: {e}")
                return TurnResult(
                    success=False,
                    turn_id=turn_id,
                    error=str(e),
                )
    
    def _observe(self, agent_id: str) -> Observation:
        """Fetch current state (read-only)."""
        return Observation(
            environment=self._fetch_environment_state(),
            agent_memory=self.state_store.get_agent_context(agent_id),
            timestamp=time.time(),
        )
    
    def _invoke_model(self, context: dict, prompt: str) -> str:
        """Call the model with timeout and retry."""
        return execute_with_timeout_and_retries(
            lambda: self.config.model.complete(
                context=context,
                prompt=prompt,
                max_tokens=2000,
            ),
            timeout_ms=30000,
            max_retries=3,
        )
    
    def _execute_side_effect(self, turn_id: str, effect: SideEffect):
        """Execute a single side effect with idempotency."""
        effect_id = f"{turn_id}:{effect.id}"
        
        # Make the effect execution idempotent
        return execute_idempotent_operation(
            effect_id,
            lambda: self.config.side_effect_executor.execute(effect),
        )

This implementation:

Wraps the entire turn in a transaction
Makes observation explicit and read-only
Validates the decision before executing side effects
Declares side effects before executing them
Makes side effect execution idempotent
Handles partial failures and rollback
Stores the full history for debugging

The Cost: Complexity for Correctness

Treating agent turns as transactions adds complexity. Transaction management, rollback logic, idempotency keys, isolation level tuning, and failure recovery are not free.

But the alternative is worse. An agent system without explicit transaction boundaries will eventually execute actions based on stale observations, leave side effects halfway executed, cascade failures through dependent systems, and lose consistency under concurrency. These failures are not edge cases in production at scale.

The choice is not “transactions or simplicity.” It is “explicit transaction semantics or implicit inconsistency.”

Correct agent systems require choosing:

Boundary definition. What counts as success or failure of a turn?
Rollback strategy. Which side effects are reversible? Which require compensation?
Side effect containment. How do side effects propagate? What limits them?
Isolation level. Can agents execute concurrently? At what consistency cost?
Failure detection. How does the system know a turn failed? How does it recover?

These are not optional. They are prerequisites for building agent systems that are predictable, debuggable, and safe to operate.

Found this helpful?