Skip to main content
Technical Systems

Why Timeouts Don't Cancel Work: The Resource Leak Nobody Notices

The client moved on. The server is still burning.

Timeouts tell clients to stop waiting, not servers to stop working. Queries keep running, goroutines keep leaking, and resources keep burning -- creating cascading failures nobody traces back.

Why Timeouts Don't Cancel Work: The Resource Leak Nobody Notices

A timeout fires. The HTTP request returns an error. The client moves on. Everything appears fine. The server is still running your query. The database is still processing. The goroutine is still alive. Resources are still consumed. Nothing was cancelled.

Timeouts give the illusion of control. They tell the client to stop waiting. They don’t tell the server to stop working. The work continues until it completes, fails, or the process dies. Every timeout that fires without actual cancellation leaks resources.

This isn’t about how timeouts work. It’s about the specific failure modes where timeouts leak goroutines, exhaust database connections, cause cascading failures, and why most timeout implementations make things worse.

What Timeouts Actually Do

A timeout is a deadline for waiting. It is not a signal to cancel work.

The Client-Side Timeout

import requests

# Set 5-second timeout
try:
    response = requests.get('https://api.example.com/slow-query', timeout=5)
except requests.Timeout:
    # Timeout fired after 5 seconds
    # Client stopped waiting
    print("Request timed out")

What happened:

# Client side:
# - Started HTTP request at t=0
# - Waited for 5 seconds
# - No response received
# - Raised Timeout exception
# - Client-side socket closed
# - Client moved on

# Server side:
# - Received HTTP request at t=0
# - Started processing slow query
# - Query still running at t=5
# - Client disconnected (socket closed)
# - Server might not notice client disconnected
# - Query continues running
# - Database resources still consumed
# - Server resources still consumed
# - Query completes at t=30 (25 seconds after client gave up)
# - Response written to closed socket (error logged)

The timeout stopped the client. It didn’t stop the server. The work continued for 25 seconds after the client abandoned it.

The Server Never Knew

from flask import Flask
import time

app = Flask(__name__)

@app.route('/slow-query')
def slow_query():
    # Start expensive operation
    result = database.execute("SELECT * FROM large_table WHERE complex_condition")

    # Process results (takes 30 seconds)
    time.sleep(30)

    # Return response
    return {"data": result}

# Client times out after 5 seconds
# Server continues processing for 30 seconds
# Server tries to send response to closed connection
# Error: "Broken pipe" or "Connection reset by peer"

The server logs show an error. The database query completed successfully. The CPU was consumed. The memory was allocated. The database connection was held. All resources were used for work that was abandoned.

The Goroutine Leak Pattern

Go makes it easy to leak goroutines with timeouts.

The Obvious Leak

package main

import (
    "context"
    "fmt"
    "net/http"
    "time"
)

func fetchWithTimeout(url string) error {
    // Create channel for result
    done := make(chan error)

    // Start goroutine
    go func() {
        resp, err := http.Get(url)
        if err != nil {
            done <- err
            return
        }
        resp.Body.Close()
        done <- nil
    }()

    // Wait with timeout
    select {
    case err := <-done:
        return err
    case <-time.After(5 * time.Second):
        return fmt.Errorf("timeout")
    }
}

What happens on timeout:

// Main goroutine:
// - Starts at t=0
// - Launches worker goroutine
// - Waits on select
// - Timeout fires at t=5
// - Returns error
// - Function exits

// Worker goroutine:
// - Started at t=0
// - Making HTTP request
// - Request takes 30 seconds
// - Still alive at t=5 (after timeout)
// - Still making HTTP request
// - Nobody listening on 'done' channel
// - Completes at t=30
// - Tries to send to 'done' channel
// - Blocks forever (nobody receiving)
// - Goroutine never exits
// - Memory never freed
// - LEAK

Every timeout creates an abandoned goroutine. The goroutine completes its work and then blocks forever trying to send the result. Memory leaks accumulate.

The Channel Must Be Buffered

func fetchWithTimeout(url string) error {
    // Buffered channel (size 1)
    done := make(chan error, 1)

    go func() {
        resp, err := http.Get(url)
        if err != nil {
            done <- err  // Can send even if nobody receiving
            return
        }
        resp.Body.Close()
        done <- nil  // Can send even if nobody receiving
    }()

    select {
    case err := <-done:
        return err
    case <-time.After(5 * time.Second):
        return fmt.Errorf("timeout")
    }
}

Now the goroutine can exit:

// On timeout:
// - Main goroutine returns error
// - Worker goroutine completes
// - Worker sends to buffered channel (doesn't block)
// - Worker exits
// - Channel garbage collected
// - No leak

But the HTTP request still ran to completion. The timeout didn’t cancel the request. It just prevented the goroutine from leaking.

The HTTP Request Is Still Alive

// The http.Get() call has no timeout
resp, err := http.Get(url)

// This request has no context, no cancellation signal
// Even though the goroutine timeout fired, the HTTP client doesn't know
// The request continues until:
// - Server responds
// - Network error occurs
// - Default HTTP client timeout (none by default)

The goroutine timeout is independent from the HTTP request timeout. Setting a goroutine timeout doesn’t cancel the HTTP request.

The Correct Implementation

func fetchWithTimeout(url string, timeout time.Duration) error {
    // Create context with timeout
    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    defer cancel()

    // Create request with context
    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
    if err != nil {
        return err
    }

    // Execute request
    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    return nil
}

Now timeout actually cancels:

// Context timeout fires at t=5
// Context marked as cancelled
// HTTP client checks context
// HTTP client cancels underlying TCP connection
// Request actually stops
// No goroutine leak
// No HTTP request leak

This requires the HTTP client to support context cancellation. Not all clients do.

Database Query Timeouts: The Illusion of Control

Database query timeouts are commonly misunderstood. Setting a timeout doesn’t cancel the query.

The Application Timeout

import psycopg2

conn = psycopg2.connect("postgresql://localhost/mydb")
cursor = conn.cursor()

# Set statement timeout on connection
cursor.execute("SET statement_timeout = '5s'")

# Execute slow query
try:
    cursor.execute("SELECT * FROM large_table WHERE expensive_computation(column)")
    results = cursor.fetchall()
except psycopg2.OperationalError as e:
    # Query timed out
    print(f"Query timeout: {e}")

What actually happened:

# Application side:
# - Sent query to database
# - Set timeout: 5 seconds
# - Waited for response
# - After 5 seconds: no response
# - Raised OperationalError
# - Application moved on

# Database side:
# - Received query
# - Started executing query
# - Query still running at t=5
# - Database sent error response to client
# - Database CANCELLED query execution
# - Query stopped
# - Resources freed

PostgreSQL’s statement_timeout actually cancels the query. This is the exception, not the rule.

The Network Timeout Doesn’t Cancel

import psycopg2

# Set network timeout (connect_timeout)
conn = psycopg2.connect(
    "postgresql://localhost/mydb",
    connect_timeout=5  # Network timeout, not query timeout
)

cursor = conn.cursor()

# Execute slow query
cursor.execute("SELECT * FROM large_table WHERE expensive_computation(column)")
# Query takes 30 seconds
# No timeout fires
# connect_timeout only applies to connection establishment

connect_timeout is not a query timeout. It’s a connection establishment timeout. The query runs forever.

MySQL’s Timeout Doesn’t Cancel Either

import mysql.connector

# Set connection timeout
conn = mysql.connector.connect(
    host='localhost',
    user='root',
    password='password',
    database='mydb',
    connection_timeout=5  # Only for connecting
)

cursor = conn.cursor()

# Execute slow query
cursor.execute("SELECT SLEEP(30)")  # Runs for 30 seconds
# No timeout

MySQL has no built-in statement timeout (before MySQL 8.0.19). The query runs until completion or connection close.

The Connection Pool Exhaustion

import psycopg2.pool

# Connection pool with 10 connections
pool = psycopg2.pool.SimpleConnectionPool(1, 10, "postgresql://localhost/mydb")

def execute_query(query):
    # Get connection from pool
    conn = pool.getconn()
    cursor = conn.cursor()

    try:
        # Execute query with no timeout
        cursor.execute(query)
        return cursor.fetchall()
    finally:
        # Return connection to pool
        pool.putconn(conn)

# 10 requests arrive
# Each executes slow query (30 seconds)
# All 10 connections busy
# 11th request arrives
# No connections available
# Request blocks waiting for connection
# All queries still running
# Pool exhausted

Slow queries without timeouts exhaust the connection pool. Every connection is busy running slow queries. New requests block waiting for available connections.

The Real Timeout Implementation

import psycopg2
import signal

class TimeoutError(Exception):
    pass

def timeout_handler(signum, frame):
    raise TimeoutError("Query timeout")

def execute_with_timeout(conn, query, timeout_seconds):
    cursor = conn.cursor()

    # Set PostgreSQL statement timeout
    cursor.execute(f"SET statement_timeout = '{timeout_seconds}s'")

    # Set Python-level timeout as backup
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(timeout_seconds)

    try:
        cursor.execute(query)
        result = cursor.fetchall()
        signal.alarm(0)  # Cancel alarm
        return result
    except psycopg2.OperationalError:
        # Database cancelled query
        signal.alarm(0)
        raise TimeoutError("Query timed out (database)")
    except TimeoutError:
        # Python timeout fired
        # Force close connection to cancel query
        conn.close()
        raise

This implementation:

  • Uses PostgreSQL’s statement_timeout (actually cancels query)
  • Uses Python signal timeout as backup
  • Closes connection if Python timeout fires (forces cancellation)

Closing the connection is the only way to guarantee cancellation if the database doesn’t support statement timeouts.

HTTP Client Timeouts: Three Different Timeouts

HTTP clients have multiple timeout settings. Each controls a different phase. None cancel the entire operation.

The Three Timeouts

import requests

response = requests.get(
    'https://api.example.com/data',
    timeout=(3.05, 27)  # (connect timeout, read timeout)
)

What these mean:

# connect timeout = 3.05 seconds
# - Time allowed to establish TCP connection
# - Includes DNS resolution, TCP handshake, SSL handshake
# - If connection not established in 3.05s: timeout
# - Does NOT apply to request/response

# read timeout = 27 seconds
# - Time allowed between bytes received
# - If server stops sending data for 27s: timeout
# - Does NOT apply to total request time
# - Only applies to gaps in data

# Total request time: UNLIMITED
# A request that receives 1 byte every 26 seconds will never timeout

The Misleading Read Timeout

# Server sends 1 byte every 10 seconds
# Read timeout: 15 seconds
# Total time: infinite

# t=0: Request sent
# t=5: Received byte 1
# t=15: Received byte 2 (10s since last byte, under 15s timeout)
# t=25: Received byte 3 (10s since last byte, under 15s timeout)
# t=35: Received byte 4 (10s since last byte, under 15s timeout)
# ... continues forever
# Never times out (bytes received every 10s, timeout is 15s)

Read timeout is not total timeout. It’s inter-byte timeout.

The Total Timeout

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import signal

class TimeoutHTTPAdapter(HTTPAdapter):
    def __init__(self, *args, timeout=None, **kwargs):
        self.timeout = timeout
        super().__init__(*args, **kwargs)

    def send(self, request, **kwargs):
        kwargs['timeout'] = self.timeout
        return super().send(request, **kwargs)

# Create session with total timeout
session = requests.Session()
adapter = TimeoutHTTPAdapter(timeout=(3, 27))  # Still not total timeout
session.mount("https://", adapter)

# For actual total timeout, use signal
def timeout_handler(signum, frame):
    raise TimeoutError("Request timeout")

signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(30)  # 30-second total timeout

try:
    response = session.get('https://api.example.com/data')
    signal.alarm(0)
except TimeoutError:
    # Total timeout fired
    # Connection still open
    # Server still processing
    signal.alarm(0)

Even with signal-based timeout, the HTTP connection isn’t cancelled. The client stops waiting. The server keeps processing.

The Server-Side Continuation

# Client side
import requests

try:
    response = requests.get('https://api.example.com/expensive-operation', timeout=5)
except requests.Timeout:
    # Client timed out
    # Client thinks request failed
    print("Request failed")

Server side:

from flask import Flask
import time

app = Flask(__name__)

@app.route('/expensive-operation')
def expensive_operation():
    # Start expensive operation
    perform_expensive_computation()  # Takes 30 seconds

    # Charge customer
    charge_payment(customer_id)

    # Send email
    send_confirmation_email(customer_id)

    # Return response
    return {"status": "success"}

# Client timed out at t=5
# Server continues running
# Computation completes at t=30
# Payment charged
# Email sent
# Response returned to closed connection

The client timeout caused the operation to appear to fail. The server completed the operation successfully. The customer was charged. The confirmation email was sent. The client never saw the response.

Async/Await Timeouts: Cooperative Cancellation

Async frameworks have timeout mechanisms. They work differently than synchronous timeouts.

Python asyncio Timeout

import asyncio

async def slow_operation():
    await asyncio.sleep(30)
    return "done"

async def main():
    try:
        result = await asyncio.wait_for(slow_operation(), timeout=5)
    except asyncio.TimeoutError:
        # Timeout fired after 5 seconds
        print("Timeout")

asyncio.run(main())

What happened:

# asyncio.wait_for():
# - Schedules slow_operation() coroutine
# - Schedules timeout (5 seconds)
# - Waits for either to complete
# - Timeout completes first
# - Cancels slow_operation() by setting cancel flag
# - slow_operation() raises CancelledError
# - slow_operation() must handle CancelledError

The Cancellation Must Be Handled

async def slow_operation():
    try:
        await asyncio.sleep(30)
    except asyncio.CancelledError:
        # Cancellation requested
        # Must clean up and re-raise
        cleanup()
        raise
    return "done"

If the code doesn’t handle CancelledError:

async def slow_operation():
    try:
        await asyncio.sleep(30)
    except asyncio.CancelledError:
        # Catch and ignore cancellation
        pass

    # Continue working despite cancellation
    perform_database_operation()
    return "done"

# asyncio.wait_for() times out
# Sends CancelledError to slow_operation()
# slow_operation() ignores cancellation
# slow_operation() continues running
# Database operation executes
# Timeout didn't actually cancel anything

Cancellation is cooperative. The code being cancelled must cooperate. If it catches and ignores CancelledError, cancellation fails.

The Blocking Call Problem

import asyncio
import time

async def slow_operation():
    # Blocking call (not async)
    time.sleep(30)  # Blocks entire event loop
    return "done"

async def main():
    try:
        # This timeout will not fire until time.sleep() completes
        result = await asyncio.wait_for(slow_operation(), timeout=5)
    except asyncio.TimeoutError:
        print("Timeout")

# time.sleep(30) blocks for 30 seconds
# No async operations can run during blocking sleep
# Timeout cannot fire (event loop blocked)
# Timeout fires after time.sleep() completes (30 seconds later)

Blocking calls prevent timeout from firing. The event loop is blocked. No other tasks can run. The timeout task cannot execute.

The Correct Async Cancellation

import asyncio
import aiohttp

async def fetch_with_timeout(url, timeout):
    async with aiohttp.ClientSession() as session:
        try:
            async with asyncio.timeout(timeout):  # Python 3.11+
                async with session.get(url) as response:
                    return await response.text()
        except asyncio.TimeoutError:
            # Timeout fired
            # aiohttp connection cancelled
            # TCP connection closed
            # Server may or may not notice
            raise

This actually cancels the HTTP request:

  • asyncio.timeout() fires after timeout
  • Context manager cancels all operations in scope
  • aiohttp session closes TCP connection
  • Server receives TCP RST or FIN

The server might still continue processing if it doesn’t check for client disconnect.

JavaScript Promise Timeouts: No Native Cancellation

JavaScript Promises cannot be cancelled. Timeouts don’t stop promise execution.

The Promise Timeout Pattern

function fetchWithTimeout(url, timeout) {
    return Promise.race([
        fetch(url),
        new Promise((_, reject) =>
            setTimeout(() => reject(new Error('Timeout')), timeout)
        )
    ]);
}

// Use it
fetchWithTimeout('https://api.example.com/data', 5000)
    .then(response => console.log(response))
    .catch(error => console.error(error));

What happens on timeout:

// Promise.race():
// - Starts fetch() promise
// - Starts timeout promise
// - Waits for first to complete
// - Timeout completes at 5s
// - Returns rejected promise
// - Caller sees timeout error

// fetch() promise:
// - Still running
// - Cannot be cancelled
// - Continues making HTTP request
// - Completes at 30s
// - Response ignored (race already resolved)
// - HTTP request completed unnecessarily

The timeout won’t cancel the fetch. The HTTP request completes. The response is discarded.

The AbortController Solution

function fetchWithTimeout(url, timeout) {
    const controller = new AbortController();
    const timeoutId = setTimeout(() => controller.abort(), timeout);

    return fetch(url, { signal: controller.signal })
        .then(response => {
            clearTimeout(timeoutId);
            return response;
        })
        .catch(error => {
            clearTimeout(timeoutId);
            throw error;
        });
}

// Use it
fetchWithTimeout('https://api.example.com/data', 5000)
    .then(response => console.log(response))
    .catch(error => {
        if (error.name === 'AbortError') {
            console.error('Request timed out');
        } else {
            console.error('Request failed', error);
        }
    });

Now timeout actually cancels:

// AbortController:
// - Created at start
// - Passed to fetch() as signal
// - setTimeout() scheduled
// - Timeout fires at 5s
// - controller.abort() called
// - AbortSignal set to aborted
// - fetch() checks signal
// - fetch() cancels HTTP request
// - TCP connection closed
// - Promise rejected with AbortError

This requires the API to support AbortSignal. Native fetch() does. Custom promise-based APIs might not.

The Fetch API Still Doesn’t Guarantee Server Cancellation

// Client aborts request
controller.abort();

// Browser closes TCP connection
// Sends TCP RST packet

// Server might:
// - Notice connection closed immediately
// - Notice on next write attempt
// - Never notice (if only reading)
// - Continue processing for minutes

The server-side framework must check for client disconnect. Not all do.

The Reverse Proxy Timeout Cascade

Multiple layers of timeouts interact poorly. Each layer has its own timeout. None coordinate.

The Timeout Stack

Browser timeout: 30s

Load balancer timeout: 60s

Nginx timeout: 120s

Application timeout: 180s

Database timeout: 300s

What happens:

# Request takes 45 seconds

# Browser side:
# - Request sent at t=0
# - Browser timeout: 30s
# - Browser aborts at t=30
# - TCP connection closed by browser
# - User sees error

# Load balancer side:
# - Request received at t=0
# - Forwarded to Nginx
# - Client (browser) disconnected at t=30
# - Load balancer might not notice
# - Load balancer timeout: 60s (hasn't fired)
# - Load balancer keeps connection to Nginx open

# Nginx side:
# - Request received at t=0
# - Forwarded to application
# - Client (load balancer) still connected
# - Nginx timeout: 120s (hasn't fired)
# - Nginx keeps connection to application open

# Application side:
# - Request received at t=0
# - Processing query
# - All upstream timeouts fired
# - Application unaware
# - Continues processing
# - Completes at t=45
# - Returns response

# Response propagates back:
# - Application → Nginx: success
# - Nginx → Load balancer: might fail (check connection state)
# - Load balancer → Browser: fails (connection closed)

# Resources consumed:
# - Application CPU: 45 seconds
# - Database connection: 45 seconds
# - Nginx connection: 45 seconds
# - Load balancer connection: 30 seconds
# - Browser connection: 30 seconds

Every layer consumed resources for work that was abandoned by the browser after 30 seconds.

The Nginx Timeout Configuration

http {
    # Client timeouts
    client_header_timeout 60s;  # Waiting for client to send headers
    client_body_timeout 60s;    # Waiting for client to send body
    send_timeout 60s;           # Waiting to send response to client

    # Proxy timeouts
    proxy_connect_timeout 60s;  # Connecting to upstream
    proxy_send_timeout 60s;     # Sending request to upstream
    proxy_read_timeout 60s;     # Reading response from upstream

    location /api {
        proxy_pass http://backend;

        # Override timeouts
        proxy_read_timeout 30s;
    }
}

Each timeout controls a different phase:

# Request flow:
# 1. Client sends headers → client_header_timeout
# 2. Client sends body → client_body_timeout
# 3. Nginx connects to backend → proxy_connect_timeout
# 4. Nginx sends request to backend → proxy_send_timeout
# 5. Nginx waits for backend response → proxy_read_timeout
# 6. Nginx sends response to client → send_timeout

# If proxy_read_timeout fires (30s):
# - Nginx stops waiting for response
# - Nginx closes connection to backend
# - Backend receives TCP RST
# - Backend might not notice immediately
# - Backend continues processing
# - Backend completes work
# - Backend tries to send response
# - Connection closed (error logged)

Nginx closing the connection doesn’t guarantee backend stops processing.

The Connection Close Detection

from flask import Flask, request
import time

app = Flask(__name__)

@app.route('/long-operation')
def long_operation():
    # Check if client still connected
    for i in range(30):
        # Check connection state
        if request.environ.get('werkzeug.socket').fileno() == -1:
            # Connection closed
            return "Client disconnected", 499

        # Do work
        time.sleep(1)

    return "Complete"

This checks connection state periodically. If client disconnected, stop processing. But:

  • Only works if framework exposes socket
  • Requires manual checking (not automatic)
  • Adds overhead to check every iteration
  • Most applications don’t do this

The Thread Pool Exhaustion

Timeouts don’t free threads. Timed-out requests still consume thread pool resources.

The Thread Pool Pattern

from concurrent.futures import ThreadPoolExecutor
import time

# Thread pool with 10 threads
executor = ThreadPoolExecutor(max_workers=10)

def slow_operation(request_id):
    time.sleep(30)  # Simulate slow work
    return f"Done: {request_id}"

# Submit 10 requests
futures = []
for i in range(10):
    future = executor.submit(slow_operation, i)
    futures.append(future)

# Wait with timeout
for future in futures:
    try:
        result = future.result(timeout=5)
    except TimeoutError:
        # Timeout after 5 seconds
        # Thread still running slow_operation()
        # Thread not freed
        print("Timeout")

After timeout:

# Main thread:
# - Submitted 10 tasks
# - Waited 5 seconds for each
# - All timed out
# - Moved on

# Thread pool:
# - 10 threads running slow_operation()
# - Each running for 30 seconds
# - All threads busy
# - Thread pool exhausted

# 11th request arrives:
# - Submitted to executor
# - No threads available
# - Blocks waiting for thread
# - Waits 25 more seconds (until first thread completes)

Timeouts don’t free threads. The thread pool remains exhausted.

The Thread Cannot Be Cancelled

import threading
import time

def slow_operation():
    time.sleep(30)
    return "done"

# Start thread
thread = threading.Thread(target=slow_operation)
thread.start()

# Wait with timeout
thread.join(timeout=5)

if thread.is_alive():
    # Thread still running
    # Cannot cancel thread
    # Cannot free resources
    # Can only wait
    print("Thread timed out but still running")

Python threads cannot be cancelled. thread.join(timeout) doesn’t stop the thread. It just stops waiting for it.

The Process Pool Alternative

from multiprocessing import Pool
import time

def slow_operation(request_id):
    time.sleep(30)
    return f"Done: {request_id}"

# Process pool with 10 workers
pool = Pool(processes=10)

# Submit request
result = pool.apply_async(slow_operation, (1,))

# Wait with timeout
try:
    output = result.get(timeout=5)
except TimeoutError:
    # Timeout fired
    # Process still running
    # Can terminate process
    pool.terminate()  # Force kill all workers
    pool.join()

Process pools can be terminated. But:

  • Terminates ALL workers, not just the timed-out one
  • Loses all in-progress work
  • Requires recreating process pool
  • High overhead (process creation is expensive)

Why Timeouts Make Things Worse

Timeouts without cancellation cause cascading failures.

The Retry Amplification

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Configure retries
retry_strategy = Retry(
    total=3,
    backoff_factor=1,
    status_forcelist=[500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry_strategy)

session = requests.Session()
session.mount("https://", adapter)

# Make request with timeout
try:
    response = session.get('https://api.example.com/slow', timeout=5)
except requests.Timeout:
    print("Request timed out")

What happens:

# Attempt 1:
# - Request sent at t=0
# - Timeout at t=5
# - Server still processing
# - Retry scheduled

# Attempt 2:
# - Request sent at t=6 (1s backoff)
# - Timeout at t=11
# - Server now processing 2 requests
# - Retry scheduled

# Attempt 3:
# - Request sent at t=13 (2s backoff)
# - Timeout at t=18
# - Server now processing 3 requests
# - All retries exhausted

# Server side:
# - 3 requests received
# - All 3 still processing
# - Each takes 30 seconds
# - All 3 complete successfully
# - All 3 responses sent to closed connections
# - 3x resource consumption
# - 0 successful responses to client

Retries on timeout amplify server load. Each retry adds another long-running request. The server load increases while the client sees only failures.

The Cascading Failure

# Service A calls Service B with 5s timeout
def call_service_b():
    try:
        response = requests.get('http://service-b/data', timeout=5)
        return response.json()
    except requests.Timeout:
        # Timeout, retry
        return call_service_b()  # Immediate retry

Under load:

# Service B response time: 6 seconds (just above timeout)
# Service A receives 100 requests/second

# t=0-5:
# - 500 requests sent to Service B (100/sec * 5sec)
# - All timeout at t=5
# - All retry immediately
# - 500 more requests sent to Service B

# t=5-10:
# - Original 500 requests still processing on Service B
# - 500 retry requests sent
# - 500 new requests sent (100/sec * 5sec)
# - Total: 1500 requests on Service B

# t=10-15:
# - 1500 previous requests still processing
# - 1000 retry requests (500+500 from previous timeouts)
# - 500 new requests
# - Total: 3000 requests on Service B

# Service B load: exponential growth
# Service B crashes

Aggressive timeout + retry causes cascading overload. Each timeout triggers a retry. Each retry adds load. The service gets slower. More timeouts occur. More retries sent. Positive feedback loop to failure.

The Circuit Breaker Solution

Circuit breakers prevent timeout retry storms.

Circuit Breaker Pattern

import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = 1   # Normal operation
    OPEN = 2     # Failing, reject requests
    HALF_OPEN = 3  # Testing if recovered

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60, success_threshold=2):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.success_threshold = success_threshold
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.timeout:
                # Timeout elapsed, try half-open
                self.state = CircuitState.HALF_OPEN
            else:
                # Circuit open, reject immediately
                raise Exception("Circuit breaker open")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                # Recovered, close circuit
                self.state = CircuitState.CLOSED
                self.failure_count = 0
                self.success_count = 0
        else:
            self.failure_count = 0

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()

        if self.failure_count >= self.failure_threshold:
            # Too many failures, open circuit
            self.state = CircuitState.OPEN
            self.success_count = 0

# Use it
breaker = CircuitBreaker(failure_threshold=5, timeout=60)

def call_service():
    try:
        return breaker.call(lambda: requests.get('http://service-b/data', timeout=5))
    except Exception as e:
        return {"error": str(e)}

Circuit breaker behavior:

# Normal operation (CLOSED):
# - Requests pass through
# - Failures counted
# - After 5 failures: OPEN

# Circuit open (OPEN):
# - Requests rejected immediately
# - No calls to downstream service
# - After 60 seconds: HALF_OPEN

# Testing recovery (HALF_OPEN):
# - Allow some requests through
# - If 2 succeed: CLOSED (recovered)
# - If any fail: OPEN (still failing)

This prevents retry storms:

  • Failed requests don’t retry
  • Downstream service gets breathing room to recover
  • Load decreases instead of increases

But timeouts still don’t cancel work. The circuit breaker just stops sending more.

The Only Real Solutions

Preventing timeout resource leaks requires actual cancellation mechanisms.

Cancellation Tokens

// Go context propagation
func fetchWithCancellation(ctx context.Context, url string) error {
    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
    if err != nil {
        return err
    }

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    return nil
}

// Use with timeout
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

err := fetchWithCancellation(ctx, "https://api.example.com/data")

The context is checked at every async point. When timeout fires, context is marked cancelled. Operations check context and abort.

Graceful Shutdown Signals

import signal
import sys

# Global flag
shutdown = False

def signal_handler(sig, frame):
    global shutdown
    shutdown = True
    print("Shutdown signal received")

signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)

# Long-running operation
def process_items():
    for item in get_items():
        if shutdown:
            # Stop processing
            cleanup()
            sys.exit(0)

        process_item(item)

This checks shutdown flag periodically. When signal received, current work finishes and then stops.

Database Query Cancellation

import psycopg2
import threading

def execute_with_cancellation(conn, query):
    # Execute query in thread
    result = [None]
    exception = [None]

    def execute():
        try:
            cursor = conn.cursor()
            cursor.execute(query)
            result[0] = cursor.fetchall()
        except Exception as e:
            exception[0] = e

    thread = threading.Thread(target=execute)
    thread.start()
    thread.join(timeout=5)

    if thread.is_alive():
        # Timeout, cancel query
        conn.cancel()  # Send cancel request to PostgreSQL
        thread.join()  # Wait for thread to finish
        raise TimeoutError("Query cancelled")

    if exception[0]:
        raise exception[0]

    return result[0]

PostgreSQL’s conn.cancel() sends a cancellation request to the database. The query actually stops executing.

Timeouts Are Not Cancellation

Setting a timeout tells the caller to stop waiting. It does not tell the worker to stop working. Every timeout that fires without cancellation leaves abandoned work consuming resources.

Building reliable systems requires:

  • Understanding that timeouts don’t cancel
  • Using cancellation mechanisms (contexts, signals, abort controllers)
  • Detecting client disconnects and stopping work
  • Avoiding retry storms with circuit breakers
  • Setting timeouts at every layer, not just the outermost
  • Monitoring resource consumption from abandoned work

Timeouts create the illusion of control. Actual control requires cancellation. Most systems have timeouts. Few have cancellation. The gap between them is where resources leak.