Your SLA Is Only as Good as Your Dependencies

A service cannot be more reliable than its dependencies. If you depend on 10 services each with 99.9% uptime, your maximum uptime is not 99.9%. It is lower. Much lower.

The math is simple. The implications are not. Most SLAs are signed before anyone calculates whether the dependency chain can actually deliver them. By the time the first incident happens, the contract is already impossible.

Availability compounds through the call chain

Each dependency is a probabilistic gate. If a dependency is available 99% of the time, 1% of requests fail. If you have 5 dependencies, and each is available 99% of the time, the combined availability is:

$$ 0.99^5 = 0.951 = 95.1% $$

Five nines becomes barely two nines. If you promised 99% uptime, you are already in breach.

The serial dependency problem

import requests

def handle_request(user_id):
    # Dependency 1: User service (99.9% uptime)
    user = requests.get(f'http://users/api/{user_id}').json()
    
    # Dependency 2: Auth service (99.9% uptime)
    auth = requests.get(f'http://auth/validate/{user["token"]}').json()
    
    # Dependency 3: Products service (99.9% uptime)
    products = requests.get(f'http://products/catalog').json()
    
    # Dependency 4: Pricing service (99.9% uptime)
    prices = requests.get(f'http://pricing/calculate', 
                         json={'items': products}).json()
    
    # Dependency 5: Inventory service (99.9% uptime)
    available = requests.get(f'http://inventory/check',
                            json={'items': products}).json()
    
    return build_response(user, products, prices, available)

Each service has 99.9% uptime. The combined availability is:

$$ 0.999^5 = 0.995 = 99.5% $$

You promised 99.9%. You can deliver 99.5%. The gap is 0.4%, which is 35 hours per year of additional downtime. Your SLA is mathematically impossible.

Parallel dependencies do not help much

If dependencies are called in parallel and any one can fail without blocking the request, availability improves. But only if you explicitly handle failures.

The parallel dependency illusion

async function loadDashboard(userId) {
  const [user, notifications, activity, recommendations] = await Promise.all([
    fetch(`/api/users/${userId}`),        // 99.9% uptime
    fetch(`/api/notifications/${userId}`), // 99.9% uptime
    fetch(`/api/activity/${userId}`),      // 99.9% uptime
    fetch(`/api/recommendations/${userId}`) // 99.9% uptime
  ]);

  return {
    user: await user.json(),
    notifications: await notifications.json(),
    activity: await activity.json(),
    recommendations: await recommendations.json()
  };
}

This looks parallel. It is not fault-tolerant. If any dependency fails, the entire request fails. The availability is still multiplicative:

$$ 0.999^4 = 0.996 = 99.6% $$

The actual parallel pattern

async function loadDashboard(userId) {
  const [user, notifications, activity, recommendations] = await Promise.allSettled([
    fetch(`/api/users/${userId}`).then(r => r.json()),
    fetch(`/api/notifications/${userId}`).then(r => r.json()),
    fetch(`/api/activity/${userId}`).then(r => r.json()),
    fetch(`/api/recommendations/${userId}`).then(r => r.json())
  ]);

  return {
    user: user.status === 'fulfilled' ? user.value : null,
    notifications: notifications.status === 'fulfilled' ? notifications.value : [],
    activity: activity.status === 'fulfilled' ? activity.value : [],
    recommendations: recommendations.status === 'fulfilled' ? recommendations.value : []
  };
}

Now if notifications fail, the dashboard still renders. If the user service fails, the dashboard is useless, so availability still depends on it. If notifications, activity, and recommendations are truly optional:

$$ \text{availability} = 0.999 \text{ (user service only)} $$

But zero services are truly optional. Product always wants everything. The actual requirement is “show everything unless something fails, then show what you can.” That requires explicit degradation logic, not just parallel calls.

Third-party SLAs are not guarantees

Cloud providers advertise 99.95% or 99.99% uptime. That is an average, not a promise. The SLA defines financial penalties, not technical guarantees.

If AWS S3 has a 4-hour outage, you get a service credit. Your users do not get their data back. The SLA does not prevent downtime. It compensates for it.

The SLA credit calculation

AWS S3 SLA: 99.9% monthly uptime

Actual uptime: 99.5% (4 hours down in a 30-day month)
SLA breach: 0.4%

Service credit: 10% of monthly bill

If you spend $1000/month on S3:
Credit = $100

If the outage cost you $50,000 in lost revenue:
Net loss = $49,900

The SLA credit does not cover your losses. It does not restore availability. It is a token refund for breach of contract. Your SLA to customers is still broken.

Dependencies you did not count

Most dependency calculations miss the implicit dependencies: DNS, load balancers, API gateways, observability systems, deployment pipelines, certificate authorities.

Each is a dependency with its own availability. Each multiplies into your SLA.

The hidden dependency chain

User Request
  → CDN (99.99%)
    → Load Balancer (99.95%)
      → API Gateway (99.95%)
        → Your Service (99.9%)
          → Database (99.95%)
          → Cache (99.9%)
        → Auth Service (99.9% via third-party)

Combined availability:

$$ 0.9999 \times 0.9995 \times 0.9995 \times 0.999 \times 0.9995 \times 0.999 \times 0.999 = 0.9955 = 99.55% $$

Seven dependencies with excellent individual availability. Combined availability is 99.55%. If you promised 99.9%, you cannot deliver it.

Retries do not improve availability

Retries increase the chance that a transient failure succeeds. They do not increase dependency availability. If a service is down, retrying does not bring it back up.

Retries can make availability worse. If every client retries 3 times, a brief outage becomes a 3x load spike when the service recovers. The service stays down longer.

The retry amplification failure

import time
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

# Aggressive retry strategy
retry_strategy = Retry(
    total=5,
    backoff_factor=0.1,
    status_forcelist=[429, 500, 502, 503, 504]
)

adapter = HTTPAdapter(max_retries=retry_strategy)
session = requests.Session()
session.mount("http://", adapter)

def call_dependency():
    # Service is down
    # First call fails
    # Retry 1 fails (100ms later)
    # Retry 2 fails (200ms later)
    # Retry 3 fails (400ms later)
    # Retry 4 fails (800ms later)
    # Retry 5 fails (1600ms later)
    # Total time: 3.1 seconds to fail
    return session.get('http://dependency/api')

The retry strategy turned a 100ms failure into a 3.1-second failure. If 100 clients do this simultaneously, the dependency sees 600 requests instead of 100 when it recovers. Recovery is delayed.

Retries convert downtime into extended downtime.

Timeouts are your real SLA

If a dependency has 99.9% uptime but p99 latency is 5 seconds, your SLA is controlled by your timeout, not their uptime.

Set timeout to 1 second? Requests faster than 1s succeed. Requests slower than 1s are treated as failures. If 5% of requests exceed 1s, your effective dependency availability is 95%, not 99.9%.

The latency-availability gap

client := &http.Client{
    Timeout: 500 * time.Millisecond,
}

func callService(url string) (*Response, error) {
    resp, err := client.Get(url)
    if err != nil {
        // Timeout or connection error
        return nil, err
    }
    defer resp.Body.Close()
    
    return parseResponse(resp)
}

// Dependency has:
// - 99.9% uptime
// - p50 latency: 50ms
// - p95 latency: 200ms
// - p99 latency: 800ms

// With 500ms timeout:
// 1% of requests exceed timeout
// Effective availability: 99.9% × 99% = 98.9%

The dependency reports 99.9% uptime. Your service sees 98.9% availability because of latency. The SLA is defined by the timeout, not the uptime metric.

Circuit breakers isolate but do not improve availability

A circuit breaker prevents cascading failures by stopping calls to broken dependencies. That protects your service. It does not make the dependency more available.

When the circuit is open, requests fail fast instead of timing out. The user still sees an error. Availability is unchanged. The failure is just cheaper.

The circuit breaker math

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30)
def call_payment_service(order_id):
    return requests.post('http://payments/charge', 
                        json={'order_id': order_id})

# Payment service goes down
# First 5 requests fail (timeout after 2s each)
# Circuit opens - all subsequent requests fail immediately

# Availability before circuit breaker:
# - 100 req/s
# - Service down for 60s
# - 6000 requests timeout after 2s each
# Impact: 6000 failures, users wait 2s each

# Availability with circuit breaker:
# - First 5 requests timeout (10s total wait time)
# - Next 5995 requests fail immediately (0s wait time)
# Impact: 6000 failures, but 5995 fail instantly

# Failure rate: same
# User experience: better (fast failure)
# Availability: unchanged

Circuit breakers reduce blast radius. They do not improve the SLA. The dependency is still unavailable.

You cannot SLA your way out of bad dependencies

If a critical dependency has 95% uptime, no contract will make it 99.9%. You have three options:

Accept the lower SLA. Your SLA is 95%. Customers know what they get.
Make the dependency non-critical. Add fallbacks so failures degrade instead of fail.
Replace the dependency. Find one that can meet your requirements.

What you cannot do is promise 99.9% and hope the dependency improves. That is not engineering. That is wishful thinking.

The honest SLA

# Your service SLA
availability: 99.5%

# Dependencies
database: 99.95%
cache: 99.9%
payment_provider: 99.0%  # Third-party, out of your control
email_service: 95.0%     # Third-party, best effort

# Calculated maximum availability
# Assuming payment and email are critical:
max_availability: 0.9995 × 0.999 × 0.990 × 0.950 = 0.939 = 93.9%

# Your promised SLA: 99.5%
# Your achievable SLA: 93.9%
# Gap: 5.6%

# Options:
# 1. Lower SLA to 93.9%
# 2. Make email non-critical (fallback to queue)
# 3. Make payment non-critical (accept orders, charge later)
# 4. Replace payment provider with higher SLA option

The math determines the possible. The architecture determines the actual. The contract is only valid if reality supports it.

Redundancy costs more than you think

The fallback strategy is redundancy. Use two payment providers. If one is down, fail over to the other. That improves availability but adds complexity, cost, and new failure modes.

Two providers with 99% uptime each:

$$ \text{combined availability} = 1 - (0.01 \times 0.01) = 0.9999 = 99.99% $$

That is the theory. The practice includes:

Failover detection delay. You do not know instantly that provider A is down.
Failover logic bugs. The fallback path is rarely tested in production.
Partial failures. Provider A is slow, not down. When do you failover?
State synchronization. If provider A accepted a payment and then failed, does provider B know?

Redundancy improves availability. It does not eliminate dependency risk. It replaces one dependency with two dependencies and a failover mechanism. The failover mechanism is now a dependency.

The only honest SLA strategy

Calculate the combined availability of all dependencies. Add margin for unknowns and failures in your own code. That is your maximum achievable SLA.

If that number is lower than what customers demand, fix the architecture before signing the contract. More reliable dependencies, fewer dependencies, or better fallbacks. Not promises you cannot keep.

Your SLA is a contract. It should be a contract you can fulfill. Dependency math determines what is possible. Everything else is a bet that failure will not happen.

It always does.