A traditional API fails. Your client retries with exponential backoff. The service recovers. Requests succeed. This pattern is standard practice.
An AI API fails. Your client retries with exponential backoff. The AI service is overloaded by retries. Recovery is delayed. Requests continue failing. The outage extends from minutes to hours.
Retry policies designed for stateless HTTP services do not work for AI services. AI services have different failure modes, different cost structures, and different recovery patterns.
Applying standard retry logic to AI failures amplifies outages instead of mitigating them.
AI Failures Are Not Transient
A database connection fails. You retry. The network blip is over. The connection succeeds. The failure was transient.
An AI model returns degraded outputs. You retry. The model still returns degraded outputs. The failure is not transient. It is structural.
AI models fail for reasons that do not resolve on retry:
- The model was trained on data that does not cover this input distribution
- The prompt triggers adversarial behavior or hallucination patterns
- The input context exceeds what the model can coherently process
- The model’s confidence threshold is miscalibrated for this domain
Retrying does not fix any of these. The same input produces the same bad output. Retries burn tokens and add latency without improving quality.
Worse, retries obscure the root cause. Logs show 10 failed requests for what is actually 1 request retried 10 times. Operators believe the issue is more widespread than it is.
Rate Limits Are Asymmetric
A traditional API rate-limits at 1000 requests per minute. You send 1000 requests. The API serves all of them. You send 1001 requests. The 1001st is rejected.
An AI API rate-limits at 1000 tokens per minute. You send 10 requests with 100 tokens each. The API serves all of them. You send an 11th request with 100 tokens. It is rejected because you hit 1100 tokens.
Token-based rate limiting means retries consume quota even when they fail. A rejected request still counts against your limit.
If you retry immediately, the retry is also rejected. You have now consumed 200 tokens and received zero successful responses. Each retry makes quota exhaustion worse.
With exponential backoff, retries spread over time. But if the original requests were already near the rate limit, the retries ensure you stay at the limit continuously. Your effective throughput drops to zero.
Traditional rate limits reject excess requests but serve the allowed quota. AI rate limits reject requests and consume quota on rejection. Retries accelerate quota exhaustion.
Model Capacity Does Not Scale with Retries
A traditional service is load balanced across 10 instances. One instance fails. The load balancer routes retries to the other 9 instances. Capacity decreases by 10%, not 100%.
An AI service runs a single large model. The model degrades under load. Requests start timing out. Clients retry. The retries add more load. The model degrades further.
AI model capacity is not horizontally scalable like stateless services. You cannot just add more instances of GPT-4. The model runs on specific hardware with fixed capacity.
When the model is saturated, retries do not distribute to available capacity. They add to the queue of requests waiting for the same saturated resource.
If 1000 requests are already queued and timing out, retrying them creates 2000 queued requests. The timeout rate increases. More requests retry. The queue grows exponentially.
The model never recovers because retries continuously add load faster than the model can drain the queue.
Timeouts Are Longer Than Clients Expect
A traditional API responds in 100ms on average. You set a 500ms timeout. 99% of requests succeed. 1% timeout and retry.
An AI API responds in 2 seconds on average. You set a 5-second timeout. During load spikes, requests take 10 seconds. 50% of requests timeout.
AI inference is slow and variable. A simple completion takes 1 second. A complex reasoning task takes 30 seconds. The same model, the same endpoint, wildly different latencies.
If your timeout is shorter than the model’s actual processing time during load, every request times out even though the model is successfully processing them.
Your client retries. The model finishes processing the original request and returns a result that the client has already abandoned. The retry arrives and is processed. The model does double work.
Every timed-out request results in wasted computation on the original request plus additional computation on the retry. The model’s effective capacity drops by 50% or more.
Partial Failures Are Not Detectable
A traditional API returns HTTP 500. You know the request failed. You retry.
An AI API returns HTTP 200 with a hallucinated response. You do not know the request failed. You do not retry.
Or worse: the AI API returns HTTP 200 with a response that is subtly wrong. Dates are plausible but incorrect. Code compiles but has logic bugs. Summaries omit critical details.
If you implement retry logic based on HTTP status codes, partial failures are invisible. The response looks successful. The client consumes bad data.
If you implement retry logic based on output validation, you must parse and validate every response. Validation is slow, complex, and often impossible without domain knowledge the client does not have.
Retrying on all responses creates infinite retry loops. Retrying on no responses means accepting bad outputs. Retrying on detected bad outputs requires validation logic that itself may be unreliable.
Traditional services have clear success/failure signals. AI services have ambiguous outputs that require human judgment to evaluate.
Cost Multiplies with Retries
A traditional API call costs fractions of a cent. Retrying 10 times costs fractions of a cent. The cost is negligible.
An AI API call costs $0.10 per request. Retrying 10 times costs $1.00. The cost is not negligible.
If your retry policy is max_retries=10 with exponential backoff, a single failed request can cost 10x the intended amount. If 10% of requests fail and retry, your costs increase by 90%.
During an outage where all requests fail, every request retries to the maximum. Your costs increase by 10x while your success rate drops to zero. You pay more and receive less.
AI APIs charge per token, per second of compute, or per request. All of these accumulate on retry. There is no free retry budget.
If your application retries automatically without cost awareness, an outage can exhaust your budget in minutes. The AI provider bills you for the failed requests. Your application stops working due to budget limits, extending the outage.
Exponential Backoff Assumes Independent Failures
Exponential backoff works when failures are independent. One request fails due to a transient issue. The next request probably succeeds.
AI service failures are not independent. They are systemic.
If the model is overloaded, all requests are slow. If the inference cluster is down, all requests fail. If the model was updated and introduced a regression, all requests to the affected endpoint degrade.
Backing off does not help. Waiting 1 second, 2 seconds, 4 seconds, 8 seconds does not matter if the model will be overloaded for the next 20 minutes.
Worse, exponential backoff assumes your client is the only one retrying. In reality, thousands of clients are all retrying with exponential backoff.
The backoff intervals are not synchronized. Clients retry at different times, creating a continuous stream of retries. The service never experiences a lull where it could recover.
Exponential backoff spreads retries over time for a single client but does not coordinate across clients. At scale, exponential backoff becomes a sustained retry storm.
Circuit Breakers Trip Too Late
A circuit breaker trips after N consecutive failures. Once tripped, requests fail immediately without calling the downstream service. This prevents retry storms.
For traditional services, N=5 is reasonable. Five failures indicate a problem. Fail fast and stop sending traffic.
For AI services, N=5 is too small. AI services have high variance. Five failures might be normal variance, not an outage.
If N is too small, the circuit breaker trips during normal operation. Legitimate requests are rejected. If N is too large, the circuit breaker does not trip until hundreds of requests have failed and retried.
The circuit breaker is either too sensitive or too late. Tuning it requires understanding the baseline failure rate of the AI service, which is higher and more variable than traditional services.
By the time the circuit breaker trips, the damage is done. Hundreds of retries have amplified the outage. The circuit breaker stops future requests but does not undo the load already sent.
AI Services Degrade Gradually, Not Abruptly
A traditional service is either up or down. Healthy or unhealthy. Response times are consistent until failure, then spike to timeout.
AI services degrade gradually. As load increases:
- Response times increase from 2s to 5s to 10s to 30s
- Output quality decreases as the model uses faster approximations
- Error rates increase from 1% to 5% to 20%
This gradual degradation confuses retry logic. Is a 10-second response a success or a failure? Is a low-quality output acceptable or should it be retried?
If you retry on slow responses, you add more load during the period when the service is already struggling. The retries push the service from degraded to failed.
If you do not retry on slow responses, you accept degraded performance. But the degraded performance may be unacceptable for your use case.
There is no clear threshold where retries become appropriate. Gradual degradation creates a gray zone where retries amplify the problem without clearly failing.
Recovery Requires Draining Load, Not Adding It
A traditional service recovers when the fault is fixed. A crashed instance restarts. A network partition heals. Traffic resumes.
An AI service recovers when the load drops below capacity. The model needs idle cycles to drain the queue, cool down GPUs, and stabilize inference latency.
Retries prevent recovery. Even after the root cause is fixed, retries keep the model at maximum load. The queue never drains. Response times never improve. The service remains in a degraded state.
This is a stable failure mode. The service is functional but overloaded. Retries maintain the overload. Operators see requests succeeding (eventually) so they do not realize the service is in a degraded state.
The only way to recover is to stop sending traffic. But clients are configured to retry. They never stop sending traffic. The service never recovers.
Recovery requires cooperation: clients must stop retrying, wait for the queue to drain, then resume at a controlled rate. Standard retry policies do not implement this.
Jitter Is Not Enough
Adding jitter to exponential backoff prevents synchronized retry storms. Instead of all clients retrying at exactly 1s, 2s, 4s, they retry at 1s±10%, 2s±10%, 4s±10%.
This helps for small numbers of clients. For thousands of clients, jitter does not matter.
If 10,000 clients are retrying with 10% jitter, the retries spread over a 200ms window instead of happening simultaneously. But 10,000 retries in 200ms is still a storm.
The AI service does not care whether retries arrive in a 1ms window or a 200ms window. Both are far faster than the model can process them. The queue grows regardless.
Jitter reduces synchronized load spikes but does not reduce total retry volume. AI services fail under sustained load, not just spikes. Jitter does not prevent this.
Idempotency Does Not Apply
Traditional APIs are designed to be idempotent. Retrying a GET request is safe. Retrying a POST request with an idempotency key is safe. The server deduplicates and returns the same result.
AI APIs are not idempotent. The same prompt can return different completions due to temperature settings, sampling, and model nondeterminism.
Retrying an AI request does not return the same result. It generates a new result, consumes new tokens, and incurs new cost.
If the first response was wrong and the retry succeeds, you have consumed 2x tokens. If both responses are wrong, you have consumed 2x tokens for 0x value.
AI services cannot deduplicate retries because every invocation is unique. Even if the input is identical, the output is not. Idempotency tokens do not help.
Model Updates Invalidate Retry Assumptions
A traditional service updates code. The new code is backward compatible. Retry logic continues to work.
An AI model is updated. The new model has different failure modes. Retry logic that worked for the old model fails for the new model.
The old model timed out under load but succeeded on retry after backoff. The new model hallucinates under load and returns bad data immediately. Retrying makes the problem worse, not better.
The old model had a rate limit of 1000 requests per minute. The new model has a rate limit of 500 tokens per minute. Retry logic designed for request-based limits does not account for token-based limits.
Model updates are frequent. Providers roll out new versions without coordinating with clients. Retry logic that worked yesterday may fail today.
There is no versioning API that allows clients to detect model changes and adjust retry behavior. Clients discover the change when retries start failing.
Observability Loses Signal
In traditional systems, retry metrics are useful. High retry rates indicate service instability. Operators investigate and fix the root cause.
In AI systems, retry metrics are misleading. High retry rates might indicate:
- Service instability
- Bad prompts that consistently fail
- Rate limit misconfiguration
- Timeout settings that are too aggressive
- Cost budget exhaustion
- Model degradation after an update
Operators cannot distinguish between these causes from retry metrics alone. They see high retry rates but do not know whether to scale the service, fix prompts, adjust rate limits, or roll back the model.
Retries obscure causation. A spike in failed requests might be one user with a bad prompt retrying 1000 times, or 1000 users each failing once. The metric is the same. The remediation is different.
Traditional observability assumes retries are signal. In AI systems, retries are noise.
Fallback Strategies Do Not Exist
A traditional service fails. You fall back to a cache, a secondary service, or a degraded mode.
An AI service fails. What is the fallback?
You cannot fall back to a cache because every request is unique. You cannot fall back to a secondary AI service because other AI services have the same capacity constraints and likely fail simultaneously during widespread outages.
You cannot fall back to degraded mode because the AI service already is the degraded mode. Before AI, you had a rules-based system or manual process. The AI service replaced it. There is nothing to fall back to.
Retry policies assume a fallback exists. When retries are exhausted, the client can degrade gracefully. AI services often have no graceful degradation path.
Retrying to exhaustion means spending maximum cost and time before finally failing. The failure is both expensive and slow.
Why This Matters
Retry policies are designed for transient, independent failures in stateless services with horizontal scalability and clear success/failure signals.
AI services have persistent failures, correlated across clients, fixed capacity constraints, ambiguous outputs, and high per-request costs.
Standard retry policies applied to AI services:
- Amplify load during outages instead of reducing it
- Multiply costs without improving success rates
- Prevent recovery by maintaining continuous load
- Obscure root causes in observability data
- Exhaust budgets faster than the service can recover
The result is longer outages, higher costs, and worse outcomes for everyone.
AI services require retry policies that account for token-based rate limits, gradual degradation, non-idempotent operations, and capacity constraints. Copying retry logic from traditional services is not sufficient.
Retries are not always the correct response to AI failures. Sometimes the correct response is to fail immediately, log the input for analysis, and avoid amplifying an outage that retry logic cannot fix.