The 3 a.m. Call That Changed The Way I Design APIs

At 3:17 a.m. on a Tuesday, my phone buzzed with the alert that would reshape the way I think about API design.
Our customer-facing API had stopped responding. Not slowly degrading; it was completely dead. Three downstream services went with it. By the time I got to my laptop, customer support tickets were flooding in.
The root cause? A single database replica had gone down, and our API had no fallback. One failure cascaded into total unavailability. I spent the next four hours manually rerouting traffic while our customers waited.
That night cost us $14,000 in service-level agreement (SLA) credits and a lot of trust. But it taught me something I now apply to every API I build: Every design decision should pass what I call âThe 3 a.m. Test.â
The 3 a.m. Test
The test is simple: When this system breaks at 3 a.m., will the on-call engineer be able to diagnose and fix it quickly?
This single question has eliminated a surprising number of âcleverâ design choices from my architectures:
- Clever error codes that require documentation lookup? Fail.
- Implicit state that depends on previous requests? Fail.
- Cascading failures that take down unrelated features? Fail.
After that incident, I rebuilt our API infrastructure from the ground up. Over the next three years, handling 50 million daily requests, I developed five principles that transformed our reliability from 99.2% to 99.95% and let me sleep through the night.
Principle 1: Design for Partial Failure
Six months after the initial incident, we had another outage. This time, a downstream payment processor went unresponsive. Our API dutifully waited for responses that never came, and request threads piled up until we crashed.
I realized weâd solved one problem but created another. We needed systems that degraded gracefully instead of failing catastrophically.
Hereâs what we built:
class ResilientServiceClient:
def __init__(self, primary_url, fallback_url):
self.primary = primary_url
self.fallback = fallback_url
self.circuit_breaker = CircuitBreaker(
failure_threshold=5,
recovery_timeout=30
)
async def fetch(self, request):
# Try primary with circuit breaker protection
if self.circuit_breaker.is_closed():
try:
response = await self.call_with_timeout(
self.primary, request, timeout_ms=500
)
self.circuit_breaker.record_success()
return response
except (TimeoutError, ConnectionError):
self.circuit_breaker.record_failure()
# Fall back to secondary
try:
return await self.call_with_timeout(
self.fallback, request, timeout_ms=1000
)
except Exception:
# Return degraded response rather than error
return self.degraded_response(request)
The key insight: A degraded response is almost always better than an error. Users can work with stale data or reduced functionality. They canât work with a 500 error.
After implementing this pattern across our services, we stopped having cascading failures. When the payment processor went down again (it did, three more times that year), our API returned cached pricing and queued transactions for later processing. Customers barely noticed.
Principle 2: Make Idempotency Non-Negotiable
This lesson came from a $27,000 mistake.
A mobile client had a bug that caused it to retry failed requests aggressively. One of those requests was a payment. The retry logic didnât include idempotency keys. You can guess what happened next.
A single customer got charged 23 times for the same order. By the time we noticed, weâd processed duplicate charges across hundreds of accounts. The refunds, the customer service hours, the engineering time to fix it cost $27,000.
Now, every mutating endpoint requires an idempotency key. No exceptions.
class IdempotentEndpoint:
def __init__(self):
self.idempotency_store = RedisStore(ttl_hours=24)
async def handle_request(self, request, idempotency_key):
# Check if we've already processed this request
existing = await self.idempotency_store.get(idempotency_key)
if existing:
# Return cached response â don't re-execute
return Response(
data=existing['response'],
headers={'X-Idempotent-Replay': 'true'}
)
# Process the request
result = await self.execute_operation(request)
# Cache for future retries
await self.idempotency_store.set(
idempotency_key,
{'response': result, 'timestamp': now()}
)
return Response(data=result)
We also started rejecting requests without idempotency keys for any POST, PUT or DELETE operation. Some client developers complained initially. Then they thanked us when their retry bugs didnât cause data corruption.
Principle 3: Version in the URL, Not the Header
I learned this one by watching a junior engineer debug an issue for six hours.
Weâd been versioning our API through a custom header: X-API-Version: 2. It seemed clean. Kept the URLs tidy.
But when something went wrong, our logs showed the URL and response code â not the headers. The engineer was looking at logs for /users/123 and couldnât figure out why the behavior was different between two clients. Six hours later, he finally thought to check the version header.
We moved versioning to the URL path that week:
/v1/users/123 /v2/users/123
Now version information shows up in:
- Every log entry
- Every trace
- Every error report
- Every monitoring dashboard
The debugging time savings alone justified the migration. But we also established versioning rules that prevented future pain:
- Breaking changes require a new version
- Additive changes (new optional fields) donât require a new version
- We support at least two versions simultaneously
- 12-month deprecation notice before sunsetting any version
When we do deprecate a version, clients get a Deprecation header warning them for months before we actually turn it off.
Principle 4: Rate Limit Before You Need To
We almost learned this lesson the hard way.
A partner company integrated with our API. Their teamâs implementation had a bug: When they got a timeout, theyâd retry immediately. Infinitely. With exponential parallelism.
At 2 p.m. on a Thursday, their system started sending 50,000 requests per second. We didnât have rate limiting. Weâd always planned to add it âwhen we needed it.â
We needed it.
Fortunately, our load balancer had basic protection that kicked in and started dropping requests. But legitimate traffic got dropped too. For 47 minutes, our API was essentially a lottery â maybe your request would get through, maybe it wouldnât.
The next week, we implemented tiered rate limiting:
class TieredRateLimiter:
def __init__(self):
self.limiters = {
'per_client': TokenBucket(rate=100, burst=200),
'per_endpoint': TokenBucket(rate=1000, burst=2000),
'global': TokenBucket(rate=10000, burst=15000)
}
async def check_limit(self, client_id, endpoint):
# Check all tiers, return first failure
for tier_name, limiter in self.limiters.items():
key = client_id if tier_name == 'per_client' else endpoint
result = await limiter.check(key)
if not result.allowed:
return RateLimitResponse(
allowed=False,
retry_after=result.retry_after,
limit_type=tier_name
)
return RateLimitResponse(allowed=True)
The key details that made this actually useful:
- Always return
Retry-Afterheaders so clients know when to try again. - Include
X-RateLimit-Remainingso clients can see their budget. - Use different limits for different client tiers (partners get more than anonymous users).
- Separate limits per endpoint (the search endpoint can handle more than the payment endpoint).
That partnerâs bug happened again six months later. This time, their requests got rate-limited, our other clients were unaffected, and I didnât even find out until I checked the metrics the next morning.
Principle 5: If You Canât See It, You Canât Fix It
The scariest outages arenât the ones where everything breaks. Theyâre the ones where something is subtly wrong and you donât notice for days.
We had an issue where 3% of requests were failing with a specific error code. Not enough to trigger our availability alerts (weâd set those at 5%). Not enough for customers to flood support. But enough that hundreds of users per day were having a bad experience.
It took us two weeks to notice. Two weeks of a broken experience for real users.
After that, we built observability into every endpoint:
class ObservableEndpoint:
async def handle(self, request):
trace_id = self.tracer.start_trace()
start_time = time.time()
try:
response = await self.process(request)
# Record success metrics
duration_ms = (time.time() - start_time) * 1000
self.metrics.histogram('request_duration_ms', duration_ms, {
'endpoint': request.path,
'status': response.status
})
self.metrics.increment('requests_total', {
'endpoint': request.path,
'status': response.status
})
return response
except Exception as e:
# Record failure with context
self.metrics.increment('requests_errors', {
'endpoint': request.path,
'error_type': type(e).__name__
})
self.logger.error('request_failed', {
'trace_id': trace_id,
'error': str(e)
})
raise
Our minimum observability requirements now:
- Request count by endpoint, status code and client
- Latency percentiles (p50, p95, p99) by endpoint
- Error rate by endpoint and error type
- Distributed tracing across service boundaries
- Alerts at 1% error rate, not 5%
The 3% failure issue? With our new observability, we would have caught it in minutes, not weeks.
The Results
After three years of applying these principles across our API infrastructure:
| Metric | Before | After |
|---|---|---|
| Monthly availability | 99.2% | 99.95% |
| Mean time to detection | 45 minutes | 3 minutes |
| Mean time to recovery | 2 hours | 18 minutes |
| Client-reported errors | 340/week | 23/week |
| 3 a.m. pages (per month) | 8 | 0.5 |
That last metric is the one I care about most. I went from being woken up twice a week to once every two months.
What Iâd Tell My Past Self
If I could go back to before that first 3 a.m. call, Iâd tell myself:
- Build for failure from day one. Every external call will eventually fail. Every database will eventually go down. Design for it before it happens, not after.
- Make the safe thing the easy thing. Requiring idempotency keys feels like friction until it saves you from a $27,000 mistake. Rate limiting feels unnecessary until a partnerâs bug tries to take you down.
- Invest in observability early. You canât fix what you canât see. The cost of good monitoring is nothing compared to the cost of not knowing your system is broken.
- Boring is good. The clever solution thatâs hard to debug at 3 a.m. isnât clever. Version in the URL. Return clear error messages. Make the obvious choice.
APIs donât survive by accident. They survive by design, specifically, by designing for the moment when everything goes wrong.
Now, when my phone buzzes at 3 a.m., itâs usually just spam. And thatâs exactly how I like it.
The post The 3 a.m. Call That Changed The Way I Design APIs appeared first on The New Stack.
