Allgemein

The Pulse: Cloudflare’s latest outage proves dangers of global configuration changes (again)

The Pulse: Cloudflare’s latest outage proves dangers of global configuration changes (again)

Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover Big Tech and startups through the lens of senior engineers and engineering leaders. Today, we cover one out of four topics from last week’s The Pulse issue. Full subscribers received the below article seven days ago. If you’ve been forwarded this email, you can subscribe here.

A mere two weeks after Cloudflare suffered a major outage and took down half the internet, the same thing has happened again. Last Friday, 5th December, thousands of sites went down or partially down once more, in a global Cloudflare outage lasting 25 minutes.

As per last time, Cloudflare was speedy to share a full postmortem on the same day. It estimated that 28% of Cloudflare’s HTTP traffic was impacted. The cause of this latest outage was Cloudflare making a seemingly innocent – but global – configuration change that went on to take out a good portion of Cloudflare, globally, until being reverted. Here’s what happened:

  • Cloudflare was rolling out a fix for a nasty React security vulnerability
  • The fix caused an error in an internal testing tool
  • The Cloudflare team disabled the testing tool with a global killswitch
  • As this global configuration change was made, the killswitch unexpectedly caused a bug that resulted in HTTP 500 errors across Cloudflare’s network

In this latest outage, Cloudflare was burnt by yet another global configuration change. The previous outage in November happened thanks to a global database permissions change. In the postmortem of that incident, the Cloudflare team closed with this action item:

“Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input”

This change would make it so that Cloudflare’s configuration files do not propagate immediately to the full network, as they still do now. But making all global configuration files have staged rollouts is a large implementation that could take months. Evidently, there wasn’t time to make it yet, and it has come back to bite Cloudflare.

Unfortunately for Cloudflare, customers are likely to find unacceptable a second outage with similar causes to a previous one, only weeks ago. If Cloudflare proves unreliable, customers should plan to onboard to backup CDNs at the very least, and a backup CDN vendor will do its best to convince new customers to use it as the primary CDN.

Cloudflare’s value-add rests on rock-solid reliability without customers needing to budget for a backup CDN. Yes, publishing postmortems on the same day as an outage occurs helps restore trust, but that will crumble anyway with repeated large outages.

To be fair, the company is doubling down on implementing staged configuration rollouts. In its postmortem, Cloudflare is its own biggest critic. CTO Dane Knecht reflected:

“[Global configuration changes rolling out globally] remains our first priority across the organization. In particular, the projects outlined below should help contain the impact of these kinds of changes:Enhanced Rollouts & Versioning: Similar to how we slowly deploy software with strict health validation, data used for rapid threat response and general configuration needs to have the same safety and blast mitigation features. This includes health validation and quick rollback capabilities among other things.Streamlined break glass capabilities: Ensure that critical operations can still be achieved in the face of additional types of failures. This applies to internal services as well as all standard methods of interaction with the Cloudflare control plane used by all Cloudflare customers.“Fail-Open” Error Handling: As part of the resilience effort, we are replacing the incorrectly applied hard-fail logic across all critical Cloudflare data-plane components. If a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services will likely give the customer the option to fail open or closed in certain scenarios. This will include drift-prevention capabilities to ensure this is enforced continuously.
These kinds of incidents, and how closely they are clustered together, are not acceptable for a network like ours”.

Global configuration errors often trigger large outages

There’s a pattern of implicit or explicit global configuration errors causing large outages, and some of the biggest ones in recent years were caused by a single change being rolled out to a whole network of machines:

  • DNS and DNS-related systems like BGP: DNS changes are global by default, so it’s no wonder that DNS changes can cause global outages. Meta’s 7-hour outage in 2021 was related to DNS changes (more specifically, Border Gateway Protocol changes.) Meanwhile, the AWS outage in October started with the internal DNS system.
  • OS updates happening at the same time, globally: Datadog’s 2023 outage cost the company $5M and was caused by Datadog’s Ubuntu machines executing an OS update within the same time window, globally. It caused issues with networking, and it didn’t help that Datadog ran its infra on 3 different cloud providers across 3 networks. The same kind of Ubuntu update also caused a global outage for Heroku in 2024.

Globally replicating configs: in 2024, a configuration policy change was rolled out globally and crashed every Spanner database node straight away. As Google concluded in its postmortem: “Given the global nature of quota management, this metadata was replicated globally within seconds”.

Step 2 – replicating a configuration file globally across GCP – caused a global outage in 2024

Implementing gradual rollouts for all configuration files is a lot of work. It’s also invisible labor because when done well, then its benefits will be undetectable, except in the absence of incidents, thanks to better infrastructure!

The largest systems in the world will likely have to implement safer ways to roll out configs – but not everybody needs to. Staged configuration rollout doesn’t make much sense for smaller companies and products because this infra work slows down product development.

It doesn’t just slow down building, but every deployment, too, and this friction is designed to make everything slower. As such, they don’t make much sense unless the stability of mature systems is more important than fast iteration.

Software engineering is a field where tradeoffs are a fact of life, and universal solutions don’t exist. The development which worked for a system with 1/100th of the load and users a year ago, may not make sense today.

This was one out of the four topics covered in this week’s The Pulse. The full edition additionally covers:

  1. Industry Pulse. Poor capacity planning at AWS, Meta moves to a “closed AI” approach, a looming RAM shortage, early-stage startups hiring slower than before, how long it takes to earn $600K at Amazon and Meta, Apple loses execs to Meta, and more
  2. How the engineering team at Oxide uses LLMs. They find LLMs great for reading documents and lightweight research, mixed for coding and code review, and a poor choice for writing documents – or any kind of writing, really!
  3. Linux officially supports Rust in the kernel. Rust is now a first-class language inside the Linux kernel, eight months after a Linux Foundation Fellow predicted more support for Rust. A summary of the pros and cons of Rust support for Linux

Read the full The Pulse issue.