Online: 2294 online | Members: 0 | Guests: 2294
Thursday, June 4, 2026
There is no translation available.

On November 18, 2025, a huge slice of the internet fell over.
If you opened ChatGPT, X (Twitter), League of Legends, Shopify, Coinbase, or countless smaller sites, you were greeted with a Cloudflare-branded 5xx error page—or the sites just wouldn’t load at all. What looked at first like yet another big “the internet is broken” moment turned out to be something more subtle and, in some ways, more worrying: a self-inflicted bug deep inside Cloudflare’s own infrastructure.

Below is a detailed walkthrough of what happened in yesterday’s Cloudflare outage (18 November 2025), why it happened, who it affected, and what lessons infrastructure teams should take away from it.

cloudfaledown.png

 


What actually happened yesterday?

On Tuesday, November 18, 2025, around late morning UTC, Cloudflare began returning large volumes of HTTP 5xx server errors for traffic that passed through its network. For end users, that meant “Internal Server Error” or “Gateway Error” pages when trying to access many popular websites and apps.

According to Cloudflare’s own post-incident blog, the outage:

  • Started impacting customer HTTP traffic at 11:28 UTC

  • Saw widespread 5xx errors across core CDN and security services

  • Had major mitigation steps around 13:05–14:30 UTC

  • Returned 5xx error volume to baseline by 17:06 UTC The Cloudflare Blog

Cloudflare itself described it as its worst outage since 2019, because it didn’t just affect one feature or dashboard – it disrupted the core proxy layer that routes the majority of customer traffic through its network. The Cloudflare Blog

Third-party monitoring backed this up. Cisco ThousandEyes saw a global outage affecting Cloudflare, with timeouts and 5xx errors on services like X, OpenAI (ChatGPT), and Anthropic, while network paths themselves looked healthy. That pointed strongly to a backend service failure, not an ISP-level or routing issue. ThousandEyes

 


Who was affected?

Because Cloudflare sits in front of a massive portion of the internet (around 20% of the web’s sites rely on Cloudflare for performance and security), the blast radius was enormous. AP News+1

Among the services reported as impacted:

  • ChatGPT / OpenAI

  • X (formerly Twitter)

  • Canva, Shopify, Dropbox, Coinbase

  • League of Legends and other gaming platforms

  • Various public transit and government sites, including New Jersey Transit and France’s SNCF railway digital systems AP News+1

Outage trackers like Downdetector recorded thousands of concurrent issue reports at the peak. Reuters reported about 5,000 affected users for X alone at one point, before counts declined as fixes rolled out. Reuters

From a user’s perspective, this manifested as:

  • Sites not loading at all

  • Login flows hanging or failing (especially where Cloudflare Access or Turnstile were involved)

  • APIs responding intermittently or with 5xx errors

  • Dashboards and admin panels timing out

In other words: huge parts of the internet “felt down”, even though the root cause was concentrated in a single provider’s internal systems.

 


How Cloudflare normally works (in simple terms)

To understand why this outage was so severe, it helps to know the rough path of a request through Cloudflare’s network.

Cloudflare acts as a reverse proxy CDN and security layer:

  1. Your browser or app connects to Cloudflare instead of directly to the origin site.

  2. Cloudflare terminates TLS and HTTP at its edge.

  3. Requests flow into Cloudflare’s core proxy system, called FL (“Frontline”) and its newer generation FL2.

  4. That core proxy:

    • Applies WAF (web application firewall) rules

    • Runs Bot Management models

    • Handles DDoS protection, caching, egress to origin

    • Routes traffic to other internal products like Workers, R2, Access, etc. The Cloudflare Blog

In normal operation this architecture is highly resilient: if one data center has a problem, traffic is routed through others; configuration changes are rolled out carefully; individual features should fail in contained ways.

Yesterday’s outage was precisely bad because the failure was inside the common proxy path itself, and it was tightly coupled with a configuration file that gets pushed worldwide frequently and automatically.

 

 


The root cause: a bot-management feature file gone rogue

Cloudflare’s official explanation points to one key culprit:
a feature configuration file used by their Bot Management system. The Cloudflare Blog

Here’s the chain of events in plain language:

  1. Bot Management uses a “feature file”

    • Cloudflare’s bot-detection model relies on a set of “features” – signals about each request used to decide if it’s human or a bot.

    • These features are bundled into a configuration file that is regenerated every few minutes and rolled out globally, so Cloudflare can adapt quickly to new attack patterns. The Cloudflare Blog

  2. A change in ClickHouse query behavior

    • The feature file is generated by queries against a ClickHouse database.

    • Cloudflare made a change around 11:05 UTC to improve security and permissions for distributed queries – allowing users to see metadata not just from a default schema but also from underlying r0 tables. The Cloudflare Blog

    • The query that builds the feature list didn’t filter by database name; suddenly it started getting duplicate columns from both default and r0, effectively doubling the number of feature rows.

  3. The feature file exploded in size

    • The Bot Management module has a hard limit on how many features it will accept (set to 200, well above the ~60 normally in use).

    • When the newly generated file exceeded that limit, the module hit the cap and panicked, due to an unhandled error in Rust code that used Result::unwrap() on an error value. The Cloudflare Blog

  4. Core proxy services started returning 5xx errors

    • Because Bot Management is integrated into the core proxy path, the panic surfaced as HTTP 5xx responses for any traffic that depended on that module.

    • On the new FL2 engine, customers saw explicit 5xx errors.

    • On the older FL engine, bot scores silently went to zero, which could cause false positives in bot-blocking rules. The Cloudflare Blog

  5. The really nasty part: the file kept flipping between “good” and “bad”

    • The ClickHouse cluster was being gradually updated, and the feature file was regenerated every five minutes.

    • Sometimes the query ran on updated nodes (producing a bad file), sometimes on non-updated nodes (producing a good file).

    • That meant for a while, Cloudflare’s network oscillated between normal operation and failure as different versions of the file were propagated. The Cloudflare Blog

This oscillation made the situation extremely confusing internally. At first, Cloudflare’s teams suspected a massive DDoS attack because the error pattern didn’t look like a simple software crash. Even the Cloudflare status page, which is hosted outside their own infrastructure, briefly showed errors – a coincidence that further fueled the suspicion of an external attack. The Cloudflare Blog+1

Only once they realized the common factor was the bot feature file did the picture become clear.

 

 


Timeline of the incident

Based on Cloudflare’s postmortem and third-party reports, we can piece together a rough timeline for November 18, 2025: The Cloudflare Blog+2ThousandEyes+2

  • 11:05 UTC – A database access control change is deployed in ClickHouse.

  • 11:20–11:30 UTC – Bad versions of the Bot Management feature file begin being generated and propagated.

  • 11:28 UTC – First customer impact: elevated HTTP 5xx errors seen on customer traffic.

  • 11:30–11:32 UTC – External monitoring tools and automated tests start detecting intermittent failures.

  • 11:35 UTC – Cloudflare opens an internal incident call; investigation begins.

  • ~11:48 UTC – Cloudflare publishes a status update confirming an incident. Resend

  • 11:30–13:05 UTC – Teams focus on what appears to be degraded Workers KV behavior and investigate multiple possible causes (including attack scenarios).

  • 13:05 UTC – Key mitigation: Workers KV and Cloudflare Access are shifted to bypass the core proxy; impact is reduced. The Cloudflare Blog

  • 14:30 UTC – Root cause identified; generation and propagation of bad feature files is stopped. A known-good configuration file is manually inserted and the core proxy is restarted. Most core traffic returns to normal. The Cloudflare Blog

  • 14:40–15:30 UTC – Dashboard and login issues linger as Turnstile and backlog of authentication attempts create secondary load spikes. The Cloudflare Blog

  • 17:06 UTC – Error rates return to baseline; Cloudflare declares systems fully normal. The Cloudflare Blog

From a user’s point of view, the outage felt worst in the late morning to early afternoon UTC, though exact impact windows varied by region and by which Cloudflare products each service depended on.


Why this outage matters so much

Centralization risk

Cloudflare is part of a small set of central internet infrastructure providers, alongside the major cloud platforms (AWS, Azure, GCP) and other large CDNs. When one of these players fails, the impact is wide and often non-obvious.

This outage:

  • Didn’t come from a BGP routing mishap or an ISP cable cut.

  • Didn’t come from a malicious attack (despite initial suspicions).

  • Came from a single configuration and limits bug in an internal component.

That’s important because it shows how complex, tightly-coupled systems can fail catastrophically even without external interference. When many organizations build on the same provider, that provider becomes a de-facto systemically important piece of the internet.

“Soft” dependencies hurt too

Some of the affected services weren’t just using Cloudflare as a dumb CDN. They were:

  • Using Cloudflare Access for authentication and zero-trust access.

  • Using Workers KV as part of internal control planes.

  • Relying on Turnstile for bot-resistant logins. The Cloudflare Blog+1

When those products failed, it wasn’t just website content that went down – logins, admin functions, and internal APIs broke as well. That makes recovery more complex: your status page, incident tooling, or admin UI might also rely on the very provider that just failed.

 

 


What Cloudflare says it will change

Cloudflare’s blog outlines several remediation steps the company is already taking to reduce the risk of anything similar recurring: The Cloudflare Blog

  1. Harden ingestion of auto-generated configuration files
    Treat internally generated configs with the same skepticism and validation as user-supplied input, including strict schema and size checking before rollout.

  2. More global kill switches
    Make it easier to quickly disable problematic internal modules (like Bot Management) across the network, so they fail open instead of panicking the entire proxy path.

  3. Protect system resources from error storms
    Ensure that core dumps, debug metadata, and observability tooling cannot overwhelm CPU and memory when errors start to spike.

  4. Review failure modes across core proxy modules
    Systematically audit how each internal module behaves under unexpected input or configuration, and ensure graceful degradation instead of global failure.

  5. Refine rollouts and isolation
    While not spelled out in huge detail, the incident suggests Cloudflare will likely further segment how new configs and DB behaviors propagate, to reduce the chance that a single bad change affects the entire fleet.

They also framed the incident as an absolute failure of their resiliency expectations, calling it “unacceptable” and explicitly acknowledging the pain it caused both customers and ordinary internet users. The Cloudflare Blog


Lessons for infrastructure & SRE teams

Even if you’re not running something as huge as Cloudflare, there are some very practical design and operational lessons in this outage:

Treat internal config like untrusted input

It’s easy to assume that “our own” generated configuration is always correct. Yesterday shows why that’s dangerous:

  • Always validate size, shape, and limits of configuration files before applying them.

  • Consider canary application of config to a small subset of traffic or nodes first, with automated rollback on anomalies.

  • Keep strict upper bounds and circuit breakers around feature counts, memory preallocation, and CPU usage.

Design for graceful partial failure

One bug in the Bot Management module should not be able to panic the entire proxy path:

  • Default to fail-open vs fail-closed in some layers of security when the alternative is complete outage.

  • Build clear, tested kill switches for non-core features.

  • Ensure critical sub-systems (auth, status page, incident tooling) can operate in degraded mode or via alternate routes.

Observe the right signals

The oscillation between “good config” and “bad config” every five minutes made the signal look like attack traffic or noisy external behavior:

  • Make sure you have per-version or per-config correlation in your observability pipeline.

  • Build dashboards that make configuration changes visually obvious on top of error graphs.

  • Include strong synthetic tests from an external vantage point, so you can quickly distinguish internal failure from network/path issues.

Don’t put all your eggs in one infra basket

For organizations using Cloudflare:

  • Consider multi-CDN setups for truly mission-critical properties.

  • Avoid making your status page entirely dependent on the same provider as your primary stack (Cloudflare does this, but there was coincidental trouble with their status page host yesterday which confused things further). The Cloudflare Blog+1

  • Think twice before tightly coupling your authentication, API control planes, and frontend delivery to the same vendor without fallback paths.


The bigger picture

In the last few months alone, we’ve seen major outages at Microsoft Azure, Amazon Web Services, and now Cloudflare, all of which have temporarily knocked large chunks of consumer and enterprise services offline. AP News+2The Washington Post+2

The pattern is clear:

  • The internet is increasingly dependent on a handful of giant infrastructure providers.

  • Outages are often self-inflicted, coming from complex internal changes rather than external attacks.

  • Even providers with world-class SRE practices can still be tripped up by unexpected interactions between configuration, database behavior, and hard-coded limits.

Yesterday’s Cloudflare incident is a stark reminder that “the cloud” isn’t magic. At the bottom, it’s still software written by humans, subject to the same classes of bugs as any other application—just with orders of magnitude more people depending on it.

For users, the incident will mostly be remembered as “that morning when X and ChatGPT wouldn’t load.”
For engineers, it will likely be studied as a textbook example of how subtle configuration bugs in a core distributed system can ripple out into a global internet event.

Latest Articles

Read More...
date dark
hits dark 4774
Read More...
date dark
hits dark 4791
Read More...
date dark
hits dark 4744
Read More...
date dark
hits dark 5105
Read More...
date dark
hits dark 2344
Read More...
date dark
hits dark 2758
Read More...
date dark
hits dark 2226
Read More...
date dark
hits dark 2714