Published

- 5 min read

Post-Mortem: Anatomy of the Cloudflare Outage (November 18, 2025)

img of Post-Mortem: Anatomy of the Cloudflare Outage (November 18, 2025)

Executive Summary

On November 18, 2025, a significant portion of the internet went dark. Major platforms like ChatGPT, Spotify, Canva, and X (formerly Twitter) became inaccessible as Cloudflare, a central pillar of internet infrastructure, suffered a critical failure.

Contrary to initial speculation, this was not a massive DDoS attack or a zero-day exploit. It was a latent bug triggered by a routine database maintenance task. This article deconstructs the technical chain of events, the specific engineering failure in the Bot Management module, and the lessons developers can learn about fault isolation and safe deployment practices.


The Incident: What Happened?

At 11:20 UTC, Cloudflare’s core network traffic delivery began to fail. End-users across the globe started seeing HTTP 5xx errors and “Cloudflare Connection Failure” screens.

  • Affected Services: Cloudflare Dashboard, API, Workers KV, Access, Turnstile, and the core CDN/Proxy service.
  • Impact Scope: Global, with intermittent availability (flapping) observed during the initial hours.
  • Root Cause: A panic in the proxy service caused by a malformed (oversized) configuration file generated by the Bot Management system.

Technical Deep Dive: The Root Cause

The failure was a classic “perfect storm” involving three distinct layers: a database permission change, a configuration generation script, and a lack of graceful error handling in the core proxy.

1. The Trigger: Database Permission Change

Cloudflare engineers were performing a routine update on a ClickHouse database cluster used to generate “feature files” for the Bot Management system. These files contain rules and scores used to detect automated traffic.

The update involved changing permissions on the database. Crucially, this change caused a specific SQL query—responsible for fetching bot detection rules—to return duplicate rows.

2. The Propagation: The Bloated Feature File

The SQL query was designed to output a binary configuration file. Because of the duplicate rows returned by the database, the generated file doubled in size.

Normally, this file is propagated to thousands of servers at Cloudflare’s edge. The distribution mechanism worked as intended, pushing this new, larger file to the edge nodes.

3. The Crash: unwrap() on Error

This is where the “latent bug” triggered. Cloudflare’s proxy service (likely written in Rust, given their stack and the error signature) loads this feature file into the Bot Management module.

The code responsible for parsing this file had a hard-coded limit on the file size or buffer. When the file exceeded this limit, the parser returned an Err (Error) result.

Critically, the calling code handled this result using an equivalent of .unwrap(). In Rust, calling unwrap() on an error result causes the thread to panic (crash) immediately. Because this module runs within the critical path of the main proxy process, the entire proxy service crashed and restarted.

The Developer Takeaway: A crash in a non-critical subsystem (Bot Management configuration loading) should never bring down the entire critical path (Core Traffic Proxy). This highlights the importance of fault isolation.


The “Saw-Tooth” Pattern: Why it Flapped

One of the most confusing aspects of this outage for external observers was the “saw-tooth” pattern of availability. Services would go down, come back up for a few minutes, and then crash again.

Why this happened:

  1. Gradual Rollout: The ClickHouse database update was being applied gradually. Some database nodes had the new permissions (returning bad data), while others had the old permissions (returning good data).
  2. Periodic Generation: The configuration file was regenerated every 5 minutes.
    • Minute 0: The generator hits an updated DB node -> Creates Bad File -> Edge Proxies Crash.
    • Minute 5: The generator hits a non-updated DB node -> Creates Good File -> Proxies Recover.
    • Minute 10: Generator hits updated node -> Crash.

This created a loop of destruction that made diagnosing the issue incredibly difficult, as the system appeared to “fix itself” repeatedly.


Remediation and Fix

The incident response followed a high-pressure trajectory:

  1. Initial Misdiagnosis: Due to the global scale and sudden onset, the team initially suspected a massive DDoS attack (specifically referencing the “Aisuru” botnet).
  2. Identification: Engineers correlated the crashes with the specific Bot Management configuration updates.
  3. The Fix:
    • Step 1: Stop the propagation of the corrupted feature file.
    • Step 2: Manually inject a known “good” version of the file into the distribution queue.
    • Step 3: Force a restart of the core proxy services to clear the crash loops.

Timeline to Recovery:

  • 11:20 UTC: Incident begins.
  • 14:30 UTC: Core traffic stabilizes (primary fix applied).
  • 17:06 UTC: Full resolution of all downstream services (Dashboard, API, etc.).

Key Lessons for Developers

  1. Validate Inputs at the Edge: Never assume configuration files pushed from a central control plane are valid. Edge nodes should validate file size, checksums, and structure before attempting to load them into memory.
  2. Avoid unwrap() in Production: In languages like Rust, use match or if let to handle errors gracefully. If a configuration file fails to load, the system should fall back to the last known good configuration or disable that specific module—not crash the application.
  3. The “Blast Radius” of Config Changes: Database changes are code changes. A permission change in a reporting database (ClickHouse) ended up taking down the global edge network. Treat infrastructure-as-code changes with the same CI/CD rigor as application code.