top of page

Subscribe to our newsletter

Cloudflare Bot Management Outage: Technical Root Cause Analysis and Impact of the November 18, 2025 Service Disruption

  • Rescana
  • 3 days ago
  • 5 min read
Image for post about research https://blog.cloudflare.com/18-november-2025-outage/?mkt_tok=NzEzLVhTQy05MTgAAAGeOJhPe3bTDEApACiQzb9C5ebeG3PNxmwJue8XKgfQdGWNYZV9kUHzObF1OcGTdZpiCy291kPKTPe1OoVtOycZBffdb0onw53mYT-_0m4kydwqPstie8UU/

Executive Summary

Publication Date: November 18, 2025

On November 18, 2025, Cloudflare experienced a significant global service disruption beginning at 11:20 UTC, resulting in widespread HTTP 5xx errors and failures across core network services. The incident was not caused by a cyber attack or malicious activity, but rather by an internal change to database permissions that led to the propagation of a malformed configuration file used by the Bot Management system. This report provides a detailed account of the incident, its technical root cause, the impact on services and customers, the response and recovery process, business implications, and key lessons learned.

Incident Timeline

At 11:05 UTC, a database access control change was deployed within Cloudflare’s infrastructure. By 11:20 UTC, the network began experiencing significant failures, with users encountering error pages indicating internal network issues. The deployment reached customer environments at 11:28 UTC, marking the start of the impact as the first errors were observed on customer HTTP traffic.

Between 11:32 and 13:05 UTC, the Cloudflare team investigated elevated error rates, initially suspecting a hyper-scale DDoS attack due to fluctuating system behavior and the coincidental unavailability of the status page. At 13:05 UTC, mitigations were implemented by bypassing the core proxy for Workers KV and Cloudflare Access, reducing the impact. By 13:37 UTC, efforts focused on rolling back the Bot Management configuration file to a last-known-good version.

At 14:24 UTC, the creation and propagation of new Bot Management configuration files were halted, and a successful test of the restored file was completed. By 14:30 UTC, the correct configuration file was deployed globally, and most services began to recover. Full restoration of all services was achieved by 17:06 UTC.

Technical Root Cause

The root cause of the outage was a change to the permissions of one of Cloudflare’s database systems, specifically a ClickHouse cluster. This change was intended to improve distributed query security and reliability by making access to underlying tables explicit. However, the change caused the database to output multiple entries into a “feature file” used by the Bot Management system, effectively doubling the file’s size due to duplicate rows from both the “default” and “r0” databases.

The Bot Management system relies on this feature file to update its machine learning model, which generates bot scores for every request. The software running on Cloudflare’s network machines had a hardcoded limit on the size of the feature file, set to 200 features, while normal operation used approximately 60 features. The malformed file exceeded this limit, causing the software to panic and fail, resulting in HTTP 5xx errors.

The issue was exacerbated by the feature file being regenerated every five minutes. Depending on which part of the ClickHouse cluster the query ran, either a good or bad configuration file was generated and rapidly propagated, causing intermittent recovery and failure cycles. Eventually, all nodes produced the bad file, stabilizing the system in a failing state until the underlying issue was identified and resolved.

Service Impact Analysis

The outage affected multiple Cloudflare services and products. The Core CDN and security services returned HTTP 5xx status codes, preventing end users from accessing customer sites. Turnstile failed to load, impacting authentication and login flows. Workers KV experienced significantly elevated HTTP 5xx errors as requests to its gateway failed. The Cloudflare Dashboard was mostly operational, but most users were unable to log in due to Turnstile unavailability. Email Security saw a temporary loss of access to an IP reputation source, reducing spam-detection accuracy, though no critical customer impact was observed. Access experienced widespread authentication failures, with all failed attempts resulting in error pages and configuration updates either failing or propagating slowly.

Additionally, the Cloudflare CDN experienced increased response latency due to high CPU consumption by debugging and observability systems. The impact was observed globally, affecting the majority of core traffic and resulting in the most severe outage since 2019.

Customer Impact

Customers experienced widespread service disruptions, including inability to access websites protected by Cloudflare, failed authentication and login attempts, and degraded performance of key services such as Workers KV and Turnstile. For customers using Bot Management rules, all traffic received a bot score of zero, leading to large numbers of false positives and potential blocking of legitimate users. Customers not using bot scores in their rules were less affected. The outage also impacted downstream services and applications relying on Cloudflare’s infrastructure, with HTTP 5xx errors and increased latency observed throughout the incident window.

Response and Recovery

Upon detection of the incident at 11:31 UTC, the Cloudflare team initiated manual investigation and incident response procedures. Initial efforts focused on mitigating what was believed to be a DDoS attack, including traffic manipulation and account limiting. As the investigation progressed, internal system bypasses were implemented for Workers KV and Cloudflare Access at 13:05 UTC, reducing the error rate.

By 13:37 UTC, the team concentrated on rolling back the Bot Management configuration file. At 14:24 UTC, the propagation of new configuration files was stopped, and a known good file was manually inserted into the distribution queue. The core proxy was restarted, and by 14:30 UTC, most services began to recover. The team continued to monitor and restart affected services, with full restoration achieved by 17:06 UTC.

Business Impact

The outage had a significant business impact on Cloudflare and its customers. The disruption of core CDN and security services affected a substantial portion of global Internet traffic, undermining customer trust and reliability perceptions. The incident also highlighted the critical dependency of many organizations on Cloudflare’s infrastructure for web performance, security, and authentication. While no data loss or security breach occurred, the operational and reputational impact was considerable, prompting a public apology from the Cloudflare team and a commitment to further harden systems against similar failures.

Lessons Learned

Cloudflare identified several key lessons from the incident. The ingestion of internally generated configuration files must be hardened to the same standards as user-generated input. More global kill switches for features are needed to quickly mitigate the impact of faulty configurations. System resource management must be improved to prevent debugging and error reporting from overwhelming critical services. Failure modes for error conditions across all core proxy modules require comprehensive review. The incident underscored the importance of robust testing, monitoring, and rollback mechanisms for configuration changes, especially in distributed environments.

References

Official incident report: https://blog.cloudflare.com/18-november-2025-outage/?mkt_tok=NzEzLVhTQy05MTgAAAGeOJhPe3bTDEApACiQzb9C5ebeG3PNxmwJue8XKgfQdGWNYZV9kUHzObF1OcGTdZpiCy291kPKTPe1OoVtOycZBffdb0onw53mYT-_0m4kydwqPstie8UU/

About Rescana

Rescana is a leading Third-Party Risk Management (TPRM) platform, empowering organizations to proactively identify, assess, and mitigate risks across their digital supply chain. Our platform delivers continuous monitoring, actionable insights, and automated workflows to help businesses strengthen their security posture and ensure compliance. For more information or to discuss how Rescana can support your risk management needs, please contact us at ops@rescana.com.

bottom of page