Cloudflare CEO Matthew Prince has admitted that the cause of its massive Tuesday outage was a change to database permissions, and that the company initially thought the symptoms of that adjustment indicated it was the target of a “hyper-scale DDoS attack,” before figuring out the real problem.
Prince has penned a late Tuesday post that explains the incident was “triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a ‘feature file’ used by our Bot Management system.”
The file describes malicious bot activity and Cloudflare distributes it so the software that runs its routing infrastructure is aware of emerging threats.
Changing database permissions caused the size of the feature file to double and grow beyond the file size limit Cloudflare imposes on its software. When that code saw the illegally large feature file, it failed.
And then it recovered – for a while – because when the incident started Cloudflare was updating permissions management on a ClickHouse database cluster it uses to generate a new version of the feature file. The permission change aimed to give users access to underlying data and metadata, but Cloudflare made mistakes in the query it used to retrieve data, so it returned extra info that more than doubled the size of the feature file.
At the time of the incident, the cluster generated a new version of the file every five minutes.
“Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network,” Prince wrote.
For a couple of hours starting at around 11:20 UTC on Tuesday, Cloudflare’s services therefore experienced intermittent outages.
“This fluctuation made it unclear what was happening as the entire system would recover and then fail again as sometimes good, sometimes bad configuration files were distributed to our network,” Prince wrote. “Initially, this led us to believe this might be caused by an attack. Eventually, every ClickHouse node was generating the bad configuration file and the fluctuation stabilized in the failing state.”
That “stabilized failing state” happened a few minutes before 13:00 UTC, which was when the fun really started and Cloudflare customers started to experience persistent outages.
Cloudflare eventually figured out the source of the problem and stopped generation and propagation of bad feature files, then manually inserted a known good file into the feature file distribution queue. The company then forced a restart of its core proxy so its systems would read only good files.
That all took time, and downstream problems for other systems that depend on the proxy.
Prince has apologized for the incident.
“An outage like today is unacceptable,” he said. “We’ve architected our systems to be highly resilient to failure to ensure traffic will always continue to flow. When we’ve had outages in the past it’s always led to us building new, more resilient systems.”
This time around the company plans to do four things:
- Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
- Enabling more global kill switches for features
- Eliminating the ability for core dumps or other error reports to overwhelm system resources
- Reviewing failure modes for error conditions across all core proxy modules
Prince ended his post with an apology “for the pain we caused the Internet today.” ®
