Cloudflare broke itself – and a big chunk of the Internet – with a bad database query

Cloudflare CEO Matthew Prince has admitted that the cause of its massive Tuesday outage was a change to database permissions, and that the company initially thought the symptoms of that adjustment indicated it was the target of a “hyper-scale DDoS attack,” before figuring out the real problem.

Prince has penned a late Tuesday post that explains the incident was “triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a ‘feature file’ used by our Bot Management system.”

The file describes malicious bot activity and Cloudflare distributes it so the software that runs its routing infrastructure is aware of emerging threats.

Changing database permissions caused the size of the feature file to double and grow beyond the file size limit Cloudflare imposes on its software. When that code saw the illegally large feature file, it failed.

And then it recovered – for a while – because when the incident started Cloudflare was updating permissions management on a ClickHouse database cluster it uses to generate a new version of the feature file. The permission change aimed to give users access to underlying data and metadata, but Cloudflare made mistakes in the query it used to retrieve data, so it returned extra info that more than doubled the size of the feature file.

At the time of the incident, the cluster generated a new version of the file every five minutes.

“Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network,” Prince wrote.

For a couple of hours starting at around 11:20 UTC on Tuesday, Cloudflare’s services therefore experienced intermittent outages.

“This fluctuation made it unclear what was happening as the entire system would recover and then fail again as sometimes good, sometimes bad configuration files were distributed to our network,” Prince wrote. “Initially, this led us to believe this might be caused by an attack. Eventually, every ClickHouse node was generating the bad configuration file and the fluctuation stabilized in the failing state.”

That “stabilized failing state” happened a few minutes before 13:00 UTC, which was when the fun really started and Cloudflare customers started to experience persistent outages.

Cloudflare eventually figured out the source of the problem and stopped generation and propagation of bad feature files, then manually inserted a known good file into the feature file distribution queue. The company then forced a restart of its core proxy so its systems would read only good files.

That all took time, and downstream problems for other systems that depend on the proxy.

Prince has apologized for the incident.

“An outage like today is unacceptable,” he said. “We’ve architected our systems to be highly resilient to failure to ensure traffic will always continue to flow. When we’ve had outages in the past it’s always led to us building new, more resilient systems.”

This time around the company plans to do four things:

Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
Enabling more global kill switches for features
Eliminating the ability for core dumps or other error reports to overwhelm system resources
Reviewing failure modes for error conditions across all core proxy modules

Prince ended his post with an apology “for the pain we caused the Internet today.” ®

Source link

Trending

Pharrell Williams says he ‘hates’ politics, calls it a ‘magic trick’

Two men arrested in connection with attempted murder of senior police officer

Expert’s tip to de-ice car windows using 40p kitchen staple

Pharrell Williams says he ‘hates’ politics, calls it a ‘magic trick’

‘Massive’ drop in Latino support for Trump stuns CNN’s Harry Enten

Comcast CEO confident in winning bidding war for Warner Bros. Discovery — but Wall Street not convinced

BBC shut down female staffers’ complaints about trans coverage

Bill Maher urges Democrats to accept moderate candidates after socialism surge in November election

New survey reveals just how much Brits love classical music | UK | News

Remove yellow stains from mattress fast using cheap grooming product

Cleaning guru warns drain cleaning hack is damaging your home

Zeta Quantum Diamonds by Themis Ecosystem: Approved to Hit Sooner Than Predicted

‘Best winter destination’ in Europe has ‘hearty food’ and public baths

Microsoft blanks out BSODs on public displays with new ‘Digital Signage mode’

FCC looks to torch Biden-era cyber rules sparked by Salt Typhoon mess

Brussels eyes AWS, Azure for gatekeeper tag in cloud clampdown

Latest Servo release hints at a real Rust alternative to Chromium

Starlink’s method of dodging solar storms may make it slower, for longer

Oops. VMware admits it over-specced storage servers for years

Two men arrested in connection with attempted murder of senior police officer

Expert’s tip to de-ice car windows using 40p kitchen staple

Martin Lewis slaps Carluccio’s and Evans with Trading Standards warning

Cloudflare broke itself – and a big chunk of the Internet – with a bad database query

Pharrell Williams says he ‘hates’ politics, calls it a ‘magic trick’

Two men arrested in connection with attempted murder of senior police officer

Expert’s tip to de-ice car windows using 40p kitchen staple

Martin Lewis slaps Carluccio’s and Evans with Trading Standards warning

Subscribe to Updates

Trending

Cloudflare broke itself – and a big chunk of the Internet – with a bad database query

Related Posts