Opinion When your cabbie asks you what you do for a living, and you answer “tech journalist,” you never get asked about cloud infrastructure in return. Bitcoin, mobile phones, AI, yes. Until last week: “What’s this AWS thing, then?” You already knew a lot of people were having a very bad day in Bezosville, but if the news had reached an Edinburgh black cab driver, new adjectives were needed.
As the world reluctantly touched grass, the AWS outage of October 20 made the top of the mainstream news. It beautifully illustrated the success of the cloud concept as it took out banking services, gaming platforms, messaging apps, and cat litter trays. Things got better after a few hours, and the nature of the collapse gradually revealed itself. A DNS failure led to a core database dropping off, leading to a control plane malfunction that broke load balancing.
Why this cascade was both possible and unexpected, and why it took so long to find and fix, is even more interesting. Here’s a clue: this kind of event had been predicted by an ex-Amazonian based on their perception that key engineering talent had been fleeing the company for years, removing irreplaceable wisdom built from knowledge and experience. Such a prediction, backed by the observation that AWS techs had to grope their way to the big picture, is compelling.
No similarly compelling answer exists for the final and best question of all: how do you stop this happening again? Building safeguards for this specific chain of events, even the class of such events, is obvious, as is re-engineering the chain of dependencies and contagion.
None of this answers the basic criticism that AWS itself is too complex to analyze for such systems of failure, at least with the resources and tools it has on hand right now. Exactly the same can be said of the systemic cybersecurity failings that power the rolling thunder of ransomware and other eviscerating attacks.
Infrastructure expands to an event horizon where utility can no longer escape the gravitational pull of complexity. It’s much cheaper to add more and more functionality than it is to add more and more stability. Eventually things will break. Most things will break in a small way, filling a sysadmin’s day with entertainment and mild hypertension. Now and again, big things will break bigly, and you’re in the news.
There are so many ways to add resilience to this picture. Edge services can keep cloudy IoT devices going during a central outage. Even better, enough local compute built into the devices will bring resilience even in the case that whatever godforsaken subscription revenue model underlying the initial offering can’t keep the parent company alive. The same sort of tiered failover can work for all manner of apps, although getting steadily more expensive the more functionality is maintained.
The same realization is dawning, crudely, in ransomware defense, expressed as making sure your organization can carry on working with pencil and paper. Which it can’t, of course, but there will be a minimum level of some independent tech that’ll keep an acceptable minimum life support.
The consistent failure that prevents designing for resiliency is that while it will save billions during a rare but likely event, it eats away at the bottom line day by day, week by week, quarter by quarter. In that way, it’s exactly like insurance. The difference is, capitalism and its symbiont governance have long recognized the enabling safety net that insurance provides. It doesn’t feel that way about infrastructure resilience, certainly not enough to apply the legal and regulatory pressure to ensure its adoption.
It would be easier if, as in aviation, failure of resilience murdered hundreds of people in eye-catching explosions of fiery death, instead of silencing Snapchat for an afternoon. Lack of resilience certainly kills people, but in an invisible, slow, and ambiguous way as it saps resources from critical systems and their supply chains. In an industrial and political environment where anti-regulatory hollering is the primary discourse, even fiery explosions of death wouldn’t make much difference.
Which means that when the correction does come, it’s going to be a biggie. If our systems break as often and obviously as they do in today’s climate, what would they do if things got stickier and our interconnected financial and commercial systems got given a proper nudge by those who do not have our best interests at heart?
Fortunately, resilience can be improved from the bottom up rather than waiting for top-down to happen. As responsible individuals, within departments, at board level, or as industry groups, what-ifs can be wargamed. You know what a power outage means, and what level of power pack, UPS, or backup generator is worth having. If AWS was to go away for weeks instead of hours, what would it look like? What would redundancy look like? Can you afford an experiment or two? Can you afford not to?
If you’ve never had this sort of conversation about any or all of your core technologies and services, then you’re part of the problem. Taking them seriously is the start of the solution, at any level. The alternative is waking up to your world on fire, realizing that when your cabbie asked you about AWS, you should have smelled the smoke. ®
