It’s human to error, but real catastrophes require computers

A typical server "rack", commonly se...
Image via Wikipedia

InfoWorld published a story last week titled the Top 10 worst cloud outages. The article certainly makes for good reading, although it would be nice, if people would stop acting so surprised about cloud failures. It is after all just software and server hardware, and, while very clever, all technology fail at some point despite the recent hype. In fact, the more you have, the more likely it is to experience failures. A Cloud vendor would actually need to work harder to just match a ‘simpler’, traditional data centre in terms of high availability.

The Butterfly Effect

The most important lesson taught at a first aid course is to ‘stop the accident’ – the same is starting to apply to highly interconnected software systems.

The recent Gmail failure caused by a software bug discovered during the deployment process, yet it still managed to affect 0.02 percentage of Gmail users. Skype has experienced two outages due to a combination of localised high load and a (replicated) software bug (discussed here and here). Amazon’s recent failure was a network misconfiguration which escalated into a data replication storm.

High availability built through infrastructure replication typically still share the same software infrastructure, e.g., multiple deployments, same code, so a bug in one equals a bug in all. The space shuttle had two separate flight systems to avoid this and achieve high reliability – not the same as high availability – which in a cloud computing context equals the ability to use two (or more) alternative cloud vendors for the same service.

The case of ‘localised failure bringing down an entire network’ isn’t new of course. Duncan Watts describes a similar incident in the American power network in August 1996, where a single power line failure brings down power to all of San Francisco causing an estimated $2 billions in damages. It was (of course) not a single line that caused the failure, only the trigger, as several factors attributed to the failure, such as poorly maintained trees near power lines, high heat causing power lines to stretch further, high winds, and high load – all which by themselves wouldn’t have caused a problem or be considered a fault, but the network failed because the trigger and all the factors were present. And many of the public cloud data centres are probably at a complexity level, where they are starting to experience similar properties?

A Reliable Cloud Requires More Than a Credit Card

If you need reliability/availability greater than what your Cloud vendor provides, or if a significant outage would (nearly) kill your business, then, you really need two separate clouds. But that might just kill the business case for moving to the cloud? Or the very least make the calculations as part of a business case significantly more complex, as the estimation needs to consider different service plans, different pricing models, and how do we avoid over-provision. Not to mention that the responsibility to develop this cross cloud high reliability/availability architecture has moved back in-house.

Starting to sound like a normal data centre situation – only more complex? Not surprisingly, a private cloud is starting to sound compelling.

But maybe the real outcome of all this is a revised Cloud architecture, where customers can purchase high reliability as well as high availability. There isn’t anything that stops Amazon (or anyone else) from building the ‘back-up’ cloud using different infrastructure. After all, one of the probable reasons people want to move their infrastructure to the Cloud is to avoid having to think about all this technical stuff.

The basic motivation for Cloud vendors is to design their cloud with a focus on maximising server utilisation, and may inadvertently sacrifice some of the reliability. Maximising server utilisation creates many software inter-dependencies in the underlying infrastructure, as the sharing of infrastructure goes up. This will make it harder for a cloud vendor to isolate and resolve problems, before they propagate to an entire data centre.

Moving to the Cloud really takes the debate from high availability to high reliability.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s