It’s human to error, but real catastrophes require computers

A typical server "rack", commonly se...
Image via Wikipedia

InfoWorld published a story last week titled the Top 10 worst cloud outages. The article certainly makes for good reading, although it would be nice, if people would stop acting so surprised about cloud failures. It is after all just software and server hardware, and, while very clever, all technology fail at some point despite the recent hype. In fact, the more you have, the more likely it is to experience failures. A Cloud vendor would actually need to work harder to just match a ‘simpler’, traditional data centre in terms of high availability.

The Butterfly Effect

The most important lesson taught at a first aid course is to ‘stop the accident’ – the same is starting to apply to highly interconnected software systems.

The recent Gmail failure caused by a software bug discovered during the deployment process, yet it still managed to affect 0.02 percentage of Gmail users. Skype has experienced two outages due to a combination of localised high load and a (replicated) software bug (discussed here and here). Amazon’s recent failure was a network misconfiguration which escalated into a data replication storm.

High availability built through infrastructure replication typically still share the same software infrastructure, e.g., multiple deployments, same code, so a bug in one equals a bug in all. The space shuttle had two separate flight systems to avoid this and achieve high reliability – not the same as high availability – which in a cloud computing context equals the ability to use two (or more) alternative cloud vendors for the same service.

The case of ‘localised failure bringing down an entire network’ isn’t new of course. Continue reading “It’s human to error, but real catastrophes require computers”