Jan 16 2013

Is It Time to Stop Making Excuses for Cloud Outages?

When the cloud goes down, the finger pointing begins.

The enthusiasm behind the cloud computing push means that more IT departments are relying on external partners and vendors for their uptime. This isn’t entirely new; after all, companies have relied on external web-hosting companies for years. But it does present businesses with new challenges.

Amazon Web Services (AWS), which is considered by many to be the premier public cloud, has had several public outages, which have taken down major companies, including Netflix and Reddit.

Some cloud pundits say, “Hey, no technology is up all the time. Deal with it and plan for outages.” But Andi Mann, vice president of strategic solutions at CA Technologies, says enough is enough when it comes to excusing cloud outages.

In a bold post on his blog, Mann strikes back at the notion that downtime is inevitable in IT.

“No enterprise IT shop could allow mission-critical services [to] go down for almost a whole day,” Mann writes.

In his experience, there are many enterprises that can do better than Amazon’s 23-hour downtime on Christmas Eve. This is largely because they’re more experienced in IT than most cloud providers:

While poorly run in-house IT abounds, I do know many enterprise IT shops that maintain better uptime than cloud service providers, and for very good reason. Cloud is immature as an IT discipline, and almost by definition embodies greater risk.

Most providers lack the experience and risk-aversion of large enterprise IT, and even of more established IT providers. Most large enterprises have been running larger and more complex IT systems than AWS for 20, 30, 40 years or more. It is almost perverse to assume cloud providers can run more stable IT systems without this long-term experience.

As for the design-for-failure mantra, Mann wonders why it isn’t cloud providers rather than cloud consumers who should adopt the mindset:

Here’s a thought – how about we all demand cloud providers ‘design for failure’ instead, and demand they supply higher quality – dare I say, “enterprise quality” – cloud services?

After all, if your in-house IT infrastructure failed, your ops team could not get away with blaming your developers because your applications were not ‘designed for failure’. Similarly, there is no reason cloud providers (and their various apologists) should be allowed to get away with it either. I can only assume cloud providers already do ‘design for failure’ to some degree, but given the number, severity, and duration of cloud failures, it certainly seems like many of them are doing a pretty awful job of it.

Outages and uptime are among the reasons why many companies are adopting a hybrid private/public cloud approach. When your business is on the line, you have to be able to ensure uptime. Otherwise, you might be left in the dark with nothing but excuses.