When a state government shuts down, the proximate root cause is a lack of funding. The next-level root cause is that the two sides weren’t able to find an acceptable compromise. The deep root cause is that THE OTHER SIDE IS EVIL, STUPID, AND ENTIRELY LACKING IN THE SLIGHTEST SENSE OF HOW THINGS WORK.
Please trust me on this; I live in Minnesota where we just shut down, and, this is pretty much how everyone looks at the situation … except for those who figure my description applies equally to both sides.
But this column isn’t about government shutdowns. It’s about Cloud shutdowns.
The latest meme floating around the meme-o-sphere is that if you shift your infrastructure into The Cloud … which all right-thinking CIOs who haven’t already done so must be planning … and you experience an outage, it’s your fault for failing to engineer your Cloud infrastructure to be sufficiently fault-tolerant.
As my Dad used to say (and still does say if the occasion calls for it), if they sell you that argument and you buy it you have something in common — you’re both schmucks.
What we have here is a collision between the principle of caveat emptor (“buyer beware,” also known as if-you-believed-that-you-deserve-whatever-happens-to-you) and the competing principle of don’t-tell-outright-lies-about-your-product.
Start with what’s been promoted as What Makes the Cloud So Incredible: It’s cheap. You can get it up and running faster. It scales up instantly when demand increases. It scales down instantly when the demand goes away. Most important of all, they know how to design computing infrastructure better than you do, so it will be more reliable … so much more reliable that you don’t need any systems engineers on staff anymore, because they won’t have anything to do.
That’s what’s been promoted. The way it’s turned out is this:
- Scalability: Cloud providers can add and shed capacity to meet demand pretty much as advertised.
- Speed of deployment: When you’re looking at Software as a Service solutions, this is what happens (assuming integration isn’t part of your implementation plan; if it is the picture is more complicated). Platform as a Service? No — the platforms you already have are already installed. Infrastructure as a Service? Assuming you’ve virtualized your data center, no — your staff can bring up new virtual servers just as fast as any Cloud vendor.
- Cost: Whether Cloud-based solutions are more or less expensive is a multivariate question that depends on such factors as the number of users and your cost of capital. The only certainty is that when you go to the Cloud you’ll spend less CapEx and more OpEx than if you put everything inside your firewall.
- Reliability: Engineering reliability into your computing environment is the responsibility of your system engineers. Not only that but (I guess), you have to buy and manage the technology that allocates your computing among multiple Cloud vendors … not to mention the technology that detects outages and performance problems in your Cloud infrastructure (without having access to the actual infrastructure).
- Reliability, Part 2: Troubleshooting is harder, not easier. That’s because if you owned your own servers you’d know where to look in the case of an outage. Many outages in The Cloud aren’t massive. Possibly, only you and a handful of other clients are down. That means the Cloud vendor, which has zillions of physical servers and gagillions of virtual ones, is searching for a needle in a haystack to even find your problem.
Not that I’m a suspicious sort, but the new meme sure sounds like something that was designed and built in a vendor’s meme lab, as a way to deflect responsibility, not for the outages themselves, but for the business harm resulting from them. There’s nothing fundamentally wrong with the concept of The Cloud (or so you’re supposed to believe). What’s wrong is the failure of people to Take Responsibility (you can always hear the capital letters when people repeat this old meme).
Whether or not this is a vendor-engineered meme, the Cloud vendor community would do well to disavow both the thought process and the engineering reality that makes it important.
The thought process is obviously disastrous for those promoting Cloud computing. It doesn’t just eliminate most of the claimed benefits — it adds a new bunch of headaches you don’t need.
As for the engineering reality, shouldn’t every Cloud vendor have multiple data centers and spread every customer across them as a matter of course?
After all, why need a meme when you can have smart engineering instead?
As a provider, there’s no way I’m telling my customer, “This one was Amazon’s fault; there was nothing I could do about it.” Huh? As if I really can’t think of a solution as to how to mitigate this problem next time?
I did tell them there was an outage with Amazon. That helped them have understanding. But the reality is, I chose the cloud provider that hosted their site. That choice ended up biting them. I never thought Amazon’s EBS would crash and burn, and pull down Heroku for days, and I had made a choice not to spend time making much preparation for that eventuality.
Until after it happened–then I brought their site up a different way, albeit over the course of hours instead of seconds. Then I started planning how I could make that into minutes the next time.
Why should my customer care which cloud provider is at fault? That’s what they pay me to think about. Heroku is just a tool that I use. I’m responsible to choose that and/or other tools to balance functionality, cost, and risk according to the need. Even with a “cloud provider,” I know there have to be single points of failure, and I should consider the best response to unexpected, worst-case scenarios.
If a server goes down, you can’t tell your boss “but this server had so much redundancy, it was impossible for it to fail.” Yeah, right. Nothing is impossible, just some degree of unlikely.
Although I agree with what I believe is your point. This is not the meme that cloud providers should want to spread.
Perhaps because I’m cynical (per G.B.Shaw, so called by those who lack the power of accurate observation), I think you listed the incorrect order of causes for government shutdowns. The primary cause is arrogant and self-serving politicians of all stripes who are both unwilling and unable to see any viewpoint beyond their own, and who more than stubbornly defend their delusions from anything resembling facts. Everything else stems from that.
I realize I just reworded your description of the “deep root cause”; I just think it should have been listed first, as it’s the most important.
As usual, H. L. Mencken nailed it: “Under democracy one party always devotes its chief energies to trying to prove the other party is unfit to rule – and both commonly succeed, and are right.”
The worst part? We are a nation of masochists. Why else would we keep re-electing these clowns?
Bob,
On the one hand, DR is not an inherent benefit of the cloud, just as it is not an inherent benefit of server virtualization. Certainly, virtualization does make it easier to provide DR functionality, and cloud computing does make certain aspects of DR and BCP easier. But it also adds complexity.
OTOH, I do agree with the fact that vendors cannot simply absolve themselves of any responsibility for providing a robust, resilient infrastructure. While customers shouldn’t take it for granted that DR is built-in to everything that they’re purchasing, or the redundancy of infrastructure will guarantee redundancy of their applications riding on that infrastructure, vendors shouldn’t get a pass to put the full burden on the customers, either.
Each party has a role to play, and each party needs to ensure that they are handling their own role appropriately.
I recently presented a webinar on Cloud Misconceptions that tackled some of these points, especially the “inherent DR” aspect of cloud computing that so many customers rely on.
At the end of the day, the cloud provides another avenue for provisioning and service delivery that needs to be considered for the proper fit with an organization, but it is not a panacea…
Totally agree. I just pulled my public web site back in-house, because of my large “totally redundant, look at our SLA’s” public cloud vendor’s fourth outage this year. I realized I already have a redundant, virtualized infrastructure with 99.999% uptime over the last 12 months. Why roll the dice on someone outside because being in the public cloud is the current fashion.
I look at the Cloud the same was a I look at any network. How do you feel when you come in one morning, spin up your computer (old timer talk for booting) and you can’t connect because the network is down? Doesn’t matter whether your inside network is down or if the Cloud connection is down- the result is no connectivity and less work gets done. At least if the inside network is down, you should be able to get hold of someone to help find a fix. If the Cloud goes down, who do you call? Also how you file an outage report on the provider’s website if the connection to the website is down? Is est quis is est!