Today is Friday, March 3rd, my “office” day where I focus on Watchtower instead of working service requests for clients. Writing a blog article was not on my ‘to-do’ list for today. But recent events this week convinced me I should do some writing this morning instead.
On Tuesday, part of Amazon’s cloud services went down for about 6 hours. At one point we heard there were around 142,000 web sites and services impacted by this outage. Major companies ground to a halt because they couldn’t access their “stuff”. Watchtower was impacted by this outage as well. One of our vendors that hosts our ticketing, time-tracking, and invoicing application went down because they use Amazon to power their service. So for 6 hours or so on Tuesday, we found ourselves unable to update existing tickets, enter time spent on client work, receive new requests, or even look up client documentation. It put us even further behind on what has been an exceptionally busy 3-month period.
Yesterday, Amazon released the cause of the outage. A technician entered a wrong command and accidentally shut-down many more storage servers than intended. Essentially, Amazon accidentally “rebooted” their system in one data center in Virginia. That’s it. There was no major hardware failure, no malicious hackers involved, no natural disaster. Just simple human error.
Of course we’ve heard some gentle teasing about how “the cloud” isn’t supposed to fail. Others are up in arms over how wrong it is that just a few very large enterprises control such a large portion of the internet. And while there is a little bit of truth to this, the reality is that Tuesday’s outage didn’t need to be anywhere near as bad as it was.
Only one of Amazon’s datacenter’s was impacted by the accidental reboot. It just so happens to be one of Amazon’s largest datacenters. The fact that so many websites and services went down because of a single datacenter outage tells us a lot about each of the businesses that rely on Amazon. The sites that went down completely were not taking advantage of the redundancies Amazon offers. A business can choose to replicate their services with Amazon to multiple datacenters to protect against just this kind of outage. Doing so would have meant those services would have continued operating during the outage. The trade-off is additional cost. Replicating data to multiple datacenters means paying more for the storage you use.
It’s a story we see time and again with many of the small businesses we support. So many are concerned with getting the absolute minimum tech they need to run their business and are not willing to spend anything more. Of course they’re also the first to complain when going the cheap route fails. The bottom line is you need to balance business continuity with your budget. It’s not always possible to afford redundant systems. Sometimes, bare-bones is the best you can do at this point in time. But that doesn’t mean you shouldn’t have a Plan B. It’s important to ask yourself what the consequences are WHEN a specific system fails. There is no such thing as 100% up-time. It’s just not possible. Internet connections go down, phone calls get dropped, computer hard drives fail, and applications will crash.
Plan for the Unexpected
If your office internet connection goes down during business hours, what are your options? Well, you could always pay for a 2nd, independent connection and a managed firewall that can automatically switch between them. The trade-off is paying for 2 internet connections and the firewall service to control them. But you already have a tight budget and really can’t afford the extra expense. This is where we need to get a little creative. We start by answering a few questions about why you need internet access. Do you just need to be able to look up pricing or info from your vendors? Do you use a cloud service? If so, then ANY internet connection will work. There are free wifi hot-spots in many public locations now. Just pick up your work laptop and go sit in a coffee shop, café, or even your living room (assuming your office isn’t already in your home). If you currently use a desktop PC, then thinking through this scenario may help you decide you need to budget for a laptop with a docking station. This gives you the same experience as a desktop PC with the flexibility of a laptop when the need arises.
So what about our scenario where our ticketing system went down? We had no other options for accessing the data in that system. What we did have was our Outlook calendars, phones, and pen & paper. Because our ticketing system syncs our scheduled tickets to our Outlook calendars, we still had access to what we were supposed to be working on. In some cases, we couldn’t do the work that was scheduled because we needed additional info that was only in our ticketing system. All we could do there was call or email the client directly and reschedule the work. Pen & paper let us track our time so that we could enter it correctly once the system came back up. It certainly wasn’t ideal and some work had to be postponed or rescheduled, but we weren’t completely dead in the water.
Don't Become Complacent
A lot of companies are taking a much closer look at how their cloud services are configured this week. Some will make improvements that protect themselves from the next outage. But many will forget, or get distracted with other things, or some other excuse….until the next outage. We believe businesses should be thinking about these scenarios frequently and always looking for ways to improve. The lessons learned this week don’t apply just to the cloud, but to IT systems in general. Please give us a call if you’d like to talk more about what your business can do to better protect itself from the unexpected.