image

Everyone enjoys a good horror story. Horror stories about one’s worst fears coming true are particularly compelling when they happen to someone else, and even more so in the professional realm. That’s not to say that blood and guts are to everyone’s tastes, but hearing that some OTHER company’s systems crashed or that someone ELSE’s mistake cost someone ELSE’s company a major deal often gives us a morose sense of self-satisfaction and professional pride coupled with a heightened air of “let-me-just-double-check-everything-one-more-time.” IT and facility managers especially like hearing ones about how someone ELSE’s
data center suffered a problem that snowballed into a total catastrophe because SOMEONE ELSE failed put a completely airtight seal on every single part of their data center’s contingency plans. Anyone who works in a data center undoubtedly knows about Delta Airlines’ troubles in August 2016 and is currently making sure that they can avoid a similar fate.

Delta Airlines suffered a fairly substantial blow as their data center suffered a power outage. This led to the cancellation of over 2000 flights, affecting hundreds of thousands of passengers, and a swift drop in stock prices.[1] While initial reports pointed to a failure of the state utilities provider, it eventually came to surface that the problem was caused by a surge when a power control module malfunctioned and some of the backups that were designed to get going failed to do so, thus preventing Delta from being able to continue operations from the recovery plan.[2] This comes on the heels of another major airline’s power failure. Southwest Airlines suffered a data center snafu that could still cost them up to an estimated $82 million dollars in damages.[3] While the investigation is still ongoing, this one was caused by a faulty network router.[4]

While the wake of these IT-related disasters continues to energize the class of pundits and experts who are providing a variety of different opinions on how to interpret and better prepare for such events, there are a couple of infallible truths we must always remember.

Truth #1: The world and its interdependent systems are becoming increasingly complex. While this means that the number and nature of the problems that we are able to handle constantly grows, so does vulnerability and exposure.

Truth #2: Things go wrong. Things don’t quite go according to plan, especially in the IT world. This will never change, plain and simple. It’s not a matter of “will it break?” but “WHEN will it break?”

Truth #3: The longer the downtime, the worse the problems. More heavily-layered systems require better disaster recovery plans, better tools to manage the environment and control the variables, and better overall planning.

Truth #4: Never underestimate the power of human error. The ever-growing strata of systems designed to remove human error can only reduce it, not eliminate it. After all, as they say, we’re only human.

image

Delta and Southwest Airlines’ latest troubles are certainly provoking the anxieties of data center managers and operators everywhere. For example, the Washington Post reports that aviation experts can tie some of these companies’ problems back to that tether outdated or incompatible computer systems to one another.[5] In the world of colocations, for example, many data centers grow almost exponentially by acquisition. This makes them vulnerable to exactly these kinds of disasters. Everyone’s seen the caricatures of data centers being these colossal, dungeon-like facilities with endless rows of high-tech cabinets and other equipment. Having to manage that inventory can be hard enough without having to integrate it with someone else’s equally-large but different systems too because the companies merged together.

This is why IT managers, facility managers, and other data center operators have such specific and evolving needs in their DCIM solutions. It often comes down to simply having the ability to keep track of all the assets and handling each expert system through what’s become known as a “single pane of glass.” At the core of it, the DCIM needs the robustness to keep up with the changing landscape of a company’s mission critical systems. With the proper tools in place, a small snafu or an unpredictable obstacle can stay a meaningless blip in a much larger and well-managed machine. IT departments can remain ahead of the curve before they become the next horror story for the competitors.

[1] http://www.latimes.com/business/la-fi-delta-outage-q-and-a-20160811-snap-story.html

[2] https://www.washingtonpost.com/local/trafficandcommuting/delta-identifies-cause-of-computer-crash-that-crippled-flights-monday/2016/08/09/65876f92-5e66-11e6-8e45-477372e89d78_story.html

[3] http://www.wfaa.com/news/local/southwest-airlines-computer-outage-costs-could-reach-82m/296158194

[4] http://www.wfaa.com/news/local/southwest-airlines-computer-outage-costs-could-reach-82m/296158194

[5] https://www.washingtonpost.com/local/trafficandcommuting/delta-identifies-cause-of-computer-crash-that-crippled-flights-monday/2016/08/09/65876f92-5e66-11e6-8e45-477372e89d78_story.html