Complexity and Outages

Complexity and Crashes

The recent surge in high-profile outages is not an illusion. It’s all too real and one major contributing factor often goes unmentioned. Reposted with permission from CircleID, this piece by Russ White highlights the role complexity plays in the increasingly fragile digital customer experience:

It’s a familiar story by now: on the 8th of August, 2016, Delta lost power to its Atlanta data center, causing the entire data center to fail. Thousands of flights were cancelled, many more delayed, and tens of thousands of travelers stranded. What’s so unusual about this event is in the larger scheme of network engineering, it’s not that unusual. If I think back to my time on the Escalation Team at a large vendor, I can think of hundreds of situations like this. And among all those events, there is one point in common: it takes longer to boot the system than it does to fix the initial problem.

There was the massive Frame Relay hub and spoke network, built up over years, servicing every site for a particular retailer. A single interface flap caused hours of downtime. Then there was the theme park, and the routing protocol convergence failure that costs eight hours of downtime. There was the bank with the misconfigured application that took several days to recover from. And now there is Delta. Perhaps it took an hour to put the fire in the generator room out, but it took another six and a half to bring the systems back on line once the fire was out. I still think about the network administrator whose backup plan was to shift a cable from one 7500 to another in the case of failure. Time to swap the cable? A few minutes. Time to reconverge the network and get all the applications running again? Several hours.

The closest analogy I can find is one of those plate spinning acts. Once all the plates are spinning, it’s impressive. But watch carefully how long it takes to get all the plates spinning, and how long it takes to bring them all down. Spinning each plate takes a minute or two. Bringing them all down at once only takes a minute or two, as well. It doesn’t matter if the plate spinner has another set of plates to replace the ones broken in a crash, by the time he’s done respinning the plates, the audience is gone.

The problem, you see, isn’t disaster recovery. It’s not even having a backup, or a plan to switch to another data center.

The problem is complexity.

We’re just too quick to add “another system, another protocol, another …” to an existing system in order to support some requirement handed to us by someone who believes networks and data systems should be able to “do anything.” This is not only an application problem, it’s a networking problem, too. It’s just so much cheaper to build another layer 2 overlay on our networks than it is to insist applications be written to support layer 3 networking directly. It’s so much easier to buy that new application for all the bells and whistles, and then say “make it run on our network” to the network engineers, than it is to perhaps scale back expectations a little in order to build a supportable network.

Maybe it’s time for all of us to take a lesson from Delta, and the many network failures before Delta over the last twenty years. That next overlay you deploy to “solve” and application level problem isn’t a solution. It’s just another hidden bit of technical debt waiting to blow the network, and your business, up in a few months.

We need to start calculating the ROI across not just disaster recovery, but complexity, as well.

By Russ White