Bear It Blog, Reflections from the Backporch: 1996: Taking America Off Line

This is the first, of what I expect will be many trips in the way back machine to some of my significant career experiences, things I did that worked out and likely some that didn't. I'll try to address both what happened and what I was able to learn as a result.

This is based on my recollection of events. I have almost certainly, unintentionally, edited events in my memory, I'm trying to be accurate, but memories can be shifty at time. Ultimately, its my blog and my story, so I'll apologize ahead of time for any errors, but I'm not going to worry about it (much).

AOL's 19 Hour Outage

I think it was August 7th, 1996, a day we ended up calling Black Wednesday when AOL suffered its longest service outage, usually referred to as AOL's 19 hour outage. Arguably, it wasn't an outage for that period, as it started with a normal service window. The outage began when AOL failed to come back online at the end of the window. The details are a bit fuzzy at this point, the actual unscheduled outage was 14 or 15 hours. Though, it doesn't matter, it was a huge amount of time for the service to be down.

I'll be coming back to this outage in the future, for now, I'll point out that it was a network event, driven by a failure in the IP networking of the service. This CNET piece was posted the week of the outage and has some of the feeling of the time.

It was shortly after this event that I assumed responsibility for the AOL IP networking group that had been at the heart of the outage. I stepped into the role desperately short of actual networking knowledge but fully aware that AOL could ill afford another outage mess and that my employment very reasonable depended on such a disaster not happening.

Something is Desperately Wrong

I had been in my new position for about a month, when in the early evening, the start of our prime time, a time when any hiccup was likely to make national news, that we started having unexplained problems in various subsystems that spanned data centers. The operations team quickly gathered in our Network Operations Center (NOC), as usual most of the team was still working at about 8PM when things started going sideways. We immediately began cataloging the problems in search of a cause.

No software or configuration changes had been made in the past few days, eliminating the most common cause of this type of problem; the system was clearly in major distress as our user count, the number of people actively using the service was below where it should be and was bouncing up and down to the tune of multiple thousands of users per minute. The issues were sporadic but so wide spread that nearly every AOL session was being affected.

The wide spread and transit nature of the problems made it feel like a network problem, the component of our system that all other elements depend upon that could vanish out from under portions of the system and recover quickly.

Steve Case Joins Us in the NOC

After what seemed like hours of focus, but was in reality closer to 30 minutes, Steve Case, the CEO of AOL quietly walked into the NOC, presumably to see what was happening for himself. I think he spoke with Matt Korn, my manager, and quietly watched as we hurriedly reviewed screens of data ran various non-destructive tests in our attempts to identify the specific problem cause(s) and theorized various causes and possibly solutions.

The pressure we felt was escalating as the clock raced on toward prime time, our customers were suffering through a poor online experience, and our CEO was watching, keenly aware of our recent network SNAFU which had earned us so much unwanted press attention. This situation was relatively unique, America was clearly Online, as we had most of our modems in operation, but the system was degrading, heading to what appeared to be a disastrous prime time crash, I could almost see the headlines as we grasped at straws.

Maybe A Memory Leak in the ATM Switches

One of the theories for a root cause was a memory leak in the ATM switches that stitched together the Vienna and Reston facility. This was the element that my team was least comfortable with and a piece that could affect, or is that afflict, any component of the system that depended on cross site communication.

We had worked with a network vendor for many hours during the 19 hour debacle, as they walked through various trouble shooting procedures with no success looking to identify a specific bug causing our outage. The technical support people for the ATM switches were proposing a similar approach to find the problem while maintaining our rather miserable ability to provide service to our customers.

Pretty much any memory leak can be fixed rather quickly by rebooting the device with the problem. This doesn't fix the leak, but it prevents the symptoms from appearing for some time. We were using a (mostly) redundant pair of ATM switches, so in theory we could and should reboot one pair, let it come back and then reboot the other. This approach would take longer but allow traffic to flow the whole time, though somewhat restricted. It also assumed that things were relatively normal which did not seem to be the case.

Reboot the ATM Switches

It was with this information and background that I turned to our network engineer who was hands on with the switches and asked him to reboot all four of them, simultaneously, now. Someone, Matt Korn, I think, asked the question, likely for Steve's benefit, "isn't that going to take us offline?" I answered, yes, we're going to knock everyone offline by doing this, but the system should be able to recover quickly and have us in good shape for our maximum load. I spoke with more hope than confidence, but I think I saw a slight nod from Matt.

The commands were entered and the ATM switches blinked off and then on as they began their startup process. As the power indicators flickered off, our user count plummeted to zero as a massive outage hit the AOL system. At my command, with my management chain looking on, in the early evening, I had taken America Online, offline, perhaps intentionally starting another back breaking outage.

We quietly watched the messages scroll past as the switches fired back up and communication started flowing. Our routers picked up next, seeing the cross site links available, they rapidly reestablished communication between Reston and Vienna. Some components of the service resumed quickly, on autopilot once data started flowing, other components required more attention, but in all cases recovery started happening. The network was once again passing bits as it was supposed to.

As I looked up from our screens, with cold sweat clinging to the back of my neck, I saw Steve and Matt quietly walking out of the NOC. Disaster had been averted.

Why Did I Make That Decision?

The experts at our vendor were advising us down a trouble shooting path that had no promise of a quick fix. It could let them find and fix the problem, benefitting all of their customers at the expense of ours. This option held little promise in my mind. My objective was restoring our service, to heck with their other customers.

The problem arose after an unusually long period of system stability, with our devices running for literally a couple of days without a reboot, a painfully short period, but still longer than we were used to. This suggested a memory leak as a possible cause, the type of thing that is fine for a period of time and then causes services to fall over.

The system problems were far reaching but intermittent, suggesting something that could crash or block quickly and clear just as quickly on a selective basis, the ATM switches were the only single component that I could think of which could cause this type of pain on such a wide scale.

Rebooting all four simultaneously would cause a 100% outage to AOL, a choice with an obvious downside versus rebooting each of the two pairs in sequence, something I knew we were supposed to do in this situation. My concern was that if something was wrong in the ATM world, the two switches kept up while the other two rebooted could share the problem with the newly started switches causing a short recovery followed by a re-entry into failure, something we experienced many times on Black Wednesday, the 19 hour outage.

Bouncing all the switches gave us a guaranteed if temporary fix to a potential memory leak with a confident recovery path afterwards, assuming a leak was the cause. Fortunately, our diagnosis was correct, bouncing the switches took America off-line, but only for a few minutes, allowing a recovery fast enough to keep us out of the late night news across the country.

What Did I Learn?

This was my biggest crisis management moment at AOL. I ended up very satisfied with my ability to gather data and make quick and decisive choices taking risks appropriate to the possibly rewards. I can't say that I didn't know this before the event, but that had been in theory, after this event, I felt confident making tough decisions quickly. That quickness element being key in a crisis situation such as this, waiting for more data would have just have dug our hole deeper.

As a network novice, I heard advice from people with years of experience. Remembering the vested interest of my advisors was key as I ended up ignoring the most knowledgable and depending on those whose biases were most closely aligned with my need to solve the problem NOW!

Postscript

The next morning, my team came to me with a plan to remove the ATM switches. They were out of the network at our next maintenance window. Everyone in the network team was keenly aware of how close we had come while dodging that bullet. Evicting those switches couldn't happen fast enough to make us happy.

In hindsight, this was a decisive moment in my career. Never again did I have management higher than my immediate manager ever ask me to justify a choice related to the network. I had a massive amount to learn, but I now had the confidence to do it.

Bear It Blog, Reflections from the Backporch

I Should Write a Blog

Saturday, August 30, 2014

1996: Taking America Off Line