A Tale of Two Outages

Posted June 11th, 2008 by Joe Pendry

Two downtime incidents crossed our paths recently that we thought deserved comment. You probably have read about the first (Amazon), but maybe missed the second (Southern Company Nuclear Power Plant).

Amazon

Let’s look at the Amazon downtime first. I haven’t seen a cause, but the WSJ business technology blog sees the top two contenders for a likely explanation as either a physical mishap (fire or explosion) at a data center or a faulty software upgrade. Hyperic blog focuses on complexity – an issue we have touched on in our previous posts – as a leading potential cause:

Javier and others have touched on a likely cause of the outage: Complexity. As systems get more moving parts, they become harder to monitor and maintain. Many hope that the move to cloud computing will make things better; as you use infrastructure in the cloud, the thinking goes, you’ll be able to rely on the cloud service provider to keep it running.

We couldn’t agree more. The interlinking between the need for uptime, the need to make dynamic changes and the reality of complexity can cause real problems. If you have a complex environment AND you make lots of changes AND you need 100% availability (all Amazon requirements)…well, let’s just say you probably aren’t getting a lot of of sleep at night.

And there is one additional point to consider. While cloud computing moves responsibility for infrastructure management to the service provider, it doesn’t necessarily reduce overall complexity of the system process. In fact, it may increase total processing complexity as the process must navigate through the cloud services.

In any event, according to the business technology blog, the 2 hour Amazon outage resulted in a 4.59% stock drop, lost sales, and damage to its reputation. What is interesting in this case is how noticeable an Amazon downtime incident is to all of us. A long time ago, downtime was a cost of doing business online. Now it is front page news and can impact the stock price.

Southern Company Nuclear Power Plant

Now let’s have a look at the nuclear power plant outage. According to the Washington Post, a nuclear power plant in Georgia was forced into an emergency shutdown in March for 48 hours after a software update was installed on a single computer. The details are less clear with regard to impact – thankfully no damage to the plant occurred. But, once again, the work of the complexity monster can be spotted. According to the Post story:

Southern Company spokeswoman Carrie Phillips explained that company technicians were aware that there was full two-way communication between certain computers on the plant’s corporate and control networks. But she said the engineer who installed the update was not aware that the software was designed to synchronize data between machines on both networks, or that a reboot in the business system computer would force a similar reset in the control system machine.

“We were investigating cyber vulnerabilities and discovered that the systems were communicating, we just had not implemented corrective action prior to the automatic [shutdown],” Phillips said. She said plant engineers have since physically removed all network connections between the affected servers.”

And these are nuclear engineers…they are known for understanding complexity.

Even though this incident was less widely reported, you can bet that Southern Company has been dealing with significant regulatory fallout from this event. And that translates into cost (although different from the reputation or revenue considerations for Amazon).

In that light, a look at the costs of downtime is important. As Jonah Paransky has discussed in a previous post, it helps to look at the following four factors as a guide:

  • IT staff hours lost
  • Revenue lost
  • Productivity lost
  • Quantifying reputation damage

Conclusions

The Amazon and nuclear plant incidents are particularly interesting because they illustrate how different companies in different industries have unique factors that must be calculated when determining the impact of downtime.

For Amazon, the very nature of their business means that downtime directly affects their customers’ ability to move product online. For Southern Company, regulatory costs must be included when calculating the cost of downtime.

Popularity: 8% [?]

Filed Under: Cloud Computing, Cost of Downtime Case Studies, Downtime, IT Operations



4 Responses to “A Tale of Two Outages”

  1. Dan Skwire Says:

    Well said. THe ability to rapidly solve a problem when it first occurs has reached significant importance. Too bad the skill and technologies required to solve problems have regressed from the excellence established in mainframe systems.

    There is hope, but it will take signirficant effort. Some times, do-overs, and other crude problem resolution approaches are insufficient for the needs of some data-centers.

  2. links for 2008-06-12 — dougmcclure.net Says:

    [...] A Tale of Two Outages | IT’s About Uptime - The StackSafe Blog cloud computing moves resp for infra mgmnt to the service provider, it doesn’t necessarily reduce overall complexity of the system process. In fact, it may increase total processing complexity as the process must navigate through the cloud services. (tags: outages impact bsm business-service-management itsm changemanagement releasemgmt cloudcomputing amazon) [...]

  3. Links List 6.13.08 | IT's About Uptime - The StackSafe Blog Says:

    [...] light of Amazon’s latest downtime issues, Gigaom explains why Amazon went down and why it matters. In a thorough explanation, Gigaom bets [...]

  4. IT’s About Uptime - The StackSafe Blog » Blog Archive » Case Study on the Cost of Downtime: Gmail Outage Says:

    [...] Amazon and Southern Company Nuclear Power Plant [...]

Leave a Comment