Case Study on the Cost of Downtime: Amazon S3 Outage
Posted July 31st, 2008 by Jonah ParanskyWhat Happened
On July 20th, Amazon’s S3 service offerings experienced a wide scale service outage. As a primary cloud based infrastructure, the outage disrupted a wide variety of websites, users and providers. The outage was heavily publicized in the mainstream media and in the blogosphere. Unlike the February S3 outage, Amazon provided significant detail about the status of S3 service availability.
Reactions
In general, the reaction to the outage seemed to coalesce around a common theme – that cloud computing is not yet ready for primetime. Some examples of prominent coverage from this perspective included:
- Read Write Web asking the questions, when is downtime too much?
- GigaOm taking the position that the S3 Outage Highlights Fragility of Web Services.
- ZDNet’s Between the Lines blog asking, Amazon’s S3 outage: Is the cloud too complicated?
- Sean Percival chalking it up to growing pains of the prototype like landscape we all live in.
Alternative voices were also heard. Michael Krigsman at IT Project Failures took the position that,
Customers hate outages, but accurate and responsive status reporting does help the ease the pain. Kudos to Amazon for learning from past mistakes.
Web Worker Daily takes a nuanced view, pointing out that if you require a high level of uptime, you may need a backup to S3. Mike also points out though that the rates are hard to beat and that S3 will continue to be attractive to many providers on the Internet.
Our Perspective
Communication after a downtime incident occurs:
Amazon deserves significant credit for the communication approach they took during and after the outage. After their downtime incident back in February , Amazon began providing detailed transparent information about service delivery. They also performed a significant public post mortem, rightly praised by Michael Krigsman and others as demonstrating significant maturity. This was a good example of the application of the Seven Key Lessons to Keep in Mind When Communicating an IT Failure.
Cost of Downtime:
As with downtime incidents of infrastructures used by many, the cost of the outage was significant, both to Amazon as well as the myriad of vendors who depend of the service. SLA payouts are likely due and organizations concerned about downtime are likely looking at backup options to S3 dependence.
Lessons Learned:
The Amazon S3 outage provides a number of good lessons for IT operations professionals.
- Outsourcing critical infrastructure comes with risk. Make sure to architect for availability, and don’t assume that your provider will achieve 100% uptime.
- Good communication buys significant goodwill from the user population. Be ready with a communication plan for when failures happen.
Popularity: 5% [?]
Filed Under: Business Continuity, Cloud Computing, Downtime















Leave a Comment