Amazon cloud outage was triggered by configuration error

Company's postmortem and apology wins praise for transparency

Amazon has released a detailed postmortem and mea culpa about the partial outage of its cloud services platform last week and identified the culprit: A configuration error made during a network upgrade.

During this configuration change, a traffic shift "was executed incorrectly," Amazon said, noting that traffic that should have gone to a primary network was routed to a lower capacity one instead. The error occurred at 12:47 p.m. on April 21 and led to a partial outage that lingered through last weekend.

The outage sent a number of prominent Web sites offline, including Quora, Foursquare and Reddit, and renewed an industry-wide debate over the maturity of cloud services.

Amazon posted updates, short and bulletin-like, throughout the outage, but what it offered in its postmortem is entirely different. This nearly 5,700-word document includes a detailed look at what happened, an apology, a credit to affected customers, as well a commitment to improve its customer communications.

Amazon didn't say explicitly whether it was human error that touched off the event, but hints at that possibility when it wrote that "we will audit our change process and increase the automation to prevent this mistake from happening in the future."

The initial mistake, followed by the subsequent increase in network load, exposed a cascading series of issues, including a "re-mirroring storm" with systems continuously searching for a storage space.

Amazon also said in its explanation of the outage that it will work to ensure that it builds software and services that can survive failures.

Matt Stevens, the CTO of AppNeta, a cloud performance network performance management company and an Amazon cloud user, praised Amazon's postmortem for its transparency. "As a technical architect, I thought it was actually amazing how deep they went into it," said Stevens, adding that he wished the company had offered more detail about the initial network change that started the problem.

In terms of the overall issue, Stevens said: "How does anybody who runs their own private data center know how it's going to hold up until you have a massive issue?"

Jim Damoulakis, CTO of GlassHouse Technologies, an enterprise storage services provider, called it "a pretty through postmortem and I think for the most part they are being transparent about it."

Damoulakis said that while Amazon will take steps to keep the problem from happening again -- and to make their availability zones more robust -- customers will ultimately be responsible for having a good disaster recovery plan.

"I think there is blame on both sides," said Justin Alexander, who heads strategic research and development at Hyland Software, an enterprise content management software firm, referring to both Amazon and its customers.

"Clearly, Amazon needs to take accountability for their services. But at the same time there were a variety of customers who were using the EC2 platform that did not suffer any period of unavailability," said Alexander, citing their disaster recovery plans.

Patrick Thibodeau covers SaaS and enterprise applications, outsourcing, government IT policies, data centers and IT workforce issues for Computerworld. Follow Patrick on Twitter at @DCgov or subscribe to Patrick's RSS feed. His e-mail address is pthibodeau@computerworld.com.

Read more about cloud computing in Computerworld's Cloud Computing Topic Center.

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags cloud computinginternetNetworkingdisaster recoverysoftwareapplicationsData CenterBusiness Continuityhardware systemsConfiguration / maintenance

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Patrick Thibodeau

Computerworld (US)
Show Comments

Cool Tech

Toys for Boys

Family Friendly

Stocking Stuffer

SmartLens - Clip on Phone Camera Lens Set of 3

Learn more >

Christmas Gift Guide

Click for more ›

Brand Post

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Aysha Strobbe

Microsoft Office 365/HP Spectre x360

Microsoft Office continues to make a student’s life that little bit easier by offering reliable, easy to use, time-saving functionality, while continuing to develop new features that further enhance what is already a formidable collection of applications

Michael Hargreaves

Microsoft Office 365/Dell XPS 15 2-in-1

I’d recommend a Dell XPS 15 2-in-1 and the new Windows 10 to anyone who needs to get serious work done (before you kick back on your couch with your favourite Netflix show.)

Maryellen Rose George

Brother PT-P750W

It’s useful for office tasks as well as pragmatic labelling of equipment and storage – just don’t get too excited and label everything in sight!

Cathy Giles

Brother MFC-L8900CDW

The Brother MFC-L8900CDW is an absolute stand out. I struggle to fault it.

Luke Hill

MSI GT75 TITAN

I need power and lots of it. As a Front End Web developer anything less just won’t cut it which is why the MSI GT75 is an outstanding laptop for me. It’s a sleek and futuristic looking, high quality, beast that has a touch of sci-fi flare about it.

Emily Tyson

MSI GE63 Raider

If you’re looking to invest in your next work horse laptop for work or home use, you can’t go wrong with the MSI GE63.

Featured Content

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?