Amazon's data center outage reads like a thriller

The outage shows why performance monitoring services are gaining ground

When an Amazon Web Services data center lost power early Wednesday, the company wrote about the unfolding event with the brevity and tension of one its bestselling pot boilers.

Our anonymous author, who we'll call Sysadmin, begins his story simply, without emotional complications and love interests.

"We are investigating connectivity issues for instances in the US-EAST-1 region," Sysadmin writes on Amazon's operations status board at 1:08 a.m. PT.

With one sentence, we're intrigued. Something's up with Amazon's data center in Northern Virginia, just a short drive to Washington; Tom Clancy country.

You can almost feel what's going on. Cloud-based services are crashing and there's a scramble for answers. Elsewhere, PC screens are refreshed as readers wait for an update from Sysadmin, (Kindle edition not yet available). Some 18 minutes pass. Tension builds.

Sysadmin offers an update, referring to isolated "power issues."

Inside the data center a real, red-light-flashing drama unfolds.

At first, a "single component of the redundant power distribution system failed in this zone," Sysadmin would later write in a postscript for his audience. But while the data center staff worked on that component, there was a twist: "A second component, used to assure redundant power paths, failed as well."

Customers are losing connectivity.

Whether data center staff cheered when the problem was fixed remains a mystery. But as soon as the "defective power distribution units were bypassed, servers restarted and instances began to come online shortly thereafter," wrote Sysadmin.

Readers wouldn't get those details until later, when Sysadmin had more information and time. In those early minutes of the outage, only essential information gets to anxious readers. At 1:51 a.m., Sysadmin wrote: "The underlying power issue has been addressed. Instances have begun to recover."

At 2:11 a.m., he writes again: a recovery is well under way.

All that's left are the reviews. That's where companies like Wellesley Mills, Mass.-based Apparent Networks Inc. come in.

In November, Apparent Networks launched its Cloud Performance Center , an online service that allows anyone to review -- in real-time -- the performance of 16 cloud providers, including Amazon and Google . It covers such things as bandwidth capacity, latency and data loss, then scores them overall.

Jim Melvin, president of the privately held Apparent Networks, said his firm can continuously monitor network performance over WANs using technology it has extended to the cloud. The monitoring is done with a "very lightweight stream of packets" that continuously travels the network to monitor activity and cloud performance.

With the available free version of its PathView Cloud tool, users can detect performance issues with the network or cloud provider, and see whether service level performance agreements are being met, Melvin said.

Apparent Networks, which has begun issuing performance advisories on cloud providers, offered up just such an advisory on Amazon, detailing when connectivity was lost and restored from various locations in the U.S. It characterized the nearly 45-minute interruption in services as "severe." (Amazon officials weren't available for additional comment.)

There are now a number of performance monitoring firms that continuously examine cloud providers and Web sites. Not surprisingly, the use of, and interest in, such services is growing.

Even in the current recession, one performance measurement company, Keynote Systems in San Mateo, Calif. reported $80.1 million in revenue in the quarter ending Sept. 30, up 4% from a year ago. And with an eye on bolstering its performance monitoring abilities, management software vendor Compuware Corp. last month acquired Gomez Inc. , a Lexington, Mass.-based monitoring service for $295 million.

According to Mike Gualtieri, an analyst at Forrester Research Inc., companies provide a service that periodically runs scripts against the functionality of Web sites. An online retailer, for instance, might have Gomez run a product search and shopping-cart script every three minutes to make sure the retailer's site is up and performing at an acceptable level. And that monitoring is done from different locations around the world.

That way, it's possible to find out whether performance hiccups are local or global.

Compuware's acquisition of Gomez, "is very interesting because now they can provide end-to-end monitoring, from the user in a browser all the way back into the applications behind the scenes," said Gualtieri.

"Continuous monitoring of cloud services is a prerequisite to any firm considering deploying mission-critical Web applications in the cloud," said Gualtieri.

And it means that companies don't have to rely exclusively on first-person Sysadmin accounts of trouble to find out how the story ends.

Join the Good Gear Guide newsletter!

Error: Please check your email address.

Tags amazon

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Patrick Thibodeau

Computerworld (US)
Show Comments

Most Popular Reviews

Latest News Articles


PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?