Amazon promises to improve redundancy after Dublin outage

Affected users will receive service credits for either 10 or 30 days

Amazon Web Services (AWS) learned a lot of lessons from the outage that affected its Dublin data center, and will now work to improve power redundancy, load balancing and the way it communicates when something goes wrong with its cloud, the company said in a summary of the incident.

The post mortem delved deeper into what caused the outage, which affected the availability of Amazon's EC2 (Elastic Compute Cloud), EBS (Elastic Block Store), the RDS database and Amazon's network. The service disruption began Aug. 7, at 10:41 a.m., when Amazon's utility provider suffered a transformer failure. At first, a lightning strike was blamed, but the provider now believes it actually wasn't the cause, and is continuing to investigate, according to Amazon.

Normally, when primary power is lost, the electrical load is seamlessly picked up by backup generators. Programmable Logic Controllers (PLCs) assure that the electrical phase is synchronized between generators before their power is brought online. But in this case one of the PLCs did not complete its task, likely because of a large ground fault, which led to the failure of some of the generators as well, according to Amazon.

To prevent this from recurring, Amazon will add redundancy and more isolation for its PLCs so they are insulated from other failures, it said.

Amazon's cloud infrastructure is divided into regions and availability zones. Regions -- for example, the data center in Dublin, which is also called EU West Region -- consists of one or more Availability Zones, which are engineered to be insulated from failures in other zones in the same region. The thinking is that customers can use multiple zones to improve reliability, something which Amazon is working on simplifying.

At the time of the disruption, customers who had EC2 instances and EBS volumes independently operating in multiple EU West Region Availability Zones did not experience service interruption, according to Amazon. However, management servers became overloaded as a result of the outage, which had an impact on performance in the whole region.

To prevent this from recurring, Amazon will implement better load balancing, it said. Also, over the last few months, Amazon has been "developing further isolation of EC2 control plane components to eliminate possible latency or failure in one Availability Zone from impacting our ability to process calls to other Availability Zones," it wrote. The work is still ongoing, and will take several months to complete, according to Amazon.

The service that caused Amazon the biggest problem was EBS, which is used to store data for EC2 instances. The service replicates volume data across a set of nodes for durability and availability. Following the outage the nodes started talking to each other to replicate changes. Amazon has spare capacity to allow for this, but the sheer amount of traffic proved too much this time.

When all nodes related to one volume lost power, Amazon in some cases had to re-create the data by putting together a recovery snapshot. The process of producing these snapshots was time-consuming, because Amazon had to move all of the data to Amazon Simple Storage Service (S3), process it, turn it into the snapshot storage format and then make the data accessible from a user's account.

By 8:25 p.m. PDT on Aug. 10, 98 percent of the recovery snapshots had been delivered, with the remaining few requiring manual attention, Amazon said.

For EBS, Amazon's goal will be to drastically reduce the recovery time after a significant outage. It will, for example, create the capability to recover volumes directly on the EBS servers upon restoration of power, without having to move the data elsewhere.

The availability of the storage service was not just impacted by the power outage, but also by separate software and human errors, which started when the hardware failure wasn't correctly handled.

As a result, some data blocks were incorrectly marked for deletion. The error was subsequently discovered and the data tagged for further analysis, but human checks in the process failed and the deletion process was executed, according to Amazon. To prevent that from happening again, it is putting in place a new alarm feature, that will alert Amazon if there are any unusual situations discovered.

How users experience an outage of this magnitude also depends on how well the affected company keeps them up to date.

"Customers are understandably anxious about the timing for recovery and what they should do in the interim," Amazon wrote. While the company did its best to keep users informed, there are several ways it can improve, it acknowledged. For example, it can accelerate the pace at which it increases the staff on the support team to be even more responsive early on, and make it easier for users to tell if their resources have been impacted, Amazon said.

The company is working on tools to do the latter, and hopes to have them ready in the next few months.

Amazon also apologized for the outage, and will give affected users service credits. Users of EC2, EBS and the RDS database will receive a credit that equals 10 days of usage. Also, companies that were affected by the EBS software bug will be awarded a 30 day credit covering their EBS usage.

The credits will be automatically subtracted from the next AWS bill, so users won't have to do anything to receive it.

Send news tips and comments to mikael_ricknas@idg.com

Join the Good Gear Guide newsletter!

Error: Please check your email address.

Tags Amazon Web Servicesamazon.comweb servicesSoftware as a servicecloud computinginternetInfrastructure services

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Mikael Ricknäs

IDG News Service
Show Comments

Most Popular Reviews

Latest News Articles

Resources

PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?