VMware causes second outage while recovering from first

VMware's new Cloud Foundry service, which is still in beta, suffered downtime over the course of two days last week

VMware's attempt to recover from an outage in its brand-new cloud computing service inadvertently caused a second outage the next day, the company said.

VMware's new Cloud Foundry service -- which is still in beta -- suffered downtime over the course of two days last week, not long after the more highly publicized outage that hit Amazon's Elastic Compute Cloud.

Cloud Foundry, a platform-as-a-service offering for developers to build and host Web applications, was announced April 12 and suffered "service interruptions" on April 25 and April 26.

STORM CLOUDS: Amazon EC2 outage calls 'availability zones' into question

The first downtime incident was caused by a power outage in the supply for a storage cabinet. Applications remained online but developers weren't able to perform basic tasks, like logging in or creating new applications. The outage lasted nearly 10 hours and was fixed by the afternoon.

But the next day, VMware officials accidentally caused a second outage while developing an early detection plan to prevent the kind of problem that hit the service the previous day.

VMware official Dekel Tankel explained that the April 25 power outage is "something that can and will happen from time to time," and that VMware has to ensure that its software, monitoring systems and operational practices are robust enough to prevent power outages from taking customer systems offline.

With that in mind, VMware began developing "a full operational playbook for early detection, prevention and restoration" the very next day.

"At 8am [April 26] this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon," Tankel wrote. "This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."

The second-day outage was the more serious of the two.

"This was our first total outage, which is an event where we need to put up a maintenance page," Tankel continued. "During this outage, all applications and system components continued to run. However, with the front-end network down, we were the only ones that knew that the system was up. By 11:30 a.m. PDT the front end network was fully operational."

VMware's second-day problem illustrated the element of human error in cloud networks, just as the root-cause analysis of Amazon's cloud outage did. In the case of Amazon, a mistake made during a system upgrade led to trouble that took several days to fully correct. (See also: "Amazon: Bad execution during planned upgrade caused outage")

VMware, which is best known for its server virtualization technology, is a new player in offering a publicly available cloud service. Previously, VMware sold technology to help customers and service providers build their own clouds.

Because Cloud Foundry is so new the customer impact was not as severe as the one caused by Amazon, whose outage forced offline numerous websites that rely on Amazon infrastructure. But VMware is getting a taste of what it's like to be a service provider when things go wrong.

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags cloud computinginternetVMwareData Centerhardware systemsConfiguration / maintenanceVMware outage

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Jon Brodkin

Network World
Show Comments

Brand Post

PC World Evaluation Team Review - MSI GT75 TITAN

"I need power and lots of it. As a Front End Web developer anything less just won’t cut it which is why the MSI GT75 is an outstanding laptop for me. It’s a sleek and futuristic looking, high quality, beast that has a touch of sci-fi flare about it."

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Luke Hill

MSI GT75 TITAN

I need power and lots of it. As a Front End Web developer anything less just won’t cut it which is why the MSI GT75 is an outstanding laptop for me. It’s a sleek and futuristic looking, high quality, beast that has a touch of sci-fi flare about it.

Emily Tyson

MSI GE63 Raider

If you’re looking to invest in your next work horse laptop for work or home use, you can’t go wrong with the MSI GE63.

Laura Johnston

MSI GS65 Stealth Thin

If you can afford the price tag, it is well worth the money. It out performs any other laptop I have tried for gaming, and the transportable design and incredible display also make it ideal for work.

Andrew Teoh

Brother MFC-L9570CDW Multifunction Printer

Touch screen visibility and operation was great and easy to navigate. Each menu and sub-menu was in an understandable order and category

Louise Coady

Brother MFC-L9570CDW Multifunction Printer

The printer was convenient, produced clear and vibrant images and was very easy to use

Edwina Hargreaves

WD My Cloud Home

I would recommend this device for families and small businesses who want one safe place to store all their important digital content and a way to easily share it with friends, family, business partners, or customers.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?