Azure updates and human error caused Visual Studio Online outage

Microsoft's account of the latest problems for its developer cloud service reveals excellent transparency - and room for improvement

The second of two lengthy outages that hit Visual Studio Online in November was caused by the same kind of issues as the first, Microsoft has just disclosed.

Visual Studio Online, which allows developers to plan and track software development projects and share code, was unavailable for just over seven and a half hours on Monday of last week. It was preceded by another outage the prior week.

In both cases, updates to Azure were to blame, and procedures that hadn't been followed delayed the process of finding a fix.

That's according to Brian Harry , a Microsoft Technical Fellow who this week detailed what went wrong in a very candid and detailed blog post and also in an interview.

His postmortem account is a model of transparency for cloud services, with details of what went wrong and why last week's outage was "probably 3-4 hours longer than it had to be due to inefficiencies." The detail is key to running cloud services, according to Harry, who works as the Product Unit Manager for Team Foundation Server.

"With service outages, in a modern devops world it's all about root cause analysis. It's important we get the service back up, but root cause comes first, because just getting the service back up with the threat it will happen again is not a victory," he said on Thursday.

It may also become a model for other Microsoft services, including Azure, where some customers were disappointed with communications during the last outage. "When we have a major issue with a high-profile service everybody cares; you get all kinds of people involved in trying to help communicate. The end result was less transparent and less empathetic communications than I think we would want. As a result you're going to see changes in the way we communicate about Azure outages. You really have to have someone willing to stand out there and say 'I own this and this is what I'm doing about it,'" he said.

What happened

In the latest outage case, an update to the SQL Azure cloud database service included a new feature designed to automatically find and repair databases with unusually high numbers of errors. Using the gradual rollout system that Microsoft calls "flighting," this was loaded in one SQL Azure region, where it caused problems for a Visual Studio Online procedure that generates a lot of duplicate record errors -- but is designed to ignore them. Trying to handle the 170,000 exceptions per minute being generated -- and successfully ignored -- took so many resources that it made a key database lock up in a few hours.

Some things in Microsoft's procedure for handling cloud problems went the way they were supposed to. Monitoring systems spotted the problem late Sunday night, over an hour before customers in Europe started tweeting about not being able to access their accounts. The update was being rolled out one region at a time, unlike the previous Azure update mistakenly deployed in multiple locations. Once the Visual Studio Online problem was identified, the update was stopped before it deployed in the next region, and Harry believes no other Azure customers were affected. The new feature also came with the option to turn it off.

But it took more than three hours to identify that the problem was caused by the SQL Azure change, because the Azure operations team didn't have documentation covering the new feature or even showing it was part of the update. Even when they knew the update was the problem and the feature could be turned off, it took another 45 minutes to work out how, and then another hour to find and fix an unrelated bug in the roaming feature that lets developers have their Visual Studio settings automatically synced to another machine.

Coping with the always-changing cloud

The details about the outage are a window into how Microsoft is building cloud services.

"In any large-scale system there will be failure, and you have got to be resilient to those failures," Harry said. "There are things we are doing to become more resilient, but the level of investment to do this well is quite high, and it takes time."

All but one of the main Visual Studio Online systems have been converted into smaller redundant services; the last Shared Platform Service was affected here. A "circuit breaker" system, for turning off specific features or groups of users to keep the rest of the system available, won't cover all features for another two to three months and isn't yet mature enough to trip automatically.

As Azure becomes a critical underpinning for other Microsoft services, there are also questions about coordinating changes, which are staggered across different regions. Harry is keen to see an Azure-wide "canary" region, similar to the fast ring in the Windows 10 technical preview and the Office 365 First Release program. "Imagine if any customer could sign up to have resources in that region, so that not only do we get to test our services all together as we roll out, but our customers who are building on Azure could choose to have some fraction of testing or production in the Azure canary region and get an early peek at changes that are coming."

The role of developers in devops

Most cloud outages are caused either by changes or by combinations of problems outside the service that expose hidden problems. The trick is spotting something going wrong before the situation becomes critical, and responding. "We get many terabytes of telemetry a day," Harry said. "You need tools to search it, mine it and understand what it's telling you."

Getting this right involves developers as well as operations, something that makes sense of the pattern of layoffs at Microsoft earlier in the year, aimed at bringing those teams closer together. "We're on a journey of transferring more responsibility for things you would traditionally call ops to the development team," he said. The engineering team on Visual Studio Online is now in charge of deploying the code they write.

Another change came after an outage where an alert hours in advance indicated a problem, "but it was buried in noisy alerts and it looked like alerts that are traditionally ignored -- so they ignored it." Now developers are responsible for closing alerts, and if an alert is too easily triggered and gets ignored, that includes fixing it to be more useful.

This is all part of the way Microsoft is doing devops at scale. It's not just the operations team that gets paged in the middle of the night when things go wrong. Even senior executives take turns carrying pagers overnight for major incidents. As well as the 24-7 monitoring team, Harry has developers around the world who can assess the problem, with an engineer on call for each service available in 15 minutes.

That 15-minute window is the Visual Studio Online team policy. "Each team is finding their way to how they manage this," he says, and reaching someone who understood the SQL Azure change took over an hour.

Making that work comes down to not just rotating who is on call, but how leaders focus on understanding what went wrong -- and not who was to blame.

"When I say we, I often mean we, Microsoft," Harry explains. "It's not my purpose to point fingers and say that team needs to improve, but to really think as one company and to think about accountability in a slightly bigger way. One of my first rules is, everybody is allowed to make a mistake; nobody is allowed to repeat a mistake."

Join the Good Gear Guide newsletter!

Error: Please check your email address.

Tags Microsoftcloud computinginternet

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Mary Branscombe

IDG News Service
Show Comments

Most Popular Reviews

Latest News Articles

Resources

PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?