Google offers tips on reducing web system latency

In the latest ACM magazine, Google fellows offer a few secrets to keeping systems responding to users as quickly as possible

Running the world's most popular website, Google's engineers know a thing or two about keeping a site responsive under very high demand. In the latest issue of the ACM (Association for Computer Machinery) monthly magazine, Google reveals a few secrets to maintaining speedy operations on large-scale systems.

Systems as large as Google's can suffer from even a few sluggish individual nodes, write the article's authors, Jeffrey Dean, a Google fellow in the company's systems infrastructure group, and Luiz André Barroso, a Google fellow who is technical lead of Google's core computing infrastructure. The good news is that while slow nodes can never be eliminated entirely, a system can be designed to still offer speedy service to the user, the authors wrote.

"It's an important topic. When you have a [user] request that needs to gather information from many machines, inherently some of the machines will be slow," said Ion Stoica, an ACM reviewer who is a computer science professor at the University of California Berkeley, as well a co-founder of video stream optimization software provider Conviva.

"As [Internet services] try to reduce the response times more and more, the problem will become more difficult because [the systems] will have less time to decide what to do when something goes wrong. So it will be an area of research and development that will get attention over the next few years," he said.

Looking at performance variability is particularly important with large distribution systems such as Google's, because performance troubles on even a single node can result in delays that affect many users. "Variability in the latency distribution of individual components is magnified at the service level," the authors wrote.

For instance, consider a server that typically responds to a request within 10 milliseconds but takes an entire second to fulfill a request every 100th time. In a single server environment, this means that only every 100th user would get a slow response. But if each user request is handled by 100 servers -- each with the same latency characteristics -- then 63 out of every 100 users would get a slow response, the authors calculated.

Performance variability can take place for a number of reasons, the authors note. Sharing resources, such as running multiple application on a single server, can affect the response time of each application. The length of a component's work queue may also have a factor, as would routine maintenance jobs that can take up resources.

The Google engineers offered a number of techniques for mitigating slow performance from individual nodes, such as breaking jobs into smaller components and better managing routine maintenance tasks.

But they admitted that only so much can be done reducing individual component latency. So the heart of the article focuses on describing some techniques that minimize the effects of such variability. In much the same way that reliable fault-tolerant systems can be made from somewhat less reliable components, so too can a consistently performing cloud system be made using somewhat less consistently performing end-nodes.

One technique they describe is "hedged requests," in which duplicate requests are sent to multiple servers, and the first response that is returned is the one that is used. Another technique is to set up micro-partitions, or multiple partitions on each machine, which allows the company to do more fine-tuned load-balancing. A third technique involves putting into practice "latency-induced probation," in which slow servers are quickly spotted and not assigned any additional work.

These techniques should "allow designers to continue to optimize for the common case while providing resilience against uncommon cases," the Google engineers wrote.

Stoica noted that many of the techniques that Google described would be applicable to smaller IT operations as well, though "though the effect would not be as pronounced as in a large deployment's as Google's."

Join the newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection

Tags Googlesoftware

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Ben Ramsden

Sharp PN-40TC1 Huddle Board

Brainstorming, innovation, problem solving, and negotiation have all become much more productive and valuable if people can easily collaborate in real time with minimal friction.

Sarah Ieroianni

Brother QL-820NWB Professional Label Printer

The print quality also does not disappoint, it’s clear, bold, doesn’t smudge and the text is perfectly sized.

Ratchada Dunn

Sharp PN-40TC1 Huddle Board

The Huddle Board’s built in program; Sharp Touch Viewing software allows us to easily manipulate and edit our documents (jpegs and PDFs) all at the same time on the dashboard.

George Khoury

Sharp PN-40TC1 Huddle Board

The biggest perks for me would be that it comes with easy to use and comprehensive programs that make the collaboration process a whole lot more intuitive and organic

David Coyle

Brother PocketJet PJ-773 A4 Portable Thermal Printer

I rate the printer as a 5 out of 5 stars as it has been able to fit seamlessly into my busy and mobile lifestyle.

Kurt Hegetschweiler

Brother PocketJet PJ-773 A4 Portable Thermal Printer

It’s perfect for mobile workers. Just take it out — it’s small enough to sit anywhere — turn it on, load a sheet of paper, and start printing.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?