MapReduce could be the server's newest friend

Placing MapReduce on the Web server could minimize the need for separate analysis clusters, a Usenix researcher argues

In the future, when an administrator does a server build, he or she may be adding a MapReduce library to the usual stack of server, database and middleware software.

This copy of MapReduce could be used to analyze log data directly on the server, minimizing the need for a separate cluster to analyze log data, and shortening the length of time until results are generated, noted University of California San Diego researcher Dionysios Logothetis at the Usenix Annual Technical Conference, being held this week in Portland, Oregon.

With this approach, "the analysis [moves] from the dedicated clusters to the log servers themselves, to avoid expensive data migration," said Logothetis, who, along with other researchers at UCSD and Salesforce.com, described this technique in a Usenix paper, "In-situ MapReduce for Log Processing."

First introduced by Google, MapReduce is increasingly being used to analyze large amounts of data that can be located across multiple servers, or nodes. It is most commonly used as part of the Hadoop data processing platform.

While most uses of MapReduce take place on dedicated clusters, the researchers argued that a version of the analytical software framework could also be an integral part of Web servers themselves, where they could be used to do early analysis of log data.

Today's commercial Internet sites log detailed information on user visits, information that can be used for targeting ads, security monitoring and debugging.

A single server working for a busy e-commerce site can generate 1MB to 10 MBs worth of log data every second. In the course of the day, this can result in tens of terabytes' worth of data. On average, 1,000 servers each producing 1MB per second can generate 86TB of data per day. As an example, Facebook generates about 100TBs a day, Logothetis said.

Typically, large organizations such as Facebook will collect data from all servers, load it into a Hadoop cluster and analyze the results using MapReduce. MapReduce is a framework and associated library for splitting a job into multiple parts so that it can be run simultaneously across many servers, or nodes.

This "store-first, query-later" approach to log analysis has a number of disadvantages, Logothetis argued. Shipping all the data from the servers consumes a great deal of bandwidth. "This puts a lot of stress on the network," he said.

Logothetis noted Facebook discards about 80 percent of its log data before it is analyzed. With this new technology, that data would not need to be transferred.

The traditional approach also introduces a lag between when the data is produced and when it can be analyzed, which can be a critical shortcoming if the data is time-sensitive in nature.

As an alternative, the researchers proposed placing MapReduce functionality on each server itself, which could analyze the data and ship a much smaller result set back to the centralized data collection point. They dubbed this approach "in-situ MapReduce" (iMR).

"iMR is designed to complement, not replace traditional cluster-based architectures. It is meant for jobs that filter or transform log data either for immediate use or before loading it into a distributed storage system ... for follow-on analysis," the researchers' paper notes.

As a program, iMR would replicate all of the MapReduce APIs (application programming interfaces), and it would carry out the similar functionality of MapReduce, namely filtering data and aggregating the results. It would differ in that it could continually produce analysis, based on the latest data.

The researchers built a prototype of iMR, using the Mortar distributed stream processing software. With iMR, the user would specify the range of data being processed in a given analysis, such as all the data gathered in the past 60 seconds. The user would also specify how often results would be produced and transmitted, such as every 15 seconds.

Logothetis admitted that the Web servers may spend most of their resources doing their intended jobs, namely delivering services to users. But iMR could use the spare cycles left over to crunch the log data.

In their paper, the researchers devised a plan where the organization could establish a trade-off between the speed at which they get the analyzed data with the completeness of the resulting analysis. If an organization wants the analysis quickly, then each server may drop individual log entries during times of heavy usage, leading to less conclusive, but still relevant, results. Whereas a thorough analysis of the data would require a longer period of time, and more server resources to complete.

While in-situ MapReduce may not be beneficial for organizations that run only a handful of servers, it could be valuable to larger operations, such as search engines, social networks and e-commerce sites, Logothetis said.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the Good Gear Guide newsletter!

Error: Please check your email address.

Tags diagnosticsUSENIXapplicationsUtilitiesSalesforce.comOnline analytical processingsoftwaredata miningsystem management

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Essentials

Microsoft L5V-00027 Sculpt Ergonomic Keyboard Desktop

Learn more >

Lexar® JumpDrive® S57 USB 3.0 flash drive

Learn more >

Mobile

Lexar® JumpDrive® S45 USB 3.0 flash drive 

Learn more >

Exec

HD Pan/Tilt Wi-Fi Camera with Night Vision NC450

Learn more >

Lexar® JumpDrive® C20c USB Type-C flash drive 

Learn more >

Audio-Technica ATH-ANC70 Noise Cancelling Headphones

Learn more >

Lexar® Professional 1800x microSDHC™/microSDXC™ UHS-II cards 

Learn more >

Budget

Back To Business Guide

Click for more ›

Most Popular Reviews

Latest News Articles

Resources

PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?