Five things you need to know about Hadoop vs Apache Spark

They're sometimes viewed as competitors in the big-data space, but the growing consensus is that they're better together

Listen in on any conversation about big data, and you'll probably hear mention of Hadoop or Apache Spark. Here's a brief look at what they do and how they compare.

1: They do different things. Hadoop and Apache Spark are both big-data frameworks, but they don't really serve the same purposes. Hadoop is essentially a distributed data infrastructure: It distributes massive data collections across multiple nodes within a cluster of commodity servers, which means you don't need to buy and maintain expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously. Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn't do distributed storage.

2: You can use one without the other. Hadoop includes not just a storage component, known as the Hadoop Distributed File System, but also a processing component called MapReduce, so you don't need Spark to get your processing done. Conversely, you can also use Spark without Hadoop. Spark does not come with its own file management system, though, so it needs to be integrated with one -- if not HDFS, then another cloud-based data platform. Spark was designed for Hadoop, however, so many agree they're better together.

3: Spark is speedier. Spark is generally a lot faster than MapReduce because of the way it processes data. While MapReduce operates in steps, Spark operates on the whole data set in one fell swoop. "The MapReduce workflow looks like this: read data from the cluster, perform an operation, write results to the cluster, read updated data from the cluster, perform next operation, write next results to the cluster, etc.," explained Kirk Borne, principal data scientist at Booz Allen Hamilton. Spark, on the other hand, completes the full data analytics operations in-memory and in near real-time: "Read data from the cluster, perform all of the requisite analytic operations, write results to the cluster, done," Borne said. Spark can be as much as 10 times faster than MapReduce for batch processing and up to 100 times faster for in-memory analytics, he said.

4: You may not need Spark's speed. MapReduce's processing style can be just fine if your data operations and reporting requirements are mostly static and you can wait for batch-mode processing. But if you need to do analytics on streaming data, like from sensors on a factory floor, or have applications that require multiple operations, you probably want to go with Spark. Most machine-learning algorithms, for example, require multiple operations. Common applications for Spark include real-time marketing campaigns, online product recommendations, cybersecurity analytics and machine log monitoring.

5: Failure recovery: different, but still good. Hadoop is naturally resilient to system faults or failures since data are written to disk after every operation, but Spark has similar built-in resiliency by virtue of the fact that its data objects are stored in something called resilient distributed datasets distributed across the data cluster. "These data objects can be stored in memory or on disks, and RDD provides full recovery from faults or failures," Borne pointed out.

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Katherine Noyes

IDG News Service
Show Comments

Cool Tech

Toys for Boys

Family Friendly

Stocking Stuffer

SmartLens - Clip on Phone Camera Lens Set of 3

Learn more >

Christmas Gift Guide

Click for more ›

Brand Post

Bitdefender 2019

This Holiday Season, protect yourself and your loved ones with the best. Buy now for Holiday Savings!

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Michael Hargreaves

Microsoft Office 365/Dell XPS 15 2-in-1

I’d recommend a Dell XPS 15 2-in-1 and the new Windows 10 to anyone who needs to get serious work done (before you kick back on your couch with your favourite Netflix show.)

Maryellen Rose George

Brother PT-P750W

It’s useful for office tasks as well as pragmatic labelling of equipment and storage – just don’t get too excited and label everything in sight!

Cathy Giles

Brother MFC-L8900CDW

The Brother MFC-L8900CDW is an absolute stand out. I struggle to fault it.

Luke Hill

MSI GT75 TITAN

I need power and lots of it. As a Front End Web developer anything less just won’t cut it which is why the MSI GT75 is an outstanding laptop for me. It’s a sleek and futuristic looking, high quality, beast that has a touch of sci-fi flare about it.

Emily Tyson

MSI GE63 Raider

If you’re looking to invest in your next work horse laptop for work or home use, you can’t go wrong with the MSI GE63.

Laura Johnston

MSI GS65 Stealth Thin

If you can afford the price tag, it is well worth the money. It out performs any other laptop I have tried for gaming, and the transportable design and incredible display also make it ideal for work.

Featured Content

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?