Cloudera launches in-memory analyzer for Hadoop

Apacahe Spark has been used by Intel and Yahoo to analyze data as it comes off the wire

Hadoop distributor Cloudera has released a commercial edition of the Apache Spark program, which analyzes data in real time from within Cloudera's Hadoop environments.

The release has the potential to expand Hadoop's use for stream processing and faster machine learning.

"Data scientists love Spark," said Matt Brandwein, Cloudera director of product marketing.

Spark does a good job at machine learning, which requires multiple iterations over the same data set, Brandwein said.

"Historically, you'd do that stuff with MapReduce, if you're using Hadoop. But MapReduce is really slow," Brandwein said, referring to how the MapReduce framework requires many multiple reads and writes to disk to carry out machine learning duties. Spark can do this task while the data is still in working memory. Maintainers of the software claim that Spark can run programs up to 100 times faster than Hadoop itself, thanks to its in-memory design model.

Spark is also good at stream processing, in which it can monitor a constant flow of data and carry out certain functions if certain conditions are met.

Stream processing, for instance, could be applied to fraud management and security event management. "In those cases, you're analyzing real-time data off the wire to detect any anomalies and take action," Brandwein said. The data can also be off-loaded to the Hadoop file system for further interactive and deeper batch-processing analysis.

First developed at University of California at Berkeley, Apache Spark provides a way to load streaming data into the working memory of a cluster of servers, where it can be queried in real-time. It has no upper limit of how many servers, or how much memory, it can use.

It relies on the latest version of Hadoop data-processing network, which uses YaRN (Yet another Research Negotiator). Spark does not require the MapReduce framework though, which operates in batch mode. It has APIs (application programming Interfaces) for Java, Scala and Python. It can natively read data from the HDFS (Hadoop File System), the HBase Hadoop database and the Cassandra data store.

The Apache Spark project has over 120 developers who have contributed to the project, and the technology has been used by Yahoo, Intel, as well as a number of other, smaller, companies. DataBricks, which offers its own commercial version Spark, offers support for Spark on behalf of Cloudera users.

The idea of applying Hadoop-style analysis to streaming data is not a new one. Twitter maintains Storm, a set of open source software it uses for analyzing messages.

In addition to Spark, Cloudera also announced that it has repackaged its commercial Hadoop offering into three separate packages: the Basic edition, the Flex edition and the Enterprise Hub Edition. The Enterprise Hub addition bundles all of the additional tools that Cloudera has integrated with Hadoop, including HBase, Spark, backup capabilities, and the Impala SQL analytic edition. The Flex edition allows the user to pick one additional tool in addition to core Hadoop.

Cloudera has also renamed its Cloudera Standard edition to Cloudera Express.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the Good Gear Guide newsletter!

Error: Please check your email address.

Tags applicationsdatabasesdata miningsoftwarecloudera

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Most Popular Reviews

Latest News Articles

Resources

PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?