In Pictures: 9 open source Big Data technologies to watch
With more and more companies storing more and more data and hoping to leverage it for actionable insights, Big Data is making a big splash these days. Open source technology is at the core of most Big Data initiatives. Here are nine key open source Big Data technologies to keep an eye on.
Apache Hadoop is an open source software framework for data-intensive distributed applications originally created by Doug Cutting to support his work on Nutch, an open source Web search engine. To meet Nutch's multimachine processing requirements, Cutting implemented a MapReduce facility and a distributed file system that together became Hadoop. He named it after his son's toy elephant. Through MapReduce, Hadoop distributes Big Data in pieces over a series of nodes running on commodity hardware. Hadoop is now among the most popular technologies for storing the structured, semi-structured and unstructured data that comprise Big Data. Hadoop is available under the Apache License 2.0.
R is an open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand beginning in 1993 and is rapidly becoming the go-to tool for statistical analysis of very large data sets. It has been commercialized by a company called Revolution Analytics, which is pursuing a services and support model inspired by Red Hat's support for Linux. R is available under the GNU General Public License.
An open source software abstraction layer for Hadoop, Cascading allows users to create and execute data processing workflows on Hadoop clusters using any JVM-based language. It is intended to hide the underlying complexity of MapReduce jobs. Cascading was designed by Chris Wensel as an alternative API to MapReduce. It is often used for ad targeting, log file analysis, bioinformatics, machine learning, predictive analytics, Web content mining and ETL applications. Commercial support for Cascading is offered by Concurrent, a company founded by Wensel after he developed Cascading. Enterprises that use Cascading include Twitter and Etsy. Cascading is available under the GNU General Public License.
Scribe is a server developed by Facebook and released in 2008. It is intended for aggregating log data streamed in real time from a large number of servers. Facebook designed it to meet its own scaling challenges, and it now uses Scribe to handle tens of billions of messages a day. It is available under the Apache License 2.0.
Developed by Shay Banon and based upon Apache Lucene, ElasticSearch is a distributed, RESTful open source search server. It's a scalable solution that supports near real-time search and multitenancy without a special configuration. It has been adopted by a number of companies, including StumbleUpon and Mozilla. ElasticSearch is available under the Apache License 2.0.
Written in Java and modeled after Google's BigTable, Apache HBase is an open source, non-relational columnar distributed database designed to run on top of Hadoop Distributed Filesystem (HDFS). It provides fault-tolerant storage and quick access to large quantities of sparse data. HBase is one of a multitude of NoSQL data stores that have become available in the past several years. In 2010, Facebook adopted HBase to serve its messaging platform. It is available under the Apache License 2.0.
Another NoSQL data store, Apache Cassandra is an open source distributed database management system developed by Facebook to power its Inbox Search feature. Facebook abandoned Cassandra in favor of HBase in 2010, but Cassandra is still used by a number of companies, including Netflix, which uses Cassandra as the back-end database for its streaming services. Cassandra is available under the Apache License 2.0.
Created by the founders of DoubleClick, MongoDB is another popular open source NoSQL data store. It stores structured data in JSON-like documents with dynamic schemas called BSON (for Binary JSON). MongoDB has been adopted by a number of large enterprises, including MTV Networks, craigslist, Disney Interactive Media Group, The New York Times and Etsy. It is available under the GNU Affero General Public License, with language drivers available under an Apache License. The company 10gen offers commercial MongoDB licenses.