EMC tackles big data with Greenplum appliance
- 14 October, 2010 04:07
Taking aim at the growing problem of big data management, EMC has released a data warehouse appliance tweaked to consume lots of data really quickly.
The Greenplum Data Computing Appliance takes advantage of an MPP (massively parallel processing) technology developed by Greenplum, a firm acquired by EMC in July.
EMC claims the appliance can ingest data twice as quickly as competing products, which EMC identified as Oracle Exadata, IBM Netezza and Teradata's enterprise data warehouse offering. A single rack can ingest 10 terabytes per hour, the company claims.
The appliance will be marketed to organizations trying to derive intelligence of large amounts of incoming data, said Scott Yara, who was the president of Greenplum, and is now vice president of products at EMC.
"Machines sitting on the network or on the Web are generating much more data than humans ever could. All the mobile phones, sensor networks and routers are pouring off millions of events each day," he said. In order to make sense of this input, "businesses are forced to create all this data analysis infrastructure that they never had to before."
To ingest data more quickly, Greenplum adopted a parallel processing architecture long used by the high performance computing community.
Most data warehouse appliances have a single master node through which all data must enter, Yura explained. This approach can be a bottleneck when trying to import large amounts of data quickly. In the MPP approach, each server on a rack gets a dedicated Ethernet connection.
"Instead of loading the data into one system and trying to distribute it, [the Greenplum architecture] loads the data in parallel to all the servers in a cluster," Yara said. In a peer-to-peer fashion, the servers coordinate amongst themselves to balance the data across all the nodes.
The MPP architecture also allows the data analysis to be executed in parallel across the servers. "You can break a single query up across the all the machines," Yura said.
The Greenplum Data Computing Appliance, available now, offers database software (Greenplum Database 4.0) preloaded on an integrated set of servers, along with storage and networking. A single rack would have 16 servers, each running two Intel E5670 hexacore processors. The appliance can be purchased as a half-rack, a single rack, or in a multiple rack configuration. Each rack could hold up to 36 terabytes of uncompressed storage, or up to 5 petabytes compressed across 24 racks. A 24-rack system could run a total of 4,608 database cores.
The appliance form-factor offers a number of advantages, the company claims. It can also be coupled with EMC's Data Domain backup and recovery software, which would allow the data warehouse material to be backed up to EMC SANs (storage area networks), as well as allow the appliance to use the SAN for additional storage.
Also, EMC's RecoverPoint software could, if needed, populate a second data warehouse with the data from the SAN.
"That next step will be a huge differentiator in my book," said Steve Hirsch, chief data officer and senior vice president of Global Data Services at New York Stock Exchange Euronext, at a launch event held in New York on Wednesday. He noted that today most organizations have to make a full second working instance of the data warehouse for backup purposes, the maintenance of which can require a lot of personnel and hardware resources.
Euronext has used the Greenplum software since 2007. The organization's internal operations generate about four terabytes of data each day, and the Greenplum database is used to derive performance metrics from some of this data.
"For us it is very expensive to move data around to analyze. We nee to load it once, analyze it there and make that data available," Hirsch said.
With this release, EMC also announced that it has created a new division, called the Data Computing Products Division, which will concentrate on data management software, such as that of Greenplum.