Twitter solves its data formatting challenge

At HadoopWorld, Twitter's analytics chief discussed the new technologies it uses to handle its daily data deluge

Eschewing popular choices such as XML, CSV and JSON, Twitter has opted to format the back-end storage of its user and systems data with a relatively unknown format pioneered by Google, called Protocol Buffers.

With the company storing 12TB of this data each day for later use, the decision of which format to use was a crucial one.

"Getting your data formats right is everything," said Twitter analytics lead Kevin Weil, during a talk at the HadoopWorld conference in New York on Tuesday.

The company is planning for the time when it will have to house "a trillion Tweets," Weil said, and it wants tools in place to analyze this information. The combination of Protocol Buffers, along with Hadoop and other associated technologies, should streamline this job, Weil said.

When stored, each short message, or "tweet," consists of 17 fields, six of which have at least one subfield, he explained. And the company will probably add more fields to these schema in the years to come.

In addition to the tweets the company's users supply, Twitter keeps internal log data on more than 80 different types of operations that occur within its systems, Weil said. Much of this log data is aggregated by Facebook's open-source technology Scribe.

The choice of a format to store all this data was a difficult one. One obvious choice is XML (Extensible Markup Language), but that protocol is "very wordy," Weil said, referring to how the name of the tag accompanies each data element.

Under XML, "one petabyte for a trillion Tweets might become 10 petabytes for a trillion Tweets," he said.

JSON (JavaScript Object Notation), though it was designed to simplify XML, is also wordy, in that it also stores the name of the key with every entry.

At the other end of the spectrum is CSV (Comma Separated Values). As the name suggests, CSV separates each data element only with a comma. While simple, it is not good for nesting data elements in subfields, Weil explained. Also, if the schema is changed, the resulting programming it would take to accommodate data in the old schema would be considerable.

A downside to all of these protocols is that, in order to get the data in and out of applications, developers have to repeatedly create data structures to encode and parse the data, work Weil considers "rote."

Protocol Buffers, used widely within Google, is an extensible protocol for serializing data, one Google claims is simpler than XML. And it can automate the process of recreating the data structures within applications.

"You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages," a Google tutorial on Protocol Buffers states. "You can even update your data structure without breaking deployed programs that are compiled against the 'old' format."

For Twitter, this automation would allow the company to spin up new features more quickly.

"Protocol Buffers will generate code in a number of different languages, so you don't have to write code beyond IDL," or Interface Description Language, Weil said. It also ensures that should the schema be changed, the older information will remain accessible.

While primary copies of user Tweets are kept in MySQL and Cassandra databases, the company is also building a second data repository, running on Hadoop, that can be used for analytics and applications.

The information in this system can be queried using Java MapReduce or Pig, which is Hadoop's own SQL-like query language. Already one feature, Twitter's name search, runs on this system, and more are expected to be built.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the Good Gear Guide newsletter!

Error: Please check your email address.

Tags Internet-based applications and servicesapplication developmentLanguages and standardsWeb services developmentstoragemiddlewaresocial networkingsocial mediadata integrationinternetDevelopment toolsGooglesoftwaretwitter

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Essentials

Lexar® JumpDrive® S57 USB 3.0 flash drive

Learn more >

Microsoft L5V-00027 Sculpt Ergonomic Keyboard Desktop

Learn more >

Mobile

Lexar® JumpDrive® S45 USB 3.0 flash drive 

Learn more >

Exec

Audio-Technica ATH-ANC70 Noise Cancelling Headphones

Learn more >

Lexar® Professional 1800x microSDHC™/microSDXC™ UHS-II cards 

Learn more >

HD Pan/Tilt Wi-Fi Camera with Night Vision NC450

Learn more >

Lexar® JumpDrive® C20c USB Type-C flash drive 

Learn more >

Budget

Back To Business Guide

Click for more ›

Most Popular Reviews

Latest News Articles

Resources

PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?