IBM: Puzzles provide clues to better analysis

An IBM fellow explains why data analysis is much like assembling a picture puzzle

Today's large-scale data analysis may be a high-tech undertaking, but smart data scientists can improve their craft by observing how simple low-tech picture puzzles are solved, said an IBM scientist at the GigaOm conference.

Watching how people put together picture puzzles can reveal "a lot of profound effects that we could bring to big data" analysis, said Jeff Jonas, IBM's chief scientist for entity analytics, speaking Wednesday at one of the more whimsical presentations at the data structure conference in New York.

Data analysis is becoming a more important component to many businesses. IDC estimates enterprises will spend more than US$120 billion by 2015 on analysis systems. IBM estimates that it will reap $16 billion in business analytics revenue by 2015.

But getting useful results from such systems requires careful planning.

In a series of informal experiments, Jonas observed how small groups of friends and family work together to assemble picture puzzles, those involving thousands of separate pieces that could be assembled to form a picture.

"My girlfriend sees her son and three cousins, I see four parallel processor pipelines," he said. To make the challenge a bit harder, he removed some of the puzzle pieces, and, obtaining a second copy of some puzzles, added duplicate pieces.

Puzzles are about assembling small bits of discrete data into larger pictures. In many ways, this is the goal of data analysis as well, namely finding ways of assembling data such that it reveals a bigger pattern.

A lot of organizations make the mistake of practicing "pixel analytics," Jonas said, in which they try to gather too much information from a single data point. The problem is that if too much analysis is done too soon, "you don't have enough context" to make sense of the data, he said.

Context, Jonas explained, means looking at what is around the bit of data, in addition to the data itself. By doing too much stripping and filtering of seemingly useless data, one can lose valuable context. When you see the word "bat," you look at the surrounding data to see what kind of bat it is, be it a baseball bat, a bat of the eyelids or a nocturnal creature, he said.

"Low-quality data can be your friend. You'll be glad you didn't over-clean it," Jonas said. Google, for instance, reaps the benefits of this approach. Sloppy typers will often get a "did you mean this?" suggestion after entering into the search engine a misspelled word. Google provides results to what it surmises is the correct word. Google guesses the correct word using a backlog of incorrectly typed queries.

With puzzles, users first concentrate on assembling one piece with another. Over time, they create small clumps of data, which they can then figure out how to connect to finish the puzzle. The edges and the corners are assembled fairly quickly. What in effect happens is that, as progress on the puzzle proceeds, "you are making faster quality decisions than before," Jonas said. "The computational costs to figure out where a piece goes declines."

Watching his teams put together the faulty puzzles, he noticed a number of interesting traits. One obvious one is that the larger the puzzle, the more time it takes to complete. "As the working space expands the computational effort increases," he said. Ambiguity also increases computational complexity. Puzzle pieces that have the same colors and shapes were harder to fit together than those with distinct details.

"Excessive ambiguity really drives up the computational cost," Jonas said.

Jonas was also impressed with how little information someone needed to get an idea of the image that the puzzle held. After assembling only four pieces, one of his teams was able to guess that its puzzle depicted a Las Vegas vista. "That is not a lot of fidelity to figure that out," he said. Having only about 50 percent of the puzzle pieces fitted together provided enough detail to show the outline of the entire puzzle image. This is good news for organizations unable to capture all the data they are studying -- even a statistical sampling might be enough to provide the big picture, so to speak.

"When you have less than half the observation space, you can make a fairly good claim about what you are seeing," Jonas said.

Also, studying how his teams finish the puzzles gave Jonas a new appreciation in batch processing, he said.

The key to analysis is a mixture of streaming and batch processing. The Apache Hadoop data framework is designed for batch processing, in which a lot of data in a static file is analyzed. This is different from stream processing, in which a continually updated string of data is observed. "Until this project, I didn't know the importance of the little batch jobs," he said.

Batch processing is a bit like "deep reflection," Jonas said. "This is no different than staying at home on the couch mulling what you already know," he said. Instead of just staring at each puzzle piece, participants would try to understand what the puzzle depicted, or how larger chunks of assembled pieces could possibly fit together.

For organizations, the lesson should be clear, Jonas explained. They should analyze data as it comes across the wire, but such analysis should be informed by the results generated by deeper batch processes, he said.

Jonas' talk, while seemingly irreverent, actually illustrated many important lessons of data analysis, said Seth Grimes, an industry analyst focusing on text and content analytics who attended the talk. Among the lessons: Data is important. Context accumulates and real-time streams of data should be augmented with deeper analysis.

"These are great lessons, communicated really effectively," Grimes said.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Essentials

Brother MFC-L3745CDW Colour Laser Multifunction

Learn more >

Mobile

Exec

Sony WH-1000XM4 Wireless Noise Cancelling Headphones

Learn more >

Budget

Back To Business Guide

Click for more ›

Brand Post

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Tom Pope

Dynabook Portégé X30L-G

Ultimately this laptop has achieved everything I would hope for in a laptop for work, while fitting that into a form factor and weight that is remarkable.

Tom Sellers

MSI P65

This smart laptop was enjoyable to use and great to work on – creating content was super simple.

Lolita Wang

MSI GT76

It really doesn’t get more “gaming laptop” than this.

Jack Jeffries

MSI GS75

As the Maserati or BMW of laptops, it would fit perfectly in the hands of a professional needing firepower under the hood, sophistication and class on the surface, and gaming prowess (sports mode if you will) in between.

Taylor Carr

MSI PS63

The MSI PS63 is an amazing laptop and I would definitely consider buying one in the future.

Christopher Low

Brother RJ-4230B

This small mobile printer is exactly what I need for invoicing and other jobs such as sending fellow tradesman details or step-by-step instructions that I can easily print off from my phone or the Web.

Featured Content

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?