Diffbot organizing Web data for enterprise use

The company claims to have created a structured representation of much of the data on the Web

Google's KnowledgeGraph organizes information on the Web so it can be programmatically queried

Google's KnowledgeGraph organizes information on the Web so it can be programmatically queried

Diffbot is trying to reorganize all the data on the Web so it can be put to better use.

The service "converts the existing Web into a structured database-like representation that can essentially be used for all sorts of intelligent applications," said Mike Tung, Diffbot CEO.

On Thursday, Diffbot said it had received $500,000 in funding from Bloomberg Beta, the investment arm of the Bloomberg media company. Andy Bechtolsheim, a founder of Sun MIcrosystems and the first major investor in Google, is also a backer. Diffbot says it already has paying customers for the service, which is being used by Microsoft's Bing, Adobe, Salesforce.com, and eBay.

The service creates an object for each Web page it finds. An object provides structure to a set of related data so that it can be programmatically reused, along with other similar objects, by a query engine or an external application. The software has been copying all the pages it finds on the Web and reorganizing them into objects.

Perhaps the most well-known example of this object-based approach is Google's Knowledge Graph, a Semantic Web project. If a search is done on a particular keyword, such as the name "Johnny Depp," Google will return, along with a standard list of Web pages, a box containing basic information on the actor, such as birth date and height. That box of information is a rendering of the "Johnny Depp" Knowledge Graph object built by Google.

Diffbot, which is based in Palo Alto, California, and was founded in 2008, claims its own collection of objects is superior to Google's.

The 14-person company says it has created an entirely automated system for accurately creating objects. Google's approach is at least partly manual, requiring individuals to edit objects after they have been created, confirmed a Google spokesman.

Google's Knowledge Graph is larger than Diffbot's, containing roughly a billion objects, while Diffbot's global index of the Web now includes 600 million objects. But Google doesn't yet offer a Knowledge Graph API for third-party commercial use, though it is working on one.

Diffbot is based on the idea that businesses could use such a collection of organized information for their own purposes. Nike, for instance, could deploy the service to build a profile of other shoe companies and their offerings, Tung suggested. DiffBot offers a set of APIs (application programming interfaces) that third-party applications can use to query the massive object set.

The company has developed a set of AI algorithms that can identify the context and subject of Web pages, some of which the company is in the process of patenting. One novel AI algorithm relies computer vision, which is not a widely used technique for indexing Web pages, Tung acknowledged. The layout and design of Web pages can provide important clues to help better define objects. "The layout is the signal that helps us determine what kind of page it is," Tung said. An e-commerce site has an entirely different structure than a news site, for instance.

Diffbot is one of a number of companies building such "knowledge graphs," through various sets of technologies, said Dave Schubmehl, an IDC research director who covers content analytics, discovery and cognitive systems. Such technology could be of potential value to any business that relies on understanding large amounts of external data, he said via email.

Another company working in this field is IBM, Schubmehl wrote. Last year, IBM purchased two companies to install similar capabilities in its Watson cognitive computing service. One was AlchemyAPI, which builds taxonomies of data assets, and the other is Blekko, which developed software for indexing Web sites.

Some organizations use other technologies to organize and synthesize large sets of otherwise unstructured information, according to Schubmehl. Neo4J and Oracle both offer graph databases, which are well-suited for identifying the connections across large collections of data. Others rely on semantic Web standards, such as the Sesame Java Framework, which is used for converting data into the structured RDF (Rich Description Framework) format.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the Good Gear Guide newsletter!

Error: Please check your email address.

Tags DiffBotsoftware

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Cool Tech

Crucial Ballistix Elite 32GB Kit (4 x 8GB) DDR4-3000 UDIMM

Learn more >

Gadgets & Things

Lexar® Professional 1000x microSDHC™/microSDXC™ UHS-II cards

Learn more >

Family Friendly

Lexar® JumpDrive® S57 USB 3.0 flash drive 

Learn more >

Stocking Stuffer

Plox Star Wars Death Star Levitating Bluetooth Speaker

Learn more >

Christmas Gift Guide

Click for more ›

Most Popular Reviews

Latest News Articles

Resources

GGG Evaluation Team

Kathy Cassidy

STYLISTIC Q702

First impression on unpacking the Q702 test unit was the solid feel and clean, minimalist styling.

Anthony Grifoni

STYLISTIC Q572

For work use, Microsoft Word and Excel programs pre-installed on the device are adequate for preparing short documents.

Steph Mundell

LIFEBOOK UH574

The Fujitsu LifeBook UH574 allowed for great mobility without being obnoxiously heavy or clunky. Its twelve hours of battery life did not disappoint.

Andrew Mitsi

STYLISTIC Q702

The screen was particularly good. It is bright and visible from most angles, however heat is an issue, particularly around the Windows button on the front, and on the back where the battery housing is located.

Simon Harriott

STYLISTIC Q702

My first impression after unboxing the Q702 is that it is a nice looking unit. Styling is somewhat minimalist but very effective. The tablet part, once detached, has a nice weight, and no buttons or switches are located in awkward or intrusive positions.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?