Diffbot organizing Web data for enterprise use

The company claims to have created a structured representation of much of the data on the Web

Google's KnowledgeGraph organizes information on the Web so it can be programmatically queried

Google's KnowledgeGraph organizes information on the Web so it can be programmatically queried

Diffbot is trying to reorganize all the data on the Web so it can be put to better use.

The service "converts the existing Web into a structured database-like representation that can essentially be used for all sorts of intelligent applications," said Mike Tung, Diffbot CEO.

On Thursday, Diffbot said it had received $500,000 in funding from Bloomberg Beta, the investment arm of the Bloomberg media company. Andy Bechtolsheim, a founder of Sun MIcrosystems and the first major investor in Google, is also a backer. Diffbot says it already has paying customers for the service, which is being used by Microsoft's Bing, Adobe, Salesforce.com, and eBay.

The service creates an object for each Web page it finds. An object provides structure to a set of related data so that it can be programmatically reused, along with other similar objects, by a query engine or an external application. The software has been copying all the pages it finds on the Web and reorganizing them into objects.

Perhaps the most well-known example of this object-based approach is Google's Knowledge Graph, a Semantic Web project. If a search is done on a particular keyword, such as the name "Johnny Depp," Google will return, along with a standard list of Web pages, a box containing basic information on the actor, such as birth date and height. That box of information is a rendering of the "Johnny Depp" Knowledge Graph object built by Google.

Diffbot, which is based in Palo Alto, California, and was founded in 2008, claims its own collection of objects is superior to Google's.

The 14-person company says it has created an entirely automated system for accurately creating objects. Google's approach is at least partly manual, requiring individuals to edit objects after they have been created, confirmed a Google spokesman.

Google's Knowledge Graph is larger than Diffbot's, containing roughly a billion objects, while Diffbot's global index of the Web now includes 600 million objects. But Google doesn't yet offer a Knowledge Graph API for third-party commercial use, though it is working on one.

Diffbot is based on the idea that businesses could use such a collection of organized information for their own purposes. Nike, for instance, could deploy the service to build a profile of other shoe companies and their offerings, Tung suggested. DiffBot offers a set of APIs (application programming interfaces) that third-party applications can use to query the massive object set.

The company has developed a set of AI algorithms that can identify the context and subject of Web pages, some of which the company is in the process of patenting. One novel AI algorithm relies computer vision, which is not a widely used technique for indexing Web pages, Tung acknowledged. The layout and design of Web pages can provide important clues to help better define objects. "The layout is the signal that helps us determine what kind of page it is," Tung said. An e-commerce site has an entirely different structure than a news site, for instance.

Diffbot is one of a number of companies building such "knowledge graphs," through various sets of technologies, said Dave Schubmehl, an IDC research director who covers content analytics, discovery and cognitive systems. Such technology could be of potential value to any business that relies on understanding large amounts of external data, he said via email.

Another company working in this field is IBM, Schubmehl wrote. Last year, IBM purchased two companies to install similar capabilities in its Watson cognitive computing service. One was AlchemyAPI, which builds taxonomies of data assets, and the other is Blekko, which developed software for indexing Web sites.

Some organizations use other technologies to organize and synthesize large sets of otherwise unstructured information, according to Schubmehl. Neo4J and Oracle both offer graph databases, which are well-suited for identifying the connections across large collections of data. Others rely on semantic Web standards, such as the Sesame Java Framework, which is used for converting data into the structured RDF (Rich Description Framework) format.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection

Tags DiffBotsoftware

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Cool Tech

SanDisk MicroSDXC™ for Nintendo® Switch™

Learn more >

Breitling Superocean Heritage Chronographe 44

Learn more >

Toys for Boys

Family Friendly

Panasonic 4K UHD Blu-Ray Player and Full HD Recorder with Netflix - UBT1GL-K

Learn more >

Stocking Stuffer

Razer DeathAdder Expert Ergonomic Gaming Mouse

Learn more >

Christmas Gift Guide

Click for more ›

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Walid Mikhael

Brother QL-820NWB Professional Label Printer

It’s easy to set up, it’s compact and quiet when printing and to top if off, the print quality is excellent. This is hands down the best printer I’ve used for printing labels.

Ben Ramsden

Sharp PN-40TC1 Huddle Board

Brainstorming, innovation, problem solving, and negotiation have all become much more productive and valuable if people can easily collaborate in real time with minimal friction.

Sarah Ieroianni

Brother QL-820NWB Professional Label Printer

The print quality also does not disappoint, it’s clear, bold, doesn’t smudge and the text is perfectly sized.

Ratchada Dunn

Sharp PN-40TC1 Huddle Board

The Huddle Board’s built in program; Sharp Touch Viewing software allows us to easily manipulate and edit our documents (jpegs and PDFs) all at the same time on the dashboard.

George Khoury

Sharp PN-40TC1 Huddle Board

The biggest perks for me would be that it comes with easy to use and comprehensive programs that make the collaboration process a whole lot more intuitive and organic

David Coyle

Brother PocketJet PJ-773 A4 Portable Thermal Printer

I rate the printer as a 5 out of 5 stars as it has been able to fit seamlessly into my busy and mobile lifestyle.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?