Apache Lucene/Solr expands across many servers

Lucene/Solr 4.0 comes with a new distributed indexing architecture and can offer real-time results

The Apache Software Foundation's widely used open source Lucene/Solr search engine package has been upgraded to accommodate its users' seemingly insatiable need to collect and use ever-larger amounts of data.

"The biggest improvement that has happened to Lucene/Solr is scalability," said Sarath Jarugula, vice president of product management at LucidWorks, which offers a commercially support version of Lucene/Solr. "Lucene/Solr has been re-architected to index data across hundreds of servers," he said.

The keepers of the project plan to release Lucene/Solr 4.0 within the next day or so. Version 4.0 has been three years in the making.

While IT professionals may not have heard of Lucene or Solr, many probably have used these technologies at some point, as the software is embedded in a number of enterprise search products. Many e-commerce and social media sites, such as Facebook and Twitter, also use Lucene/Solr to power their search services.

Doug Cutting, who also created the Apache Hadoop data processing platform, built Lucene as a full-text search engine based on Java. While Lucene is a Java library of search functions, Solr provides an API (application programming interface) so other applications can interface with Lucene. Although Lucene and Solr started as separate projects, the two were merged into a single entity in 2010, now called Apache Lucene/Solr.

This new update reflects how organizations are ingesting and reusing more and more data.

Ten years ago, Jarugula noted, larger organizations might have stored a few million electronic documents, which collectively took up several hundred gigabytes. These days, however, such repositories have ballooned in size: It is not uncommon for Jarugula to encounter organizations that generate a terabyte of data a day.

Lucene/Solr has been updated to handle such larger workloads.

Most significantly, the Solr component includes a new technique called distributed indexing, which divides document indexing duties across multiple servers to speed response time even as the data sets grow larger. To further speed operations, Solr now can spawn multiple threads to index material, with each thread being able to write to disk concurrently.

The software can now also recognize when it operates in a clustered server environment and adjust its actions to the new setup. This set of technologies comes under the name SolrCloud. "If you have a cluster, Solr will know will any server goes down and will watch for when it comes back up," Jarugula said. To help with these with duties, Lucene/Solr uses the Apache ZooKeeper cluster configuration management software.

The distributed indexing also shortens the time indexed material is made available to users, which paves the way for real-time search. Typically, enterprise search engines only update their indices once a day, or once every few hours. Lucene can now update continuously, even with a data set of billions of documents. "You can now index on a per-second basis," Jarugula said.

As a result, as soon as a document has been entered into a repository, it can be indexed and will start appearing in search results. This feature also reflects the changing needs of the enterprise. Thanks to the influence of Twitter and Facebook, "as I send an email or update a document, I want it to be immediately available to my colleagues," Jarugula said.

Lucene/Solr 4.0 will also offer a number of other features, such as versioning -- in which older versions of data are retained -- and a new Web-based administrative interface.

One organization looking forward to the new edition is deal-of-the-day Internet service Groupon. Groupon uses the open source version of Lucene/Solr and contracts with LucidWorks for engineering support. "Lucene/Solr is highly competitive against other commercial offerings," said Jeff Ayars, who is a Groupon vice president of engineering.

Groupon uses Lucene/Solr to index all the emails it sends to its users, Ayars said. Emails are customized for each user, so as a result, "tens of millions of new documents are indexed daily," Ayers said. When a user calls the company, a representative can search for the specific email that the caller has a question about. The company also uses Lucene/Solr's geospatial indexing capabilities to provide each user information about nearby deals.

Perhaps not surprisingly, Ayers is most looking forward to the new clustering features of Lucene/Solr 4.0. "There's been recipes for clustering with Solr for a very long time. But it's helpful for us to have baked-in support," Ayars said.

The Apache Lucene/Solr project has 37 core committers, nine of whom work for LucidWorks (which was previously called Lucid Imagination). Users of LucidWorks' Lucene/Solr commercial package include AT&T, Ford, Verizon, Cisco, Raytheon, Salesforce.com, Qualcomm and eHarmony.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Cool Tech

Toys for Boys

Family Friendly

Stocking Stuffer

SmartLens - Clip on Phone Camera Lens Set of 3

Learn more >

Christmas Gift Guide

Click for more ›

Brand Post

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Aysha Strobbe

Microsoft Office 365/HP Spectre x360

Microsoft Office continues to make a student’s life that little bit easier by offering reliable, easy to use, time-saving functionality, while continuing to develop new features that further enhance what is already a formidable collection of applications

Michael Hargreaves

Microsoft Office 365/Dell XPS 15 2-in-1

I’d recommend a Dell XPS 15 2-in-1 and the new Windows 10 to anyone who needs to get serious work done (before you kick back on your couch with your favourite Netflix show.)

Maryellen Rose George

Brother PT-P750W

It’s useful for office tasks as well as pragmatic labelling of equipment and storage – just don’t get too excited and label everything in sight!

Cathy Giles

Brother MFC-L8900CDW

The Brother MFC-L8900CDW is an absolute stand out. I struggle to fault it.

Luke Hill

MSI GT75 TITAN

I need power and lots of it. As a Front End Web developer anything less just won’t cut it which is why the MSI GT75 is an outstanding laptop for me. It’s a sleek and futuristic looking, high quality, beast that has a touch of sci-fi flare about it.

Emily Tyson

MSI GE63 Raider

If you’re looking to invest in your next work horse laptop for work or home use, you can’t go wrong with the MSI GE63.

Featured Content

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?