Apache Lucene/Solr expands across many servers

Lucene/Solr 4.0 comes with a new distributed indexing architecture and can offer real-time results

The Apache Software Foundation's widely used open source Lucene/Solr search engine package has been upgraded to accommodate its users' seemingly insatiable need to collect and use ever-larger amounts of data.

"The biggest improvement that has happened to Lucene/Solr is scalability," said Sarath Jarugula, vice president of product management at LucidWorks, which offers a commercially support version of Lucene/Solr. "Lucene/Solr has been re-architected to index data across hundreds of servers," he said.

The keepers of the project plan to release Lucene/Solr 4.0 within the next day or so. Version 4.0 has been three years in the making.

While IT professionals may not have heard of Lucene or Solr, many probably have used these technologies at some point, as the software is embedded in a number of enterprise search products. Many e-commerce and social media sites, such as Facebook and Twitter, also use Lucene/Solr to power their search services.

Doug Cutting, who also created the Apache Hadoop data processing platform, built Lucene as a full-text search engine based on Java. While Lucene is a Java library of search functions, Solr provides an API (application programming interface) so other applications can interface with Lucene. Although Lucene and Solr started as separate projects, the two were merged into a single entity in 2010, now called Apache Lucene/Solr.

This new update reflects how organizations are ingesting and reusing more and more data.

Ten years ago, Jarugula noted, larger organizations might have stored a few million electronic documents, which collectively took up several hundred gigabytes. These days, however, such repositories have ballooned in size: It is not uncommon for Jarugula to encounter organizations that generate a terabyte of data a day.

Lucene/Solr has been updated to handle such larger workloads.

Most significantly, the Solr component includes a new technique called distributed indexing, which divides document indexing duties across multiple servers to speed response time even as the data sets grow larger. To further speed operations, Solr now can spawn multiple threads to index material, with each thread being able to write to disk concurrently.

The software can now also recognize when it operates in a clustered server environment and adjust its actions to the new setup. This set of technologies comes under the name SolrCloud. "If you have a cluster, Solr will know will any server goes down and will watch for when it comes back up," Jarugula said. To help with these with duties, Lucene/Solr uses the Apache ZooKeeper cluster configuration management software.

The distributed indexing also shortens the time indexed material is made available to users, which paves the way for real-time search. Typically, enterprise search engines only update their indices once a day, or once every few hours. Lucene can now update continuously, even with a data set of billions of documents. "You can now index on a per-second basis," Jarugula said.

As a result, as soon as a document has been entered into a repository, it can be indexed and will start appearing in search results. This feature also reflects the changing needs of the enterprise. Thanks to the influence of Twitter and Facebook, "as I send an email or update a document, I want it to be immediately available to my colleagues," Jarugula said.

Lucene/Solr 4.0 will also offer a number of other features, such as versioning -- in which older versions of data are retained -- and a new Web-based administrative interface.

One organization looking forward to the new edition is deal-of-the-day Internet service Groupon. Groupon uses the open source version of Lucene/Solr and contracts with LucidWorks for engineering support. "Lucene/Solr is highly competitive against other commercial offerings," said Jeff Ayars, who is a Groupon vice president of engineering.

Groupon uses Lucene/Solr to index all the emails it sends to its users, Ayars said. Emails are customized for each user, so as a result, "tens of millions of new documents are indexed daily," Ayers said. When a user calls the company, a representative can search for the specific email that the caller has a question about. The company also uses Lucene/Solr's geospatial indexing capabilities to provide each user information about nearby deals.

Perhaps not surprisingly, Ayers is most looking forward to the new clustering features of Lucene/Solr 4.0. "There's been recipes for clustering with Solr for a very long time. But it's helpful for us to have baked-in support," Ayars said.

The Apache Lucene/Solr project has 37 core committers, nine of whom work for LucidWorks (which was previously called Lucid Imagination). Users of LucidWorks' Lucene/Solr commercial package include AT&T, Ford, Verizon, Cisco, Raytheon, Salesforce.com, Qualcomm and eHarmony.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments



Victorinox Werks Professional Executive 17 Laptop Case

Learn more >



Back To Business Guide

Click for more ›

Brand Post

Most Popular Reviews

Latest Articles


PCW Evaluation Team

Louise Coady

Brother MFC-L9570CDW Multifunction Printer

The printer was convenient, produced clear and vibrant images and was very easy to use

Edwina Hargreaves

WD My Cloud Home

I would recommend this device for families and small businesses who want one safe place to store all their important digital content and a way to easily share it with friends, family, business partners, or customers.

Walid Mikhael

Brother QL-820NWB Professional Label Printer

It’s easy to set up, it’s compact and quiet when printing and to top if off, the print quality is excellent. This is hands down the best printer I’ve used for printing labels.

Ben Ramsden

Sharp PN-40TC1 Huddle Board

Brainstorming, innovation, problem solving, and negotiation have all become much more productive and valuable if people can easily collaborate in real time with minimal friction.

Sarah Ieroianni

Brother QL-820NWB Professional Label Printer

The print quality also does not disappoint, it’s clear, bold, doesn’t smudge and the text is perfectly sized.

Ratchada Dunn

Sharp PN-40TC1 Huddle Board

The Huddle Board’s built in program; Sharp Touch Viewing software allows us to easily manipulate and edit our documents (jpegs and PDFs) all at the same time on the dashboard.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?