We recently caught up with Otis Gospodnetić, founder of Sematext, a Lucene & Solr consultancy in Brooklyn NY. Otis is a coauthor of Lucene in Action (1st and 2nd edition). He has been involved with Lucene since 2000 and is also a member of Apache Solr, Nutch, and Mahout development teams, as well as Lucene Project Management Committee.
When did you first get interested in working with search technologies?
I think my interest in search appeared with my first exposure to the web in mid-nineties. The first search engines, from WAIS and Archie to Infoseek and Webcrawler were fascinating. During college I spent a lot of time building my own little search engines, crawlers, and search-related applications, including something called WebPh which was my first open-source project. In 2000 I started to get involved with the Lucene project and in 2004 I co-authored the first edition of Lucene in Action. I started working with Solr in 2006 and founded Sematext a year later.
How did you start working with the Apache Software Foundation?
I discovered Lucene in 2000, when it was still hosted on SourceForge. It seemed like this was the software I had always needed in my search projects. The Lucene project later moved to the Apache Software Foundation and I came along with it. Now, 10 years later I’m still a happy ASF member.
What types of problems are you helping your customers solve with Solr/Lucene?
At Sematext we do everything from performance troubleshooting, Solr setup and configuration reviews, to complete search backend architecture and implementation. Everyone wants their search to be good and fast and we help them get there. We also design full search solutions for clients that are looking to add a search capability.
What features of Solr are you seeing your clients need the most?
Everyone seems to love facets and if they have data that lends itself to faceting, they make use of it. Multilingual support via various Lucene/Solr analyzers and Sematext’s Multilingual Indexer is also something lots of Solr users need. As datasets grow, more and more clients need to break their indices into smaller shards, which means they make use of Solr’s distributed search abilities. Finally, most people still keep “the truth” in relational databases, so Solr’s DataImportHandler is something we get to use a lot.
What are some common problems you see with a typical search implementation?
Performance is a common problem, or at least something people often want to improve. We frequently use New Relic RPM to monitor Solr performance, to see what our starting point is, and go from there.
What kind of factors can affect the performance of a search implementation?
A number of factors affect search performance, starting with the choice of hardware, index size and query rate, tokenization and analysis, the use of filters and caching. We systematically approach the search system as a whole and address each of these factors.
How do you use New Relic to measure the performance and behavior of a Solr environment?
We use RPM ourselves to monitor our two Solr-powered search applications: search-lucene.com and search-hadoop.com . We also use it during client engagements to help us get a better feeling for how the search application behaves, especially during load testing as well as in production.
What resources do you recommend for getting up to speed with Solr?
The Solr wiki is quite good where you can search the wiki, the source code, javadocs, mailing lists and website content. Of course my books Lucene in Action and Solr in Action are a great place to start, too!
What’s on the horizon for Lucene/Solr in the near future?
Lucene is getting improved in so many areas, including at its very core, it’s hard to keep up. Lucene has always been very fast, but amazingly its developers are still finding ways to make it go faster. Solr clearly benefits from that directly. In addition, Solr Cloud will make Solr much better in truly distributed systems, mainly thanks to the wonderful Apache Zookeeper project.
Thanks Otis! It was great talking with you!
Find our more about Solr Services and Solr Performance.
Visit Sematext to learn more about their implementation and consulting services.
To learn more about RPM for Solr, please please visit our Solr page.