Our site's Lucene search is obscenely slow and totally unusable - 30 seconds or more to search the term "dog" in ~6,000 records.
I am totally new to Lucene's search and indexing.
I realize that there are many ways to optimize something.
I have run the profiler and am pasting the results here.
Here is the index controller code that gobbles most of the time:
JO_Search_Lucene_Search_QueryParser::setDefaultEncoding('UTF-8');
$hits = $index->find($query);
foreach ($hits as $hit) {
$ids[] = $hit->item_id;
}
$index->find($query)
goes to 4 implementations. I can paste these somewhat longer functions if useful, but have only pasted the commented descriptions here as a starter:
first:
/**
* Performs a query against the index and returns an array
* of JO_Search_Lucene_Search_QueryHit objects.
* Input is a string or JO_Search_Lucene_Search_Query.
*
* @param mixed $query
* @return array JO_Search_Lucene_Search_QueryHit
* @throws JO_Search_Lucene_Exception
*/
second:
/**
* Performs a query against the index and returns an array
* of JO_Search_Lucene_Search_QueryHit objects.
* Input is a string or JO_Search_Lucene_Search_Query.
*
* @param mixed $query
* @return array JO_Search_Lucene_Search_QueryHit
* @throws JO_Search_Lucene_Exception
*/
third:
/**
* Performs a query against the index and returns an array
* of JO_Search_Lucene_Search_QueryHit objects.
* Input is a string or JO_Search_Lucene_Search_Query.
*
* @param mixed $query
* @return array JO_Search_Lucene_Search_QueryHit
* @throws JO_Search_Lucene_Exception
*/
fourth:
/**
* Performs a query against the index and returns an array
* of JO_Search_Lucene_Search_QueryHit objects.
* Input is a string or JO_Search_Lucene_Search_Query.
*
* @param mixed $query
* @return array JO_Search_Lucene_Search_QueryHit
* @throws JO_Search_Lucene_Exception
*/
Below is a snapshot of the execution "time", "own time" and "calls". Any guidance on where to look is appreciated.
In Zend Search Lucene, the offending method _less
from your trace is doing a strcmp
(187,158 times).
This particular call chain reads from disk. And 271,837 reads (_fread
s) seems excessive for a healthy index with ~6000 items!
Most likely this index has never been optimized. What that means for you is, that each time a commit
is issued, all transient writes are persisted to disk (as a new, immutable Lucene index segment). Now Lucene is searching two sets of index files. Say you issued 1,000 commits since launch, that means Lucene is searching 1000 sets of index files on disk.
When index optimization is performed, all of these many index segments are consolidated into one segment. There's a section in the Zend documentation about this and more about Indexing Performance in the best practices section.
Calling optimize
can be an expensive operation. But it's important to do periodically (daily batch perhaps?) to keep the number of index segments manageable.