lucene performance grant ingersoll november 16, 2007 atlanta, ga

Post on 29-Dec-2015

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lucene Performance

Grant Ingersoll

November 16, 2007

Atlanta, GA

Overview

• Defining Performance• Basics• Indexing

– Parameters

– Threading

• Search• Document Retrieval• Search Quality

Defining Performance• Many factors in assessing Lucene (and search)

performance• Speed• Quality of results (subjective)

– Precision • # relevant retrieved out of # retrieved

– Recall• # relevant retrieved out of total # relevant

• Size of index– Compression rate

• Other Factors: – Local vs. distributed

Basics

• Consider latest version of Lucene– Lucene 2.3/Trunk has many performance improvements over prior

versions

• Consider Solr– Solr employs many Lucene best practices

• contrib/benchmark can help assess many aspects of performance, including speed, precision and recall– Task based approach makes for easy extension

• Sanity check your needs

• Profile to identify bottlenecks

Indexing Factors

• Lucene indexes Documents into memory

• On certain occasions, memory is flushed to the index representation (called a segment)

• Segments are periodically merged

• Internal Lucene models are changing and (drastically) improving performance

IndexWriter factors• setMaxBufferedDocs controls minimum # of docs

before merge occurs– Larger == faster– > RAM

• setMergeFactor controls how often segments are merged– Smaller == less RAM, better for large # of updates– Larger == faster, better for batch

• setMaxFieldLength controls the # of terms indexed from a document

• setUseCompoundFile controls the file format Lucene uses. Turning off compound file format is faster, but you could run out of file descriptors

Lucene 2.3 IndexWriter Changes

• setRAMBufferSizeMB– New model for automagically controlling indexing

factors based on the amount of memory in use– Obsoletes setMaxBufferedDocs and setMergeFactor

• Takes storage and term vectors out of the merge process

• Turn off auto-commit if there are stored fields and term vectors

• Provides significant performance increase

Analysis• An Analyzer is a Tokenizer and one or more TokenFilters

• More complicated analysis, slower indexing– Many applications could use simpler Analyzers than

the StandardAnalyzer– StandardTokenizer is now faster in 2.3 (thus

making StandardAnalyzer faster)

• Reuse in 2.3:– Re-use Token, Document and Field instances– Use the char[] API with Token instead of String API

Thread Safety

• Use a single IndexWriter for the duration of indexing

• Share IndexWriter between threads

• Parallel Indexing– Index to separate Directory instances

– Merge when done with IndexWriter.addIndexes()– Distribute and collect

Other Indexing Factors

• NFS– Have been some improvements lately, but…– “proceed with caution”– Not as good as local filesystem

• Replication– Index locally and then use rsync to replicate

copies of index to other servers– Have I mentioned Solr?

Benchmarking Indexing

• contrib/benchmark• Try out different algorithms between Lucene 2.2

and trunk (2.3)– contrib/benchmark/conf:

• indexing.alg• indexing-multithreaded.alg

• Info:– Mac Pro 2 x 2GHz Dual-Core Xeon– 4 GB RAM– ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M

Benchmarking ResultsRecords/Sec Avg. T Mem

2.2 421 39M

Trunk 2,122 52M

Trunk-mt (4) 3,680 57M

Search Performance

• Many factors influence search speed– Query Type, size, analysis, # of occurrences,

index size, index optimization, index type– Known Enemies

• Search Quality also has many factors– Query formulation, synonyms, analysis, etc.– How to judge quality?

Query Types

• Some queries in Lucene get rewritten into simpler queries:– WildcardQuery rewrites to a BooleanQuery of all the terms that satisfy the wildcards

• a* -> abe, apple, an, and, array…

– Likewise with RangeQuery, especially with date ranges

Query Size

• Stopword removal can help reduce size

• Choose expansions carefully

• Consider using fewer fields to search over

• When doing relevance feedback, don’t use whole document, instead focus on most important terms

Index Factors for Search

• Size: – more unique terms, more to search– Stopword removal and stemming can help

reduce– Not a linear factor due to index compression

• Type – RAMDirectory if index smaller– MMapDirectory may perform better

Search Speed Tips

• IndexSearcher– Thread-safe, so share– Open once and use as long as possible

• Cache Filters when appropriate• Optimize if you have the time• Warm up your Searcher first by sending

it some preliminary queries before making it live

Known Enemies

• CPU, Memory, I/O are all known enemies of performance– Can’t live without them, either!

• Profile, run benchmarks, look at garbage collection policies, etc.

• Check your needs– Do you need wildcards?– Do you need so many Fields?

Document Retrieval

• Common Search Scenario:– Many small Fields containing info about the Document

– One or two big Fields storing content– Run search, display small Fields to user– User picks one result to view content

FieldSelector

• Gives developer greater control over how the Document is loaded– Load, Lazy, No Load, Load and Break, Size,

etc.

• In previous scenario, lazy load the large Fields

• Easier to store original content without performance penalty

Quality Queries

• Evaluating search quality is difficult and subjective

• Lucene provides good out of the box quality by most accounts

• Can evaluate using TREC or other experiments, but these risk overtuning

• Unfortunately, judging quality is a labor-intensive task

Quality Experiments

• Needs:– Standard collection of docs - easy

– Set of queries• Query logs

• Develop in-house

• TREC, other conferences

– Set of judgments• Labor intensive

• Can use log analysis to determine estimates of which queries are relevant based on clicks, etc.

Query Formulation

• Invest the time in determining the proper analysis of the fields you are searching– Case sensitive search– Punctuation analysis– Strict matching

• Stopword policy– Stopwords can be useful

• Operator choice• Synonym choices

Effective Scoring

• Similarity class provides callback mechanism for controlling how some Lucene scoring factors count towards the score– tf(), idf(), coord()

• Experiment with different length normalization factors– You may find Lucene is overemphasizing shorter or

longer documents

Effective Scoring

• Can also implement your own Query class– Ask if anyone else has done it first on java-user

mailing list

• Go beyond the obvious:– org.apach.lucene.search.function

package provides means for using values of Fields to change the scores

• Geographic scoring, user ratings, others

• Payloads (stay tuned for next presentation)

top related