lucene performance grant ingersoll november 16, 2007 atlanta, ga

26
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Upload: osborn-perry

Post on 29-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Lucene Performance

Grant Ingersoll

November 16, 2007

Atlanta, GA

Page 2: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Overview

• Defining Performance• Basics• Indexing

– Parameters

– Threading

• Search• Document Retrieval• Search Quality

Page 3: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Defining Performance• Many factors in assessing Lucene (and search)

performance• Speed• Quality of results (subjective)

– Precision • # relevant retrieved out of # retrieved

– Recall• # relevant retrieved out of total # relevant

• Size of index– Compression rate

• Other Factors: – Local vs. distributed

Page 4: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Basics

• Consider latest version of Lucene– Lucene 2.3/Trunk has many performance improvements over prior

versions

• Consider Solr– Solr employs many Lucene best practices

• contrib/benchmark can help assess many aspects of performance, including speed, precision and recall– Task based approach makes for easy extension

• Sanity check your needs

• Profile to identify bottlenecks

Page 5: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Indexing Factors

• Lucene indexes Documents into memory

• On certain occasions, memory is flushed to the index representation (called a segment)

• Segments are periodically merged

• Internal Lucene models are changing and (drastically) improving performance

Page 6: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

IndexWriter factors• setMaxBufferedDocs controls minimum # of docs

before merge occurs– Larger == faster– > RAM

• setMergeFactor controls how often segments are merged– Smaller == less RAM, better for large # of updates– Larger == faster, better for batch

• setMaxFieldLength controls the # of terms indexed from a document

• setUseCompoundFile controls the file format Lucene uses. Turning off compound file format is faster, but you could run out of file descriptors

Page 7: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Lucene 2.3 IndexWriter Changes

• setRAMBufferSizeMB– New model for automagically controlling indexing

factors based on the amount of memory in use– Obsoletes setMaxBufferedDocs and setMergeFactor

• Takes storage and term vectors out of the merge process

• Turn off auto-commit if there are stored fields and term vectors

• Provides significant performance increase

Page 8: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Analysis• An Analyzer is a Tokenizer and one or more TokenFilters

• More complicated analysis, slower indexing– Many applications could use simpler Analyzers than

the StandardAnalyzer– StandardTokenizer is now faster in 2.3 (thus

making StandardAnalyzer faster)

• Reuse in 2.3:– Re-use Token, Document and Field instances– Use the char[] API with Token instead of String API

Page 9: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Thread Safety

• Use a single IndexWriter for the duration of indexing

• Share IndexWriter between threads

• Parallel Indexing– Index to separate Directory instances

– Merge when done with IndexWriter.addIndexes()– Distribute and collect

Page 10: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Other Indexing Factors

• NFS– Have been some improvements lately, but…– “proceed with caution”– Not as good as local filesystem

• Replication– Index locally and then use rsync to replicate

copies of index to other servers– Have I mentioned Solr?

Page 11: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Benchmarking Indexing

• contrib/benchmark• Try out different algorithms between Lucene 2.2

and trunk (2.3)– contrib/benchmark/conf:

• indexing.alg• indexing-multithreaded.alg

• Info:– Mac Pro 2 x 2GHz Dual-Core Xeon– 4 GB RAM– ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M

Page 12: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Benchmarking ResultsRecords/Sec Avg. T Mem

2.2 421 39M

Trunk 2,122 52M

Trunk-mt (4) 3,680 57M

Page 13: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Search Performance

• Many factors influence search speed– Query Type, size, analysis, # of occurrences,

index size, index optimization, index type– Known Enemies

• Search Quality also has many factors– Query formulation, synonyms, analysis, etc.– How to judge quality?

Page 14: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Query Types

• Some queries in Lucene get rewritten into simpler queries:– WildcardQuery rewrites to a BooleanQuery of all the terms that satisfy the wildcards

• a* -> abe, apple, an, and, array…

– Likewise with RangeQuery, especially with date ranges

Page 15: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Query Size

• Stopword removal can help reduce size

• Choose expansions carefully

• Consider using fewer fields to search over

• When doing relevance feedback, don’t use whole document, instead focus on most important terms

Page 16: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Index Factors for Search

• Size: – more unique terms, more to search– Stopword removal and stemming can help

reduce– Not a linear factor due to index compression

• Type – RAMDirectory if index smaller– MMapDirectory may perform better

Page 17: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Search Speed Tips

• IndexSearcher– Thread-safe, so share– Open once and use as long as possible

• Cache Filters when appropriate• Optimize if you have the time• Warm up your Searcher first by sending

it some preliminary queries before making it live

Page 18: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Known Enemies

• CPU, Memory, I/O are all known enemies of performance– Can’t live without them, either!

• Profile, run benchmarks, look at garbage collection policies, etc.

• Check your needs– Do you need wildcards?– Do you need so many Fields?

Page 19: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Document Retrieval

• Common Search Scenario:– Many small Fields containing info about the Document

– One or two big Fields storing content– Run search, display small Fields to user– User picks one result to view content

Page 20: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

FieldSelector

• Gives developer greater control over how the Document is loaded– Load, Lazy, No Load, Load and Break, Size,

etc.

• In previous scenario, lazy load the large Fields

• Easier to store original content without performance penalty

Page 21: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Quality Queries

• Evaluating search quality is difficult and subjective

• Lucene provides good out of the box quality by most accounts

• Can evaluate using TREC or other experiments, but these risk overtuning

• Unfortunately, judging quality is a labor-intensive task

Page 22: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Quality Experiments

• Needs:– Standard collection of docs - easy

– Set of queries• Query logs

• Develop in-house

• TREC, other conferences

– Set of judgments• Labor intensive

• Can use log analysis to determine estimates of which queries are relevant based on clicks, etc.

Page 23: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Query Formulation

• Invest the time in determining the proper analysis of the fields you are searching– Case sensitive search– Punctuation analysis– Strict matching

• Stopword policy– Stopwords can be useful

• Operator choice• Synonym choices

Page 24: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Effective Scoring

• Similarity class provides callback mechanism for controlling how some Lucene scoring factors count towards the score– tf(), idf(), coord()

• Experiment with different length normalization factors– You may find Lucene is overemphasizing shorter or

longer documents

Page 25: Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Effective Scoring

• Can also implement your own Query class– Ask if anyone else has done it first on java-user

mailing list

• Go beyond the obvious:– org.apach.lucene.search.function

package provides means for using values of Fields to change the scores

• Geographic scoring, user ratings, others

• Payloads (stay tuned for next presentation)