lucene bootcamp - 2

Lucene Boot Camp

Grant IngersollLucid Imagination

Nov. 4, 2008 New Orleans, LA

2

Schedule

• In-depth Indexing/Searching – Performance, Internals– Filters, Sorting

• Terms and Term Vectors• Class Project• Q & A

3

Day I Recap

• Indexing– IndexWriter

– Document/Field– Analyzer

• Searching– IndexSearcher

– IndexReader

– QueryParser

• Analysis• Contrib

4

Indexing In-Depth

• Deletions and Updates• Optimize• Important Internals

– File Formats– Segments, Commits, Merging– Compound File System

• Performance

5

Lucene File Formats and Structures

• http://lucene.apache.org/java/2_4_0/fileformats.html

• A Lucene index is made up of one or more Segments

• Lucene tracks Documents internally by an int “id”

• This id may change across index operations– You should not rely on it unless you know your index isn’t changing

• You can ask for a Document by this id on the IndexReader

http://lucene.apache.org/java/2_4_0/fileformats.html

http://lucene.apache.org/java/2_4_0/fileformats.html

6

Segments

• Each Segment is an independent index containing:– Field Names– Stored Field values– Term Dictionary, proximity info and normalization factors

– Term Vectors (optional)– Deleted Docs

• Compound File System (CFS) stores all of these logical pieces in a single file

How Lucene Indexes

• Lucene indexes Documents into memory– At certain trigger points, memory (segments) are committed/flushed to the Directory•Can be forced by calling commit()

– Segments are periodically merged (more in a moment)

8

Segments and Merging

• May be created when new documents are added

• Are merged from time to time based on segment size in relation to:– MergePolicy– MergeScheduler– Optimization

9

Merge Policy

• Identifies Segments to be merged

• Two Current Implementations– LogDocMergePolicy– LogByteSizeMergePolicy

• mergeFactor - Max # of segments allowed before merging

10

MergeScheduler

• Responsible for performing the merge

• Two Implementations:– Serial - blocking– Concurrent - new, background

11

Optimize

• Optimize is the process of merging segments down into a single segment

• This process can yield significant speedups in search

• Can be slow• Can also do partial optimizes

12

Final Thoughts On Merging

• Usually don’t have to think about it, except when to optimize

• In high update, performance critical environments, you may need to dig into it more as it can sometimes cause long pauses

• Good to optimize when you can, otherwise, keep a low mergeFactor

Deletion

• A deletion only marks the Document as deleted– Doesn’t get physically removed until a merge

• Deletions can be a bit confusing– Both IndexReader and IndexWriter have delete methods•By: id, term(s), Query(s)

14

Task

– Build your index from yesterday and then try some deletes•Id, term, Query

– Also try out an optimize on a FSDirectory against the full Reuters sample

– 15-20 minutes

15

Updates

• Updates are always a delete and an add

• Updates are always a delete and an add– Yes, that is a repeat!– Nature of data structures used in search

• See IndexWriter.updateDocument()

Performance Factors• setRAMBufferSizeMB

– New model for automagically controlling indexing factors based on the amount of memory in use

– Obsoletes setMaxBufferedDocs• maxBufferedDocs

– Minimum # of docs before merge occurs and a new segment is created

– Usually, Larger == faster, but more RAM

17

More Factors

• mergeFactor– How often segments are merged

– Smaller == less RAM, better for incremental updates

– Larger == faster, better for batch indexing

• maxFieldLength– Limit the number of terms in a Document

• Analysis

• Reuse– Document, TokenStream, Token

Index Threading

• IndexWriter and IndexReader are thread-safe and can be shared between threads without external synchronization

• One open IndexWriter per Directory

• Parallel Indexing– Index to separate Directory instances– Merge using IndexWriter.addIndexes– Could also distribute and collect

Benchmarking Indexing

• contrib/benchmark• Try out different algorithms between Lucene 2.2 and 2.3– contrib/benchmark/conf:

• indexing.alg• indexing-multithreaded.alg

• Info:– Mac Pro 2 x 2GHz Dual-Core Xeon– 4 GB RAM– ant run-task -Dtask.alg=./conf/indexing.alg -

Dtask.mem=1024M

Benchmarking ResultsRecords/Sec

Avg. T Mem

2.2 421 39MTrunk 2,122 52MTrunk-mt (4)

3,680 57MYour results will depend on analysis, etc.

Searching

• Earlier we touched on basics of search using the QueryParser

• Now look at:– Searcher/IndexReader Lifecycle– Query classes– More details on the QueryParser– Filters– Sorting

Lifecycle

• Recall that the IndexReader loads a snapshot of index into memory– This means updates made since loading the index will not be seen

• Business rules are needed to define how often to reload the index, if at all– IndexReader.isCurrent() can help

• Loading an index is an expensive operation– Do not open a Searcher/IndexReader for every search

23

Reopen

• It is possible to have IndexReader reopen new or changed segments– Save some on the cost of loading a new index

• Does not close the old reader, so application must

• See DeletionsUpdatesTest.testReopen()

Query Classes• TermQuery is basis for all non-span queries

• BooleanQuery combines multiple Query instances as clauses– should– required

• PhraseQuery finds terms occurring near each other, position-wise– “slop” is the edit distance between two terms

• Take 2-3 minutes to explore Query implementations

Spans

• Spans provide information about where matches took place

• Not supported by the QueryParser

• Can be used in BooleanQuery clauses

• Take 2-3 minutes to explore SpanQuery classes– SpanNearQuery useful for doing phrase matching

QueryParser

• MultiFieldQueryParser• Boolean operators cause confusion

– Better to think in terms of required (+ operator) and not allowed (- operator)

• Check JIRA for QueryParser issues• http://www.gossamer-threads.com/lists/lucene/java-us

er/40945

• Most applications either modify QP, create their own, or restrict to a subset of the syntax

• Your users may not need all the “flexibility” of the QP

http://www.gossamer-threads.com/lists/lucene/java-user/40945

http://www.gossamer-threads.com/lists/lucene/java-user/40945

Sorting• Lucene default sort is by score• Searcher has several methods that take in a Sort object

• Sorting should be addressed during indexing

• Sorting is done on Fields containing a single term that can be used for comparison

• The SortField defines the different sort types available– AUTO, STRING, INT, FLOAT, CUSTOM, SCORE, DOC

Sorting II

• Look at Searcher, Sort and SortField

• Custom sorting is done with a SortComparatorSource

• Sorting can be very expensive– Terms are cached in the FieldCache

Filters

• Filters restrict the search space to a subset of Documents

• Use Cases– Search within a Search– Restrict by date– Rating– Security– Author

Filter Classes

• QueryWrapperFilter (QueryFilter)– Restrict to subset of Documents that match a Query

• RangeFilter– Restrict to Documents that fall within a range

– Better alternative to RangeQuery

• CachingWrapperFilter– Wrap another Filter and provide caching

31

Task

• Modify your program to sort by a field and to filter by a query or some other criteria– ~15 minutes

Searchers• MultiSearcher

– Search over multiple Searchables, including remote

• MultiReader– Not a Searcher, but can be used with IndexSearcher to achieve same results for local indexes

• ParallelMultiSearcher– Like MultiSearcher, but threaded

• RemoteSearchable– RMI based remote searching

• Look at MultiSearcherTest in example code

Expert Results

• Searcher has several “expert” methods

• HitCollector allows low-level access to all Documents as they are scored

Search Performance• Search speed is based on a number of factors:– Query Type(s)– Query Size– Analysis– Occurrences of Query Terms– Optimize– Index Size– Index type (RAMDirectory, other)– Usual Suspects

• CPU• Memory• I/O• Business Needs

Query Types

• Be careful with WildcardQuery as it rewrites to a BooleanQuery containing all the terms that match the wildcards

• Avoid starting a WildcardQuery with wildcard

• Use ConstantScoreRangeQuery instead of RangeQuery

• Be careful with range queries and dates– User mailing list and Wiki have useful tips for optimizing date handling

Query Size

• Stopword removal

• Search an “all” field instead of many fields with the same terms

• Disambiguation – May be useful when doing synonym expansion

– Difficult to automate and may be slower

– Some applications may allow the user to disambiguate

• Relevance Feedback/More Like This– Use most important words

– “Important” can be defined in a number of ways

Usual Suspects• CPU

– Profile your application

• Memory– Examine your heap size, garbage collection approach

• I/O– Cache your Searcher

• Define business logic for refreshing based on indexing needs

– Warm your Searcher before going live -- See Solr

• Business Needs– Do you really need to support Wildcards?

– What about date range queries down to the millisecond?

FieldSelector

• Prior to version 2.1, Lucene always loaded all Fields in a Document

• FieldSelector API addition allows Lucene to skip large Fields– Options: Load, Lazy Load, No Load, Load and Break, Load for Merge, Size, Size and Break

• Makes storage of original content more viable without large cost of loading it when not used

• FieldSelectorTest in example code

39

Relevance

• At some point along your journey, you will get results that you think are “bad”

• Is it a big deal?– Content, Content, Content!– Relevance Judgments– Don’t break other queries just to “fix” one

• Hardcode it!– A query doesn’t always have to result in a “search”

Scoring and Similarity

• Lucene has sophisticated scoring mechanism designed to meet most needs

• Has hooks for modifying scores

• Scoring is handled by the Query, Weight and Scorer class

Explanations

• explain(Query, int) method is useful for understanding why a Document scored the way it did

• Shows all the pieces that went into scoring the result:– Tf, DF, boosts, etc.

Tuning Relevance

• FunctionQuery from Solr (variation in Lucene)

• Override Similarity• Implement own Query and related classes

• Payloads• Boosts

43

Task

• Open Luke and try some queries and then use the “explain” button

• Or, write some code to do explains on a query and some documents

• See how Query type, boosting, other factors play a role in the score

44

Terms and Term Vectors

• Sometimes you need access to the Term Dictionary:– Auto suggest– Frequency information

• Sometimes you need a Document-centric view of terms, frequencies, positions and offsets– Term Vectors

Term Information• TermEnum gives access to terms and how many Documents they occur in– IndexReader.terms()

• TermDocs gives access to the frequency of a term in a Document– IndexReader.termDocs()

– TermPositions extends TermDocs and provides access to position and payload info– IndexReader.termPositions()

46

Term Vectors

• Term Vectors give access to term frequency information in a given Document– IndexReader.getTermFreqVector

• TermVectorMapper provides callbacks for working with Term Vectors

47

TermsTest

• Provides samples of working with terms and term vectors

Lunch ?

1-2:30

Recap

• Indexing• Searching• Performance• Odds and Ends

– Explains– FieldSelector– Relevance– Terms and Term Vectors

50

Class Project

• Your chance to really dig in and get your hands dirty

• Ask Questions• Options…

51

Option I

• Start building out your Lucene Application!– Index your Data (or any data)

•Threading/Updates/Deletions•Analysis

– Search•Caching/Warming•Dealing with Updates•Multi-threaded

– Display

52

Option II

• Dig deeper into an area of interest– Performance

•How fast can you index?•Search? Queries per Second?

– Analysis– Query Parsing– Scoring– Contrib

53

Option III

• Dig into JIRA issues and find something to fix in Lucene

• https://issues.apache.org/jira/secure/Dashboard.jspa

• http://wiki.apache.org/lucene-java/HowToContribute

https://issues.apache.org/jira/secure/Dashboard.jspa

https://issues.apache.org/jira/secure/Dashboard.jspa

http://wiki.apache.org/lucene-java/HowToContribute

http://wiki.apache.org/lucene-java/HowToContribute

54

Option IV

• Try out Solr• http://lucene.apache.org/solr

http://lucene.apache.org/solr

55

Option V

• Other?– Architecture Review/Discussion– Use Case Discussion

Project Post-Mortem

• Volunteers to share?

Open Discussion

• Multilingual Best Practices– UNICODE– One Index versus many

• Advanced Analysis• Distributed Lucene• Crawling• Hadoop• Nutch• Solr

Resources

• [email protected]• Lucid Imagination– Support– Training– Value Add– [email protected]

mailto:[email protected]

Finally…

• Please take the time to fill out a survey to help me improve this training– Located in base directory of source

– Email it to me at [email protected]

• There are several Lucene related talks on Wednesday

lucene bootcamp - 2

Technology

new index

memory segments

independent index

snapshot of index

index operations

index isnt

index threading indexwriter

process of merging segments