xebia knowledge exchange (mars 2010) - lucene : from theory to real world

Post on 11-May-2015

2.077 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lucenefrom theory to real world

Information retrieval ApachePerformance tuning

Probabilistic

RelevanceVector

Dictionary

Inverted index

Analysis

Model

Doug Cutting

Java

Fields

IndexReader

Document

Open Source

Query

Library

Parser

Indexing

Production

Architecture

DesignTroubleshooting

Real world

Cluster

Search application

Solr

Server

www.xebia.fr / blog.xebia.fr

Agenda

Introduction to Information Retrieval

Lucene overview Lucene in details

Search applications design Performance tuning

2

www.xebia.fr / blog.xebia.fr 3

Information Retrieval

www.xebia.fr / blog.xebia.fr 4

Information Retrieval

“ Information Retrieval (IR) is the science of searching for document ”

www.xebia.fr / blog.xebia.fr 6

Inverted Index

www.xebia.fr / blog.xebia.fr 7

Boolean Model

Query and documents are conceived as sets of terms

Q = (T1 OR T2) AND (T3 OR T4)D1 = {T1, T3}D2 = {T2, T3, T4}

Results set of query is a composition of unions and intersections

R = {D1, D2}with Union for OR operator

Intersection for AND operator

www.xebia.fr / blog.xebia.fr 8

Vector Space Model

Documents and queries are represented as vectors

Similarity can be computed with :dj = (w1,j,w2,j,...,wt,j)

q = (w1,q,w2,q,...,wt,q)

www.xebia.fr / blog.xebia.fr 9

Lucene

www.xebia.fr / blog.xebia.fr 10

Lucene : where do we come from ?Version Release date Description

0.01 March 2000 First open source release (SourceForge)

1.0 October 2000

1.01b July 2001 Last SourceForge release

1.2 June 2002 First Apache Jakarta release

1.3 December 2003 Compound index format, QueryParser enhancements, remote searching, extensible scoring API

1.4 July 2004 Sorting, span queries, term vectors

1.4.1 August 2004 Bug fix for sorting performance

1.4.2 October 2004 IndexSearcher optimization and misc. fixes

1.4.3 29 November 2004 Misc. fixes

1.9.0 27 February 2006 Binary stored fields, DateTools, NumberTools, RangeFilter, RegexQuery, Require Java 1.4

1.9.1 2 March 2006 Bug fix in BufferedIndexOutput

2.0 26 May 2006 Removed deprecated methods

2.1 17 February 2007 Delete/update document in IndexWriter, QueryParser improvements, contrib/benchmark

2.2 19 June 2007 Performance improvements, Function queries, Payloads, Preanalyzed fields, custom deletion policies

2.3.0 24 January 2008 Performance improvements, custom merge policies and merge schedulers, IndexReader.reopen

2.3.1 23 February 2008 Bug fixes from 2.3.0

2.3.2 06 May 2008 Bug fixes from 2.3.1

2.4.0 8 October 2008 Further performance improvements, transactional semantics, expungeDeletes method

2.4.1 9 March 2009 Bug fixes from 2.4.0

2.9 25 September 2009 New per-segment Collector API, faster search performance, near real-time search, attribute based analysis

2.9.1 6 November 2009 Bug fixes from 2.9

3.0.0 25 November 2009 Removed deprecated methods, fixed some bugs

3.0.1 and 2.9.2 26 February 2010 Bug fixes from previous minor versions. Both have same bugfix level

www.xebia.fr / blog.xebia.fr 11

Lucene documentation

www.xebia.fr / blog.xebia.fr 12

Lucene : Simple indexing example

Directory directory = new RAMDirectory();IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(),

IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();doc.add(new Field(“company”, “Xebia”, Field.Store.YES, Field.Index.NOT_ANALYZED));doc.add(new Field(“country”, “France”, Field.Store.YES, Field.Index.NO));

writer.addDocument(doc);writer.close();

www.xebia.fr / blog.xebia.fr 13

Lucene : Simple search example

IndexSearcher searcher = new IndexSearcher(dir, true);

Term t = new Term(“country”, “France”);Query query = new TermQuery(t);TopDocs docs = searcher.search(query, 10);

assertEquals(1, docs.totalHits);

searcher.close();

www.xebia.fr / blog.xebia.fr 14

Lucene - indexing

www.xebia.fr / blog.xebia.fr 15

Lucene - analyzers

www.xebia.fr / blog.xebia.fr 16

Lucene – Field types

Store : YES / NO

Index : NO / ANALYZED / NOT_ANALYZED / ANALYZED_NO_NORMS / NOT_ANALYZED_NO_NORMS

TermVector : NO / WITH_POSITIONS / WITH_OFFSETS / WITH_POSITIONS_OFFSETS / YES

www.xebia.fr / blog.xebia.fr 17

Lucene storage - segments

www.xebia.fr / blog.xebia.fr 18

Lucene storage - segments

A new segment is created each time IndexWriter is flushed

When documents are deleted, a marker is added in the current segment

www.xebia.fr / blog.xebia.fr 19

Lucene storage – segments merge

Segments are merged manually with IndexWriter.optimize()

Or automatically merged depending on : (int) log(max(minMergeMB,

size))/log(mergeFactor)

www.xebia.fr / blog.xebia.fr 20

Lucene - search

www.xebia.fr / blog.xebia.fr 21

Lucene - search

Programatic API

TermQuery

PhraseQuery

WildcardQuery

RangeQuery

FuzzyQuery

BooleanQuery

www.xebia.fr / blog.xebia.fr 22

Lucene - QueryParser

QueryParser build a Query object from a user query string

+JUNIT +ANT –MOCK +xebya~0.8 +title:«Junit in action»

Most of the time, won’t fit application requirements

www.xebia.fr / blog.xebia.fr 23

Lucene – contrib/QueryParser

Framework that simplifies the creation of a query parser that fit your needs

3 layers : QueryParser : Transforms a query string into an

Abstract Syntax Tree representation QueryNodeProcessor : Processes nodes of the

tree to move, remove or modify them QueryBuilder : builds a Lucene BooleanQuery

tree from the abstract syntax tree

www.xebia.fr / blog.xebia.fr 24

Lucene – boolean queries

www.xebia.fr / blog.xebia.fr 25

Lucene – PhraseQuery & SpanQuery

SpanQuery : match documents that contains terms separated by n other terms (n is the ‘slop’)

PhraseQuery : SpanQuery with a slop value of 0

Uses position information

www.xebia.fr / blog.xebia.fr 26

Lucene storage – approximative queries

Approximatives queries (Prefix, Regex, Wildcard, Fuzzy) get transformed to a set of TermQueries

Dictionnary = { court, cours, courir }

FuzzyQuery = cour

TransformedQuery = court OR cours

www.xebia.fr / blog.xebia.fr 27

Inverted Index

www.xebia.fr / blog.xebia.fr 28

Lucene – Levenshtein distance

FuzzyQuery uses Levenshtein distance : the number of modifications required to switch

from one word to another

www.xebia.fr / blog.xebia.fr 29

Lucene - FuzzyQuery

Current implementation not optimal LUCENE-2089 will use a Levenshtein automaton

Prefix Length PQ Size Avg MS (old) Avg MS (new)

0 1024 3286.0 7.8

0 64 3320.4 7.6

1 1024 316.8 5.6

1 64 314.3 5.6

2 1024 31.8 3.8

2 64 31.9 3.7

www.xebia.fr / blog.xebia.fr 30

Lucene – Highlighter

Produces ready to use HTML snippets with highlighted words from query

Can be fully customized

By default limited to 50 KB characters

Uses FastVectorHighlighter for faster results (~2.5 times faster)

www.xebia.fr / blog.xebia.fr 31

Lucene – FieldCache

Lucene cache that allows to store in memory values of a single field

Used internally by Sort objects

Can be used to manually load values of a single field :

float[] weights = FieldCache.DEFAULT.getFloats(reader, “weight”);

www.xebia.fr / blog.xebia.fr 32

Lucene – MoreLikeThis

Finds similar documents

Produces a query to be searched

MoreLikeThis mlt = new MoreLikeThis(reader);mlt.setFieldNames(new String[] {"title", "author"});mlt.setMinTermFreq(1);mlt.setMinDocFreq(1);

Query query = mlt.like(docId);indexSearcher.search(query, 10);

www.xebia.fr / blog.xebia.fr 33

Lucene – Function Queries

Allows score customization

Consider using FieldCaches to Reduce fetching cost

FieldScoreQuery scoreQuery = new FieldScoreQuery("score",

FieldScoreQuery.Type.BYTE);CustomScoreQuery customQ = new CustomScoreQuery(q, scoreQuery ) {

public float customScore(int doc, float

subQueryScore, float

valSrcScore) {return (float) (Math.sqrt(subQueryScore) * valSrcScore);

}};

www.xebia.fr / blog.xebia.fr 34

Lucene – Luke

www.xebia.fr / blog.xebia.fr 35

Lucene – Global performance tuning

Consider using SSD for low latency

Consider using RAMDirectory / InstanciatedIndex

Uses latest version of Lucene

Uses NIODirectory for Unix and MMAPDirectory for Windows

Try to turn off setUseCompoundFile

www.xebia.fr / blog.xebia.fr 36

Lucene – Indexing performance tuning

Set RAMBufferSizeMB according to your needs

Tune your merge policy with care

www.xebia.fr / blog.xebia.fr 37

Lucene – Search performance tuning

Open IndexReader in read-only mode (default in Lucene 2.9+)

Warmup FieldCache to ensure immediate access when sorting

Limit use of TermVector

Ensure index is optimized

www.xebia.fr / blog.xebia.fr 38

Architecture with Hibernate Search

www.xebia.fr / blog.xebia.fr 39

Architecture with Solr

www.xebia.fr / blog.xebia.fr 40

Architecture with Infinispan

www.xebia.fr / blog.xebia.fr 41

Lucene – Distributed : Katta

Shards and distributes Lucene index over instances

Uses Hadoop for distribution

www.xebia.fr / blog.xebia.fr 42

Lucene galaxy

Apache Nutch : Lucene + Crawling and parsing Apache Compass : Search engine framework Apache Solr : Lucene standalone search server Apache Mahout : Distributed machine learning

Hibernate Search : Hibernate + Lucene

Katta : Distributed Lucene with Hadoop

www.xebia.fr / blog.xebia.fr 43

Lucene - Futures

Flex Branch : making Lucene even more customizable

Apache Mahout : distributed machine learning for clustering, classification and recommendation algorithms

www.xebia.fr / blog.xebia.fr 44

Questions ?

top related