xebia knowledge exchange (mars 2010) - lucene : from theory to real world

Lucenefrom theory to real world

Information retrieval ApachePerformance tuning

Probabilistic

RelevanceVector

Dictionary

Inverted index

Analysis

Doug Cutting

Fields

IndexReader

Document

Open Source

Library

Parser

Indexing

Production

Architecture

DesignTroubleshooting

Real world

Cluster

Search application

Server

www.xebia.fr / blog.xebia.fr

Agenda

Introduction to Information Retrieval

Lucene overview Lucene in details

Search applications design Performance tuning

www.xebia.fr / blog.xebia.fr 3

Information Retrieval

“ Information Retrieval (IR) is the science of searching for document ”

Inverted Index

Boolean Model

Query and documents are conceived as sets of terms

Q = (T1 OR T2) AND (T3 OR T4)D1 = {T1, T3}D2 = {T2, T3, T4}

Results set of query is a composition of unions and intersections

R = {D1, D2}with Union for OR operator

Intersection for AND operator

Vector Space Model

Documents and queries are represented as vectors

Similarity can be computed with :dj = (w1,j,w2,j,...,wt,j)

q = (w1,q,w2,q,...,wt,q)

Lucene

Lucene : where do we come from ?Version Release date Description

0.01 March 2000 First open source release (SourceForge)

1.0 October 2000

1.01b July 2001 Last SourceForge release

1.2 June 2002 First Apache Jakarta release

1.3 December 2003 Compound index format, QueryParser enhancements, remote searching, extensible scoring API

1.4 July 2004 Sorting, span queries, term vectors

1.4.1 August 2004 Bug fix for sorting performance

1.4.2 October 2004 IndexSearcher optimization and misc. fixes

1.4.3 29 November 2004 Misc. fixes

1.9.0 27 February 2006 Binary stored fields, DateTools, NumberTools, RangeFilter, RegexQuery, Require Java 1.4

1.9.1 2 March 2006 Bug fix in BufferedIndexOutput

2.0 26 May 2006 Removed deprecated methods

2.1 17 February 2007 Delete/update document in IndexWriter, QueryParser improvements, contrib/benchmark

2.2 19 June 2007 Performance improvements, Function queries, Payloads, Preanalyzed fields, custom deletion policies

2.3.0 24 January 2008 Performance improvements, custom merge policies and merge schedulers, IndexReader.reopen

2.3.1 23 February 2008 Bug fixes from 2.3.0

2.3.2 06 May 2008 Bug fixes from 2.3.1

2.4.0 8 October 2008 Further performance improvements, transactional semantics, expungeDeletes method

2.4.1 9 March 2009 Bug fixes from 2.4.0

2.9 25 September 2009 New per-segment Collector API, faster search performance, near real-time search, attribute based analysis

2.9.1 6 November 2009 Bug fixes from 2.9

3.0.0 25 November 2009 Removed deprecated methods, fixed some bugs

3.0.1 and 2.9.2 26 February 2010 Bug fixes from previous minor versions. Both have same bugfix level

Lucene documentation

Lucene : Simple indexing example

Directory directory = new RAMDirectory();IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(),

IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();doc.add(new Field(“company”, “Xebia”, Field.Store.YES, Field.Index.NOT_ANALYZED));doc.add(new Field(“country”, “France”, Field.Store.YES, Field.Index.NO));

writer.addDocument(doc);writer.close();

Lucene : Simple search example

IndexSearcher searcher = new IndexSearcher(dir, true);

Term t = new Term(“country”, “France”);Query query = new TermQuery(t);TopDocs docs = searcher.search(query, 10);

assertEquals(1, docs.totalHits);

searcher.close();

Lucene - indexing

Lucene - analyzers

Lucene – Field types

Store : YES / NO

Index : NO / ANALYZED / NOT_ANALYZED / ANALYZED_NO_NORMS / NOT_ANALYZED_NO_NORMS

TermVector : NO / WITH_POSITIONS / WITH_OFFSETS / WITH_POSITIONS_OFFSETS / YES

Lucene storage - segments

A new segment is created each time IndexWriter is flushed

When documents are deleted, a marker is added in the current segment

Lucene storage – segments merge

Segments are merged manually with IndexWriter.optimize()

Or automatically merged depending on : (int) log(max(minMergeMB,

size))/log(mergeFactor)

Lucene - search

Programatic API

TermQuery

PhraseQuery

WildcardQuery

RangeQuery

FuzzyQuery

BooleanQuery

Lucene - QueryParser

QueryParser build a Query object from a user query string

+JUNIT +ANT –MOCK +xebya~0.8 +title:«Junit in action»

Most of the time, won’t fit application requirements

Lucene – contrib/QueryParser

Framework that simplifies the creation of a query parser that fit your needs

3 layers : QueryParser : Transforms a query string into an

Abstract Syntax Tree representation QueryNodeProcessor : Processes nodes of the

tree to move, remove or modify them QueryBuilder : builds a Lucene BooleanQuery

tree from the abstract syntax tree

Lucene – boolean queries

Lucene – PhraseQuery & SpanQuery

SpanQuery : match documents that contains terms separated by n other terms (n is the ‘slop’)

PhraseQuery : SpanQuery with a slop value of 0

Uses position information

Lucene storage – approximative queries

Approximatives queries (Prefix, Regex, Wildcard, Fuzzy) get transformed to a set of TermQueries

Dictionnary = { court, cours, courir }

FuzzyQuery = cour

TransformedQuery = court OR cours

Inverted Index

Lucene – Levenshtein distance

FuzzyQuery uses Levenshtein distance : the number of modifications required to switch

from one word to another

Lucene - FuzzyQuery

Current implementation not optimal LUCENE-2089 will use a Levenshtein automaton

Prefix Length PQ Size Avg MS (old) Avg MS (new)

0 1024 3286.0 7.8

0 64 3320.4 7.6

1 1024 316.8 5.6

1 64 314.3 5.6

2 1024 31.8 3.8

2 64 31.9 3.7

Lucene – Highlighter

Produces ready to use HTML snippets with highlighted words from query

Can be fully customized

By default limited to 50 KB characters

Uses FastVectorHighlighter for faster results (~2.5 times faster)

Lucene – FieldCache

Lucene cache that allows to store in memory values of a single field

Used internally by Sort objects

Can be used to manually load values of a single field :

float[] weights = FieldCache.DEFAULT.getFloats(reader, “weight”);

Lucene – MoreLikeThis

Finds similar documents

Produces a query to be searched

MoreLikeThis mlt = new MoreLikeThis(reader);mlt.setFieldNames(new String[] {"title", "author"});mlt.setMinTermFreq(1);mlt.setMinDocFreq(1);

Query query = mlt.like(docId);indexSearcher.search(query, 10);

Lucene – Function Queries

Allows score customization

Consider using FieldCaches to Reduce fetching cost

FieldScoreQuery scoreQuery = new FieldScoreQuery("score",

FieldScoreQuery.Type.BYTE);CustomScoreQuery customQ = new CustomScoreQuery(q, scoreQuery ) {

public float customScore(int doc, float

subQueryScore, float

valSrcScore) {return (float) (Math.sqrt(subQueryScore) * valSrcScore);

Lucene – Luke

Lucene – Global performance tuning

Consider using SSD for low latency

Consider using RAMDirectory / InstanciatedIndex

Uses latest version of Lucene

Uses NIODirectory for Unix and MMAPDirectory for Windows

Try to turn off setUseCompoundFile

Lucene – Indexing performance tuning

Set RAMBufferSizeMB according to your needs

Tune your merge policy with care

Lucene – Search performance tuning

Open IndexReader in read-only mode (default in Lucene 2.9+)

Warmup FieldCache to ensure immediate access when sorting

Limit use of TermVector

Ensure index is optimized

Architecture with Hibernate Search

Architecture with Solr

Architecture with Infinispan

Lucene – Distributed : Katta

Shards and distributes Lucene index over instances

Uses Hadoop for distribution

Lucene galaxy

Apache Nutch : Lucene + Crawling and parsing Apache Compass : Search engine framework Apache Solr : Lucene standalone search server Apache Mahout : Distributed machine learning

Hibernate Search : Hibernate + Lucene

Katta : Distributed Lucene with Hadoop

Lucene - Futures

Flex Branch : making Lucene even more customizable

Apache Mahout : distributed machine learning for clustering, classification and recommendation algorithms

Questions ?

xebia knowledge exchange (mars 2010) - lucene : from theory to real world

new termcountry

new whitespaceanalyzer

new indexsearcherdir

new indexwriterdirectory

new documentdoc

new termqueryttopdocs

fr8vectorspace model

sets of terms q

Technology

lucene part2. lucene jarkarta lucene ( is a high-...

orientdb & lucene

xebia university brochure

backday xebia : microservices en démo

xebia masterclass better-faster-smarter-with-devops

the lucene full-text search...

lucene bootcamp

lucene tutorial

lucene introduction

fll ditibtdsfully distributed...

xebia knowledge exchange - owasp top ten

domain-driven design and cqrs -...

xebia deploy it

hbase training in india - xebia training

lucene tutorial

lucene algorithm paper

xebia les frameworks web java haute productivite

backday xebia : akka, the reactive toolkit

apache lucene: searching the web and everything...

xebia e-commerce / mcommerce solutions