recent additions to lucene arsenal

28

Upload: lucenerevolution

Post on 10-May-2015

801 views

Category:

Technology


0 download

DESCRIPTION

Presented by Shai Erera, Researcher, IBM Lucene's arsenal has recently expanded to include two new modules: Index Sorting and Replication. Index sorting lets you keep an index consistently sorted based on some criteria (e.g. modification date). This allows for efficient search early-termination as well as achieve better index compression. Index replication lets you replicate a search index to achieve high-availability, fault tolerance as well as take hot index backups. In this talk we will introduce these modules, discuss implementation and design details as well as best practices.

TRANSCRIPT

Page 1: Recent Additions to Lucene Arsenal
Page 2: Recent Additions to Lucene Arsenal

Recent Additions to Lucene’s Arsenal

Shai Erera, Researcher, IBM

Adrien Grand, ElasticSearch

Page 3: Recent Additions to Lucene Arsenal

• Shai Erera– Working at IBM – Information Retrieval Research– Lucene/Solr committer and PMC member– http://shaierera.blogspot.com– [email protected]

• Adrien Grand– @jpountz– Lucene/Solr committer and PMC member– Software engineer at Elasticsearch

Who We Are

Page 4: Recent Additions to Lucene Arsenal

The Replicator

Page 5: Recent Additions to Lucene Arsenal

Load Balancing

Load

Balancer

Page 6: Recent Additions to Lucene Arsenal

Failover

Page 7: Recent Additions to Lucene Arsenal

Index Backup

Page 8: Recent Additions to Lucene Arsenal

The Replicator

Primary

Backup

Backup

http://shaierera.blogspot.com/2013/05/the-replicator.html

Re

plic

ato

r Re

plic

atio

nC

lien

tR

ep

lica

tion

Clie

nt

Page 9: Recent Additions to Lucene Arsenal

• Replicator– Mediates between the client and server– Manages the published Revisions– Implementation for replication over HTTP

• Revision– Describes a list of files and metadata– Responsible to ensure the files are available as long as clients replicate it

• ReplicationClient– Performs the replication operation on the replica side– Copies delta files and invokes ReplicationHandler upon successful copy– Always replicates latest revision

• ReplicationHandler– Acts on the copied files

Replication Components

Page 10: Recent Additions to Lucene Arsenal

• IndexRevision– Obtains a snapshot on the last commit through SnapshotDeletionPolicy– Released when revision is released by Replicator

• IndexReplicationHandler– Copies the files to the index directory and fsync them– Aborts (rollback) on any error– Upon successful completion, invokes a callback (e.g.

SearcherManager.maybeRefresh())

• Similar extensions for faceted index replication– IndexAndTaxonomyRevision: obtains snapshots on both the search and taxonomy

indexes– IndexAndTaxonomyReplicationHandler: copies the files to the respective

directories, keeping both in sync

Index Replication

Page 11: Recent Additions to Lucene Arsenal

Sample Code

// Server-side: publish a new RevisionReplicator replicator = new LocalReplicator();replicator.publish(new IndexRevision(indexWriter));

// Client-side: replicate a RevisionReplicator replicator; // either LocalReplicator or HttpReplicator

// refresh SearcherManager after index is updatedCallable<Boolean> callback = new Callable<Boolean>() { public Boolean call() throws Exception { // index was updated, refresh manager searcherManager.maybeRefresh(); }}

ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);ReplicationClient client = new ReplicationClient(replicator, handler, factory);

client.updateNow(); // invoke client manually// -- OR --client.startUpdateThread(30000); // check for updates every 30 seconds

Page 12: Recent Additions to Lucene Arsenal

• Resume– Session level: don’t copy files that were already successfully copied– File level: don’t copy file parts that were already successfully copied

• Parallel Replication– Copy revision files in parallel

• Other replication strategies– Peer-to-peer

Future Work

Page 13: Recent Additions to Lucene Arsenal

Index SortingHow to trade index speed for search speed

Page 14: Recent Additions to Lucene Arsenal

Index = collection of immutable segments

Segments store documents sequentially on disk

Add data = create a new segment

Segments get eventually merged together

Order of segments / documents in segments doesn’t matter– the following segments are equivalent

Anatomy of a Lucene index

9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13

1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12IdPrice

13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0

12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3IdPrice

Page 15: Recent Additions to Lucene Arsenal

ordinal of a doc in a segment = doc id

used in the inverted index to refer to docs

Anatomy of a Lucene index

9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13

1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12Id

Price

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16doc id

shoe 1, 3, 5, 8, 11, 13, 15

Page 16: Recent Additions to Lucene Arsenal

Get top N=2 results:– Create a priority queue of size N– Accumulate matching docs

Top hits

9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13

1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12IdPrice

(3)() (3,4) (4,20) (4,9) (4,9) (9,31) (9,31)

Automatic overflow of the priority queue to remove the

least one

Create an empty priority queue

Top hits

Page 17: Recent Additions to Lucene Arsenal

Let’s do the same on a sorted index

Early termination

13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0

12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3IdPrice

(9)() (9,31) (9,31) (9,31) (9,31) (9,31) (9,31)

Priority queue never changes after this

document

Page 18: Recent Additions to Lucene Arsenal

Pros– makes finding the top hits much faster– file-system cache-friendly

Cons– only works for static ranks

– not if the sort order depends on the query– requires the index to be sorted– doesn’t work for tasks that require visiting every doc:

– total number of matches– faceting

Early termination

Page 19: Recent Additions to Lucene Arsenal

Not uncommon!

Graph-based ranks– Google’s PageRank

Facebook social search / Unicorn– https://www.facebook.com/publications/219621248185635

Many more...

Doesn’t need to be the exact sort order– heuristics when score is only a function of the static rank

Static ranks

Page 20: Recent Additions to Lucene Arsenal

A live index can’t be kept sorted– would require inserting docs between existing docs!– segments are immutable

Offline sorting to the rescue:– index as usual– sort into a new index– search!

Pros/cons– super fast to search, the whole index is fully sorted– but only works for static content

Offline sorting

Page 21: Recent Additions to Lucene Arsenal

Offline Sorting

// open a reader on the unsorted index and create a sorted (but slow) viewDirectoryReader reader = DirectoryReader.open(in);boolean ascending = false;Sorter sorter = new NumericDocValuesSorter("price", ascending);AtomicReader sortedReader = SortingAtomicReader.wrap( SlowCompositeReaderWrapper.wrap(reader), sorter);

// copy the content of the sorted reader to the new dirIndexWriter writer = new IndexWriter(out, iwConf);writer.addIndexes(sortedReader);writer.close();reader.close();

Page 22: Recent Additions to Lucene Arsenal

Sort segments independently– wouldn’t require inserting data into existing segments– collection could still be early-terminated on a per-segment basis

But segments are immutable– must be sorted before starting writing them

Online sorting?

Page 23: Recent Additions to Lucene Arsenal

2 sources of segments– flush– merge

flushed segments can’t be sorted– Lucene writes stored fields to disk on the fly– could be buffered but this would require a lot of memory

merged segments can be sorted– create a sorted view over the segments to merge– pass this view to SegmentMerger instead of the original segments

not a bad trade-off– flushed segments are usually small & fast to collect

Online sorting?

Page 24: Recent Additions to Lucene Arsenal

Online sorting?

Flushed segments - NRT reopens - RAM buffer size limit hit

Merged segments

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Merged segments can easily take 99+% of the size of the index

Page 25: Recent Additions to Lucene Arsenal

Online Sorting

IndexWriterConfig iwConf = new IndexWriterConfig(...);

// original MergePolicy finds the segments to mergeMergePolicy origMP = iwConf.getMergePolicy();

// SortingMergePolicy wraps the segments with a sorted viewboolean ascending = false;Sorter sorter = new NumericDocValuesSorter("price", ascending);MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter);

// setup IndexWriter to use SortingMergePolicyiwConf.setMergePolicy(sortingMP);IndexWriter writer = new IndexWriter(dir, iwConf);

// index as usual

Page 26: Recent Additions to Lucene Arsenal

Collect top N matches

Offline sorting– index sorted globally– early terminate after N matches have been collected– no priority queue needed!

Online sorting– no early termination on flushed segments– early termination on merged segments

– if N matches have been collected– or if current match is less than the top of the PQ

Early termination

Page 27: Recent Additions to Lucene Arsenal

Early Termination

class MyCollector extends Collector {

@Override public void setNextReader(AtomicReaderContext context) throws IOException { readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter); collected = 0; }

@Override public void collect(int doc) throws IOException { if (readerIsSorted && (++collected >= maxDocsToCollect || curVal <= pq.top()) { // Special exception that tells IndexSearcher to terminate // collection of the current segment throw new CollectionTerminatedException(); } else { // collect hit } }}

Page 28: Recent Additions to Lucene Arsenal

Questions?