lucene with bloom filtered segments

7
Lucene and Bloom- Filtered Segments Performance improvements to be gained from “knowing what we don’t know” Mark Harwood

Upload: mark-harwood

Post on 19-Dec-2014

1.246 views

Category:

Technology


3 download

DESCRIPTION

A 2 x performance improvement to low-frequency term searches e.g. primary keys

TRANSCRIPT

Page 1: Lucene with Bloom filtered segments

Lucene and Bloom-Filtered SegmentsPerformance improvements to be gained from “knowing what we don’t know”

Mark Harwood

Page 2: Lucene with Bloom filtered segments

Benefits

2 x speed up on primary-key lookups

Small speed-up on general text searches (1.06 x )

Optimised memory overhead

Minimal impact on indexing speeds

Minimal extra disk space

Page 3: Lucene with Bloom filtered segments

ApproachOne appropriately sized Bitset is held per segment,

per Bloom-filtered field

e.g. 4 segments x 2 filtered fields = 8 bitsets

Segment 1

Segment 2

Segment 3

Segment 4

URL 000010001000000101000001

PKey 0010000001001001000001

URL 000010001000001

PKey 001000000100001

URL 000000001

PKey 001000001

URL 000000001

PKey 001000001

Page 4: Lucene with Bloom filtered segments

Fail-fast searches: modified TermInfosReaderint hash=searchTerm.hashcode();

int bitIndex=hash%bitsetSize;

if(!bitset.contains(bitIndex)) return false;

//term might be in index – continue as normal search

Segment 1

URL 000010001000000101000001

PKey 0010000001001001000001

Is most effective on fields with many low doc-frequency terms or scenarios where query terms often don’t exist in the index.

An unset bit guarantees the term is missing from the segment and a search can be avoided.

Page 5: Lucene with Bloom filtered segments

Memory efficiencyBitset sizes are automatically tuned according to:

1. the volume of terms in the segment

2. desired saturation settings (more sparse=more accurate)

Segment 1

Segment 2

Segment 3

Segment 4

URL 000010001000000101000001

PKey 0010000001001001000001

URL 000010001000001

PKey 001000000100001

URL 000000001

PKey 001000001

URL 000010000

PKey 001000001

Page 6: Lucene with Bloom filtered segments

Indexing: a modified TermInfosWriter

00000000000000000000000000000000001000000000000000000000000000010000000

000000000000001000000000010000000

Term writes are gathered in a large bitset

The final flush operation consolidates information in the big bitset into a suitably compact bitset for storage on disk based on how many set bits were accumulated.This re-mapping saves disk space and the RAM required when servicing queries

Page 7: Lucene with Bloom filtered segments

NotesSee JIRA LUCENE-4069 for patch to Lucene 3.6

Core modifications pass existing 3.6 Junit tests (but without exercising any Bloom filtering logic).

Benchmarks contrasting Bloom-filtered indexes with non-filtered are here: http://goo.gl/X7QqU

TODOsCurrently relies on a field naming convention to introduce a Bloom filter to the index (use “_blm” on end of indexed field name when writing)How to properly declare need for Bloom filter?Changes to IndexWriterConfig? A new Fieldable/FieldInfo setting? Dare I invoke the “schema” word?Where to expose tuning settings e.g. saturation preferences?Can we give some up-front hints to TermInfosWriter about segment size being written so initial choice of BitSet size can be reduced?Formal Junit tests required to exercise Bloom-filtered indexes – no false negatives. Can this be covered as part of existing random testing frameworks which exercise various index config options?