1 numeric range queries with lucene trierange uwe schindler lucene java contrib committer...

Post on 03-Jan-2016

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Numeric Range Queries with Lucene TrieRange

Uwe SchindlerLucene Java Contrib Committer

uschindler@apache.org

PANGAEA® - Publishing Network for Geoscientific & Environmental Data

MARUM, Center for Marine Environmental Sciences, Bremen, Germany

2

Problems with actual RangeQueries/-Filters

• Classical RangeQuery hits TooManyClausesException on large ranges and is very slow.

• ConstantScoreRangeQuery is faster, cacheable, but still has to visit a large number of terms.

• Both need to enumerate a large number of terms from TermEnum and then retrieve TermDocs for each term.

• The number of terms to visit grows with number of documents and unique values in index (especially for float/double values)

3

TrieRange: How it works

421

52

4

44 6442

644642641634633632522521448446445423

63

5 6

range

4

Supported Data Types• Native data type: long, int (standard Java

signed). All “tricks” like padding are not needed!These types are internally made unsigned, each trie precision is generated by stripping off least significant bits (using precisionStep parameter). Each value is then converted to a sequence of 7bit ASCII chars, result is prefixed with the number of bits stripped, and indexed as term. Only 7 bits/char are used because of most efficient bit layout in index (8 or more bits would split into two or more bytes when UTF-8 encoded).

• double, float: Converter to/from IEEE-754 bit layout that sorts like a signed long/int

• Date/Calendar: Convert to UNIX time stamp with e.g. Date.getTime()

• Money/prices: Do not use float/double (rounding), use a long/int representation of Cents

5

Speed• Upper limit on number of terms,

independent of index size. This value depends only on precisionStep

• Term numbers: 8bit approx. 400 terms, 4 bit approx. 100 terms, 2 bit approx. 40 terms

• Query time: in most cases <100 ms with 500,000 docs index, 13 trie fields, precisionStep 8 bit

6

How to use (indexing)// add some numerical fields: long lvalue = 121345L; TrieUtils.addIndexedFields(doc, "exampleLong",

TrieUtils.trieCodeLong(lvalue, precisionStep)); double dvalue = 1.057E17; TrieUtils.addIndexedFields(doc, "exampleDouble",

TrieUtils.trieCodeLong(TrieUtils.doubleToSortableLong(dvalue), precisionStep));

int ivalue = 121345; TrieUtils.addIndexedFields(doc, "exampleInt",

TrieUtils.trieCodeInt(ivalue, precisionStep)); float fvalue = 1.057E17f; TrieUtils.addIndexedFields(doc, "exampleFloat",

TrieUtils.trieCodeInt(TrieUtils.floatToSortableInt(fvalue), precisionStep));

Date datevalue = new Date(); // actual time TrieUtils.addIndexedFields(doc, "exampleDate",

TrieUtils.trieCodeLong(datevalue.getTime(), precisionStep));

7

How to use (searching)// Java 1.4, because Long.valueOf(long) is not available: Query q = new LongTrieRangeFilter("exampleLong",

precisionStep, new Long(123L), new Long(999999L), true, true).asQuery();

// OR, Java 1.5, using autoboxing: Query q = new LongTrieRangeFilter("exampleLong",

precisionStep, 123L, 999999L, true, true).asQuery(); // execute the search, as usual: TopDocs docs = searcher.search(q, 10); for (int i = 0; i<docs.scoreDocs.length; i++) { Document doc = searcher.doc(docs.scoreDocs[i].doc); System.out.println(doc.get("exampleString")); // decode a prefix coded, stored field: System.out.println(TrieUtils.prefixCodedToLong(

doc.get("exampleLong2"))); }

8

Future Developments• Current state: Helper field for lower precision

values needed (because of sorting). Some ideas for fixing this (see recent discussions on java-dev).

• Planned: Nice and more GC-friendly API with more flexibility on indexing: trieCodeLong() and trieCodeInt() return TokenStream that can be indexed into one field with custom options (Solr implements this with a wrapper at the moment).

• Move to core, more-userfriendly name (NumberRangeQuery, NumberUtils)?

9

Demonstration

• www.pangaea.de (main site)

• www.wdc-mare.org (displays query time)

10

Thank You!

top related