1 numeric range queries with lucene trierange uwe schindler lucene java contrib committer...

10
1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer [email protected] PANGAEA ® - Publishing Network for Geoscientific & Environmental Data MARUM, Center for Marine Environmental Sciences, Bremen, Germany

Upload: cameron-garrison

Post on 03-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer uschindler@apache.org PANGAEA ® - Publishing Network for Geoscientific

1

Numeric Range Queries with Lucene TrieRange

Uwe SchindlerLucene Java Contrib Committer

[email protected]

PANGAEA® - Publishing Network for Geoscientific & Environmental Data

MARUM, Center for Marine Environmental Sciences, Bremen, Germany

Page 2: 1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer uschindler@apache.org PANGAEA ® - Publishing Network for Geoscientific

2

Problems with actual RangeQueries/-Filters

• Classical RangeQuery hits TooManyClausesException on large ranges and is very slow.

• ConstantScoreRangeQuery is faster, cacheable, but still has to visit a large number of terms.

• Both need to enumerate a large number of terms from TermEnum and then retrieve TermDocs for each term.

• The number of terms to visit grows with number of documents and unique values in index (especially for float/double values)

Page 3: 1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer uschindler@apache.org PANGAEA ® - Publishing Network for Geoscientific

3

TrieRange: How it works

421

52

4

44 6442

644642641634633632522521448446445423

63

5 6

range

Page 4: 1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer uschindler@apache.org PANGAEA ® - Publishing Network for Geoscientific

4

Supported Data Types• Native data type: long, int (standard Java

signed). All “tricks” like padding are not needed!These types are internally made unsigned, each trie precision is generated by stripping off least significant bits (using precisionStep parameter). Each value is then converted to a sequence of 7bit ASCII chars, result is prefixed with the number of bits stripped, and indexed as term. Only 7 bits/char are used because of most efficient bit layout in index (8 or more bits would split into two or more bytes when UTF-8 encoded).

• double, float: Converter to/from IEEE-754 bit layout that sorts like a signed long/int

• Date/Calendar: Convert to UNIX time stamp with e.g. Date.getTime()

• Money/prices: Do not use float/double (rounding), use a long/int representation of Cents

Page 5: 1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer uschindler@apache.org PANGAEA ® - Publishing Network for Geoscientific

5

Speed• Upper limit on number of terms,

independent of index size. This value depends only on precisionStep

• Term numbers: 8bit approx. 400 terms, 4 bit approx. 100 terms, 2 bit approx. 40 terms

• Query time: in most cases <100 ms with 500,000 docs index, 13 trie fields, precisionStep 8 bit

Page 6: 1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer uschindler@apache.org PANGAEA ® - Publishing Network for Geoscientific

6

How to use (indexing)// add some numerical fields: long lvalue = 121345L; TrieUtils.addIndexedFields(doc, "exampleLong",

TrieUtils.trieCodeLong(lvalue, precisionStep)); double dvalue = 1.057E17; TrieUtils.addIndexedFields(doc, "exampleDouble",

TrieUtils.trieCodeLong(TrieUtils.doubleToSortableLong(dvalue), precisionStep));

int ivalue = 121345; TrieUtils.addIndexedFields(doc, "exampleInt",

TrieUtils.trieCodeInt(ivalue, precisionStep)); float fvalue = 1.057E17f; TrieUtils.addIndexedFields(doc, "exampleFloat",

TrieUtils.trieCodeInt(TrieUtils.floatToSortableInt(fvalue), precisionStep));

Date datevalue = new Date(); // actual time TrieUtils.addIndexedFields(doc, "exampleDate",

TrieUtils.trieCodeLong(datevalue.getTime(), precisionStep));

Page 7: 1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer uschindler@apache.org PANGAEA ® - Publishing Network for Geoscientific

7

How to use (searching)// Java 1.4, because Long.valueOf(long) is not available: Query q = new LongTrieRangeFilter("exampleLong",

precisionStep, new Long(123L), new Long(999999L), true, true).asQuery();

// OR, Java 1.5, using autoboxing: Query q = new LongTrieRangeFilter("exampleLong",

precisionStep, 123L, 999999L, true, true).asQuery(); // execute the search, as usual: TopDocs docs = searcher.search(q, 10); for (int i = 0; i<docs.scoreDocs.length; i++) { Document doc = searcher.doc(docs.scoreDocs[i].doc); System.out.println(doc.get("exampleString")); // decode a prefix coded, stored field: System.out.println(TrieUtils.prefixCodedToLong(

doc.get("exampleLong2"))); }

Page 8: 1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer uschindler@apache.org PANGAEA ® - Publishing Network for Geoscientific

8

Future Developments• Current state: Helper field for lower precision

values needed (because of sorting). Some ideas for fixing this (see recent discussions on java-dev).

• Planned: Nice and more GC-friendly API with more flexibility on indexing: trieCodeLong() and trieCodeInt() return TokenStream that can be indexed into one field with custom options (Solr implements this with a wrapper at the moment).

• Move to core, more-userfriendly name (NumberRangeQuery, NumberUtils)?

Page 9: 1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer uschindler@apache.org PANGAEA ® - Publishing Network for Geoscientific

9

Demonstration

• www.pangaea.de (main site)

• www.wdc-mare.org (displays query time)

Page 10: 1 Numeric Range Queries with Lucene TrieRange Uwe Schindler Lucene Java Contrib Committer uschindler@apache.org PANGAEA ® - Publishing Network for Geoscientific

10

Thank You!