advanced indexing techniques with apache lucene - payloads advanced indexing techniques with michael...
TRANSCRIPT
Advanced Indexing Techniques with Apache Lucene - Payloads
Advanced Indexing Techniques
with
Michael Busch
Advanced Indexing Techniques with Apache Lucene - Payloads
Agenda
• Part 1: Inverted Index 101– Posting Lists– Stored Fields vs. Payloads
• Part 2: Use cases for Payloads– BoostingTermQuery– Simple facet counting
Advanced Indexing Techniques with Apache Lucene - Payloads
Lucene’s data structures
InvertedIndex
Store
search
Results
retrieve stored fields
Hits
Advanced Indexing Techniques with Apache Lucene - Payloads
c:\docs\shakespeare.txt:
To be or not to be.
c:\docs\einstein.txt:
The important thing is not tostop questioning.
Query: not
String comparison slow!
Solution: Inverted index
Advanced Indexing Techniques with Apache Lucene - Payloads
c:\docs\shakespeare.txt:
To be or not to be.
c:\docs\einstein.txt:
The important thing is not tostop questioning.
Query: notInverted index
be
important
is
not
or
questioning
stop
to
the
thing
0
1
1
0
0
0 1
1
0
0
0 1
0
0
Document IDs
Advanced Indexing Techniques with Apache Lucene - Payloads
c:\docs\shakespeare.txt:
To be or not to be.
c:\docs\einstein.txt:
The important thing is not tostop questioning.
Inverted index
be
important
is
not
or
questioning
stop
to
the
thing
0
1
1
0
0
0 1
1
0
0
0 1
0
0
0 1 2 3 4 5
0 1 2 3 4 5
6 7
Query: ”not to”
Document IDs
Advanced Indexing Techniques with Apache Lucene - Payloads
c:\docs\shakespeare.txt:
To be or not to be.
c:\docs\einstein.txt:
The important thing is not tostop questioning.
Query: ”not to”Inverted index
be
important
is
not
or
questioning
stop
to
the
thing
0
1
1
0
0
0
1
0
0
0
0
0
1
0 1 2 3 4 5
0 1 2 3 4 5
6 7
1
1
3
4
2
7
6
5
0
2
5
0 41
Document IDsPositions
Advanced Indexing Techniques with Apache Lucene - Payloads
c:\docs\shakespeare.txt:
To be or not to be.
c:\docs\einstein.txt:
The important thing is not tostop questioning.
Inverted index with Payloads
be
important
is
not
or
questioning
stop
to
the
thing
0
1
1
0
0
0
1
0
0
0
0
0
0 1 2 3 4 5
0 1 2 3 4 5
6 7
1
1
3
4
2
7
6
5
0
2
0
1
5
1
Document IDsPositions Payloads
4
Advanced Indexing Techniques with Apache Lucene - Payloads
So far…
• String comparison slow
• Inverted index used to accelerate search
• Store positions in posting lists to allow phrase searches
• Store payloads in posting lists to store arbitrary data with each position
Advanced Indexing Techniques with Apache Lucene - Payloads
Lucene’s data structures
InvertedIndex
Store
search
Results
retrieve stored fields
Hits
Advanced Indexing Techniques with Apache Lucene - Payloads
Store
StoreField 1: titleField 2: contentField 3: hashvalue
Documents:
F3D0 F1 F2 F3 D1 F1 F2 D2 F1 F2 F3
Advanced Indexing Techniques with Apache Lucene - Payloads
F3
Store
D0 F1 F2 F3 D1 F1 F2 D2 F1 F2 F3
• Optimized for random access
• Document-locality
Advanced Indexing Techniques with Apache Lucene - Payloads
F3
Store
D0 F1 F2 F3 D1 F1 F2 D2 F1 F2 F3
• Optimized for scanning and skipping
• Value-locality
Posting list with Payloads
D0 D1 D1F30 0 0F3 F3Document IDsPositions Payloads
XXX
Advanced Indexing Techniques with Apache Lucene - Payloads
Agenda
• Part 1: Inverted Index 101– Posting Lists– Stored Fields vs. Payloads
• Part 2: Use cases for Payloads– BoostingTermQuery– Simple facet counting
Advanced Indexing Techniques with Apache Lucene - Payloads
org.apache.lucene.analysis.Token
void setPayload(Payload payload)
org.apache.lucene.index.TermPositions
int getPayloadLength();byte[] getPayload(byte[] data, int offset)
Payloads - API
Advanced Indexing Techniques with Apache Lucene - Payloads
Analyzer:
final byte BoldBoost = 5;…Token token = new Token(…);…If (isBold) { token.setPayload( new Payload(new byte[] {BoldBoost}));}…return token;
Example: BoostingTermQuery
Advanced Indexing Techniques with Apache Lucene - Payloads
Similarity:Similarity boostingSimilarity = new DefaultSimilarity() { // @override public float scorePayload(byte [] payload, int offset, int length) { if (length == 1) return payload[offset]; };
Example: BoostingTermQuery
Advanced Indexing Techniques with Apache Lucene - Payloads
Example: BoostingTermQuery
BoostingTermQuery:
Query btq = new BoostingTermQuery( new Term(“field”, “searchterm”));
Searching:
Searcher searcher = new IndexSearcher(…);Searcher.setSimilarity(boostingSimilarity);…Hits hits = searcher.search(btq);
Advanced Indexing Techniques with Apache Lucene - Payloads
Analyzer:
public TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName.equals(“_facet”)) { return new TokenStream() { boolean done = false; public Token next() { if (done) return null; Token token = new Token(…); token.setPayload( new Payload(computeHash(url)); done = true; return token;}}}}
Example: Simple facet counting
Advanced Indexing Techniques with Apache Lucene - Payloads
Hitcollector:
Example: Simple facet counting
• Use different PriorityQueues for different sites
• Instead of returning top-n results of the whole data set, return top-n results per site
Advanced Indexing Techniques with Apache Lucene - Payloads
Summary
Example: Simple facet counting
• In this example: facet (site) used for scoring, but extendable for facet counting
• Good performance due to locality of facet values
Advanced Indexing Techniques with Apache Lucene - Payloads
Conclusion
• Payloads offer great flexibility
• Payloads are stored very space-efficient
• Sophisticated data structures enable efficient skipping over payloads
• Payloads should be used whenever special data is required for finding hits and scoring
Advanced Indexing Techniques with Apache Lucene - Payloads
Outlook
• Finalize API (currently Beta)
• Add more out-of-the-box query types
• Per-document Payloads
Advanced Indexing Techniques with Apache Lucene - Payloads
Advanced Indexing Techniques
with
Questions ?