using fingerprints in n-gram indices digital libraries: advanced methods and technologies, digital...
TRANSCRIPT
Using Fingerprints in n-Gram Indices
Digital Libraries:Advanced Methods and Technologies,Digital Collections
Stefan [email protected]
17.09.2009
Thursday, September 17, 2009
Using Fingerprints in n-Gram Indices
Overview
• Introduction– Inverted Index– N-Gram Index– Bitmaps– Signature Files
• n-Gram Fingerprints
• n-Gram Fingerprints in Combination with Posting Lists
• Fingerprint Compression
• Conclusion and Future Work
Thursday, September 17, 2009
INTRODUCTION
Thursday, September 17, 2009
Inverted Index
• Very common index structure• Term-oriented• Every term is linked to its postings
Thursday, September 17, 2009
n-Gram Index
• Uses n-Grams as indexing terms• Any kind of subsequence can be searched• n-Gram is a subsequence of a text with
• Postings for longer subsequences can be calculated:
2121 wposdecwposwwpos
PxxPdec 112
11 ww
nw
Thursday, September 17, 2009
n-Gram Index
• Index structure is very similar to an inverted index
• Searching is more complex
Thursday, September 17, 2009
Bitmaps
• Bitmaps are occurrence maps• Each bit signals an occurrence of a specific term in a
specific document
Thursday, September 17, 2009
Signature Files
Thursday, September 17, 2009
N-GRAM FINGERPRINT
Thursday, September 17, 2009
N-Gram Fingerprint
The idea:
Create fingerprints that:
•Have a fixed size•Contain information about the postings
Thursday, September 17, 2009
N-Gram Fingerprint
A 2D-Fingerprint is a bit-matrix
1,10,
1,00,0
off
o
w
bb
bb
B
o:0
mod
mod::1
,
therwisejopoffset
ifpfileidwpospb ji
Thursday, September 17, 2009
N-Gram Fingerprint
• Given two 1-grams and their fingerprintsBw1 and Bw2 the fingerprint Bw1w2 can beaproximated:
• B’w2 is constructed by cyclic shifting each column of Bw2 by one position to the left.
212121 '' wwwwww BBBB
Thursday, September 17, 2009
N-Gram Fingerprint
Thursday, September 17, 2009
N-Gram Fingerprint
Query Bit-matrix
Time for verification
Hits
rhinolo 219 ms 94 ms 18
sanfilipo 290 ms 0 ms 0
itracon 266 ms 336 ms 64
oxyuria 197 ms 48 ms 6
Search Speed
Results from the “Online Encyclopedia of Dermatology from P. Altmeyer”
Thursday, September 17, 2009
N-GRAM FINGERPRINTS IN COMBINATION WITH POSTING
LISTS
Thursday, September 17, 2009
Combining Fingerprints and Posting Lists
By combining fingerprints and posting lists
• No verification step is needed• Posting lists are partitioned into smaller subsets.
Each bit of the fingerprint corresponds to a separate posting list
• Costs for intersection of posting lists are being reduced
Thursday, September 17, 2009
Combining Fingerprints and Posting Lists
Thursday, September 17, 2009
Managing n-Gram Posting Lists
• Very large number of posting-subsets have to be managed:
For example:
1024 residue classes for the fileID128 residue classes for the offset14.000 different n-grams
• Subsets are stored in a hash• The hash value is a function of the residue classes
Thursday, September 17, 2009
Managing n-Gram Posting Lists
Thursday, September 17, 2009
Managing n-Gram Posting Lists
0
5000
10000
15000
20000
25000
30000
35000
40000
0 20 40 60 80 100 120 140
freq
uenc
y
number of ...
hash collisions and collision resolving
... collisions... comparisons
... comparisons after sorting
Thursday, September 17, 2009
Results
• Performance improved by 40% compared to the setup without posting lists
Query Bit-matrix
Time for verification
Hits
rhinolo 230 ms 10 ms 18sanfilipo 271 ms 0 ms 0itracon 245 ms 15 ms 64oxyuria 210 ms 12 ms 6
Thursday, September 17, 2009
FINGERPRINT COMPRESSION
Thursday, September 17, 2009
Fingerprint Compression
• Fingerprints with high or low densities do not contain much information
• Fingerprints can be compressed by reducing the resolution
• Dictionary based compression
Thursday, September 17, 2009
Fingerprint Compression
Density threshold for convolution
Performance loss
Fingerprint index reduction
no convolution 0 % 0 %
0-0,025 and 0.975-1 3.1 % 23 %0-0.05 and 0.95-1 3.2 % 27 %0-0.1 and 0.9-1 10 % 29 %0-0.2 and 0.8-1 25 % 31 %
• Results: Fingerprint convolution
• In combination with the dictionary based compression the index size is being reduced by additional 30%
Thursday, September 17, 2009
CONCLUSION AND FUTURE WORK
Thursday, September 17, 2009
Conclusion
• Fingerprints improve the scalability of n-gram indices• Fingerprints improve the performance of n-gram
indices• The index structure can be adjusted to user
behavior, so that common queries can be processed more efficiently
• The fingerprints can be stored in a compressed index with loosing only a minimum of performance
Thursday, September 17, 2009
Future Work
• Combination of term based inverted index and n-Gram fingerprint index
• Profit from the advantages of both using terms and n-Grams as indexing terms– Substring search– Ranking– Thesaurus information
Digital Libraries:Advanced Methods and Technologies,Digital Collections 17.09.2009
Thank You!