cost-based variable-length-gram selection for string collections to support approximate queries...
Post on 22-Dec-2015
221 views
TRANSCRIPT
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently
Xiaochun Yang, Bin Wang Chen Li
Northeastern University, China
2
Approximate selection queries
Keanu Reeves
Samuel Jackson
Schwarzenegger
Samuel Jackson
…
Schwarrzenger
Query errors: Limited knowledge about data Typos Limited input device (cell phone) input
Data errors Typos Web data OCR
Applications Spellchecking Query relaxation …
Similarity functions: Edit distance Jaccard Cosine …
3
Performance is a big issue
Answer queries interactively Many queries on a server
5ms/query 20ms/query
200 queries/second 50 queries/second
4
Outline
Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments
6
q-gram inverted lists
2-grams
id strings123456
bingobioinngbitinginbitingboinggoing
D0
gram string ids bi 1,2,3,4 bo 5 gi 3 go 1,6 in 1,2,3,3,4,5,6 io 2 it 3,4 ng 1,2,3,4,5,6 nn 2 oi 2,5,6 ti 3,4
7
Query processing
2-grams
id strings123456
bingobioinngbitinginbitingboinggoing
ED(bingon, ?)≤1
D0
gram string ids bi 1,2,3,4 bo 5 gi 3 go 1,6 in 1,2,3,3,4,5,6 io 2 it 3,4 ng 1,2,3,4,5,6 nn 2 oi 2,5,6 ti 3,4
# of common grams >= 3
8
VGRAM: variable-length grams [VLDB07]
[2,3]-gram dictionaryb i n g o n
gram bi bin bo gi go in ing io it ng nn oi ti
i nb
n4on13
o
n10
n3i o
n11tn14n12
nn15
n5g n
n16
n6in17
n7in18
n1tg
n24
gn8
n2i o
n9
n19
n #n20
#n32
#n21
#n22
#n23
#n25
#n26
#n27
#n28
#n29
#n30
#n31
#n33
9
Adopting VGRAM in algorithms
VGRAMgram dictionary
string grams
lower bound
b i n g o nb i n g o n
i nb
n 4o
n13
o
n10
n 3i o
n11
t
n14n12
n
n15
n 5g n
n16
n 6i
n17
n 7i
n18
n 1t
g
n24
gn8
n 2i o
n 9
n19
n #
n20#
n32
#
n21
#
n22
#
n23
#
n25
#
n26
#
n27
#
n28
#
n29
#
n30
#
n31#
n33
# of common grams >= 3
10
Contributions of this study Tightening lower bounds using dynamic
programming Cost-based quantitative approach
Analyze and estimate query performance when adding each gram
Automatically find high-quality grams
Gram dictionary
Stringcollection
High quality gram
11
Outline
Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments
12
Calculating lower bound
ed(s1,s2) <= k, then
# of common grams >= # of s1 grams – k * q
Fixed length (q)
b i i n d i n g
13
Calculating lower bound
b i i n d i n g
1 2 3 2 3 2 1 1
lower bound = # of grams of s1 – NAG(s1,k)
Variable lengths
14
Too pessimistic?
k-Max: Summation of k largest values
NAG(s,2)=3+3=6
1 2 3 2 3 2 1 1 b i i n d i n g
15
Tightening lower bound
Dynamic programming: tightening NAG(s,k) Subproblems: NAG(s[1,j], i)
String sj1
opi
17
Dynamic programming
1 2 3 2 3 2 1 1 b i i n d i n g
0 0 0 0 0 0 0 0 0
0 1 2 3 3 3 3 3 3
0 1 2 3 4 5 5 5 5
k=0
k=1
k=2
NAG vector
18
Outline
Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments
19
Effects on inverted lists
ab
bcadd gram abc
Gram dictionaryab
bc
abc
Gram dictionary
string --abc----ab----bc--
20
Effects on query performance
Decrease query’s inverted list Change lower bound Change # of candidates
21
Effects on query’s inverted lists
ab
bcadd gram abc
Gram dictionaryab
bc
abc
Gram dictionary
Query Q
Adding a new gram abc will not change or decrease the query’s inverted lists
- - - - - - - - - - - - -- - - - - a b - - - - - -- - - - - a b c - - - - -
22
Effects on lower bound
Query Q - - - - a b c d - - - - -
- - - - a b c d - - - - -Query Q
Query: Q, ED(Q, ?)≤1
23
Effects on # of candidates
Change lower bound change # of candidates
Query Q
- - - - a b c d - - - -
ab
bcadd gram abc
Gram dictionaryab
bc
abc
Gram dictionary
- - - - a b c d - - - -
24
Outline
Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments
27
Outline
Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments
28
Data sets
Environment:GNU C++, Dell GX620 PC with an Intel Pentium 2.40Hz Dual Core CPU, 2GB memory, 250GB disk, Ubuntu (Linux) O.S. Index structure were assumed to be in memory
Data set String # Length Range of # of injected edit operations
Min Max Avg
Article Titles 277,000 6 207 66 [1,6]
Movie Titles 855,000 8 249 35 [1,3]
Actor Names 1,200,000 4 74 17 [1,2]
29
Effect of Tightening Lower Bound
1M Actor names, Construct gram dictionary: 100,000 sample strings, 5000 queries, qmin = 4
30
Comparison with algorithm Prune [VLDB07]
Dataset: 1M article titlesPrune: qmin=5, qmax=7, T=2000, LargeFirst policyGramGen: 1% sampling ratio, 2000 queries, (qmin=5 automatically determined)
32
Conclusions Tightening lower bound
Dynamic programming Analysis of adding a gram affects
Index structure Performance of queries
Efficient algorithm Automatically generating a high-quality gram
dictionary