i ncremental m aintenance of l ength n ormalized i ndexes for a pproximate s tring m atching -...
DESCRIPTION
I NTRODUCTION Inverted Document Frequency Partial Score Contribution 3TRANSCRIPT
INCREMENTAL MAINTENANCE OF LENGTH NORMALIZED INDEXES FORAPPROXIMATE STRING MATCHING
- Ashwin Joshi1
PROBLEM Consider a real system - Tens of millions of strings - Updated on hourly basis - Practical scenario 1. Updates buffered 2. Indexed rebuilt weekly - Re-computation time = few hours - Limitations of online systems
2
INTRODUCTION Inverted Document Frequency
Partial Score Contribution
3
LENGTH NORMALIZATION
Types : L0 ,L1 & L2 ………Why L2 is preferred? Similarity,
e.g. Query, q = {t1, t2, t3}, String S1 = {t1}, String S2 = {t1, t2, t3}
and idf(t1) = 10 , idf(t2) = 8 , idf(t3)= 2 .
For L0 , S0(q,s1) = 100/3 > S0(q,s2) = 168/9
For L1 , S1(q,s1) = 100/200 > S1(q,s2) = 168/400
For L2 , S2(q,s1) = 100/41 < S2(q,s2) = 168/168 = 1 4
APPROXIMATE STRING MATCHING Theorem:
Length Boundedness Determine string that are either too
short or too long to match the query
5
MAINTENANCE OPERATIONS Propagating Updates 1. Insert 2. Delete 3. Modify Effectively a ‘Delete’ followed by an
‘Insert’
6
Insert S7
- Generate new tokens - Add new strings - N changes -> idf changes -> L changes
INSERT
7
RELAXED PROPAGATION Relaxation of N - What is Nb ? - Divergence between N & Nb
Relaxation of df - Definition of dfp(ti) - Range of dfp(ti)
Relaxed similarity S2~
8
LOSS IN PRECISION Assume total possible divergence in idf
Relaxed Similarity,
For ρ=1.1 & query threshold,
Equation1 : ,
Equation2 : , 9
UPDATE PROPAGATION ALGORITHM
10 …continued
11
EXPERIMENT (DBLP) - Period = 30 days - 2460433 author/id pairs - 5712041 total words - 269281 distinct words - 33461 total updates - 32121 insertions,1340 deletions
12
EXPERIMENT (BUSINESS LISTING)
13
14
15
QUERY ACCURACY
16
THANK YOU.
17