i ncremental m aintenance of l ength n ormalized i ndexes for a pproximate s tring m atching -...

17
INCREMENTAL MAINTENANCE OF LENGTH NORMALIZED INDEXES FOR APPROXIMATE STRING MATCHING - Ashwin Joshi 1

Upload: jordan-jefferson

Post on 18-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

I NTRODUCTION Inverted Document Frequency Partial Score Contribution 3

TRANSCRIPT

Page 1: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

INCREMENTAL MAINTENANCE OF LENGTH NORMALIZED INDEXES FORAPPROXIMATE STRING MATCHING

- Ashwin Joshi1

Page 2: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

PROBLEM Consider a real system - Tens of millions of strings - Updated on hourly basis - Practical scenario 1. Updates buffered 2. Indexed rebuilt weekly - Re-computation time = few hours - Limitations of online systems

2

Page 3: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

INTRODUCTION Inverted Document Frequency

Partial Score Contribution

3

Page 4: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

LENGTH NORMALIZATION

Types : L0 ,L1 & L2 ………Why L2 is preferred? Similarity,

e.g. Query, q = {t1, t2, t3}, String S1 = {t1}, String S2 = {t1, t2, t3}

and idf(t1) = 10 , idf(t2) = 8 , idf(t3)= 2 .

For L0 , S0(q,s1) = 100/3 > S0(q,s2) = 168/9

For L1 , S1(q,s1) = 100/200 > S1(q,s2) = 168/400

For L2 , S2(q,s1) = 100/41 < S2(q,s2) = 168/168 = 1 4

Page 5: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

APPROXIMATE STRING MATCHING Theorem:

Length Boundedness Determine string that are either too

short or too long to match the query

5

Page 6: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

MAINTENANCE OPERATIONS Propagating Updates 1. Insert 2. Delete 3. Modify Effectively a ‘Delete’ followed by an

‘Insert’

6

Page 7: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

Insert S7

- Generate new tokens - Add new strings - N changes -> idf changes -> L changes

INSERT

7

Page 8: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

RELAXED PROPAGATION Relaxation of N - What is Nb ? - Divergence between N & Nb

Relaxation of df - Definition of dfp(ti) - Range of dfp(ti)

Relaxed similarity S2~

8

Page 9: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

LOSS IN PRECISION Assume total possible divergence in idf

Relaxed Similarity,

For ρ=1.1 & query threshold,

Equation1 : ,

Equation2 : , 9

Page 10: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

UPDATE PROPAGATION ALGORITHM

10 …continued

Page 11: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

11

Page 12: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

EXPERIMENT (DBLP) - Period = 30 days - 2460433 author/id pairs - 5712041 total words - 269281 distinct words - 33461 total updates - 32121 insertions,1340 deletions

12

Page 13: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

EXPERIMENT (BUSINESS LISTING)

13

Page 14: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

14

Page 15: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

15

Page 16: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

QUERY ACCURACY

16

Page 17: I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1

THANK YOU.

17