a combination of trie-trees and inverted files for the indexing of set-valued attributes

Post on 01-Jan-2016

23 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes. Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) Timos Sellis (NTUA). Problem. We are interested in low cardinality set-values Retail store transaction logs Web logs - PowerPoint PPT Presentation

TRANSCRIPT

A Combination of Trie-trees and Inverted files for the Indexing of

Set-valued Attributes

Manolis Terrovitis (NTUA)Spyros Passas (NTUA)Panos Vassiliadis (UoI)

Timos Sellis (NTUA)

Terrovitis et. al., CIKM '06

Problem

We are interested in low cardinality set-values– Retail store transaction logs– Web logs– Biomedical databases etc.

We address the efficient evaluation of containment queries– In which transactions were products ‘a’ and ‘b’ sold together?– Which users visited only the main page or the download page

of our site?

We propose the Hybrid Trie-Inverted file (HTI) index

Terrovitis et. al., CIKM '06

Outline

Problem definition The HTI index Query evaluation Experiments Conclusions

Terrovitis et. al., CIKM '06

Outline

Problem definition The HTI index Query evaluation Experiments Conclusions

Terrovitis et. al., CIKM '06

Data and queries

tid products tid products

1 {f,a} 9 {a,e}

2 {a,d,c} 10 {g,c,a}

3 {c,b,a} 11 {b,a,e}

4 {f,a,c} 12 {b,d,c}

5 {c,g} 13 {c,f,a,d,b}

6 {a,b,g,c,d,e}

14 {b,d}

7 {a,d,b} 15 {e}

8 {a,e,b} 16 {b,f,a}

Terrovitis et. al., CIKM '06

Data and queries

Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset)

tid products tid products

1 {f,a} 9 {a,e}

2 {a,d,c} 10 {g,c,a}

3 {c,b,a} 11 {b,a,e}

4 {f,a,c} 12 {b,d,c}

5 {c,g} 13 {c,f,a,d,b}

6 {a,b,g,c,d,e}

14 {b,d}

7 {a,d,b} 15 {e}

8 {a,e,b} 16 {b,f,a}

Terrovitis et. al., CIKM '06

Data and queries

Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset)

Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality)

tid products tid products

1 {f,a} 9 {a,e}

2 {a,d,c} 10 {g,c,a}

3 {c,b,a} 11 {b,a,e}

4 {f,a,c} 12 {b,d,c}

5 {c,g} 13 {c,f,a,d,b}

6 {a,b,g,c,d,e}

14 {b,d}

7 {a,d,b} 15 {e}

8 {a,e,b} 16 {b,f,a}

Terrovitis et. al., CIKM '06

Data and queries

Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset)

Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality)

Find all transactions that contain only items from ‘a’, ‘b’ and ‘d’ (superset)

tid products tid products

1 {f,a} 9 {a,e}

2 {a,d,c} 10 {g,c,a}

3 {c,b,a} 11 {b,a,e}

4 {f,a,c} 12 {b,d,c}

5 {c,g} 13 {c,f,a,d,b}

6 {a,b,g,c,d,e}

14 {b,d}

7 {a,d,b} 15 {e}

8 {a,e,b} 16 {b,f,a}

Terrovitis et. al., CIKM '06

Data and queries

Traditional methods– Signature files– Inverted files

Differences from text databases:– Low cardinality– Large number of records in comparison

with vocabulary size– New types of queries (equality-superset)

Terrovitis et. al., CIKM '06

Outline

Problem definition The HTI index Query evaluation Experiments Conclusions

Terrovitis et. al., CIKM '06

The HTI index Background – The inverted file

d

e

f

g

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

a

c

b

1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, 16

3, 6, 7, 8, 9, 11, 12, 13, 14, 16

2, 3, 4, 5, 6, 10, 12, 13 14

16

b, d

b, f, a

Database transactions (D)

Inverted (postings) lists

Voc

abu

lary

(I)

Terrovitis et. al., CIKM '06

HTI indexInverted files - problems

The evaluation of containment queries relies on merge-joining the inverted lists

The inverted lists become very long – when the database size is very big compared to the

vocabulary – when the items’ distribution is skewed

This is often the case in the real world!

Terrovitis et. al., CIKM '06

HTI indexSolution?

We need to break up the lists!

But how?– Lets make a list for every combination of

items!

Terrovitis et. al., CIKM '06

HTI indexSolution?

We assume a total order based on the frequency of appearance for the items of the database

We order the items in each set-value and we transform it to a sequence

We create a path in the access tree for each sequence

Terrovitis et. al., CIKM '06

HTI indexAll combinations?

Null

a

b efc

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

fedc

fe

d

gfd

dc

d

cc

g

g

Terrovitis et. al., CIKM '06

HTI indexAll combinations?

Null

a

b efc

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

fedc

fe

d

gfd

dc

d

cc

g

g

tid’s: 1

tid’s: 1

Terrovitis et. al., CIKM '06

HTI indexAll combinations?

Null

a

b efc

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

fedc

fe

d

gfd

dc

d

cc

g

g

tid’s: 1

tid’s: 1,2

tid’s: 2

tid’s: 2

Terrovitis et. al., CIKM '06

HTI indexAll combinations?

Null

a

b efc

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

fedc

fe

d

gfd

dc

d

cc

g

tid’s: 1,2,3,4,6,7,8,9,10,11,13,16

tid’s: 3,6,7,8,11,13,16

tid’s: 1 tid’s: 9

tid’s: 7 tid’s: 8,11 tid’s: 16tid’s: 3,6,13

tid’s: 13,16

tid’s: 13 tid’s: 16

tid’s: 2,4,10

tid’s: 2 tid’s: 4 tid’s: 10

tid’s: 12,14

tid’s: 12

tid’s: 12

tid’s: 14

tid’s: 5

tid’s: 5

tid’s: 15

gtid’s: 13

Terrovitis et. al., CIKM '06

HTI indexAll combinations? Maybe, not…

Null

a

b efc

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

fedc

fe

d

gfd

dc

d

cc

g

tid’s: 1,2,3,4,6,7,8,9,10,11,13,16

tid’s: 3,6,7,8,11,13,16

tid’s: 1 tid’s: 9

tid’s: 7 tid’s: 8,11 tid’s: 16tid’s: 3,6,13

tid’s: 13,16

tid’s: 13 tid’s: 16

tid’s: 2,4,10

tid’s: 2 tid’s: 4 tid’s: 10

tid’s: 12,14

tid’s: 12

tid’s: 12

tid’s: 14

tid’s: 5

tid’s: 5

tid’s: 15

gtid’s: 13

Terrovitis et. al., CIKM '06

HTI indexAn access tree for the frequent items

Null

a

b

c

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

c c c

tid’s: 1,2,3,4,6,7,8,9,10,11,13,16

tid’s: 3,6,7,8,11,13,16

tid’s: 3,6,13 tid’s: 2,4,10 tid’s: 12,14

tid’s: 12

tid’s: 5

Terrovitis et. al., CIKM '06

HTI indexAn access tree for the frequent items

Null

a

b

c

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

c c c

tid’s: 1,2,3,4,6,7,8,9,10,11,13,16

tid’s: 3,6,7,8,11,13,16

tid’s: 3,6,13 tid’s: 2,4,10 tid’s: 12,14

tid’s: 12

tid’s: 5

Terrovitis et. al., CIKM '06

The HTI index

Vocabulary

a

c

d

e

f

b

f

Terrovitis et. al., CIKM '06

The HTI index

Vocabulary

a

c

d

e

f

b

f

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

Inverted Lists

Terrovitis et. al., CIKM '06

The HTI index

Vocabulary

a

c

d

e

f

b

f

Null

a

b

c

b

c c c

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

The HTI index

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

HTI indexThe basic points

The access tree is used only for the most frequent items

The inverted lists are restructured so that each node of the access tree points to a different inverted sublist

We keep the access tree in main memory

Terrovitis et. al., CIKM '06

Outline

Problem definition The HTI index Query evaluation Experiments Conclusions

Terrovitis et. al., CIKM '06

Query EvaluationBasic Steps

1. Find the frequent items of the query set

2. Use the access tree to detect the sublists which might participate in the answer

3. Merge-join these sublists with the inverted lists of the non-frequent items

Terrovitis et. al., CIKM '06

Subset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Subset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Subset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Subset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Subset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Equality - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Equality - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Equality - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Equality - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

f

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Terrovitis et. al., CIKM '06

Outline

Problem definition The HTI index Query evaluation Experiments Conclusions

Terrovitis et. al., CIKM '06

ExperimentsSetup

Real Data from UCI– web log from microsoft.com [ 320k records, 294

items]– web log from msnbc.com [1M records, 17 items]

Synthetic data– Zipfian distribution of order 1– 100k-1M records– 1k-10k items– Queries with 2-22 items

Terrovitis et. al., CIKM '06

ExperimentsQuery performance – DB size

synthetic data - DB size

0

500

1000

1500

2000

2500

3000

0 200 400 600 800 1000

1000's of records

disk

pag

e acc

esse

s

I F

HTI - 0.5%

HTI - 1%

HTI - 3%

Terrovitis et. al., CIKM '06

ExperimentsQuery performance – query length

synthetic data - query length

0

500

1000

1500

2000

2500

2 7 12 17 22

query length

disk

pag

e acc

esse

s

I F

HTI - 0.5%

HTI - 1%

HTI - 3%

Terrovitis et. al., CIKM '06

ExperimentsQuery performance – query length

real data - subset

0

50

100

150

200

250

300

350

400

2 3 4 5 6 7

query length

dis

k p

age

acc

esse

s

I F

HTI - 5%

HTI - 20%

HTI - 40%

Terrovitis et. al., CIKM '06

ExperimentsQuery performance – query length

real data - equality

0

50

100

150

200

250

300

350

400

2 3 4 5 6 7

query length

dis

k pa

ge a

ccess

es

I F

HTI - 5%

HTI - 20%

HTI - 40%

Terrovitis et. al., CIKM '06

ExperimentsQuery performance – query length

real data - superset

0

100

200

300

400

500

600

700

800

900

1000

2 3 4 5 6 7

query length

disk

pag

e acc

esse

s

I F

HTI - 5%

HTI - 20%

HTI - 40%

Terrovitis et. al., CIKM '06

ExperimentsAccess tree size – DB size

Eff ect of the DB size

0

500

1000

1500

2000

2500

0 200 400 600 800 1000

1000's of records

1000

's o

f tr

ee n

odes

HTI - 0.5%

HTI - 1%

HTI - 3%

Terrovitis et. al., CIKM '06

ExperimentsAccess tree size – DB size

Eff ect of the DB size

0

200

400

600

800

1000

1200

1400

1600

1800

0 5 10 15 20 25 30

millions of records

1000

's o

f tr

ee n

odes

Terrovitis et. al., CIKM '06

Experiments

The HTI scales a lot better than the inverted file as the query and the database size grow

A small threshold is enough for a performance gain over an order of magnitude

The main memory requirements do not exceed 0.5M for the real data.

Terrovitis et. al., CIKM '06

Outline

Problem Definition The HTI index Query evaluation Experiments Conclusions

Terrovitis et. al., CIKM '06

Conclusions

The HTI index relies on breaking up the larger inverted lists in smaller lists that contain known combinations of items

The HTI index significantly outperforms the inverted file for small domains and skewed item distributions

It has moderate memory requirements that can be adjusted by using the right threshold

Terrovitis et. al., CIKM '06

The End

Thank You!

Terrovitis et. al., CIKM '06

ExperimentsVocabulary size

Eff ect of the vocabulary size

0

200

400

600

800

1000

1200

1400

1600

1 3 5 7 9

vocabulary size in 1000's of items

1000

's o

f tr

ee n

odes

HTI - 0.5%

HTI - 1%

HTI - 3%

Terrovitis et. al., CIKM '06

ExperimentsThreshold choice

Eff ect of the threshold

0

200

400

600

800

1000

1200

1400

0,00% 2,00% 4,00% 6,00% 8,00% 10,00%

threshold

1000's of tree nodes

Terrovitis et. al., CIKM '06

ExperimentsThreshold choice

Eff ect of the threshold

0

50

100

150

200

250

300

0,00% 2,00% 4,00% 6,00% 8,00% 10,00%

threshold

Avg of disk page accesses

top related