using trees to depict a forest

47
Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor 1

Upload: angus

Post on 22-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Using Trees to Depict a Forest. Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor. Motivation – Too Many Results. In interactive database querying, we often get more results than we can comprehend immediately Try search a popular keyword - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Trees to Depict a Forest

Using Trees to Depict a Forest

Bin Liu, H. V. JagadishEECS, University of Michigan, Ann Arbor

1

Page 2: Using Trees to Depict a Forest

Motivation – Too Many Results

• In interactive database querying, we often get more results than we can comprehend immediately

• Try search a popular keyword

• When do you actually click over 2-3 pages of results?– 85% of users never go to the second page [1,2]

2

Page 3: Using Trees to Depict a Forest

Why IR Solutions Do NOT Apply

• Sorting and ranking are standard IR techniques– Search engines show most relevant hits in the first

page• However, for a database query, all tuples in

the query result set are equally relevant– For example, Select * from Cars where price < 13,000– All matching results should be available to user– What to do when there are millions of results?

3

Page 4: Using Trees to Depict a Forest

Make the First Page Count

• If no user preference information available, how to best arrange results?– Sort by attribute?– Random selection?– Others?

• Show the most “representative” results– Best help users learn what is in the result set– User can decide further actions based on

representatives4

Page 5: Using Trees to Depict a Forest

Our Proposal – MusiqLens Experience

5

Page 6: Using Trees to Depict a Forest

6

Suppose a user wants a 2005 Civic but there are too many of them…

Page 7: Using Trees to Depict a Forest

7

MusiqLens on the Car DataID MODEL PRICE YEAR MILEAGE CONDITION

872 Civic $12,000 2005 50,000 Good 122 morelike this

901 Civic $16,000 2005 40,000 Excellent 345 morelike this

725 Civic $18,500 2005 30,000 Excellent 86 morelike this

423 Civic $17,000 2005 42,000 Good 201 morelike this

132 Civic $9,500 2005 86,000 Fair 185 morelike this

322 Civic $14,000 2005 73,000 Good 55 morelike this

Page 8: Using Trees to Depict a Forest

8

MusiqLens on the Car DataID MODEL PRICE YEAR MILEAGE CONDITION

872 Civic $12,000 2005 50,000 Good 122 morelike this

901 Civic $16,000 2005 40,000 Excellent 345 morelike this

725 Civic $18,500 2005 30,000 Excellent 86 morelike this

423 Civic $17,000 2005 42,000 Good 201 morelike this

132 Civic $9,500 2005 86,000 Fair 185 morelike this

322 Civic $14,000 2005 73,000 Good 55 morelike this

Page 9: Using Trees to Depict a Forest

9

Zooming in: 2005 Honda Civics ~ ID 132

ID MODEL PRICE YEAR MILEAGE CONDITION342 Civic $9,800 2005 72,000 Good 25 more

like this768 Civic $10,000 2005 60,000 Good 10 more

like this132 Civic $9,500 2005 86,000 Fair 63 more

like this122 Civic $9,500 2005 76,000 Good 5 more

like this123 Civic $9,100 2005 81,000 Fair 40 more

like this898 Civic $9,000 2005 69,000 Fair 42 more

like this

Page 10: Using Trees to Depict a Forest

10

Now Suppose User Filters by “Price < 9,500”

ID MODEL PRICE YEAR MILEAGE CONDITION342 Civic $9,800 2005 72,000 Good 25 more

like this768 Civic $10,000 2005 60,000 Good 10 more

like this132 Civic $9,500 2005 86,000 Fair 63 more

like this122 Civic $9,500 2005 76,000 Good 5 more

like this123 Civic $9,100 2005 81,000 Fair 40 more

like this898 Civic $9,000 2005 69,000 Fair 42 more

like this

Page 11: Using Trees to Depict a Forest

11

ID MODEL PRICE YEAR MILEAGE CONDITION123 Civic $9,100 2005 81,000 Fair 40 more

like this898 Civic $9,000 2005 69,000 Fair 42 more

like this133 Civic $9,300 2005 87,000 Fair 33 more

like this126 Civic $9,200 2005 89,000 Good 3 more

like this129 Civic $8,900 2005 81,000 Fair 20 more

like this999 Civic $9,000 2005 87,000 Fair 12 more

like this

After Filtering by “Price < 9,500”

Page 12: Using Trees to Depict a Forest

Challenges

• Metric challenge– What is the best set of representatives?

• Representative finding challenge– How to find them efficiently?

• Query challenge– How to efficiently adapt to user’s query

operations?

12

Page 13: Using Trees to Depict a Forest

Finding a Suitable Metric

• Users should be the ultimate judge– Which metric generates the representatives

that I can learn the most from

• User study– Use a set of candidates– Users observe the representatives– Users estimate more data points in the data– Representatives lead to best estimation wins

13

Page 14: Using Trees to Depict a Forest

Metric Candidates

• Sort by attributes• Uniform random sampling• Density-biased sampling [3]• Sort by typicality [4]• K-medoids

– Average– Maximum

14

Page 15: Using Trees to Depict a Forest

Density-biased Sampling

• Proposed by C. R. Palmer and C. Faloutsos [3]• Sample more from sparse regions, less from

dense regions• To counter the weakness of uniform sampling

where small clusters are missed

15

Page 16: Using Trees to Depict a Forest

Sort by Typicality

16

Proposed by Ming Hua, Jian Pei, et al [4]

Figure source: slides from Ming Hua

Page 17: Using Trees to Depict a Forest

Metric Candidates - K-medoids

• A medoid of a cluster is the object whose average or maximum dissimilarity to others is smallest– Average medoid and max medoid

• K-medoids are k objects, each from a cluster where the object is the medoid

• Why not K-means– K-means cluster centers do not exist in

database– We must present real objects to users

17

C

Page 18: Using Trees to Depict a Forest

Plotting the Candidates

18

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.10.20.30.40.50.60.70.80.9

1Random

Data: Yahoo! Autos, 3922 data points. Normalized price and mileage to 0-1.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.10.20.30.40.50.60.70.80.9

1Density Biased

Page 19: Using Trees to Depict a Forest

19

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Typical

Plotting the Candidates - Typicality

Page 20: Using Trees to Depict a Forest

20

Plotting the Candidates – k-medoids

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Max-Medoids

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Avg-Medoids

Page 21: Using Trees to Depict a Forest

User Study Procedure

• Users are given– 7 sets of data, generated using the 7 candidate methods– Each set consists of 8 representative points

• Users predict 4 more data points– That are most likely in the data set– Should not pick those already given

• Measure the predication error

21

Page 22: Using Trees to Depict a Forest

22

Predication Quality Measurement

P1

P2

D1

D2

So

For data point So:MinDist: D1

MaxDist: D2

AvgDist: (D1+D2)/2

Page 23: Using Trees to Depict a Forest

Performance – AvgDist and MaxDist

23

0

0.2

0.4

0.6

0.8

1

1.2AvgDist MaxDist

For AvgDist: Avg-Medoid is the winner.

For MaxDist: Max-Medoid is the winter.

Page 24: Using Trees to Depict a Forest

Performance – MinDist

24

Random

Avg-M

ed

Sort-

Mile

Density

Sort-

Price

Max-Med

Typica

l0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

Avg-Medoid seems to be the winner

Page 25: Using Trees to Depict a Forest

25

Verdict

• Although result is insignificant in MinDist, overall AvgMeoid is better than Density

• Based on AvgDist and MinDist: Avg-Medoid• Based on MaxDist: Max-Medoid• In this paper, we choose average k-medoids

– Our algorithm can extend to max-medoids with small changes

• Statistical Significance of Result

Page 26: Using Trees to Depict a Forest

Challenges

• Metric challenge– What is the best set of representatives?

• Representative finding challenge– How to find them efficiently?

• Query challenge– How to efficiently adapt to user’s query

operations?

26

Page 27: Using Trees to Depict a Forest

Cover Tree Based Algorithm

• Cover Tree was proposed by Beygelzimer, Kakade, and Langford in 2006 [5]

• Briefly discuss Cover Tree properties• Cover Tree based algorithms for computing k-

medoids

27

Page 28: Using Trees to Depict a Forest

Cover Tree Properties (1)

28Figure modified from slides of Cover Tree authors

Ci

Ci+1

Points in the Data (One Dimension)

Nesting: for all i, 1 ii CCAssume all pair-wise distance <= 1.

Page 29: Using Trees to Depict a Forest

Cover Tree Properties (2)

29

Ci

Ci+1

Covering: node in Ci is within distance of to its children in Ci+1

i2/1

Distance from node to any descendant is less than This value is called the “span” of the node.

12/1 i

Page 30: Using Trees to Depict a Forest

Cover Tree Properties (3)

30Figure modified from slides of Cover Tree authors

Ci

Ci+1

Points in the Data

Separation: nodes in Ci are separated by at least i2/1

Page 31: Using Trees to Depict a Forest

s1s2

s10

s8s6

s7

s3

s5

s3 s8s5

s6s1 s2 s7

s8s5

s9s4s5 s8

s9

s5

s4

s3

s2

s10

s7s3

Additional Stats for Cover Tree (2D Example)

31

Density (DS): number of points in the subtree

DS = 10

DS = 3

Centroid (CT): geometric center of points in the subtree

p

Page 32: Using Trees to Depict a Forest

k-medoid Algorithm Outline

• We descend the cover tree to a level with more than k nodes

• Choose an initial k points as first set of medoids (seeds)– Bad seeds can lead to local minimums with a high

distance cost• Assigning nodes and repeated update until

medoids converge

32

Page 33: Using Trees to Depict a Forest

Cover Tree Based Seeding

33

• Descend the cover tree to a level with more than k nodes (denote as level m)

• Use the parent level (m-1) as starting point for seeds– Each node has a weight, calculated as product of

span and density (the contribution of the subtree to the distance cost)

– Expand nodes using a priority queue– Fetch the first k nodes from the queue as seeds

Page 34: Using Trees to Depict a Forest

A Simple Example: k = 4

34

s1s2

s10

s8s6

s7

s3

s5

s3 s8s5

s6s1 s2 s7

s8s5

s9s4s5 s8

s9

s5

s4

s3

s2

s10

s7s3

Span = 2

Span = 1

Span = 1/2

Span = 1/4

Priority Queue on node weight (density * span):

S3 (5), S8 (3), S5 (2)

S8 (3/2), S5 (1), S3 (1), S7 (1), S2 (1/2)

Final set of seeds

Page 35: Using Trees to Depict a Forest

Update Process

1. Initially, assign all nodes to closest seed to form k clusters

2. For each cluster, calculate the geometric center

• Use centroid and density information to approximate subtree

3. Find the node that is closest to the geometric center, designate as a new medoid

4. Repeat from step 1 until medoids converge35

Page 36: Using Trees to Depict a Forest

Challenges

• Metric challenge– What is the best set of representatives?

• Representative finding challenge– How to find them efficiently?

• Query challenge– How to efficiently adapt to user’s query

operations?

36

Page 37: Using Trees to Depict a Forest

Query Adaptation

• Handle user actions– Zooming– Selection (filtering)

• Zooming– Expand all nodes assigned to the medoid– Run k-medoid algorithm on the new set of nodes

37

Page 38: Using Trees to Depict a Forest

Selection

• Effect of selection on a node– Completely invalid– Fully valid– Partially valid

• Estimate the validity percentage (VG) of each node

• Multiply the VG with weight of each node

38

50

150

A

Mileage

S1

S2

S3 S4

S5

S6

S7

a

Pric

e

1200030

201

4557

90b

Page 39: Using Trees to Depict a Forest

Experiments – Initial Medoid Quality

• Compare with R-tree based method [6]• Data sets

– Synthetic dataset: 2D points with zipf distribution– Real dataset: LA data set from R-tree Portal, 130k

points• Measurement

– Time to compute the medoids– Average distance from a data point to its medoid

40

Page 40: Using Trees to Depict a Forest

Results on Synthetic Data

41

256K 512K 1024K 2048K 4096K-1.73472347597681E-18

0.002

0.004

0.006

0.008

0.01

R-tree

Cover Tree

Cardinality

Tim

e (s

econ

ds)

256K 512K 1024K 2048K 4096K0

100

200

300

400

500

600

700

800

R-tree

Cover Tree

CardinalityDi

stan

ce

For various sizes of data, Cover-tree based method outperforms R-tree based method

Time Distance

Page 41: Using Trees to Depict a Forest

Result on Real Data

42

2 8 32 128 5120

200

400

600

800

1000

1200

1400

1600

R-tree

Cover Tree

k

Dist

ance

2 8 32 128 5120

0.01

0.02

0.03

0.04

0.05

0.06

R-tree

Cover Tree

k

Tim

e (s

econ

ds)

For various k values, Cover-tree based method outperforms R-tree based method on real data

Page 42: Using Trees to Depict a Forest

Query Adaptation

43

0.8 0.6 0.4 0.20

100

200

300

400

500

600Re-Compute

Incremental

Selectivity

Dist

ance

0.8 0.6 0.4 0.20

50

100

150

200

250

300

350Re-Compute

Incremental

Selectivity

Dist

ance

Synthetic Data Real Data

Compare with re-building the cover tree and running the k-medoid algorithm from scratch.

Time cost of re-building is orders-of-magnitude higher than incremental computation.

Page 43: Using Trees to Depict a Forest

44

Related Work

• Classic/textbook k-medoid methods– Partition Around Medoids (PAM) and Clustering LARge

Applications (CLARA), L. Kaufman and P. Rousseeuw, 1990– CLARANS, R. T. Ng and J. Han, TKDE 2002

• Tree-based methods– Focusing on Representatives (FOR), M. Ester, H. Kriegel,

and X. Xu, KDD 1996– Tree-based Partitioning Querying (TPAQ), K. Mouratidis, D.

Papadias, and S. Papadimitriou, VLDBJ 2008

Page 44: Using Trees to Depict a Forest

45

Related Work (2)• Clustering methods

– For example, BIRCH, T. Zhang, R. Ramakrishnan, and M. Livny, SIGMOD 1996

• Result presentation methods– Automatic result categorization, K.Chakrabarti,

S.Chaudhuri, and S.wonHwang, SIGMOD 2004– DataScope, T. Wu, et al, VLDB 2007

• Other recent work– Finding representative set from massive data, ICDM 2005– Generalized group by, C. Li, et al, SIGMOD 2007– Query result diversification, E. Vee et al., ICDE 2008

Page 45: Using Trees to Depict a Forest

Conclusion

• We proposed MusiqLens framework for solving the many-answer problem

• We conducted user study to select a metric for choosing representatives

• We proposed efficient method for computing and maintaining the representatives under user actions

• Part of the database usability project at Univ. of Michigan– Led by Prof. H.V. Jagadish– http://www.eecs.umich.edu/db/usable/

46

Page 46: Using Trees to Depict a Forest

Thank you.

47

Bin Liu,[email protected]

Questions?

Page 47: Using Trees to Depict a Forest

References[1] E. Agichtein, E. Brill, S. T. Dumais, and R. Ragno, Learning user interaction models

for predicting web search result preferences. SIGIR, 2006[2] B. J. Jansen and A. Spink. How are we searching the world wide web? a comparison

of nine search engine transaction logs. Inf. Process. Manage., 42(1), 2006[3] C. R. Palmer and C. Faloutsos. Density biased sampling: An improved method for

data mining and clustering. In SIGMOD Conference, 2000[4] M. Hua, J. Pei, A. W.-C. Fu, X. Lin, and H. Fung Leung. Efficiently answering top-k

typicality queries on large databases. In VLDB, pages 890{901, 2007.[5] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In

ICML, 2006.[6] K. Mouratidis, D. Papadias, and S. Papadimitriou. Tree-based partition querying: a

methodology for computing medoids in large spatial datasets. VLDB J., 17(4):923-945, 2008.

48