using trees to depict a forest
DESCRIPTION
Using Trees to Depict a Forest. Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor. Motivation – Too Many Results. In interactive database querying, we often get more results than we can comprehend immediately Try search a popular keyword - PowerPoint PPT PresentationTRANSCRIPT
Using Trees to Depict a Forest
Bin Liu, H. V. JagadishEECS, University of Michigan, Ann Arbor
1
Motivation – Too Many Results
• In interactive database querying, we often get more results than we can comprehend immediately
• Try search a popular keyword
• When do you actually click over 2-3 pages of results?– 85% of users never go to the second page [1,2]
2
Why IR Solutions Do NOT Apply
• Sorting and ranking are standard IR techniques– Search engines show most relevant hits in the first
page• However, for a database query, all tuples in
the query result set are equally relevant– For example, Select * from Cars where price < 13,000– All matching results should be available to user– What to do when there are millions of results?
3
Make the First Page Count
• If no user preference information available, how to best arrange results?– Sort by attribute?– Random selection?– Others?
• Show the most “representative” results– Best help users learn what is in the result set– User can decide further actions based on
representatives4
Our Proposal – MusiqLens Experience
5
6
Suppose a user wants a 2005 Civic but there are too many of them…
7
MusiqLens on the Car DataID MODEL PRICE YEAR MILEAGE CONDITION
872 Civic $12,000 2005 50,000 Good 122 morelike this
901 Civic $16,000 2005 40,000 Excellent 345 morelike this
725 Civic $18,500 2005 30,000 Excellent 86 morelike this
423 Civic $17,000 2005 42,000 Good 201 morelike this
132 Civic $9,500 2005 86,000 Fair 185 morelike this
322 Civic $14,000 2005 73,000 Good 55 morelike this
8
MusiqLens on the Car DataID MODEL PRICE YEAR MILEAGE CONDITION
872 Civic $12,000 2005 50,000 Good 122 morelike this
901 Civic $16,000 2005 40,000 Excellent 345 morelike this
725 Civic $18,500 2005 30,000 Excellent 86 morelike this
423 Civic $17,000 2005 42,000 Good 201 morelike this
132 Civic $9,500 2005 86,000 Fair 185 morelike this
322 Civic $14,000 2005 73,000 Good 55 morelike this
9
Zooming in: 2005 Honda Civics ~ ID 132
ID MODEL PRICE YEAR MILEAGE CONDITION342 Civic $9,800 2005 72,000 Good 25 more
like this768 Civic $10,000 2005 60,000 Good 10 more
like this132 Civic $9,500 2005 86,000 Fair 63 more
like this122 Civic $9,500 2005 76,000 Good 5 more
like this123 Civic $9,100 2005 81,000 Fair 40 more
like this898 Civic $9,000 2005 69,000 Fair 42 more
like this
10
Now Suppose User Filters by “Price < 9,500”
ID MODEL PRICE YEAR MILEAGE CONDITION342 Civic $9,800 2005 72,000 Good 25 more
like this768 Civic $10,000 2005 60,000 Good 10 more
like this132 Civic $9,500 2005 86,000 Fair 63 more
like this122 Civic $9,500 2005 76,000 Good 5 more
like this123 Civic $9,100 2005 81,000 Fair 40 more
like this898 Civic $9,000 2005 69,000 Fair 42 more
like this
11
ID MODEL PRICE YEAR MILEAGE CONDITION123 Civic $9,100 2005 81,000 Fair 40 more
like this898 Civic $9,000 2005 69,000 Fair 42 more
like this133 Civic $9,300 2005 87,000 Fair 33 more
like this126 Civic $9,200 2005 89,000 Good 3 more
like this129 Civic $8,900 2005 81,000 Fair 20 more
like this999 Civic $9,000 2005 87,000 Fair 12 more
like this
After Filtering by “Price < 9,500”
Challenges
• Metric challenge– What is the best set of representatives?
• Representative finding challenge– How to find them efficiently?
• Query challenge– How to efficiently adapt to user’s query
operations?
12
Finding a Suitable Metric
• Users should be the ultimate judge– Which metric generates the representatives
that I can learn the most from
• User study– Use a set of candidates– Users observe the representatives– Users estimate more data points in the data– Representatives lead to best estimation wins
13
Metric Candidates
• Sort by attributes• Uniform random sampling• Density-biased sampling [3]• Sort by typicality [4]• K-medoids
– Average– Maximum
14
Density-biased Sampling
• Proposed by C. R. Palmer and C. Faloutsos [3]• Sample more from sparse regions, less from
dense regions• To counter the weakness of uniform sampling
where small clusters are missed
15
Sort by Typicality
16
Proposed by Ming Hua, Jian Pei, et al [4]
Figure source: slides from Ming Hua
Metric Candidates - K-medoids
• A medoid of a cluster is the object whose average or maximum dissimilarity to others is smallest– Average medoid and max medoid
• K-medoids are k objects, each from a cluster where the object is the medoid
• Why not K-means– K-means cluster centers do not exist in
database– We must present real objects to users
17
C
Plotting the Candidates
18
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.10.20.30.40.50.60.70.80.9
1Random
Data: Yahoo! Autos, 3922 data points. Normalized price and mileage to 0-1.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.10.20.30.40.50.60.70.80.9
1Density Biased
19
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Typical
Plotting the Candidates - Typicality
20
Plotting the Candidates – k-medoids
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Max-Medoids
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Avg-Medoids
User Study Procedure
• Users are given– 7 sets of data, generated using the 7 candidate methods– Each set consists of 8 representative points
• Users predict 4 more data points– That are most likely in the data set– Should not pick those already given
• Measure the predication error
21
22
Predication Quality Measurement
P1
P2
D1
D2
So
For data point So:MinDist: D1
MaxDist: D2
AvgDist: (D1+D2)/2
Performance – AvgDist and MaxDist
23
0
0.2
0.4
0.6
0.8
1
1.2AvgDist MaxDist
For AvgDist: Avg-Medoid is the winner.
For MaxDist: Max-Medoid is the winter.
Performance – MinDist
24
Random
Avg-M
ed
Sort-
Mile
Density
Sort-
Price
Max-Med
Typica
l0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
Avg-Medoid seems to be the winner
25
Verdict
• Although result is insignificant in MinDist, overall AvgMeoid is better than Density
• Based on AvgDist and MinDist: Avg-Medoid• Based on MaxDist: Max-Medoid• In this paper, we choose average k-medoids
– Our algorithm can extend to max-medoids with small changes
• Statistical Significance of Result
Challenges
• Metric challenge– What is the best set of representatives?
• Representative finding challenge– How to find them efficiently?
• Query challenge– How to efficiently adapt to user’s query
operations?
26
Cover Tree Based Algorithm
• Cover Tree was proposed by Beygelzimer, Kakade, and Langford in 2006 [5]
• Briefly discuss Cover Tree properties• Cover Tree based algorithms for computing k-
medoids
27
Cover Tree Properties (1)
28Figure modified from slides of Cover Tree authors
Ci
Ci+1
Points in the Data (One Dimension)
Nesting: for all i, 1 ii CCAssume all pair-wise distance <= 1.
Cover Tree Properties (2)
29
Ci
Ci+1
Covering: node in Ci is within distance of to its children in Ci+1
i2/1
Distance from node to any descendant is less than This value is called the “span” of the node.
12/1 i
Cover Tree Properties (3)
30Figure modified from slides of Cover Tree authors
Ci
Ci+1
Points in the Data
Separation: nodes in Ci are separated by at least i2/1
s1s2
s10
s8s6
s7
s3
s5
s3 s8s5
s6s1 s2 s7
s8s5
s9s4s5 s8
s9
s5
s4
s3
s2
s10
s7s3
Additional Stats for Cover Tree (2D Example)
31
Density (DS): number of points in the subtree
DS = 10
DS = 3
Centroid (CT): geometric center of points in the subtree
p
k-medoid Algorithm Outline
• We descend the cover tree to a level with more than k nodes
• Choose an initial k points as first set of medoids (seeds)– Bad seeds can lead to local minimums with a high
distance cost• Assigning nodes and repeated update until
medoids converge
32
Cover Tree Based Seeding
33
• Descend the cover tree to a level with more than k nodes (denote as level m)
• Use the parent level (m-1) as starting point for seeds– Each node has a weight, calculated as product of
span and density (the contribution of the subtree to the distance cost)
– Expand nodes using a priority queue– Fetch the first k nodes from the queue as seeds
A Simple Example: k = 4
34
s1s2
s10
s8s6
s7
s3
s5
s3 s8s5
s6s1 s2 s7
s8s5
s9s4s5 s8
s9
s5
s4
s3
s2
s10
s7s3
Span = 2
Span = 1
Span = 1/2
Span = 1/4
Priority Queue on node weight (density * span):
S3 (5), S8 (3), S5 (2)
S8 (3/2), S5 (1), S3 (1), S7 (1), S2 (1/2)
Final set of seeds
Update Process
1. Initially, assign all nodes to closest seed to form k clusters
2. For each cluster, calculate the geometric center
• Use centroid and density information to approximate subtree
3. Find the node that is closest to the geometric center, designate as a new medoid
4. Repeat from step 1 until medoids converge35
Challenges
• Metric challenge– What is the best set of representatives?
• Representative finding challenge– How to find them efficiently?
• Query challenge– How to efficiently adapt to user’s query
operations?
36
Query Adaptation
• Handle user actions– Zooming– Selection (filtering)
• Zooming– Expand all nodes assigned to the medoid– Run k-medoid algorithm on the new set of nodes
37
Selection
• Effect of selection on a node– Completely invalid– Fully valid– Partially valid
• Estimate the validity percentage (VG) of each node
• Multiply the VG with weight of each node
38
50
150
A
Mileage
S1
S2
S3 S4
S5
S6
S7
a
Pric
e
1200030
201
4557
90b
Experiments – Initial Medoid Quality
• Compare with R-tree based method [6]• Data sets
– Synthetic dataset: 2D points with zipf distribution– Real dataset: LA data set from R-tree Portal, 130k
points• Measurement
– Time to compute the medoids– Average distance from a data point to its medoid
40
Results on Synthetic Data
41
256K 512K 1024K 2048K 4096K-1.73472347597681E-18
0.002
0.004
0.006
0.008
0.01
R-tree
Cover Tree
Cardinality
Tim
e (s
econ
ds)
256K 512K 1024K 2048K 4096K0
100
200
300
400
500
600
700
800
R-tree
Cover Tree
CardinalityDi
stan
ce
For various sizes of data, Cover-tree based method outperforms R-tree based method
Time Distance
Result on Real Data
42
2 8 32 128 5120
200
400
600
800
1000
1200
1400
1600
R-tree
Cover Tree
k
Dist
ance
2 8 32 128 5120
0.01
0.02
0.03
0.04
0.05
0.06
R-tree
Cover Tree
k
Tim
e (s
econ
ds)
For various k values, Cover-tree based method outperforms R-tree based method on real data
Query Adaptation
43
0.8 0.6 0.4 0.20
100
200
300
400
500
600Re-Compute
Incremental
Selectivity
Dist
ance
0.8 0.6 0.4 0.20
50
100
150
200
250
300
350Re-Compute
Incremental
Selectivity
Dist
ance
Synthetic Data Real Data
Compare with re-building the cover tree and running the k-medoid algorithm from scratch.
Time cost of re-building is orders-of-magnitude higher than incremental computation.
44
Related Work
• Classic/textbook k-medoid methods– Partition Around Medoids (PAM) and Clustering LARge
Applications (CLARA), L. Kaufman and P. Rousseeuw, 1990– CLARANS, R. T. Ng and J. Han, TKDE 2002
• Tree-based methods– Focusing on Representatives (FOR), M. Ester, H. Kriegel,
and X. Xu, KDD 1996– Tree-based Partitioning Querying (TPAQ), K. Mouratidis, D.
Papadias, and S. Papadimitriou, VLDBJ 2008
45
Related Work (2)• Clustering methods
– For example, BIRCH, T. Zhang, R. Ramakrishnan, and M. Livny, SIGMOD 1996
• Result presentation methods– Automatic result categorization, K.Chakrabarti,
S.Chaudhuri, and S.wonHwang, SIGMOD 2004– DataScope, T. Wu, et al, VLDB 2007
• Other recent work– Finding representative set from massive data, ICDM 2005– Generalized group by, C. Li, et al, SIGMOD 2007– Query result diversification, E. Vee et al., ICDE 2008
Conclusion
• We proposed MusiqLens framework for solving the many-answer problem
• We conducted user study to select a metric for choosing representatives
• We proposed efficient method for computing and maintaining the representatives under user actions
• Part of the database usability project at Univ. of Michigan– Led by Prof. H.V. Jagadish– http://www.eecs.umich.edu/db/usable/
46
References[1] E. Agichtein, E. Brill, S. T. Dumais, and R. Ragno, Learning user interaction models
for predicting web search result preferences. SIGIR, 2006[2] B. J. Jansen and A. Spink. How are we searching the world wide web? a comparison
of nine search engine transaction logs. Inf. Process. Manage., 42(1), 2006[3] C. R. Palmer and C. Faloutsos. Density biased sampling: An improved method for
data mining and clustering. In SIGMOD Conference, 2000[4] M. Hua, J. Pei, A. W.-C. Fu, X. Lin, and H. Fung Leung. Efficiently answering top-k
typicality queries on large databases. In VLDB, pages 890{901, 2007.[5] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In
ICML, 2006.[6] K. Mouratidis, D. Papadias, and S. Papadimitriou. Tree-based partition querying: a
methodology for computing medoids in large spatial datasets. VLDB J., 17(4):923-945, 2008.
48