maximal vector computation in large data sets the 31st international conference on very large data...

29
Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey, Jarek Gryz York University Ryan Shipley The College of William and Mary Speaker: ZHANG Shiming (Simon) Supervisor: Prof. David Cheung Dr. Nikos Mamoulis

Upload: herbert-cooper

Post on 12-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Maximal Vector Computation in Large

Data Sets

The 31st International Conference on Very Large Data BasesVLDB 2005 / VLDB Journal 2006, August

Parke Godfrey, Jarek Gryz York UniversityRyan Shipley The College of William and Mary

Speaker: ZHANG Shiming (Simon)Supervisor: Prof. David Cheung Dr. Nikos Mamoulis

Page 2: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)223/4/21

Outline Introduction

Skyline Vs Maximal Vector Problem Goals & Accomplishments Design & Analysis Considerations Generic Algorithms & Analyses LESS Algorithm & Performance Conclusions

This presentation based on this paper but not limited to it

Page 3: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)323/4/21

What is skyline? Skyline Query

Given a set of d-dimensional data points, skyline query is to find a set of data points not dominated by others.

Adversarial skyline query: Adversarial skyline query: finds a set of data finds a set of data point point not dominatingnot dominating others (not covered in any others (not covered in any paper)paper)

Dominate Relationship A data point p dominates another data point q if and

only if p is better than or as good as(preference) q on all dimensions and p is strictly better than q on at least one dimension

Monotone Preference Function

Page 4: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)423/4/21

What is skyline? SQL Extensions Find the maximals over tuples in the

database context w.r.t skyline criteria

SELECT...FROM...WHERE...GROUPBY...HAVING...

SKYLINE OF [DISTINCT] d1 [MIN|MAX|DIFF],

...,

dm [MIN|MAX|DIFF]

ORDERBY...

Page 5: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)523/4/21

What is skyline? Skyline Examples

Interesting hotel

# of rooms

price

Hotel Information

(price, #of rooms)

Skyline of hotels

Name # of rooms Price

Hotel 1 20 70

Hotel 2 40 40

Hotel 3 40 100

Hotel 4 50 70

Hotel 5 60 100

Hotel 6 70 10

Hotel 7 80 40

Not too crowded cheap hotel

Page 6: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)623/4/21

What is skyline? Skyline Examples

Consider a Hotel table with columns name, address, dist(distance to the beach), stars (quality ranking), & price.

Page 7: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)723/4/21

Maximal Vector Problem A classical interesting

problem since the 1960’s To identify the maximals over

a collection of vectors Tuples ≈ vectors (or points) in

k-dim. space

Related to nearest neighbors convex hull

Page 8: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)823/4/21

Challenges of skyline query processing(not in this paper)

Search efficiency Update efficiency Scalability to skyline query variants and

various-type data High dimensionality and Large Data Set

Page 9: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)923/4/21

Related Work (not in this paper) General Skyline Algorithms

BNL and D&C, Börzsönyi et al., ICDE’01 Bitmap and Index, Tan et al., VLDB’01 NN, Kossmann et al., VLDB’02 SFS, Chomicki et al., ICDE’03 BBS, Papadias et al., SIGMOD’03 LESS,Parke et al., VLDB’05 Static attributes vs. dynamic spatial attributes in SSQ

SSQ is a dynamic skyline query, M. Sharifzadeh et al., VLDB’06 Z Order Skyline, Ken et al., VLDB’07 BBRS-Reverse Skyline, Evangelos et al., VLDB’07 ……

Nearest Neighbor Search K-NN …

Computational Geometry Voronoi Diagram Delaunay Graph Convex Hull High-Dimensional computational geometry

Maximal Vector Problem FLET(Fast Linear Expected-Time),J.L. Bentley et al.,SODA 1990

Index on Skyline Bitmap, B-tree, R-tree, aR-tree

….

Spatial Skyline Query (SSQ): find the data points pi that are not spatially dominated by any other point pj with respect to the given query points {q}.

Page 10: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1023/4/21

Variations of Skyline Queries (not in this paper)

Constrained skyline (spatial skyline) Ranked Skyline Group-by Skyline Dynamic Skyline or Multi-source Skyline Enumerating Skyline/Top-K/K-Dominating Skyline K-Skyband Skyline Approximate Skyline Reverse Skyline Subspace Skyline SkyCub in subspace Probabilistic Skylines on Uncertain Data Privacy Skyline ……

Page 11: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1123/4/21

Goals & Accomplishments

Page 12: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1223/4/21

Design & Analysis Considerations

Relational Performance Criteria External

I/O conscious (too much data for main memory) well behaved

compatible with a query optimizer CPU computational load (asymptotic runtime analyses)

generic (focus on generic maximal-vector algorithm) no indexes, no pre-computed information

good properties progressive, pipe-lineable, universality and etc. at worse, linear run-time ( O(n) )

Page 13: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1323/4/21

Design Choices divide-and-conquer (D&C) or scan-based

Can D&C be I/O conscious? Can scan-based be efficient?

to sort or not to sort Is sorting useful? Is sorting too inefficient? (Not linear. . .)

comparison policy Which vectors to compare next? How to reduce the number of comparisons?

Page 14: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1423/4/21

A Model for Average-Case Analysis

Component Independence (CI)

Uniform Independence (UI)

Page 15: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1523/4/21

Expected Number of Maximals

Page 16: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1623/4/21

Algorithms & Analyses Generic Algorithms

Page 17: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1723/4/21

Algorithms & Analyses Generic Algorithms’ Performance

Page 18: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1823/4/21

Algorithms & Analyses Divide-and-Conquer algorithms

No evidence to make an efficient external version Although they are good in asymptotic complexity

for n, dimension curve is a problem for k

Scan-based algorithms Find global maximals early and eliminate non-

maximals more quickly.

Page 19: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1923/4/21

DD&C:D&C|+Sort

Page 20: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2023/4/21

LD&C:D&C|-Sort

Page 21: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2123/4/21

Block Nested Loops (BNL) Algorithm

O(kn)average caseUnder CI

Page 22: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2223/4/21

Sort Filter Skyline (SFS) Algorithm Have a window (W) and stream (S), as with BNL. Sort S first (via an external sort routine): e.g.,

Then, call improved BNL

Any w in the window is guaranteed to be maximal (skyline).

Page 23: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2323/4/21

BNL vs SFS

Page 24: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2423/4/21

BNL & SFS

Page 25: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2523/4/21

The LESS Algorithm Combine best aspects of the algorithms, mainly BNL & SFS.

EF Win--Elimination-Filter keep records with the best entropy scores

SF Win--Skyline-Filter keep current skyline for further filter

block-sort pass

last merge pass

Page 26: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2623/4/21

LESS: Linear Average-Case Issues & Improvement

Page 27: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2723/4/21

LESS: Performance n = 500, 000 EF window: 200 vectors SF window: 76 pages, 3,000 vectors Pentium III, 733 MHz RedHat Linux 7.3

Page 28: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2823/4/21

Conclusions Future Works for Optimization of LESS

Page 29: Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Department of Computer Sciences, The University of Hong Kong

The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2923/4/21