maximal vector computation in large data sets the 31st international conference on very large data...
TRANSCRIPT
Maximal Vector Computation in Large
Data Sets
The 31st International Conference on Very Large Data BasesVLDB 2005 / VLDB Journal 2006, August
Parke Godfrey, Jarek Gryz York UniversityRyan Shipley The College of William and Mary
Speaker: ZHANG Shiming (Simon)Supervisor: Prof. David Cheung Dr. Nikos Mamoulis
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)223/4/21
Outline Introduction
Skyline Vs Maximal Vector Problem Goals & Accomplishments Design & Analysis Considerations Generic Algorithms & Analyses LESS Algorithm & Performance Conclusions
This presentation based on this paper but not limited to it
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)323/4/21
What is skyline? Skyline Query
Given a set of d-dimensional data points, skyline query is to find a set of data points not dominated by others.
Adversarial skyline query: Adversarial skyline query: finds a set of data finds a set of data point point not dominatingnot dominating others (not covered in any others (not covered in any paper)paper)
Dominate Relationship A data point p dominates another data point q if and
only if p is better than or as good as(preference) q on all dimensions and p is strictly better than q on at least one dimension
Monotone Preference Function
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)423/4/21
What is skyline? SQL Extensions Find the maximals over tuples in the
database context w.r.t skyline criteria
SELECT...FROM...WHERE...GROUPBY...HAVING...
SKYLINE OF [DISTINCT] d1 [MIN|MAX|DIFF],
...,
dm [MIN|MAX|DIFF]
ORDERBY...
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)523/4/21
What is skyline? Skyline Examples
Interesting hotel
# of rooms
price
Hotel Information
(price, #of rooms)
Skyline of hotels
Name # of rooms Price
Hotel 1 20 70
Hotel 2 40 40
Hotel 3 40 100
Hotel 4 50 70
Hotel 5 60 100
Hotel 6 70 10
Hotel 7 80 40
Not too crowded cheap hotel
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)623/4/21
What is skyline? Skyline Examples
Consider a Hotel table with columns name, address, dist(distance to the beach), stars (quality ranking), & price.
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)723/4/21
Maximal Vector Problem A classical interesting
problem since the 1960’s To identify the maximals over
a collection of vectors Tuples ≈ vectors (or points) in
k-dim. space
Related to nearest neighbors convex hull
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)823/4/21
Challenges of skyline query processing(not in this paper)
Search efficiency Update efficiency Scalability to skyline query variants and
various-type data High dimensionality and Large Data Set
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)923/4/21
Related Work (not in this paper) General Skyline Algorithms
BNL and D&C, Börzsönyi et al., ICDE’01 Bitmap and Index, Tan et al., VLDB’01 NN, Kossmann et al., VLDB’02 SFS, Chomicki et al., ICDE’03 BBS, Papadias et al., SIGMOD’03 LESS,Parke et al., VLDB’05 Static attributes vs. dynamic spatial attributes in SSQ
SSQ is a dynamic skyline query, M. Sharifzadeh et al., VLDB’06 Z Order Skyline, Ken et al., VLDB’07 BBRS-Reverse Skyline, Evangelos et al., VLDB’07 ……
Nearest Neighbor Search K-NN …
Computational Geometry Voronoi Diagram Delaunay Graph Convex Hull High-Dimensional computational geometry
Maximal Vector Problem FLET(Fast Linear Expected-Time),J.L. Bentley et al.,SODA 1990
Index on Skyline Bitmap, B-tree, R-tree, aR-tree
….
Spatial Skyline Query (SSQ): find the data points pi that are not spatially dominated by any other point pj with respect to the given query points {q}.
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1023/4/21
Variations of Skyline Queries (not in this paper)
Constrained skyline (spatial skyline) Ranked Skyline Group-by Skyline Dynamic Skyline or Multi-source Skyline Enumerating Skyline/Top-K/K-Dominating Skyline K-Skyband Skyline Approximate Skyline Reverse Skyline Subspace Skyline SkyCub in subspace Probabilistic Skylines on Uncertain Data Privacy Skyline ……
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1123/4/21
Goals & Accomplishments
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1223/4/21
Design & Analysis Considerations
Relational Performance Criteria External
I/O conscious (too much data for main memory) well behaved
compatible with a query optimizer CPU computational load (asymptotic runtime analyses)
generic (focus on generic maximal-vector algorithm) no indexes, no pre-computed information
good properties progressive, pipe-lineable, universality and etc. at worse, linear run-time ( O(n) )
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1323/4/21
Design Choices divide-and-conquer (D&C) or scan-based
Can D&C be I/O conscious? Can scan-based be efficient?
to sort or not to sort Is sorting useful? Is sorting too inefficient? (Not linear. . .)
comparison policy Which vectors to compare next? How to reduce the number of comparisons?
…
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1423/4/21
A Model for Average-Case Analysis
Component Independence (CI)
Uniform Independence (UI)
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1523/4/21
Expected Number of Maximals
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1623/4/21
Algorithms & Analyses Generic Algorithms
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1723/4/21
Algorithms & Analyses Generic Algorithms’ Performance
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1823/4/21
Algorithms & Analyses Divide-and-Conquer algorithms
No evidence to make an efficient external version Although they are good in asymptotic complexity
for n, dimension curve is a problem for k
Scan-based algorithms Find global maximals early and eliminate non-
maximals more quickly.
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)1923/4/21
DD&C:D&C|+Sort
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2023/4/21
LD&C:D&C|-Sort
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2123/4/21
Block Nested Loops (BNL) Algorithm
O(kn)average caseUnder CI
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2223/4/21
Sort Filter Skyline (SFS) Algorithm Have a window (W) and stream (S), as with BNL. Sort S first (via an external sort routine): e.g.,
Then, call improved BNL
Any w in the window is guaranteed to be maximal (skyline).
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2323/4/21
BNL vs SFS
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2423/4/21
BNL & SFS
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2523/4/21
The LESS Algorithm Combine best aspects of the algorithms, mainly BNL & SFS.
EF Win--Elimination-Filter keep records with the best entropy scores
SF Win--Skyline-Filter keep current skyline for further filter
block-sort pass
last merge pass
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2623/4/21
LESS: Linear Average-Case Issues & Improvement
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2723/4/21
LESS: Performance n = 500, 000 EF window: 200 vectors SF window: 76 pages, 3,000 vectors Pentium III, 733 MHz RedHat Linux 7.3
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2823/4/21
Conclusions Future Works for Optimization of LESS
Department of Computer Sciences, The University of Hong Kong
The 31st International Conference on Very Large Data Bases(VLDB05/VLDBJ06 August)2923/4/21