preference queries from olap and data mining perspective jian pei 1, yufei tao 2, jiawei han 3 1...
Post on 19-Dec-2015
215 views
TRANSCRIPT
![Page 1: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/1.jpg)
Preference Queries from OLAP and Data Mining Perspective
Jian Pei1, Yufei Tao2, Jiawei Han3
1Simon Fraser University, Canada, [email protected] Chinese University of Hong Kong, China, [email protected]
3University of Illinois at Urbana Champaign, USA, [email protected]
![Page 2: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/2.jpg)
Outline
Preference queries from the traditional perspective Ranking queries and the TA algorithm Skyline queries and algorithms Variations of preference queries
Preference queries from the OLAP perspective Ranking with multidimensional selections Ranking aggregate queries in data cubes Multidimensional skyline analysis
Preference queries and preference mining Online skyline analysis with dynamic preferences Learning user preferences from superior and interior examples
Conclusions
2
![Page 3: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/3.jpg)
3
Top-k search
Given a set of d-dimensional points, find the k points that minimize a preference function.
Example 1: FLAT(price, size). Find the 10 flats that minimize price + 1000 / size.
Example 2: MOVIE(YahooRating, AmazonRating, GoogleRating). Find the 10 movies that maximize
YahooRating + AmazonRating + GoogleRating.
![Page 4: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/4.jpg)
4
Geometric interpretation
Find the point that minimizes x + y.
x
y
O
p1
p2
p3
p4
p5
p6
p7
p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 5: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/5.jpg)
5
Algorithms
Too many.
We will cover a few representatives: Threshold algorithms.
[Fagin et al. PODS 01]
Multi-dimensional indexes. [Tsaparas et al. ICDE 03, Tao et al. IS 07]
Layered indexes. [Chang et al. SIGMOD 00, Xin et al. VLDB 06]
![Page 6: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/6.jpg)
6
No random access (NRA) [Fagin et al. PODS 01]
Find the point minimizing x + y.
x y
ascending
At this time, there is a chance we are able to tell that the
blue point is definitely better than the yellow point.
![Page 7: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/7.jpg)
7
Optimality
Worst case: Need to access everything.
But NRA is instance optimal.
If the optimal algorithm performs s sequentialaccesses, then NRA performs O(s).
The hidden constant is a function of d and k. [Fagin et al. PODS 01]
Computation time per access? The state of the art approach: O(logk + 2d).
[Mamoulis et al. TODS 07]
x y
![Page 8: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/8.jpg)
8
Threshold algorithm (TA) [Fagin et al. PODS 01]
Similar to NRA but use random accesses to calculate an object score as early as possible.
x y
ascending
For any object we haven’t seen, we know a lower
bound of its score.
![Page 9: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/9.jpg)
9
Optimality
TA is also instance optimal.
If the optimal algorithm performs s sequential accesses and r random accesses, then NRA accesses O(s) sequential accesses and O(r) random accesses.
The hidden constants are functions of d and k. [Fagin et al. PODS 01]
![Page 10: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/10.jpg)
10
Top-1 = Nearest neighbor
Find the point that minimizes x + y. Equivalently, find the nearest neighbor of the origin under the L1
norm.
x
y
O
p1
p2
p3
p4
p5
p6
p7
p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 11: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/11.jpg)
11
R-tree
x
y
O
p1
p2p3
p4
p5
p6
p7
p8
p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 12: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/12.jpg)
12
R-tree
Find the point that minimizes the score x + y.
x
y
O 0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2defines the score
lower bound
![Page 13: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/13.jpg)
13
R-tree [Tsaparas et al. ICDE 03, Tao et al. IS 07]
Always go for the node with the smallest lower bound.
x
y
O
p1
p2p3
p4
p5
p6
p7
p8
p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 14: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/14.jpg)
14
R-tree
Always go for the node with the smallest lower bound.
x
y
O
p1
p2p3
p4
p5
p6
p7
p8
p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 15: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/15.jpg)
15
R-tree
Always go for the node with the smallest lower bound.
x
y
O
p1
p2p3
p4
p5
p6
p7
p8
p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 16: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/16.jpg)
16
Optimality
Worst case: Need to access all nodes.
But the algorithm we used is “R-tree optimal”. No algorithm can visit fewer nodes of the same tree.
x
y
O
p1
p2p3
p4
p5
p6
p7
p8
p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 17: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/17.jpg)
17
Layered index 1: Onion [Chang et al. SIGMOD 02]
The top-1 object of any linear preference function c1 x + c2 y must be on the convex hull, regardless of c1 and c2.
Due to symmetry, next we will focus on positive c1 and c2.
x
y
O
p1
p2
p3
p4
p5
p6
p7
p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 18: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/18.jpg)
18
Onion
Similarly, the top-k objects must exist in the first k layers of convex hulls.
x
y
O
p1
p2
p3
p4
p5
p6
p7
p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 19: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/19.jpg)
19
x
y
O
p1
p2
p3
p4
p5
p6
p7
p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
Onion
Each layer in Onion may contain unnecessary points. In fact, p6 cannot be the top-2 object of any linear preference
function.
x
y
O
p1
p2
p3
p4
p5
p6
p7
p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 20: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/20.jpg)
20
Optimal layering [Xin et al. VLDB 06]
What is the smallest k such that p6 is in the top-k result of some linear preference function?
The question can be answered in O(nlogn) time.
x
y
O
p1
p2
p3
p4
p5
p6
p7
p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
The answer is 3.
It suffices to put p6 in the 3rd layer.
![Page 21: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/21.jpg)
21
Other works
Many great works, including the following and many others.
PREFER [Hristidis et al. SIGMOD 2001]
Ad-hoc preference functions [Xin et al. SIGMOD 2007]
Top-k join[Ilyas et al. VLDB 2003]
Probabilistic top-k [Soliman et al. ICDE 2007]
![Page 22: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/22.jpg)
Outline
Preference queries from the traditional perspective Ranking queries and the TA algorithm Skyline queries and algorithms Variations of preference queries
Preference queries from the OLAP perspective Ranking with multidimensional selections Ranking aggregate queries in data cubes Multidimensional skyline analysis
Preference queries and preference mining Online skyline analysis with dynamic preferences Learning user preferences from superior and interior examples
Conclusions
22
![Page 23: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/23.jpg)
23
Drawback of top-k
Top-k search requires a concrete preference function.
Example 1 (revisited): FLAT(price, size). Find the flat that minimizes price + 1000 / size.
Why not price + 2000 / size? Why does it even have to be linear?
The skyline is useful in scenarios like this where a good preference function is difficult to set.
![Page 24: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/24.jpg)
24
Dominance
p1 dominates p2.
Hence, p1 has a smaller score under any monotone preference function f(x, y). f(x, y) is monotone if it increases with both x and y.
x
y
p1
p2
![Page 25: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/25.jpg)
25
Skyline
x
y
p1
p2
p3
p4
p5
p6
p7
p8
The skyline contains points that are not dominated by others.
![Page 26: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/26.jpg)
26
Skyline vs. convex hull
x
y
p1
p2
p3
p4
p5
p6
p7
p8
x
y
p1
p2
p3
p4
p5
p6 p7
p8
Contains the top-1 object of any monotone function.
Contains the top-1 object of any linear function.
![Page 27: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/27.jpg)
27
Algorithms
Easy to do O(n2).
Many attempts to make it faster.
We will cover a few representatives: Optimal algorithms in 2D and 3D.
[Kung et al. JACM 75] Scan-based.
[Chomicki et al. ICDE 03, Godfrey et al. VLDB 05] Multi-dimensional indexes
[Kossmann et al. VLDB 02, Papadias et al. SIGMOD 04]
Subspace skylines[Tao et al. ICDE 06]
![Page 28: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/28.jpg)
28
Lower bound [Kung et al. JACM 75]
(nlogn)
x
y
![Page 29: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/29.jpg)
29
2D
If not dominated, add to the skyline. Dominance check in O(1) time.
x
yplane sweep
![Page 30: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/30.jpg)
30
3D
If not dominated, add to the skyline. Dominance check in O(logn) time using a binary tree.
z
y
p
![Page 31: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/31.jpg)
31
Dimensionality over 3
O(nlogd-2n)
Kung et al. JACM 75
![Page 32: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/32.jpg)
32
Scan-based algorithms
Sort-first skyline (SFS)[Chomicki et al. ICDE 03]
Linear elimination sort for skyline (LESS).[Godfrey et al. VLDB 05]
d
i
ip1
1ln
![Page 33: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/33.jpg)
33
Skyline retrieval by NN search [Kossmann et al. VLDB 02]
z
y
p1
I
II
III
IV
z
y
p2
p3
![Page 34: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/34.jpg)
34
Branch-and-bound skyline (BBS) [Papadias et al. SIGMOD 04]
Always visits the next MBR closest to the origin, unless the MBR is dominated.
x
y
O
p1
p2p3
p4
p5
p6
p7
p8
p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 35: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/35.jpg)
35
Branch-and-bound skyline (BBS)
Always visits the next MBR closest to the origin, unless the MBR is dominated.
x
y
O
p1
p2p3
p4
p5
p6
p7
p8
p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 36: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/36.jpg)
36
Branch-and-bound skyline (BBS)
Always visits the next MBR closest to the origin, unless the MBR is dominated.
x
y
O
p1
p2p3
p4
p5
p6
p7
p8
p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 37: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/37.jpg)
37
Branch-and-bound skyline (BBS)
Always visits the next MBR closest to the origin, unless the MBR is dominated.
x
y
O
p1
p2p3
p4
p5
p6
p7
p8
p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 38: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/38.jpg)
38
Branch-and-bound skyline (BBS)
Always visits the next MBR closest to the origin, unless the MBR is dominated.
x
y
O
p1
p2p3
p4
p5
p6
p7
p8
p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
dominated
![Page 39: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/39.jpg)
39
Branch-and-bound skyline (BBS)
Always visits the next MBR closest to the origin, unless the MBR is dominated.
x
y
O
p1
p2p3
p4
p5
p6
p7
p8
p1 p2 p3 p4 p5 p6 p7 p8
0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
![Page 40: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/40.jpg)
40
Optimality
BBS is R-tree optimal.
x
y
p1
p2
p3
p4
p5
p6
p7
p8
![Page 41: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/41.jpg)
Outline
Preference queries from the traditional perspective Ranking queries and the TA algorithm Skyline queries and algorithms Variations of preference queries
Preference queries from the OLAP perspective Ranking with multidimensional selections Ranking aggregate queries in data cubes Multidimensional skyline analysis
Preference queries and preference mining Online skyline analysis with dynamic preferences Learning user preferences from superior and interior examples
Conclusions
41
![Page 42: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/42.jpg)
42
Skyline in subspaces [Tao et al. ICDE 06]
PROPERTY price size distance to the nearest super market distance to the nearest railway station air quality noise level security
Need to be able to efficiently find the skyline in any subspace.
![Page 43: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/43.jpg)
43
Skyline in subspaces
Non-indexed methods Still work but need to access the entire database.
R-tree Dimensionality curse.
![Page 44: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/44.jpg)
44
SUBSKY [Tao et al. ICDE 06]
Say all dimensions have domain [0, 1].
Maximal corner: The point having coordinate 1 on all dimensions.
Sort all data points in descending order of their L distances to the maximal corner.
To find the skyline of any subspace: Scan the sorted order and stop when a condition holds.
![Page 45: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/45.jpg)
45
Stopping condition
![Page 46: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/46.jpg)
46
Skylines have risen everywhere
Many great works, including the following and many others.
Spatial skyline[Sharifzaden and Shahabi VLDB 06]
k-dominant skyline[Chan et al. SIGMOD 06]
Reverse skyline[Dellis and Seeger VLDB 07]
Probabilistic skyline[Jian et al. VLDB 07]
![Page 47: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/47.jpg)
Outline
Preference queries from the traditional perspective Ranking queries and the TA algorithm Skyline queries and algorithms Variations of preference queries
Preference queries from the OLAP perspective Ranking with multidimensional selections Ranking aggregate queries in data cubes Multidimensional skyline analysis
Preference queries and preference mining Online skyline analysis with dynamic preferences Learning user preferences from superior and interior examples
Conclusions
47
![Page 48: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/48.jpg)
Review: Ranking Queries
Consider an online accommodation database Number of bedrooms Size City Year built Furnished or not
select top 10 * from R where city = “Shanghai” and Furnished = “Yes” order by price / size asc
select top 5 * from R where city = “Vancouver” and num_bedrooms >= 2 order by (size + (year – 1960) * 15)2 – price2 desc
48
![Page 49: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/49.jpg)
Multidimensional Selections and Ranking
Different users may ask different ranking queries Different selection criteria Different ranking functions Selection criteria and ranking functions may be dynamic –
available when queries arrive Optimizing for only one ranking function or the whole table is not
good enough Challenge: how to efficiently process ranking queries with dynamic
selection criteria and ranking functions? Selection first approaches: select data satisfying the selection
criteria, then sort them according to the ranking function Ranking first approaches: progressively search data by the ranking
function, then verify the selection criteria on each top-k candidate
49
![Page 50: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/50.jpg)
Traditional Approaches
50
tid City BR Price Sq feett1 SEA 1 500 600t2 CLE 2 700 800t3 SEA 1 800 900t4 CLE 3 1000 1000t5 LA 1 1100 200
t6 LA 2 1200 500t7 LA 2 1200 560t8 CLE 3 1350 1120
tid City BR Price Sq feett7 LA 2 1200 560
t5 LA 1 1100 200
t6 LA 2 1200 500
Selection first approaches
tid City BR Price Sq feet f (10^4)t1 SEA 1 500 600 29t2 CLE 2 700 800 9t3 SEA 1 800 900 5t4 CLE 3 1000 1000 4t5 LA 1 1100 200 37
t6 LA 2 1200 500 13t7 LA 2 1200 560 9.76t8 CLE 3 1350 1120 22.49
Ranking first approaches
![Page 51: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/51.jpg)
Ranking Cubes – Principle
Selection criteria and ranking functions Selection dimensions: the attributes used to select data Ranking dimensions: the attributes used to define ranking
functions General principle
Build a ranking cube on the selection dimensions – multidimensional selection can be handled by the cube structure
The measure in each cell should have rank-aware properties – top-k queries with ad hoc ranking functions can be answered efficiently
Challenges Creating a data partition for each selection condition is not
scalable We cannot know every ranking function beforehand
51
![Page 52: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/52.jpg)
Data Cube
52
SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales
WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color);
![Page 53: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/53.jpg)
Ranking-Cube: the Framework
Step 1: Partition data by Ranking Dimensions Step 2: Assign each data object a Block ID Step 3: Group data by Selection Dimensions Step 4: Compute a measure for each group
High-level: which blocks contain data Low-level: which data entries are in those blocks
53
![Page 54: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/54.jpg)
54
Materialize Ranking-Cube
tid City BR Price Sq feet Block IDt1 SEA 1 500 600 5t2 CLE 2 700 800 5t3 SEA 1 800 900 2t4 CLE 3 1000 1000 6t5 LA 1 1100 200 15
t6 LA 2 1200 500 11t7 LA 2 1200 560 11t8 CLE 3 1350 1120 4
Step 1: Partition Data on Ranking Dimensions
Step 3: Group data bySelection Dimensions
City
BR
City & BR
3 421
CLE
LA
SEA
Step 2: Assign Block ID
Step 4: Compute Measures for each group
For the cell (LA)
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
High-level: {11, 15}Low-level: {11: t6, t7; 15: t5}
![Page 55: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/55.jpg)
55
Query Processing
Select top 10 from Apartmentwhere city = “LA”
order by [price – 1000]^2 + [sq feet - 800]^2 asc
800
1000
Point with the best ranking score
Without ranking-cube: start search from here
800
1000
Point with the best ranking score
With ranking-cube: start search from here
Measure for LA: {11, 15}
{11: t6,t7; 15:t5}
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
![Page 56: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/56.jpg)
Variations of Ranking-Cube
Different partition methods Grid Partition Hierarchical Partition
Various coding scheme for measures ID lists Bit-map encoding
56
![Page 57: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/57.jpg)
57
Hierarchical Partition
R-tree Partition [Guttman 1984]
Partition data into hierarchically nested
blocks
Each block corresponds to a node in R-tree
tid Price Sq feett1 500 600t2 700 800t3 800 900t4 1000 1000t5 1100 200
t6 1200 500t7 1200 560t8 1350 1120
![Page 58: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/58.jpg)
58
Materialize Ranking-Cube
tid City BR Price Sq feet BIDt1 SEA 1 500 600 N3, N1t2 CLE 2 700 800 N3, N1t3 SEA 1 800 900 N3,N1t4 CLE 3 1000 1000 N4,N1t5 LA 1 1100 200 N5,N2
t6 LA 2 1200 500 N5,N2t7 LA 2 1200 560 N6,N2t8 CLE 3 1350 1120 N6,N2
Step 1: Partition Data on Ranking Dimensions
Step 3: Group data bySelection Dimensions
City
BR
City & BR
3 421
CLE
LA
SEA
Step 2: Assign Block ID
Step 4: Compute Measure
For the cell (LA):Binary description1: data residence
0: no data
![Page 59: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/59.jpg)
59
Prune Search Space
Select top 10 from Apartmentwhere city = “LA”
order by [price – 1000]^2 + [sq feet - 800]^2 asc
Measure for (LA)
Pruned by Ranking-Cube
W/O Ranking-Cube: Search over the whole R-tree
W/ Ranking-Cube: Search over the right sub-tree
![Page 60: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/60.jpg)
60
Branch-and-Bound Search
Select top 1 from Apartmentwhere city = “LA”
order by [price – 1000]^2 + [sq feet - 800]^2 asc
F=[price-1000]^2+[sq feet – 800]^2 F(ROOT)=0
1100
1200
500
560
F(N2)=10,000
F(N5)=100,000
F(N6)=97,600
F(t7)=97,600, done!
Pruned by Boolean Selections
Pruned by Ranking Criterion
![Page 61: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/61.jpg)
Outline
Preference queries from the traditional perspective Ranking queries and the TA algorithm Skyline queries and algorithms Variations of preference queries
Preference queries from the OLAP perspective Ranking with multidimensional selections Ranking aggregate queries in data cubes Multidimensional skyline analysis
Preference queries and preference mining Online skyline analysis with dynamic preferences Learning user preferences from superior and interior examples
Conclusions
61
![Page 62: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/62.jpg)
Ranking on Multi-dimensional Aggregation
62
Car Sales Database (S)
ID Time Location
Type Sales
1 2007 Chicago Sedan 13
2 2007 Vancouver
Pickup 10
3 2008 Vancouver
SUV 37
4 2008 Vancouver
Sedan 20
5 2007 Chicago SUV 12
… … … …
Example Top-k Query
SELECT Time, Location, SUM(Sales)
FROM S
GROUP BY Time, Location
ORDER BY SUM(Sales) desc
LIMIT 2
Query Results
Cell (2008, Vancouver): 57
Cell (2007, Chicago): 25
![Page 63: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/63.jpg)
A Naïve Solution and Challenges
Materializing a data cube A ranking aggregate query finds the top-k group-bys
Challenge: the number of group-bys is exponential with respect to the number of attributes In a table of many attributes, it may be infeasible to materialize a
data cube
![Page 64: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/64.jpg)
Finding the top-1 US City in PopulationHeuristically, the states with large population should be searched first
![Page 65: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/65.jpg)
Pruning
Once New York City in New York state is seen which has 8 million people, the cities in 39 states whose population in the whole state is less than 8 million can be pruned
65
California 36M
Virginia 7M
Texas 23M
Washington
6M
New York 19M
Massachusetts
6M
Florida 18M
Indiana 6M
Illinois 12M
Arizona 6M
Pennsylvania
12M
Tennessee 6M
Ohio 11M
Missouri 5M
Michigan 10M
Maryland 5M
Georgia 9M Wisconsin 5M
N. Carolina
9M Minnesota 5M
New Jersey
8M 29 more… <5M
P
R
U
N
E
D
![Page 66: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/66.jpg)
Aggregate Ranking Cube (ARC)
A partially materialization approach Guiding cuboids store high-level statistics to guide the ranking query
processing Example: storing state population to help searching for city
population Supporting cuboids store inverted index to support efficient online
aggregation Aggregate functions
Monotonic: SUM, COUNT, MAX, … Non-monotonic: AVG, RANGE, …
66
![Page 67: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/67.jpg)
ARC Example
67
Base table
Guiding cuboids
Supporting cuboids
![Page 68: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/68.jpg)
Query Answering Example
QueryTop-1Group-by (A,B)Measure: SUM
68
![Page 69: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/69.jpg)
Step-0
Idea: use two guiding cuboids (A) and (B) to answer query in cuboid (A,B)
Sorted lists are generated by scanning and sorting the materialized guiding cuboids
69
A SUM
a1 123
a3 120
a2 68
B SUM
b2 157
b1 154
Sorted List A Sorted List B
A guiding cell157: aggregate-bound
![Page 70: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/70.jpg)
70
Step-1
Generate the first candidate on group-by (A,B): (a1, b2) Intuition: likely to have large SUM
A SUM
a1 123
a3 120
a2 68
B SUM
b2 157
b1 154
Sorted List A
Sorted List B
![Page 71: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/71.jpg)
71
Step-2
Verify candidate (a1, b2)
Using supporting cuboids TID-list intersection
SUM (a1, b2) = t2:10+t3:50 = 60
![Page 72: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/72.jpg)
72
Step-3
Update sorted lists We’ve already known SUM(a1, b2)=60 Thus we can infer SUM(a1, bj)=123-60 for j<>2
• And SUM(ai, b2)=157-60 for i<>1
A SUM
a1 123-60=63
a3 120
a2 68
B SUM
b2 157-60=97
b1 154
Sorted List A Sorted List B
![Page 73: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/73.jpg)
73
Aggregate Bound
A guiding cell’s aggregate-bound in a sorted list is the largest aggregate a combined candidate cell could achieve (i.e., upper-bound) Example: (a3,*)<=120, (*, b2)<=97
A SUM
a3 120
a2 68
a1 63
B SUM
b1 154
b2 97Sorted List A Sorted List B
![Page 74: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/74.jpg)
74
Step-4
Repeat candidate generation and verification Another candidate: SUM(a3, b1) = 75
A SUM
a3 120
a2 68
a1 63
B SUM
b1 154
b2 97
Sorted List A Sorted List B
![Page 75: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/75.jpg)
75
Step-5
Update SUM(a3, b1) = 75
A SUM
a3 120-75=45
a2 68
a1 63
B SUM
b1 154-75=79
b2 97
Sorted List A Sorted List B
![Page 76: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/76.jpg)
76
Done!
Candidates seen so far (a1, b2):60, (a3, b1):75 Unseen ones: <75. No more candidate! So, (a3, b1):75 is the final top-1 answer
A SUM
a2 68 (pruned)
a1 63 (pruned)
a3 45 (pruned)
B SUM
b2 97
b1 79
Sorted List A Sorted List B
![Page 77: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/77.jpg)
Outline
Preference queries from the traditional perspective Ranking queries and the TA algorithm Skyline queries and algorithms Variations of preference queries
Preference queries from the OLAP perspective Ranking with multidimensional selections Ranking aggregate queries in data cubes Multidimensional skyline analysis
Preference queries and preference mining Online skyline analysis with dynamic preferences Learning user preferences from superior and interior examples
Conclusions
77
![Page 78: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/78.jpg)
Domination and Skyline
A set of objects S in an n-dimensional space D=(D1, …, Dn)
For u, vS, u dominates v if u is better than v in one dimension, and u is not worse than v in any other dimensions For illustration in this talk, the smaller the better
u S is a skyline object if u is not dominated by any other objects in S
78
![Page 79: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/79.jpg)
Full Space Skyline Is Not Enough!
Skylines in subspaces If one does not care about the number of stops, how can we
derive the superior trade-offs between price and travel-time from the full space skyline?
Sky cube – computing skylines in all non-empty subspaces (Yuan et al., VLDB’05) Any subspace skyline queries can be answered (efficiently)
79
![Page 80: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/80.jpg)
Sky Cube
80
![Page 81: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/81.jpg)
Understanding Skylines
Understanding skyline objects Both Wilt Chamberlain and Michael Jordan are in the full space
skyline of the Great NBA Players, which merits, respectively, really make them outstanding?
How are they different? Finding the decisive subspaces – the minimal combinations of
factors that determine the (subspace) skyline membership of an object? Total rebounds for Chamberlain, (total points, total rebounds,
total assists) and (games played, total points, total assists) for Jordan
81
![Page 82: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/82.jpg)
Redundancy in Sky Cube
82
Does it just happen that skylines in multiple subspaces are identical?
![Page 83: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/83.jpg)
Are Subspace Skylines Monotonic?
Is subspace skyline membership monotonic? x is in the skylines in spaces ABCD and A, but it is not in the
skyline in ABD – it is dominated by y in ABD x and y collapse in AD, x and y are in the skylines of both A and D
83
![Page 84: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/84.jpg)
Coincident Objects
Coincidence: two objects taking the same value on one attribute Suppose there are no coincident objects, if an object is in the
skyline of space B, then it is in the skyline of every superspace of B Then, why do we care coincident objects?
Coincident objects exist in large data sets (Subspace) skyline band: find all objects which are at most of
distance from a skyline point
84
![Page 85: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/85.jpg)
Coincident Groups
(G, B) is a coincident group (c-group) if all objects in G share the same values on all dimensions in B GB is the projection
A c-group (G, B) is maximal if no any further objects or dimensions can be added into the group Example: (xy, AD)
85
![Page 86: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/86.jpg)
Skyline Groups
A maximal c-group (G, B) is a skyline group if GB is in the subspace skyline of B
How to characterize the subspaces where GB is in the skyline? (x, ABCD) is a skyline group If the set of subspaces are convex, we can use
bounds
86
![Page 87: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/87.jpg)
Decisive Subspaces
A space CB is decisive ifGC is in the subspace skyline of C
No any other objects share the same values with objects in G on C
C is minimal – no C’C has the above two properties
(x, ABCD) is a skyline group, AC, CD are decisive
87
![Page 88: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/88.jpg)
Semantics
In which subspaces an object or a group of objects are in the skyline?
For skyline group (G, B), if C is decisive, then G is in the skyline of any subspace C’ where CC’B
Signature of skyline group Sig(G, B)=(GB, C1, …, Ck) where C1, …, Ck are all decisive subspaces
88
![Page 89: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/89.jpg)
OLAP Analysis on Skylines
Subspace skylines Relationships between skylines in subspaces Closure information
89
![Page 90: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/90.jpg)
Full Space vs. Subspace Skylines
For any skyline group (G, B), there exists at least one object uG such that u is in the full space skylineCan use u as the representative of the group
An object not in the full skyline can be in some subspace skyline only if it collapses to some full space skyline objects in the subspaceAll objects not in the full space skyline and not
collapsing to any full space skyline object can be removed from skyline analysis
If only the projections are concerned, only the full space skyline objects are sufficient for skyline analysis
90
![Page 91: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/91.jpg)
Subspace Skyline Computation
Compute the set of skyline groups and their signatures NP-hard: reduction from the frequent closed itemset problem
Find skyline groups and their decisive subspaces in the full space The seed lattice
Extend the seed lattice to compute all skyline groups in all subspaces Seeds: skyline points in the full space
91
![Page 92: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/92.jpg)
Seed Lattice
92
Seed lattice
![Page 93: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/93.jpg)
Outline
Preference queries from the traditional perspective Ranking queries and the TA algorithm Skyline queries and algorithms Variations of preference queries
Preference queries from the OLAP perspective Ranking with multidimensional selections Ranking aggregate queries in data cubes Multidimensional skyline analysis
Preference queries and preference mining Online skyline analysis with dynamic preferences Learning user preferences from superior and interior examples
Conclusions
93
![Page 94: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/94.jpg)
Preferences, Skylines, and Recommendations
94
![Page 95: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/95.jpg)
Favorable Facet Mining
A set of points in a multidimensional space Attributes
Fully ordered attributes: the preference orders are fixed, e.g., price, star-level, and quality
(Categorical) Partially ordered attributes: the preference orders are not fully determined,
• Examples: airlines, hotel groups, and property types• Some templates may apply, e.g., single houses > semi-
detached houses When a user preference presents, what are the skyline points? Favorable facets of a point p: the partial orders that make p in the
skyline A point p is in the skyline with respect to a user preference if the
preference is a favorable facet of the p
95
![Page 96: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/96.jpg)
Monotonicity of Partial Orders
If p is not in the skyline with respect to partial R, p is not in the skyline with any partial order stronger than R
96
![Page 97: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/97.jpg)
Minimal Disqualifying Conditions
For a point p, a most general partial order that disqualifies p in the skyline is a minimal disqualifying condition (MDC)
Any partial orders stronger than an MDC cannot make p in the skyline
How to compute MDC’s efficiently? MDC-O: Computing MDC On-the-fly, not storing MDCs of points MDC-M: A Materialization Method, storing MDCs of all points
97
![Page 98: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/98.jpg)
98
Algorithm Framework
Given data point p
Variable MDC(p): minimal disqualifying condition
Algorithm MDC(p) For each data point q which quasi-dominates p
• if MDC(p) does not contain Rqp
– insert Rqp to MDC(p) Return MDC(p)
Point q is said to quasi-dominate point p ifall attributes of point q are NOT worse than
those of point p.
![Page 99: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/99.jpg)
Skyline Warehouse on Preferences
Materializing all MCDs and pre-compute skylines Using an Implicit Preference Order tree (IPO-tree) index
Can online answer skyline queries with respect to any user preferences
99
![Page 100: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/100.jpg)
Outline
Preference queries from the traditional perspective Ranking queries and the TA algorithm Skyline queries and algorithms Variations of preference queries
Preference queries from the OLAP perspective Ranking with multidimensional selections Ranking aggregate queries in data cubes Multidimensional skyline analysis
Preference queries and preference mining Online skyline analysis with dynamic preferences Learning user preferences from superior and interior examples
Conclusions
100
![Page 101: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/101.jpg)
Mining Preferences from Examples
How would a realtor recommend realties to customers? A customer’s preference depends on many factors: price,
location, style, lot size, # bedrooms, year, developer, … It is hard for a customer to specify preferences on every factor
What does a smart realtor do? Presenting to a customer a small number of examples – some
realties available on the market A customer may selectively label some superior and inferior
examples Superior examples: not dominated by any other examples in the
given set – skyline points Inferior examples: dominated by some other examples in the
given set – non-skyline points
101
![Page 102: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/102.jpg)
Satisfying Preference Sets
Preference mining problem: given a set O of points in a multidimensional space (D1, …, Dn), a set S O of superior examples and a set Q O of inferior examples (S Q = ), find partial orders R on attributes D1, …, Dn such that every point in S is a skyline point and every point in Q is not a skyline point R is called a satisfying preference set (SPS)
In general, given a set of superior and inferior examples, there may be no SPS, one SPS, or multiple SPSs the SPS existence problem
The SPS existence problem is NP-complete, even when there is only one undetermined attribute
Any polynomial time approximation algorithm cannot guarantee to find a SPS when a SPS exists
102
![Page 103: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/103.jpg)
Minimal Satisfying Preference Sets
If multiple SPSs exist, the simplest one – the weakest partial order – is preferred Occam’s razor (aka the principle of parsimony): “One should not
increase, beyond what is necessary, the number of entities required to explain anything”
R is minimal if there does not exist another SPS weaker than R The minimal SPS problem is NP-hard Any polynomial time approximation algorithm cannot guarantee the
minimality of the SPSs found
103
![Page 104: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/104.jpg)
A Greedy Method
A term-based method Iteratively adding a term (x < y on a dimension Di) until all inferior
examples are satisfied An inferior example may need multiple terms – greedily adding the
term that helps to satisfy as many unsolved inferior examples as possible
A condition-based method Iteratively adding a condition which at least satisfies one inferior
example Greedily adding the condition that satisfies as many unsolved
inferior examples as possible with the least complexity increase Protecting superior examples
A term/condition is violating if it makes a superior example inferior Such terms and conditions cannot be added
104
![Page 105: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/105.jpg)
Conclusions
Preference queries are essential in database and data analysis Ranking queries Skyline queries
There are many traditional studies on preference queries The TA algorithm for ranking queries Efficient and scalable algorithms for skyline queries Variations of skyline queries
OLAP and data mining can take advantage of preference queries Multidimensional selections and ranking Ranking aggregates Multidimensional skyline analysis Skyline on dynamic user preferences Mining preferences using superior and inferior examples
105
![Page 106: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/106.jpg)
What Is Next?
Preference queries on broader applications Preference queries in information retrieval applications Preference queries in recommendation systems …
Preference mining Pushing ideas and techniques to Web scale applications
Representative answers to preference queries …
106
![Page 107: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/107.jpg)
References (Preference Queries & OLAP)
J. Pei et al. Computing Compressed Skyline Cubes Efficiently. In ICDE’07.
J. Pei et al. Towards Multidimensional Subspace Skyline Analysis. TODS 2006.
J. Pei et al. Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces. In VLDB’05.
T. Xia and D. Zhang, Refreshing the Sky: The Compressed Skycube with Efficient Support for Frequent Updates. In SIGMOD’06
D. Xin and J. Han. P-Cube: answering preference queries in multi-dimensional space. In ICDE’08.
D. Xin et al. Answering top-k queries with multi-dimensional selections: the ranking cube approach. In VLDB’06.
T. Wu et al. ARCube: supporting ranking aggregate queries in partially materialized data cubes. In SIGMOD’08.
Y. Yuan et al. Efficient Computation of the Skyline Cube. In VLDB’05
107
![Page 108: Preference Queries from OLAP and Data Mining Perspective Jian Pei 1, Yufei Tao 2, Jiawei Han 3 1 Simon Fraser University, Canada, jpei@cs.sfu.ca 2 The](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d265503460f949fd6a4/html5/thumbnails/108.jpg)
References (Preferences and Mining)
R. Aggarwal and E. Wimmers. A framework for expressing and combining preferences. In SIGMOD’00.
S. Holland et al. Preference mining: a novel approach on mining user preferences for personalized applications. In PKDD’03.
B. Jiang et al. Mining Preferences from Superior and Inferior Examples. In KDD’08.
W. Kiebling. Foundations of preferences in database systems. In VLDB’02.
R. E. S. William et al. Learning to order things. JAIR 1999. R. C-W Wong et al. Efficient Skyline Querying with Variable User
Preferences on Nominal Attributes. In VLDB’08. R. C-W Wong et al. Mining Favorable Facets. In KDD’07. R. C-W Wong et al. Online Skyline Analysis with Dynamic Preferences
on Nominal Attributes. TKDE.
108