computer science and engineering
DESCRIPTION
Computer Science and Engineering. Efficiently Monitoring Top-k Pairs over Sliding Windows. Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema 1 , Xuemin Lin 21 , Wenjie Zhang 1 , Haixun Wang 3. 1 The University of New South Wales, Australia - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/1.jpg)
Computer Science and Engineering
Efficiently Monitoring Top-k Pairs over Sliding Windows
Presented By: Zhitao Shen1
Joint work with Muhammad Aamir Cheema1, Xuemin Lin21, Wenjie Zhang1, Haixun
Wang3
1The University of New South Wales, Australia
2 East China Normal University3 Microsoft Research Asia
![Page 2: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/2.jpg)
2
IntroductionTop-k Pairs Query:• Given a scoring function score() that computes the score of a pair of
objects, return k pairs of objects with the smallest scores.
Examples:• k closest pairs queries• k furthest pairs queries
Top-k Pairs against sliding windows• Given a data stream, return top-k pairs among the most recent N objects.
Applications• Wireless sensor network, stock market, traffic monitoring and transaction
monitoring
![Page 3: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/3.jpg)
3
MotivationNo existing work for general pairs queries over sliding windows
Support arbitrary scoring functions.
Example:Fraud detection over transaction streams
– Query the transaction pairs that have small time difference but the locations are far away.
Select a.id, b.id from trans a, trans bwhere a.id <> b.id and a.account = b.accountorder by |a.time - b.time| - dist(a.loc, b.loc)limit kwindow [24 hours]
203-13845 10:15:20 New York $1000
203-13845 10:18:10 L.A. $1000
![Page 4: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/4.jpg)
4
Problem Definitions (Preliminaries)Sliding Windows
– A sliding window contains most recent N objects of the data stream.
– The number of pairs is N(N – 1) / 2
Sliding window of size 5
neweroldero1o2o3o4o5o6o7
. . . . .o0
Lower bound runtime cost : O(N) for each new objectLower bound storage cost : O(N)
Age of an object: 5 4 3 2 1 0
The age of a pair depends on the
older object.
![Page 5: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/5.jpg)
5
ContributionsUnified framework • First to study top-k pairs queries over sliding windows.• Support arbitrarily complex scoring functions• Support efficient queries for any window size n ≤ N and any k ≤ K
Lower bound Expected cost for our algorithms
Storage requirement O(N) O(N) + O(K log(N/K)) for eachscoring function
Skyband maintenance cost for each object
O(N) O(N (log (log N) + log K))
Answering top-k pairs O(k) O(log(log n) + log K + k)
![Page 6: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/6.jpg)
6
Preliminaries
p1
p2
p4
p7
Age
Sco
re
Map all the pairs to an age–score spaceTop-2 pairs
K-skyband[Papadias et al., TODS05] keeps the minimum set for the candidate results.
p2 dominates p5 because p2.score < p5.score and p2 expires no later than p5.
Task1 : how we efficiently maintain the K-skyband Task2 : how we use the K-skyband to efficiently obtain top-k pairs against any sliding window n ≤ N
p1(o0, o1) (p1.age, p1.score) (1, 3)
o1o2o3o4 o0
p3
p5
p6
p8
p9
p10
1 2 3 4
Naive: O(N |SKB|) for checking all N-1 pairs
Expected size of skyband is O(K log(N/K))
Our: O(N log|SKB|)
![Page 7: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/7.jpg)
7
p1
p2
p3
p4
2-skyband Age
Sco
re
p5
Efficient Skyband MaintenanceCan we find a boundary between the
skyband points and non-skyband points?
K-staircase
How can we efficiently compute the K-staircase and K-skyband?
s1
Update the K-staircase and K-skyband in O(|SKB| log K)),
Check if a pair is dominated by K-skyband in O(log |SKB|) time for each new pair by doing binary search.
p5
K-staircase
s1
s2
s2 p1
p6
p7
![Page 8: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/8.jpg)
8
Window size = NAny window size = n < N
Efficient Query Answering
p3
p1
p5
p7
p8
2-skyband Age
Sco
re
p6
p4
p2
Can we do better for any sliding window size n < N?
Use Priority Search Tree to index the skyband points
Self-balancing treeEfficient 3-sides range query
6p1
3p5
1p7 4p6
2p8
9p2
8p3
5p4
Priority Search Tree
![Page 9: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/9.jpg)
9
Efficient Query Answering
p3
p1
p5
p7
p8
2-skyband Age
Sco
re
p6
p4
p2
Our contribution: Retrieve top-k pairs in the 1-sided range.
An algorithm similar to post-order traversal costs O(log|SKB| + k)
Any window size = n < N
6p1
3p5
1p7 4p6
2p8
9p2
8p3
5p4
Priority Search Tree
![Page 10: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/10.jpg)
10
What else in the paper?Efficient continuous queries on the skyband.• Continuously monitoring the top-k results for any fixed k (k ≤ K) and
n (n ≤ N).• Amortized O(k/n (log |SKB| + k)) time per update.
Optimization on monotonic scoring functions.• Handling the k-closest pairs, k-furthest pairs queries.• Applying Threshold Algorithm on sorted lists • Improving the number of considered pairs for each new object from
N to (d+1) N d/(d+1) K 1/(d+1)
![Page 11: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/11.jpg)
11
Experimental SettingsReal dataset.
– Sensor data in the Intel research lab– 2.3 million records.
Synthetic data.– Uniform, correlated and anti-correlated distributions.– 2 million objects– Closest and furthest pairs in Manhattan distance
|.humidityo-.humidityo| |.tempo-.tempo| |.timeo-.timeo|
)o ,score(oyxyx
yxyx
![Page 12: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/12.jpg)
12
Experiments (Overall Cost on real data)SCase: our algorithm using K-staircase to maintain the skyband.Naïve: maintains kN pairs and sort them on their scores.LB: shows lower bound cost
Varying K Varying N (in thousands)
![Page 13: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/13.jpg)
13
Experiments (Query Answering)Linear: scan the skyband points to find the top-k pairs.Snapshot: our snapshot query algorithm.Continuous: our continuous query algorithm.LB: an algorithm to obtain top-k results in O(k) time.
Varying K Varying |Q| (in thousands)
![Page 14: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/14.jpg)
14
Conclusion:• First to study a broad class of top-k pairs queries over
sliding windows.
• We present efficient algorithms and show that the performance of our algorithm is reasonably close to the lower bound cost.
• We provide extensive experiment results on both real and synthetic data sets to show the efficiency and scalability of the proposed algorithms.
![Page 15: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/15.jpg)
15
Question and Answer
Thank You!Any Questions?
![Page 16: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/16.jpg)
16
Related WorkTop-k Query Processing• Fagin’s Algorithm (FA), threshold Algorithm (TA), no-random access
(NRA)
Top-k Pairs Queries Processing• k-closest pairs queries• k-furthest pairs queries• Top-k pairs queries [Cheema et al., ICDE’11]
Data Stream Processing• Top-k query processing over data stream [Mouratidis et al.,
SIGMOD’06]• k-nearest neighbour queries [Böhm et al., ICDE’07]
![Page 17: Computer Science and Engineering](https://reader036.vdocuments.us/reader036/viewer/2022062521/5681682b550346895dddc135/html5/thumbnails/17.jpg)
17
Experiments (Skyband Maintenance algorithm)Basic: maintening algorithm without K-staircase
SCase: our algorithm using K-staircase to maintain the skyband.TA: Optimized algorithm for monotonic scoring functions.LB: show lower bound cost
# of attributesVarying K