continuous processing of preference queries in data streams : a survey

Post on 23-Feb-2016

49 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Continuous Processing of Preference Queries in Data Streams : a Survey. M. Kontaki , A.N. Papadopoulos, Y. Manolopoulos Data Engineering Lab Department of Informatics Aristotle University of Thessaloniki. Presentation Layout. Preliminaries Continuous skyline queries - PowerPoint PPT Presentation

TRANSCRIPT

Continuous Processing of Preference Queries in

Data Streams : a Survey

M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos

Data Engineering LabDepartment of Informatics

Aristotle University of Thessaloniki

Presentation Layout

Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Summary

Presentation Layout

Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Summary

Data Streams Data Stream is an infinite sequence

of objects. Each object can be one-dimensional

or multi-dimensional. Streaming Time Series are finite

sequences of objects. Streaming Time Series changes over

time. Arrival rate of objects usually varies.

t1 t2 t3 t4 t5 t6 t7 t8

Time

W=5

expiredactive

Count-based window: Sliding window contains the W most recent tuples (“active”).

Older tuples expire.

Sliding Window Model (1)

Sliding Window Model (2)

t1 t2 t3 t4 t5

t6

t7

Time

W=5

expiredactive

Time-based window: Sliding window contains the tuples (“active”) of the W most recent timestamps.

Older records expire.

t8

User / Application

Input

Query ResultResultQuery

Database System

Continuous Evaluation in a Data Stream System

User / Application

Query

Query processor

Result

Motivation (1) Numerous data stream contexts

Financial data analysis Network management Astronomical data analysis Sensor network Telecommunication data

management

Motivation (2) Preference queries

Useful decision support tool Many applications in data streams

Example 1 (telecommunication data)Report the clients with the maximum call time and the maximum number of calls.

Continuous skyline query

Example 2 (stock-market data)Report the products with the maximum price, the minimum sales and the minimum number of buyers.

Continuous top-k dominating query

Presentation Layout

Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Conclusions

Skyline Query

distance

price

T4

Hotelsprice

distance

T1 4 1T2 3 2T3 0.5 3T4 2.5 4.5T5 1.5 4T6 3.5 5

T3

T2

T6

T5

T1

Dominant tuple: A tuple t dominates another tuple t’ if • t is not worse than t’ in all dimensions, and • t is better than t’ in at least one dimension.

Skyline: contains all the tuples not dominated by any other tuple.

Continuous Skyline Query Problem definition: We have to

continuously evaluate a skyline query in multidimensional streaming time series.

Application example: network data Computers with suspicious behavior. Network traffic, number of connections,

number of destinations.

Basic Idea Skyline changes due

The insertion of a new skyline tuple. The expiration of a skyline tuple.

LookOut [Morse, ICDE06] and Lazy [Tao, TKDE06] Use of a spatial index Advantage: simple implementation Disadvantage: the expiration of a

skyline tuple is not handled efficiently

Event Approach (1) Existing skyline tuple expires:

How can we find new skyline tuples? Very costly operation

Skyline influence time (SIT) Minimum time in which a tuple may

become a skyline tuple. Generate events based on SIT

Event Approach (2)W=10

K.SIT=19Tuple K can be discardeddue to tuple L (younger and better)

H(8)A(1)

D(4)

C(3)

I(9)B(2)

F(6)

E(5)

K(11)G(7)

J(10)

L(12)

Eager [Tao, TKDE06] Advantage: handles

skyline expiration Disadvantage: pro-

cessing time per tuple

n-of-N Skyline Queries (1)

S6 = {a,c} S4 = {c,g}

source: icde05

n-of-N definition

n-of-N Skyline Queries (2)

S6 = {c,h} S4 = {e,h}

source: icde05

n-of-N definition

Method cnN(1)

Tuple K is redundant because tuple L is better and younger than K

The dominance relation between L and E is critical because E is the youngest tuple which dominates L

Tuple L is dominated by D and E.

W=10H(8)

A(1)

D(4)

C(3)

I(9)B(2)

F(6)

E(5)

K(11)G(7)

J(10)

L(12)

Method cnN [Lin, ICDE05] is also based on events

Method cnN (2)

Generate intervals For the skyline tuples, e.g. C = (0,3] For the critical dominance relations, C -> G =

(3,7] Use an interval-tree to store them

Dominance graph contains all the critical dominance relations

A(1)B(2)

D(4)

F(6)E(5)

C(3)

G(7)

Redundant tuples

Critical dominance relation

Method cnN (3) A tuple t is in the answer of an n-of-N skyline

query iff there exists an interval containing the value M–n+1, where M is the number of the total elements seen so far.A(1)

B(2)

D(4)

F(6)E(5)

C(3)

G(7)

To answer a n-of-N query, apply a (M–n+1) stabbing

query

C = (0,3]

D -> F = (4,6] D -> E = (4,5]C -> G = (3,7]

D = (0,4]For n = 6,

M–n+1 = 2

S6 = {C, D}

For n = 4,M–n+1 = 4

S4 = {D, G}

stabbing queryM = 7

Method cnN (3) Advantages

Good use of skyline properties Multiple query processing

Disadvantages Processing time per tuple Increased memory requirements

Frequent Skyline - Motivation

Highly dynamic environment The skyline results are meaningful

only if the skyline tuples appear consistently

Frequent skyline: tuples on the skyline for a minimum user-defined interval. [Zhang, SIGMOD09]

Streaming Model Client/Server architecture Server receives object updates from the

clients. Each object can be represented as a

d-dimensional point. Object update (point movement in the

d-dimensional space). at least a value in one dimension changes

Object insertion or deletion Point movement from/to a nonexistent position

Minimization of communication cost

Filter Safe region technique

Skyline remains unchanged if each object stays in a safe region

Communication happens only when the safe region is violated

Safe region approach leads to communication optimization

An object as a point and its filter (safe

region)

source: sigmod09

Sampling All clients report their skyline at

the same sampled time The clients are synchronized with

the same random seed Guaranteed quality if sampling

rate is high enough

Hybrid Hybrid solution

Combines Filter and Sampling Small changes: apply Filter Larger changes: apply Sampling

Disadvantage of all three methods energy consumption is not uniform

(critical in sensor networks)

k-dominant Skyline Query - Μotivation

Skyline: contains tuples not dominated by any other tuple.Disadvantage: High dimensionality problem.Solution: Relax the notion of dominance.k-dominant tuple: A tuple t k-dominates another tuple t’ if • t is not worse than t’ in at least k dimensions and • t is better than t’ in at least one of them. k-dominant skyline: contains all tuples not k-dominated by any other tuple [Kontaki, SAC08]

k-dominant Skyline Query - Εxample

D1 D2 D3 D4 D5 D6

T1 6 5 4 3 2 1T2 5 4 3 5 4 3T3 6 6 2 2 6 5T4 6 6 6 1 6 6T5 6 6 6 5 5 5

Conventional skyline {T1, T2, T3, T4}5-dominant skyline {T1, T2,

T3}4-dominant skyline {T1,

T2}Smaller k, less tuples in k-dominant skyline

T1 dominates T5

T1 5-dominates T4

T1 4-dominates T3

Observations Traditional or streaming skyline

methods are inappropriate Skyline properties do not hold

E.g. transitive property k-dominance can be cyclic

Existence of multiple users and multiple queries.

Method CoSMuQ (1) A query on D dimensions arrives. Given a parameter value k, split the query

to subqueries of d=k dimensions. Compute the conventional skyline of each

subquery. The k-dominant skyline is the intersection

of the skylines of the subqueries of a query.

Method CoSMuQ (2) Advantages

Based on conventional skyline (simple domination checks)

Properties of conventional skylines can be used Exploits the overlap between different queries.

Disadvantages Memory requirements increase in high

dimensionality.

Continuous Skyline methods - SummaryMethod Query

TypeWindow

TypeMultiple Queries

LookOut skyline time noLazy and

Eagerskyline both no

n-of-N skyline count yesFilter and Sampling

frequent skyline

time no

CoSMuQ k-dominant skyline

both yes

Presentation Layout

Data streams - Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Summary

Top-k query - Εxample

distance

price

T4

Hotelsprice

distance

T1 4 1T2 3 2T3 0.5 3T4 2.5 4.5T5 1.5 4T6 3.5 5

T3

T2T6

T5

T1

Given a preference function, a top-k query returns the k tuples with the best scores.

k=1k=2F=price+distance

Continuous Top-k Query Problem definition: Continuous evaluation

of top-k query in multidimensional streaming time series.

Application Example: network data top-100 flows with the largest individual

throughput Common destination DDoS attack

Basic Idea

Influence region

tk

x2

x1

New tuple changes the top-k Should belong in the influence

region of the query Top-k tuple expiration

From scratch query computation TMA (Top-k Monitoring

Algorithm) [Mouratidis, SIGMOD06] Advantage: simple

implementation Disadvantage: no efficient

handling of an expired top-k tuple

source: sigmod06

Line defined by theF = score(tk) =

x1 + x2

Skyband - Example

k-skyband: contains all the tuples which are dominated by at most k–1 other tuples.

E

DB

C

A1-skyband (tuples not dominated by other tuples)

1-skyband is the skyline2-skyband (tuples dominated by at most 1 other tuples)

Dominated by 2 other tuples

(3-skyband)

Skyband Approach (1)

Dominance counter (DC): number of tuples that are younger and better

Rule: Keep tuples with DC < k

Observation: tuples appearing in some top-k result belong to the k-skyband in the (score,exp_time) space.

Transform tuples in the (score,expiration_time) space

T4

T3

T2

T6

T5

T1

distance

pricescore

T1 5T2 5T3 3.5T4 7T5 5.5T6 8.5

original space transformed space

T4

T3

T2

T6

T5

T1

exp_time

scoreF=price+distance

DC=0DC=1

DC=1

DC=0

DC=1

DC=0

top-1

Skyband Approach (2) SMA (Skyband Monitoring Algorithm)

proposed in [Mouratidis, SIGMOD06] Advantage: independent of the

dimensionality 2-dimensional space (score-exp_time)

Disadvantage: k-skyband may contain less than k tuples In this case, a top-k tuple expiration will cause

query computation from scratch

Distributed Top-k Continuously report the k largest

values obtained from distributed data streams.

Objective is to minimize communication cost

Proposed by [Babcock, SIGMOD03]

Streaming Model Nodes: N1, N2 , … , Nm, coordinator node: N0 Set of n data objects O1, O2 , … , On associated

with real values V1, V2 , … , Vn Value updates are represented as <Oi, Nj, >

tuples: Nj detects a change in the value Vi of Oi. Change is not seen by other nodes Nk

(kj) The value Vi for an object Oi:

Vi= j (Vi,j) where Vi,j is the value of i-th object in the j-th node

Method (1) Initialize a top-k set at the coordinator

node Set arithmetic constraints at monitor

nodes Depend on current top-k set

Constraints valid No communications Constraints invalidated

Client communicates with server Possibly new top-k set Recomputation of constraints

Method(2) - Adjustment Factors

V1,1 = 1 V2,1 = 9V1,2 = 3 V2,2 = 1

= 0 = -3 = 0 = 32,2

1,21,1Node 1

Node 2

Object 1 Object 2

Top-1 = {O1}Node 1, Local Top-1 = {O1}Node 2, Local Top-1 = {O2}

Local top-ks differ from global top-k=>Unnecessary constraint violations

=> Increased communication cost

2,1

Object 1 Object 2Adjustment Factors (AF)

Node 2: V1,2 = 3+0 = 3Node 2: V2,1 = 1+3 = 4

Local top-k similar to global=>Low communication costTo keep the results valid

AF for each object sum to zeroDisadvantage: Energy consumption is not

uniform

Uncertain DataScore Prob

.6 0.85 0.52 0.48 0.4

Tuples Pr. Tuples

Pr. Tuples

Pr.

2, 5, 6, 8

.064

2, 5, 6 .096

2, 6, 8 .064

2, 6 .096

2, 5, 8 .016

2, 5 .024

2, 8 .016

2 .024

5, 6, 8 .096

5, 6 .144

5, 8 .024

5 .036

6, 8 .096

6 .144

8 .024

Empty .036

tuples 16 possible worlds

Pk-topk query: returns the k most probable tuples of being the top-k.Top-2: {6,5} with prob. {0.64, 0.5}

Compute probability of 6

Sum the world probabilities

source: pvldb08

Pk-topk Query Solution proposed by [Jin, PVLDB08]

Compact set based Space-efficient solution

Discard unnecessary tuples and Apply several compression schemes to

compress data Disadvantages

Model assumption: the probability of a tuple is assumed random and independent of each other.

Continuous Top-k Methods -Summary

Method Query Type

Window Type

Multiple Queries

TMA and SMA top-k both yesDistributed top-k Distributed

top-ktime no

Compact set based

Pk-topk both no

Presentation Layout

Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Summary

Top-k Dominating Query - Example

distance

price

T4

Hotelsprice

distance

T1 4 1T2 3 2T3 0.5 3T4 2.5 4.5T5 1.5 4T6 3.5 5

T3

T2T6

T5

T1Skyline: contains all the tuples not dominated by any other tuple.

Disadvantage: High dimensionality problem.

Top-k: Given a preference function, a top-k query returns the k tuples with the best scores.

Disadvantage: user-defined preference function.

Top-k dominating: the answer contains the k tuples with highest domination power.

Combines the advantages of skyline and top-k queries and avoids their disdvantages.

k=1k=2F=price+distance

Continuous Top-k Dominating Query

Problem definition: Continuous evaluation of top-k dominating query in multidimensional streaming time series.

Application Example: sensor network Areas with high probability of fire outbreak Temperature, humidity and wind speed

EVA Objective: reduce domination checks Safe interval of a tuple

Ignore tuple for this interval It depends on its score and the k-th score

End of safe interval -> event Event

Try to compute new safe interval, else Compute score from scratch

New tuple Find another tuple that dominates the new one Estimate a lower bound of the safe interval

ADA Advanced computation of safe

interval Depends on the number of tuples that

dominate this tuple and expire later Candidate tuples

Tuples with scores close to k-th score are updated in each time instance

EVA and ADA proposed by [Kontaki 2009]

Presentation Layout

Preliminaries Continuous skyline queries Continuous top-k queries Continuous top-k dominating

queries Summary

Summary Preference queries are very useful

in data streams Presented state-of-the-art methods

For continuous skyline queries For continuous top-k queries For continuous top-k dominating

queries Examined advantages and

disadvantages of the proposed methods

Research Directions Continuous subspace skyline

queries Solutions appropriate for

distributed environments uniform energy consumption

Approximate algorithms Existence of multiple queries

Thank you

top related