adaptivity in continuous query systems
DESCRIPTION
Adaptivity in continuous query systems. Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003. Outline. Introduction Adapting to the “burstiness” of data streams by using a smart operator scheduling strategy - PowerPoint PPT PresentationTRANSCRIPT
Adaptivity in continuous query systems
Luis A. Sotomayor & Zhiguo Xu
Professor Carlo ZanioloCS240B - Spring 2003
Sotomayor - Xu 2
Outline Introduction Adapting to the “burstiness” of data
streams by using a smart operator scheduling strategy
Adapting to high volumes of data streamed by multiple data sources through the use of “adaptive filters”
Conclusion
Sotomayor - Xu 3
Introduction Two distinguishing characteristics of
data streams: Volume of data is extremely high Decisions are made in close to real time
Traditional solutions are impractical Data cannot be stored in static databases
for offline querying Importance of data streams is due to
variety of applications
Sotomayor - Xu 4
Applications of data streams Network monitoring Intrusion detection systems Fraud detection Financial monitoring E-commerce Sensor networks
Sotomayor - Xu 5
Research efforts Large number of applications has led to
many efforts seeking to construct full-fledged DSMS
Efforts have concentrated on issues of System architectures Query languages Algorithm efficiency
Issues such as efficient resource allocation, and communication overhead have received less attention
Sotomayor - Xu 6
Importance of adaptivity DSMS deal with multiple long-running continuous
queries Data streams do not usually arrive at a regular rate
Considerable “burstiness” and variation over time Environment conditions in which queries are
executed are frequently different from the conditions for which the query plans were generated
DSMS may face an increasing number of data sources and therefore an increased volume of traffic
The “Chain” operator scheduling strategy
Sotomayor - Xu 8
The classic solution Buffer the backlog of unprocessed
tuples Work through them during periods of
light load Problem:
Heavy load could exceed physical memory (causing page switches)
The memory used for these backlogs has to be minimized
Sotomayor - Xu 9
Finding a better solution Claim: the operator scheduling
strategy can have a significant impact on run-time resource consumption
Use an operator scheduling strategy that will minimize the amount of memory used during query execution I.e. reduce the size of the backlogs
Sotomayor - Xu 10
Chain scheduling A near optimal operator scheduling
strategy Outperforms competing operator
scheduling strategies Strategy concentrates on
Single stream queries involving Selection Projection Foreign-key joins with stored relations
Sliding window queries over multiple streams
Sotomayor - Xu 11
The model Query execution is conceptualized as a data
flow diagram (a directed acyclic graph) Nodes correspond to pipelined operators Edges represent compositions of operators
An edge from A to B indicates the output of operator A is the input to operator B
Another interpretation: an edge represents an input queue that buffers the output from A before it is input to B
Sotomayor - Xu 12
An example Suppose the query is
SELECT Name FROM EmployeeStream WHERE ID = ‘12345’;
Operators are Projection (SELECT …) Selection (WHERE …)
Input stream
Select ProjectOutput stream
Operator path
Sotomayor - Xu 13
Main ideas Operators are thought of as filters
Operate on a set of tuples Produce s tuples in return
s selectivity of an operator If s = 0.2 we can interpret the value in
two ways Out of every 10 tuples, the operator outputs
2 tuples If the input requires 1 unit of memory, the
output will require 0.2 units of memory
Sotomayor - Xu 14
Example Consider an operator path with two
operators O1 and O2 Assume that O1 takes one unit of time
to process a tuple and that its selectivity is 0.2
Assume that O2 takes one unit of time to process 0.2 tuples and that its selectivity is 0
I.e. O2 outputs tuples out of the system
Sotomayor - Xu 15
Example (cont) Now consider two strategies
FIFO A tuple is passed through both operators in
two consecutive time units No other tuples are processed during that
time Greedy strategy
If there is a tuple buffered before O1 then it is operated on using one time unit
Otherwise if there are tuples buffered before O2, 0.2 tuples are processed using 1 time unit
Sotomayor - Xu 16
Example (cont)
Time Greedy scheduling FIFO scheduling
0 1 11 1.2 1.22 1.4 2.03 1.6 2.24 1.8 3.05 2.0 3.26 2.2 4.0
Memory usage
Need to consider the growth or reduction of data as it travels along the operator path
Sotomayor - Xu 17
Progress charts Behavior of data
is captured by progress charts Points represent
an operator The ith operator
takes (ti – ti-1) units of time to process a tuple of size si-1
Result is a tuple of size si
Sotomayor - Xu 18
Progress charts (cont) We can define
selectivity as the drop in tuple size from operator i to operator i+1. In other words
selectivity is equal to si/si-1
selectivity
Sotomayor - Xu 19
The lower envelope Consider some point (s,
t) on the progress chart Imagine there is a line
from this point to every operator point (ti, si) to its right
The operator that corresponds to the line with the steepest slope is called the “steepest descent operator point”
Sotomayor - Xu 20
The lower envelope (cont) By starting at the first
point (t0, s0) and repeatedly calculating the steepest descent operator point we find the lower envelope P’ for a progress chart P
Notice that the slopes of the segments are non-increasing
Sotomayor - Xu 21
The lower envelope (cont) So what is it?
A way to find which segments of the operator path yield the biggest drops in tuple size
It allows us to consider changes in selectivity across groups of operators We call these groups “chains”
Sotomayor - Xu 22
“Chain” scheduling Chain assigns priorities to operators
equaling the slope of the lower envelope segment to which the operator belongs
At any time Out of all the operators with tuples in their
input queues the one with the highest priority is chosen
When there are “ties,” the operator with the oldest tuples is chosen (based on arrival time)
Sotomayor - Xu 23
The Chain strategy along the progress chart Tuples don’t actually
move along lower envelope
They instead move along the operator path
When the Chain strategy moves along the actual progress chart P, the memory requirements are not that much greater than before
Sotomayor - Xu 24
Multiple stream queries Queries that have at least one
tuple-based sliding window join between two streams
Sotomayor - Xu 25
Multiple stream query execution
Query is first broken up into parallel operator paths
R
S
R
S
Shared
Sotomayor - Xu 26
Experimental results Compared the performance of
Chain, FIFO, Greedy, and Round-Robin
2 data sets (network data) Synthetic data set Real data set
Queries used IP addresses and packet sizes in selection and projection predicates
Sotomayor - Xu 27
Experiment: single stream queries (4 operators)
Query: 4 operators Third operator is
very selective In between two
less selective operators
Sotomayor - Xu 28
Experiment results
Sotomayor - Xu 29
Multiple stream experiment Three simultaneous queries
A sliding window join Two single stream queries with
selectivities less than one Results show Chain outperforms other
strategies by a large margin
Sotomayor - Xu 30
Multiple stream experiment results
Sotomayor - Xu 31
Summary Proved that the choice of operator
scheduling strategy has a significant impact on resource consumption
Proved that the Chain scheduling strategy outperforms competing strategies
Future work Latency and starvation issues Consider query plans that change over time Consider the sharing of computation and
memory in query plans
Sotomayor - Xu 32
“Adaptive filters” for continuous queries over distributed data streams
Sotomayor - Xu 33
What’s the problem? Distributed data sources continuously stream
updates to a centralized processor where continuous queries are evaluated
Because of the high volume of data updates, the communication overhead jeopardizes system performance E.g. path latency computed by monitoring
queuing latency at routers: the volume of monitoring traffic from routers may exceed that of normal traffic
Can we reduce the communication overhead to make continuous queries based on multiple data streams feasible and efficient?
Sotomayor - Xu 34
Important observations Exact precision for continuous queries is not
always needed E.g. path latency application: <= 5 ms of accuracy
Approximate answers of sufficient precision can usually be computed from a small fraction of the input stream. E.g. average network traffic volume received by all
hosts within the organization The precision constraint for queries may
change over time. E.g. more precise traffic volume needed in face of
attack
Sotomayor - Xu 35
Overview of Approach Reduce communication overhead at
the cost of query precision. Quantitative precision constraints specified
with the continuous queries Bounded approximate answer [L, H] Precision constraint δ. 0 ≤ H – L ≤ δ
Filters installed at the remote data sources by the stream processor
Filter at data object O’s source: [Lo, Ho] of width Wo centered around most recent numeric update V.
Sotomayor - Xu 36
Naive filtering policy Uniform allocation
E.g a single CQ: AVG(O1, O2, …, On) Precision constraint δ Filters with a bound of width δ
The wider a bound, the more restrictive a filter and consequently the more imprecise the query answers.
Cons Multiple CQs are issued on one object. If the
smallest bound width is chosen for the filter, the higher update stream rate may be wasted on a few CQs.
Data updates rate and magnitudes not counted.
Sotomayor - Xu 37
System structure Data source Filters Stream coordinator Precision manager Bound cache CQ evaluator
Sotomayor - Xu 38
System structure
Sotomayor - Xu 39
Adaptive filter setting algorithm Goal: set bound widths for steam filters adaptively to
reduce communication costs while guaranteeing the precision constraints of CQs AVG queries analyzed only
Q1, Q2, …, Qm with sets S1, S2, …, Sm. Sj is a subset of a set of n data objects O1, O2, …, On
Query result Qj :
Precision constraint: Basic idea:
Implicit bound width shrinking Explicit bound width growing
ji SOnii
j
VS ,1
1
jjSOnii SWji
,1
Sotomayor - Xu 40
Bound shrinking Filtering bound width Wi for object
Oi Maintained both at the central stream
coordinator and at the source filter Wi Wi · (1 – S) for every Γ time units
Γ: adjustment period S: shrink percentage
Sotomayor - Xu 41
Bound growing
Burden score: the degree to which an object is contributing to the overall communication cost due to streamed updates
where Ci is communication cost for Oi, Wi is the current bound width, and
ii
ii WP
CB
ii NP
Burden target: the lowest overall burden required of the objects in the query in order to meet the precision constraint at all times.
Where Ni is the number of updates of Oi received by the stream coordinator in the last Γ time units
Sotomayor - Xu 42
Bound growing (Cont) Burden deviation: the
degree to which an object is “over-burdened” with respect to the burden targets of the queries that access it.
Queried objects are considered in order of decreasing deviation, and it is assigned the maximum possible bound growth when it is considered.
ji SOmjjii TBD
,1
0,max
jkji SOnk
kjjSOmj
i WSW,1
,1min
Sotomayor - Xu 43
Bound growing (Summary) Each object is assigned a burden score Each query is assigned a burden target by either
averaging burden scores or invoking an iterative linear solver
Each object is assigned a deviation value based on the difference between its burden score and the burden targets of the queries that access it
The objects are considered in order of decreasing deviation, and each object is assigned the maximum possible bound growth when it is considered
Sotomayor - Xu 44
Burden Target Computation Single AVG query Qk over every object O1, …, On.
B1 = B2 = … = Bn = Tk
Or
Intuitive explanation behind this formula Objects having higher than average burden scores will
be given a higher priority for bound width growth to lower their burden scores;
Objects having lower than average burden scores will shrink by default, thereby raising their burden scores.
ki SOnii
kk B
ST
,1
1
Sotomayor - Xu 45
Burden Target Computation (Cont) Multiple queries over different set of objects
θi,j : the portion of object Oi’s burden score corresponding to query Qj and
Goal for adjusting burden scores in presence of overlapping queries is to have the burden score Bi of each object Oi equal the sum of the burden targets of the queries over Oi.
Burden target:
iSOmi ji Bji
,1 ,
ji kiSOni SOjkmkki
j
j TBS
T,1 ,,1
1
Sotomayor - Xu 46
Validation against optimized strategy The adaptive bound width setting algorithm converges on
bounds that are on par with those selected by an optimizer.
Sotomayor - Xu 47
Implementation and experimental validation Single query
Sotomayor - Xu 48
Implementation and experimental validation Multiple queries
Sotomayor - Xu 49
Summary Trade the precision of query results for
lower communication costs. The specification of precision for continuous
queries Adaptive filters
Future work How imprecision propagates through more
complex query plans Develop appropriate optimization
techniques for adapting remote filter predicates in more complex environments
Sotomayor - Xu 50
Conclusion The problem
DSMS must consider the high volume as well as the “burstiness” of data streams
Effectiveness of systems depends on being able to gracefully adapt to environmental conditions (I.e. resource availability)
Two different approaches for adaptivity Minimizing the amount of memory at all
times Controlling the amount of data sent from
multiple data sources
Sotomayor - Xu 51
Conclusion (cont) Chain operator scheduling minimizes the
amount of memory used during execution making the system more adaptable to variation in arrival rates
Adaptive filters reduce the volume of data so that a system can perform efficiently while providing a certain level of precision
Overall, the need for adaptivity in DSMS is necessary due to the unpredictability of data streams
Sotomayor - Xu 52
References J. M. Hellerstein et al. Adaptive Query
Processing: Technology in Evolution. IEEE 2000 B. Babcock, S. Babu, M. Datar, R. Motwani, and
J. Widom. Models and Issues in Data Stream Systems. ACM SIGMOD/PODS 2002 Conference.
B. Babcock, S. Babu, M. Datar, R. Motwani. Chain: Operator Scheduling for Memory Minimization in Data Stream Systems. SIGMOD 2003
Chris Olston, Jing Jiang, Jennifer Widom. Adaptive Filters for Continuous Queries Over Distributed Data Streams. SIGMOD 2003.