stream data management system prototypes ying sheng, richard sia june 1, 2004 professor carlo...

48
Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Post on 22-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Stream Data Management System Prototypes

Ying Sheng, Richard SiaJune 1, 2004

Professor Carlo Zaniolo CS 240B

Spring 2004

Page 2: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Outline Motivation of DSMS Aurora (Brown, Brandeis, MIT)

Model Operator Scheduling Storage/Memory Management QoS issue

STREAM (Stanford) System Architecture Query Language Query Plans and Execution Performance Issues Approximation Techniques STREAM Interface

Conclusion

Page 3: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Motivation

HADP DAHP Continuous data and static queries

Monitoring using sensor Military Traffic Environment

Financial analysis Object tracking

Page 4: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora

Page 5: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Model General Purpose DSMS Continuous stream data comes Flow through a set of operators Output to application or materialized

Page 6: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Model Components

Storage manager Scheduler Load Shedder Router QoS Monitor GUI

Page 7: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Model 3 kinds of query supported

Continuous View Ad-Hoc Query

Page 8: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Model 8 primitive operators (Box)

Windowed Slide Tumble Latch Resample

Non-windowed Filter Map GroupBy Join

Page 9: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Operator Optimization Each operator associated with

Selectivity: s(b), sel(b) Computation time: c(b), cost(b)

General Optimization Techniques Pushing projection upstream Combining boxes Reordering boxes

Page 10: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Operator Optimization Case 1 : cost of ab

c(a) + s(a)c(b)

Case 2: cost of ba c(b) + s(b)c(a)

Criteria for switching box position c(a)+s(a)c(b) > c(b)+s(b)c(a)

a b

b a

Page 11: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Operator Scheduling Scheduling by OS

One thread per box, shift the job to OS Easier to program

Aurora Scheduler Single thread for the scheduler The scheduler pick a box with highest priority and

call the box to consume tuples from queue Allow finer control of resource

Scalable !

Page 12: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Operator Scheduling

Page 13: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Operator Scheduling Problem: which box to execute next?

Min-Cost (MC) Reduce computation cost

Min-Latency (ML) Return result as soon as possible

Min-Memory (MM) Reduce memory usage of queue

Page 14: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Operator Scheduling

Example

b4

b5

b6

b2

b3 b1

streams application

Downstream

Page 15: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Operator Scheduling Min-Cost

Objective: avoid overhead of calling boxes

Min-Latency Prefer box which can produce tuples in the output at a

shorter period of time

Min-Memory Give preference to box which will consume more tuples

with less computation time Similar to “Chain Operator Scheduling”

More at:Operator Scheduling in a Data Stream Manager, VLDB 2003

Page 16: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Storage/Memory Management Manage the queue in front of each box

2 boxes sharing the same queue windowed operator

The initial queue size is 128 KB Queues are managed as a circular queue

If overflow, double the queue size, or vice versa

Page 17: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Storage/Memory Management Swap in/out between memory / disk based

on priority of boxes using it

Work with Operator Scheduler to exchange box priority and buffer-state information

Connection Point Management A B-tree indexed on timestamp is built to support

random access of tuples by ad-hoc query

Page 18: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Storage/Memory Management

Page 19: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – QoS Issue Different queries/applications have

different QoS requirement Stock market monitoring Average temperature of a set of sensor

QoS Graph

Page 20: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Latency-based QoS Graph

b

time

QoS

0

D(b)

eol(b)

est(b)

latency(b)

cost(D(b))

Critical Point

Page 21: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – QoS-driven Scheduling Assign priority to each box based on

priority (b) = [utility (b), est (b)] utility (b) = gradient (eol (b))

How is the QoS degrading by the time the tuple leave the system when we process it now.

est (b) How soon it will exhibit another performance

degradation if we don’t process it now.

Performance 200 queries/application, each with 5 boxes Round robin - 0.43 QoS driven scheduling – 0.85

Page 22: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Aurora – Current Status Main components of a DSMS are introduced

Operator scheduler Memory/storage management QoS concept in stress environment Load shedding

Implemented in C++, with Java-based GUI Dependent on a few software/library

More? Distributed architecture – Aurora* Fault tolerance or disaster recovery ?

Page 23: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM

Page 24: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Introduction General-purpose prototype DSMS Supports data streams and stored

relations Declarative language for registering

continuous queries Flexible query plans and execution

strategies Aggressive sharing of state and

computation among queries

Page 25: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Introduction Designed to cope with

Stream rates that may be high, variable, bursty

Continuous query loads that may be high, volatile

Primary coping techniques Graceful approximation as necessary Careful resource allocation and use Continuous self-monitoring and

reoptimization

Page 26: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

DSMS

Scratch Store

STREAM – System Architecture

Input streams

RegisterQuery

StreamedResult

StoredResult

Archive

StoredRelations

Page 27: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Query Language

Continuous Query Language – CQL Extends SQL with

Streams as new data type Stream: Unbounded bag of pairs <tuple,

timestamp> Relation: time-varying bags of tuples

Continuous instead of one-time semantics Three classes of operators

Relation-to-relation Stream-to-relation Relation-to-stream

Page 28: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – CQL Operators Relation-to-relation

SQL constructs Stream-to-relation

Tuple-based sliding window: [Rows N], [Rows Unbounded]

Time-based sliding window: [Range ω], [Now] Partitioned sliding window: [Partition By A1,…Ak

Rows N] Relation-to-stream

Istream: insert stream Dstream: delete stream Rstream: relation stream

Page 29: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Example Query 1

Two example streams:Orders (orderID, customer, cost)Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”:Select Sum(O.cost) From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk =

“Sue” And O.customer = “Joe”

Page 30: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Example Query 2

Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost: Select F.clerk, Max(O.cost) From Orders O, Fulfillments F [Partition By

clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk

Page 31: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Simplified Query 2

Result is a relation, updated as stream elements arrive:Select F.clerk, Max(O.cost)From O, F [Rows 100]Where O.orderID = F.orderIDGroup By F.clerk

Page 32: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Simplified Query 2

Result is streamed: Emits <clerk, max> stream element whenever max changes for a clerk (or new clerk):Select Istream(F.clerk, Max(O.cost))From O, F [Rows 100]Where O.orderID = F.orderIDGroup By F.clerk

Page 33: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Example Query 3 Relation: CurPrice(stock, price) Average price over last day for

each stock: Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day]Group By stock

Istream provides history of CurPrice Window on history (back to relation),

group and aggregate

Page 34: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Query plans and Execution When a continuous query is registered, generate a

query plan New plan merged with existing plans Users can also create & manipulate plans directly

Plans composed of three main components: Operators

Flag: insertion(+), deletion (-) Elements: tuple-timestamp-flag tuples Streams: only + elements Relations: both + and - elements

Queues Enforce nondecreasing timestamps (“heartbeats”) Mechanisms for buffering tuples

States (Synopses) Global scheduler for plan execution

Page 35: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – States

States (Synopses) Summarize elements seen so far

(exact or approximate) for operators requiring history

To implement windows Example: synopsis join

Sliding-window join Approximation of full join

State1 State2⋈

Page 36: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Simple Query Plan

Select * From S1 [Rows 1000], S2 [Range 2 Minutes]Where S1.A = S2.A

And S1.A > 10

Page 37: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Performance Issues

Synopsis Sharing Eliminate data redundancy

Exploiting Constraints Selectively discard data to reduce

state Operator Scheduling

Reduce queue sizes

Page 38: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Synopsis Sharing

Eliminate redundancy by replacing the nearly identical synopses with

light weight stubs a single store to hold the actual tuples

Store tracks the progress of each stub, presents the appropriate view to each stub.

The store contains the union of its corresponding stubs

Page 39: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Synopsis Sharing

Select * From S1 [Rows 1000], S2 [Range 2

Minutes]Where S1.A = S2.A

And S1.A > 10

Select A, Max(B) From S1 [Rows 200]Group By A

Page 40: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Exploiting Constraints

Specify an adherence parameter k to capture how closely a given stream or sets of streams adheres to a constraint of that type Referential integrity k-constraint Ordered-arrival k-constraint Clustered-arrival k-constraint

Query execution plans reduce or eliminate sate based on k-constraints

If constraint violated, get approximate result

Page 41: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Operator Scheduling Goal:Goal: minimize total queue size for unpredictable,

bursty stream arrival patterns Chain Scheduling Algorithm:Chain Scheduling Algorithm:

1. Mark the first operator in the plan as the “current” operator

2. Find the block of consecutive operators starting at the “current” operator that maximizes the reduction in total queue size per unit time.

3. Mark the first operator following this block as the “current” operator and repeat Step 2 until all operators have been assigned to chains.

4. Chains are scheduled according to the greedy algorithm, but within a chain, execution proceeds in FIFO order.

Proven:Proven: within constant factor of any “clairvoyant” strategy, i.e., the optimal strategy based on knowledge of future input, for some queries

Empirical results:Empirical results: large savings over naive strategies for many queries

But minimizing queue sizes is at odds with minimizing latency

Page 42: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Approximation CPU-Limited Approximation

Insufficient CPU time to process each stream element due to the high data arrival rate.

load-shedding sampling operators Approximate by probabilistically dropping

elements before they are processed Memory-Limited Approximation

The total state required for all registered queries exceeds available memory.

The system selectively shrinks or discards synopses.

Page 43: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Query Interface View the structure of query plans the

their component entities. View the detailed properties of each

entity. Dynamically adjust entity properties. View monitoring graphs that display

time-varying entity properties plotted dynamically against time. Queue sizes, throughput, overall memory

usage, and join selectivity.

Page 44: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Query Plan Monitoring

Page 45: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

STREAM – Current Status Version 1.0 up and running Includes a new monitoring and adaptive query

processing infrastructure – StreaMon Executor runs query plans to produce results. Profiler collects and maintains statistics about stream and

plan characteristics. Reoptimizer ensures that the plans and memory structures

are the most efficient for current characteristics. Web demo available at http://shark.stanford.edu:8080/ Future Directions:

Distributed Stream Processing Crash Recovery Improved Approximation Classification of Applications

Page 46: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

Conclusion Ideal DSMS

Well defined and flexible query language User-friendly interface Scalable

Operator scheduling Storage management Synopsis sharing Approximation

Quality assurance Fault tolerant

Page 47: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

References R. Motwani et al., “Query Processing, Approximation, and

Resource Management in a Data Stream Management System”, in proceedings of the 1st CIDR Conference, 2003.

S. Madden et al., “Continuously Adaptive Continuous Queries over Streams”, in proceedings of SIGMOD Conference, 2002

D. Carney et al., “Monitoring Streams - A New Class of Data Management Applications”, in Proceedings of VLDB conference, 2002.

D. Carney et al., “Operator Scheduling in a Data Stream Manager”, in Proceedings of VLDB conference, 2003

Stanford STREAM Project Website: http://www-db.stanford.edu/stream/index.html

Aurora Project Website: http://www.cs.brown.edu/research/aurora

Page 48: Stream Data Management System Prototypes Ying Sheng, Richard Sia June 1, 2004 Professor Carlo Zaniolo CS 240B Spring 2004

End