beth plale computer science dept. indiana university calder: ogsa-dai access to data in streams
TRANSCRIPT
Beth Plale Computer Science Dept.
Indiana University
Calder: OGSA-DAI access to data in streams
Contributors
• PhD students:– Nithya Vijayakumar– Ying Liu
• Funding agencies:– Department of Energy
• Early Career grant
– National Science Foundation• LEAD project
Problem Statement• Data streams (e.g., from sensor networks,
instruments) are growing in pervasivness and importance
• Bringing streaming sources to a grid by means of wrapping each sensor and instrument as grid or web service is a naïve solution.
• It is the data in the streams that is of value to growing groups of user, not the instruments.– Select few will ‘steer’ instruments
• Need fresh solution that provides access to collections of data streams.
Outline
• Characterization of data stream systems
Data streams:• Indefinite sequence of events (or
messages, tuples)
• Often time marked– Generation time, that is, timestamp, and– Logical time
• Events continuously generated– pushed or pulled from providers to remote
consumers
Types of Data Stream Systems
Data manipulationsystems-process large amounts of data
Stream routing systems- delivery of events
Stream detectionsystems-detect unusual behavior
sizeand numberof stream“chunks”analyzed
timeliness demandson response
overlap
Stream Routing Systems• Known by various names
– Publish/subscribe, selective data dissemination, document filtering, message oriented middleware (MOM)
• Decisions made event-by-event– Set of queries (usually very large number)
managed over long time duration, arriving event matched against set of queries.
• Stream routing projects– Xfilter (UMaryland), Xyleme (INRIA),
XPushMachine (UWashington), NaradaBroker (IndianaU), Bayou(XeroxParc), Echo (GeorgiaTech)
route
detectmanip
Stream Routing Example
stock quote stream
queries added through user interface
Each arriving event matched against set of queries
Results multicast to owners of satisfied queries
- Timeliness requirement necessitates focus on efficient matching
Long-standingqueries
Data Manipulation Systems
• Event streams subject to transformation, filtering, aggregation.
• Looser timeliness requirements on results• Long running queries, often periodic (based
on assumption of synchronous streams)• Results in generation of new streams• Projects:
– Antarctic Monitoring(UNottingham), sensor network query layer (Cornell), dQUOB (IndianaU), STREAM (Stanford), Fjords (Berkeley), NiagraCQ (UWisconsin)
route
detectmanip
Stream Detection Systems
• Event-oriented (versus periodic)• Less predictable, asynchronous streams• Intent is to detect anomalous behavior• Timeliness is critical, time markers key to
decision making• Result is notification message• Examples:
– R-GMA (EU DataGrid), dQUOB (IndianaU), Conquer (GeorgiaTech), Gigascope (AT&T), Fjords (Berkeley)
route
detectmanip
Claim: streams systems that qualify as Data Stream Resource are only those circled
• A Data resource has coherence and meaning• Both systems qualify as ‘data resource’ because: -- distributed global snapshot on stream behavior alone -- distributed global snapshot has meaning and coherence
Data manipulationsystems
Stream routing systems
Stream detectionsystemsoverlap
Justification:
Rowsetgrid dataservice
Rowsetgrid dataservice
Rowsetgrid dataservice
Rowsetgrid dataservice
OGSA-DAI OGSI access to data sources
RowsetGrid dataservice
RowsetGrid dataservice
Grid dataservice
registry
Grid dataservice
registry
Grid dataservice
registry
Grid dataservice
registryGrid dataService(GDS)
Grid dataService(GDS)
R-DBMSR-DBMS
Data description - database description
Data access - query/update access
Data management - monitor service
Data factory - create rowset instance
Rowsetgrid dataservice
Rowsetgrid dataservice
Rowsetgrid dataservice
Rowsetgrid dataservice
Database Access using OGSA-DAI OGSI
RowsetGrid dataservice
RowsetGrid dataservice
grid dataservice
registry
grid dataservice
registry
grid dataservicefactory
grid dataservicefactory
Grid dataservice
registry
Grid dataservice
registry
Grid dataservice
registry
Grid dataservice
registry
Grid dataService(GDS)
Grid dataService(GDS)
R-DBMS
Grid servicehandle offactory
createGDS
handle ofGDS
query
servicecreate
response(row-by-row)
Rowsetgrid dataservice
Rowsetgrid dataservice
Rowsetgrid dataservice
Rowsetgrid dataservice
Database Access using OGSA-DAI OGSI
RowsetGrid dataservice
RowsetGrid dataservice
grid dataservice
registry
grid dataservice
registry
grid dataservicefactory
grid dataservicefactory
Grid dataservice
registry
Grid dataservice
registry
Grid dataservice
registry
Grid dataservice
registry
Grid dataService(GDS)
Grid dataService(GDS)
Grid servicehandle offactory
createGDS
handle ofGDS
query
servicecreate
response(row-by-row)
Calder: presents db interface to application and supports multiple stream systems (like ogsa-dai)
Specifying Long-Running Queries: An Example
Select from 3 data types, CAPS radar, ACARS data, and NEXRAD Doppler level II,
Filter out events not falling within 80 mile radius around New OrleansExecute beginning immediately and terminating execution in 1 hour. Set sliding window size (RANGE) as window over which joins are carried out
SELECT (caps, acars, nexradII) WHERE REGION(90W, 30N, 62.5KM) START now EXPIRE 1hr RANGE 6 min
Retrieving results from rowset service: Rowset Request API
Results obtained through request to rowset service of form:-- single event based on timestamp,-- range of events bounded by time range,-- most recent n events, or-- stream of events.
getTuple(timestamp, ringBufferID)getRangeTuple(startTS, endTS, ringBufferID)getMostRecent(lastRecent, num_events, ringBufferID)getStream(ringbufferID)
Experimental Environment• GDS, rowset server, stream processing server
– Dual Dell 2.8 GHz Precision workstation, 2 GB memory, RHEL
• Planner– Solaris UltraSPARC 502MHz, 1GB memory,
SunOS 5.8• Computational mesh (1 node)
– Xeon Intel 2.8 Ghz, 2GB RAM, RedHat 8.0• 1 Gbps switched Ethernet network• OGSA-DAI OGSI v4.0• dQUOB v1.0
Evaluation
• Benchmark following steps – GDS setup - ogsa-dai factory call– Query plan time
• Plan how query is to be distributed
– Query compile - compile into portable script form– Distribute query - deploy script into computational
mesh (1 node in test)– Ring buffer setup - allocate space in rowset
service, return handle to GDS.
Benchmark results: average taken over 100 different runs
Time interval (a) obtained bygetRangeTuple(startTS, end TS, ringBufferID)drops as # queries at node increases; turnaround increases
Related Research
• Common Middleware Instrument Architecture (CIMA), McMullen, Bramley et al. • GATES,
– Aggarwal, HPDC04
• Data Cutter, Saltz• Grid Stream Database Manager (GSDM)
– Koparanova and Risch, AxGrids 2003– Stores into O-O DBMS
• DQP – Manchester and Newcastle
• DB community: Widom, Borealis, NiagaraCQ
Summary
• Model of data stream store provides conceptual framework for retrieving data from streams by means of rich and meaningful queries
• GGF DAIS framework of OGSA-DAI is intuitive abstraction for accessing data stream store
• It is not about the instruments and sensors … it is about the streams!
Real-Time WRF Real-Time WRF executed on executed on Grid when Grid when
environment environment primed and primed and
storms presentstorms present
On-DemandOn-DemandResourceResource
SchedulingScheduling
Parting Views
• Our work brings data streams to the grid in way that user intuitively thinks about accessing data resources
• Querying heterogeneous data management tools is a difficult problem. Heterogeneous query languages will exist– I.e., continuous query languages will always be
different from database query languages.
• Scientists need a lot of coaching to understand why their monolithic “pile of perl” is not a grid service, and further why they should invest in breaking up that “pile of perl”.
http://www.cs.indiana.edu/dde/projects/Calder.html
Calder extends dQUOB (http://www.cs.indiana.edu/dde/projects/dquob.html).
dQUOB v1.0, which is available for release as open source, includes the stream processing system components of Calder.