real-time querying of live and historical stream data joe hellerstein uc berkeley

43
Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Upload: nathan-golden

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Real-Time Querying of Live and Historical Stream Data

Joe HellersteinUC Berkeley

Page 2: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Joint Work

• Fred Reiss UC Berkeley (IBM Almaden)

• Kurt Stockinger, Kesheng Wu, Arie Shoshani Lawrence Berkeley National Lab

Page 3: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Outline

• A challenging stream query problem– Real-world example: US DOE network monitoring

• Open-Source Components– Stream Query Engine: TelegraphCQ– Data Warehousing store: FastBit

• Performance Study– Stream Analysis, Load, Lookup

• Handling Bursts: Data Triage

Page 4: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Outline

• A challenging stream query problem– Real-world example: US DOE network monitoring

• Open-Source Components– Stream Query Engine: TelegraphCQ– Data Warehousing store: FastBit

• Performance Study– Stream Analysis, Load, Lookup

• Handling Bursts: Data Triage

Page 5: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Agenda

• Study a practical application of stream queries– High data rates– Data-rich: needs to consult “history”

• Obvious settings– Financial – System Monitoring

• Keep it real

Page 6: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

DOE Network Monitoring

• U.S. Department of Energy (DOE) runs a nationwide network of laboratories– Including our colleagues at LBL

• Labs send data over a number of long-haul networks

• DOE is building a nationwide network operations center– Need software to help operators monitor network

security and reliability

Page 7: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Monitoring infrastructure

Page 8: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Challenges

• Live Streams– Continuous queries over

unpredictable streams

• Archival Streams– Load/index all data– Access on demand as part of

continuous queries

Open Source

Page 9: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Outline

• A challenging stream query problem– Real-world example: US DOE network monitoring

• Open-Source Components– Stream Query Engine: TelegraphCQ– Data Warehousing store: FastBit

• Performance Study– Stream Analysis, Load, Lookup

• Handling Bursts: Data Triage

Page 10: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Telegraph Project

• 1999-2006, joint with Mike Franklin– An “adaptive dataflow” system

• v1 in 2000: Java-based– Deep-Web Bush/Gore demo presages live web mashups

• V2: TCQ, rewrite of PostgreSQL– External data & streams– Open source with active users, mostly in net monitoring

– Commercialization at Truviso, Inc– “Big” academic software

• 2 faculty, 9 PhDs, 3 MS

Page 11: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Some Key Telegraph Features

• Eddies: Continuous Query Optimization– Reoptimize queries at any point in execution

• FLuX: Fault-Tolerant Load-balanced eXchange– Cluster Parallelism with High Availability

• Shared query processing– Query processing is a join of queries and data

• Data Triage– Robust statistical approximation under stress

• See http://telegraph.cs.berkeley.edu for more

Page 12: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

FastBit

• Background• Vertically-partitioned relational storewith

bitmap indexes– Word-Aligned Hybrid (WAH) compression tuned

for CPU efficiency as well as disk bandwidth• LBL internal project since 2004• Open source LPGL in 2008

Page 13: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Introduction to Bitmap Indexing

Column 1 Column 2

321543…

342512…

Column 1

1 2 3 4 5

000010…

001001…

100000…

010000…

000100…

Column 2

001000…

1 2 3 4 5

010000…

100001…

000010…

000100…

Relational Table Bitmap Index

Page 14: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Why Bitmap Indexes?

• Fast incremental appending– One index per stream– No sorting or hashing of the input data

• Efficient multidimensional range lookups– Example: Find number of sessions from prefix 192.168/16

with size between 100 and 200 bytes

• Efficient batch lookups– Can retrieve entries for multiple keys in a single pass over

the bitmap

Page 15: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Outline

• A challenging stream query problem– Real-world example: US DOE network monitoring

• Open-Source Components– Stream Query Engine: TelegraphCQ– Data Warehousing store: FastBit

• Performance Study– Stream Analysis, Load, Lookup

• Handling Bursts: Data Triage

Page 16: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

The DOE dataset

• 42 weeks, 08/2004 – 06/2005– Est. 1/30 of DOE traffic

• Projection:– 15K records/sec typical– 1.7M records/sec peak

• TPC-C today– 4,092,799 tpmC (2/27/07)

• i.e. 68.2K per sec

• 1.5 orders of magnitude needed!– And please touch 2 orders of magnitude more random data, too– But.. append-only updates + streaming queries (temporal locality)

Page 17: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Our Focus: Flagging abnormal traffic

• When a network is behaving abnormally…– Notify network operators– Trigger in-depth analysis or countermeasures

• How to detect abnormal behavior?– Analyze multiple aspects of live monitoring data– Compute relevant information about “normal”

behavior from historicalmonitoring data– Compare current behavior against this baseline

Page 18: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Example: “Elephants”

• The query:– Find the k most significant sources of traffic on the

network over the past t seconds.– Alert the network operator if any of these sources

is sending an unusually large amount of traffic for this time of the day/week, compared with its usual traffic patterns.

Page 19: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

System Architecture

Page 20: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Query Workload

• Five monitoring queries, based on discussions with network researchers and practitioners

• Each query has three parts:– Analyze flow record stream (TelegraphCQ query)– Retrieve and analyze relevant historical monitoring

data (FastBit query)– Compare current behavior against baseline

Page 21: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Query Workload Summary• Elephants

– Find heavy sources of network traffic that are not normally heavy network users

• Mice– Examine the current behavior of hosts that normally send very little traffic

• Portscans– Find hosts that appear to be probing for vulnerable network ports– Filter out “suspicious” behavior that is actually normal

• Anomaly Detection– Compare the current traffic matrix (all source, destination pairs) against past

traffic patterns• Dispersion

– Retrieve historical traffic data for sub-networks that exhibit suspicious timing patterns

• Full queries are in the paper

Page 22: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Best-Case Numbers• Single PC, dual 2.8GHz single-core Pentium 4, 2GB RAM, IDE RAID

(60 MB/sec throughput)• TCQ performance up to 25Krecords/sec

– Depends heavily on query, esp. window size • Fastbit can load 213K tups/sec

– NW packet trace schema– Depends on batch size: 10Mtups per batch

• Fastbit can “fetch” 5 Mrecords/sec – 8-bytes of output per record only! 40MB/sec near RAID I/O throughput

• Best end-to-end: 20Ktups/sec– Recall desire of 15Ktups/sec steady state, 1.7Mtups/sec burst

Page 23: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Streaming Query Processing

Page 24: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Streaming Query Processing

Page 25: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Index Insertion

Page 26: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Index Insertion

Page 27: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Index Lookup

Page 28: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Index Lookup

Page 29: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

End-to-End Throughput

Page 30: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

End-to-End Throughput

Page 31: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Summary of DOE Results

• With sufficiently large load window, FastBit can handle expected peak data rates

• Streaming query processing becomes the bottleneck– Next step: Data Triage

Page 32: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Outline

• A challenging stream query problem– Real-world example: US DOE network monitoring

• Open-Source Components– Stream Query Engine: TelegraphCQ– Data Warehousing store: FastBit

• Performance Study– Stream Analysis, Load, Lookup

• Handling Bursts: Data Triage

Page 33: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

ICDE 2006 33Feb 21, 2006

Data Triage

• Provision for the typical data rate• Fall back on approximation during bursts

– But always do as much “exact” work as you can!• Benefits:

– Monitor fast links with cheap hardware– Focus on query processing features, not speed– Graceful degradation during bursts– 100% result accuracy most of the time

Page 34: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

ICDE 2006 34Feb 21, 2006

Summarizer

Data Triage

• Bursty data goes to the triage process first

• Place a triage queue in front of each data source

• Summarize excess tuples to prevent missing deadlines

Triage Process

RelationalTuples

Triage Queue

SummariesOf TriagedTuples

TriagedTuples

To QueryEngine

Initial ParsingAnd Filtering

Packets

Tupl

es

Page 35: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

ICDE 2006 35Feb 21, 2006

Data Triage

• Query engine receives tuples and summaries

• Use a shadow query to compute approximation of missing results

MainQuery

ShadowQuery

Query Engine

Merge

User

RelationalTuples

SummariesOf TriagedTuples

SummariesOf MissingResults

Page 36: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

ICDE 2006 36Feb 21, 2006

Read the paper for…• Provisioning

– Where are the performance bottlenecks in this pipeline?– How do we mitigate those bottlenecks?

• Implementation– How do we “plug in” different approximation schemes

without modifying the query engine?– How do we build shadow queries?

• Interface– How do we present the merged query results to the user?

Page 37: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

ICDE 200637Feb 21, 2006

Delay Constraints

…must be delivered by this time

Window 1

Window 2

DelayConstraint

All results from this window…

Time

Page 38: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

ICDE 2006 38

Experiments• System

– Data Triage implemented on TelegraphCQ

– Pentium 3 server, 1.5 GB of memory

• Data stream– Timing-accurate

playback of real network traffic from www.lbl.gov web server

– Trace sped up 10x to simulate an embedded CPU

Feb 21, 2006

select W.adminContact, avg(P.length) asavgLength, stdev(P.length) asstdevLengthwtime(*) aswindowTimefrom

Packet P [range ’1 min’ slide ’1 min’ ],WHOIS W

whereP. srcIP>WHOIS.minIP

andP.srcIP<WHOIS.maxIPgroupbyW.adminContactlimit delay to ‘10 seconds’;

Page 39: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

ICDE 200639Feb 21, 2006

Experimental Results: Latency

Page 40: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

ICDE 200640Feb 21, 2006

Experimental Results: Accuracy

• Compare accuracy of Data Triage with previous work

• Comparison 1: Drop excess tuples– Both methods using 5-

second delay constraint– No summarization

Page 41: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

ICDE 200641Feb 21, 2006

Experimental Results: Accuracy

• Comparison 2: Summarize all tuples– Reservoir sample– 5 second delay

constraint– Size of reservoir =

number of tuples query engine can process in 5 sec

Page 42: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

Conclusions & Issues• Stream Query + Append Only Warehouse

– Good match, can scale a long way at modest $$• Data Triage combats Stream Query bottleneck

– Provision for the common case– Approximate on excess load during bursts– Keep approximation limited, extensible

• Parallelism needed– See FLuX work for High Availability

• Shah et al, ICDE ‘03 and SIGMOD ‘04• vs. Google’s MapReduce

• Query Optimization for streams? Adaptivity!– Eddies: Avnur & Hellerstein SIGMOD ‘00 – SteMs: Raman, Deshpande, Hellerstein, ICDE ‘03– STAIRs: Deshpande & Hellerstein VLDB ’04– Deshpande/Ives/Raman survey, F&T-DB ‘07

Page 43: Real-Time Querying of Live and Historical Stream Data Joe Hellerstein UC Berkeley

More?

• http://telegraph.cs.berkeley.edu

• Frederick Reiss and Joseph M. Hellerstein. “Declarative Network Monitoring with an Underprovisioned Query Processor”. ICDE 2006.

• F. Reiss, K. Stockinger, K. Wu, A. Shoshani, J. M. Hellerstein. “Enabling Real-Time Querying of Live and Historical Stream Data”.

SSDBM 2007.

• Frederick Reiss. “Data Triage”. Ph.D. thesis, UC Berkeley, 2007.