computer science characterizing and exploiting reference locality in data stream applications feifei...

Computer Science

Characterizing and Exploiting Reference Locality in Data Stream Applications

Feifei Li, Ching Chang, George Kollios, Azer BestavrosComputer Science DepartmentBoston University

Data Stream Management System

ApplicationApplicationApplicationApplication

Query Query (e.g. Joins over two streams)(e.g. Joins over two streams)

Query ProcessorQuery Processor

ResultResult

MemoryMemory

Data Stream Management System (DSMS)

Select tuples that maximize

the query metrics

Unselected tuples

Observations

Storage / Computation limitation Full contents of tuples of interest

cannot be stored in memory.

Cast as “caching” problems Query processing with memory constraint.

“Caching” Problem in DSMS

window size

is the memory

size

sliding window joins

What tuples to store to

max the size of join

results?

sum of

Locality of reference properties (Denning & Schwatz)

Locality-Aware Algorithms

Our Locality-aware

algorithms

Previous algorithms

Our ContributionsCast query processing with memory constraint in DSMS as “caching” problem and analyze the two causes of reference locality

Provide a mathematical model and simple method to infer it to characterize the reference locality in data streams

Show how to improve performance of data stream applications with locality-aware algorithms

Reference Locality - Definition

In a data stream recently appearing tuples have a high probability of appearing in the near future.

Inter Arrival Distance (IAD) A random variable that corresponds to the number of tuples separating consecutive appearances of the same tuple.

2 2 4104 10 7 7 4 ……

3010 1IAD

Calculate distribution of IAD

,,()( ixixPkd jnkni )|1...1 ixkj n

i ii kdpkd )()( Where pi is the frequency

of value i in this data stream

i a ceb a i ……

xn xn+k

distance is k

Sources of Reference Locality

Long-term popularity vs. Short-term correlation (web traces, Bestavros and Crovella)

MS MS IBMIBMMS GG IBM MS MS

Reference locality due to long-term popularity

……

For example: Stock Traces

A MS MSAA GG GG MS IBM

Reference locality due to short-term correlation

……

George’s Company A listed today!

Independent Reference Model

With the independent, identically-distributed (IID) assumption:

N

i

N

i

kiiii ppkdpkd

1 1

12 )1()()(

Problem: only captures reference locality due to skewed popularity profile.

1)1()( kiii ppkd

Metrics of Reference Locality

How to distinguish the two causes of reference locality?

Compare IAD distribution of the two!

A MS MSAA GG GG MS IBM

Original Data Stream S

……

MS MS MSAGG IBM IBM MS IBM

Random Permutation of S

……

Stock Transaction Traces

Daily stock transaction data from INET ATS, Inc.

Zipf-like Popularity Profile (log-log scale)

Stock Transaction Traces

CDF of IAD for Original and Randomly Permuted Traces

Still has strong reference locality, due to skewed popularity distribution

Network OD Flow TracesNetwork traces of Origin-Destination (OD) flows intwo major networks: US Abilene and Sprint-Europe

Zipf-like Popularity Profile (log-log scale)

Network OD Flow Traces

CDF of IAD for Original andRandomly Permuted Traces

Outline

Motivation Reference Locality: source and metrics A Locality-Aware Data Stream Model Application of Locality-Aware Model Max-subset Join Approximate count estimation Data summarization

Performance Study Conclusion

Locality-Aware Stream Model

stream S

2 2 5104 10 7 7…

Index xn-1

P(xn=xn-

4)=a4

Recent h tuples

xn-h

Popularity Distribution of S

P

Recent h tuples of S

5

xn


stream S

2 2 5104 10 7 7…

Index xn-1

2

xn

P(xn=2 from popularity profile)=b*p(2)

Recent h tuples

xn-h

Recent h tuples of SPopularity Distribution of S

P


Xn

=

Xn-i with probability ai

Y with probability b

where 1 i h, and Y is a IID random variable w.r.t P, and

h

iiab

1

1

h

jjnjhnnn cxacbPxxcxP

11 ),()(),...,|(

where (xk,c)=1 if xk=c, and 0 otherwise.

Similar model appears for caching of web-traces, example Konstantinos Psounis, et. al

Infer the Model

Expected value for xn:

h

j

D

ijnjn iiPbxax

1 1

* )(

Least square method:

minimize over a1, … , ah, b:2*

1

][ ii

N

hi

xx

Make N observations, infer ai and b (h+1) parameters

Model on Real Traces- Stock

b: degree of reference locality due to long-term popularity 1-b: … due to short-term correlation

Model on Real Traces- OD Flow

Utilizing Model for Prediction

xn-h xnxn-1… xn+1 xn+2 … xn+TS ……

T

The expected number of occurrence for tuple with value e in a future period of T, ET(e).

Using only T+1 constants calculated based on thelocality model of S

Outline

Motivation Reference Locality: source and metrics A Locality-Aware Data Stream Model Application of Locality-Aware Model Max-subset Join Approximate count estimation Data summarization

Performance Study Conclusion

Approximate Sliding Window Join

window size

is the memory

size

sliding window joins

What tuples to store to

max the size of join

results?

sum of

Existing Approach

Metrics: Max-subset Previous approach: Random load shedding: poor performance (J.

Kang et. al, A. Das et. al)

Frequency model: IID assumption (A. Das et. al)

Age-based model: too strict assumption (U. Srivastava et. al)

Stochastic model: not universal (J. Xie et. al)

Marginal Utility

6 5 10810 12 10 …

Stream S

…

8 10 …

Stream R

…

nn-1

T=5

5,3 TU n

X R

Calculate Marginal Utility10 x 8x13 x x 8 9S ……

nTuple Index:

9 7…n

x ?

P1

x ?

P2

…R

Based on locality model, we can show that:

))(,,,...,()( 11

xPbaaFxTbPPU h

T

ii

n

X R

where F depends the characteristic equation of Pi which is a linear recursive sequence!

ELBA Exact Locality-Based Algorithm (ELBA) Based on the previous analysis, calculate

the marginal utility of tuples in the buffer, evict the victim with the smallest value

Expensive

LBA

Locality-Based Algorithm (LBA) Assume T is fixed, approximate marginal utility based on the prediction power of locality model. Depends on only T+1 constants that could be pre-computed.

Space Complexity

A histogram stores both P over a domain size D and T+1 constants histogram space usage is poly

logarithm: O(poly[logN]) space usage for N values (A. Gilbert, et. al)

Sliding window join: varying buffer size – OD Flow

Sliding window join: varying buffer size - Stock

Sliding window join: varying window size - stock

Conclusion

Reference locality property is important for query processing with memory constraint in data stream applications.

Most real data streams have strong temporal locality, i.e. short term correlations.

How about spatial locality, i.e. correlation among different attributes of the tuple?

Thanks!

Approximate Count Estimation

Derive much tighter space bound for Lossy-counting algorithm (G. Manku

et. al) using locality-aware techniques.Tight space bound is important, as it tells us how much memory space to allocate.

Data SummarizationDefine Entropy over a window in data stream using locality-aware techniques, instead of the normal way of entropy definition.

1 1 221 2 3 3 3 ……

1 2 123 2 3 1 3 ……

Important for data summarization, change detection, etc.

For example:

Data Stream Entropy

Data Streams Locality-Aware Entropy

Uniform IID 6.19

Permuted Stock Stream

5.48

Original Stock Stream 3.32Higher degree of reference locality infers less entropy

computer science characterizing and exploiting reference locality in data stream applications feifei...

Documents

metrics of reference

iad slide

causes of reference

strong reference locality

sources of reference

sum of locality

independent reference

data streams