![Page 1: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/1.jpg)
Computer Science
Characterizing and Exploiting Reference Locality in Data Stream Applications
Feifei Li, Ching Chang, George Kollios, Azer BestavrosComputer Science DepartmentBoston University
![Page 2: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/2.jpg)
Data Stream Management System
ApplicationApplicationApplicationApplication
Query Query (e.g. Joins over two streams)(e.g. Joins over two streams)
Query ProcessorQuery Processor
ResultResult
MemoryMemory
Data Stream Management System (DSMS)
Select tuples that maximize
the query metrics
Unselected tuples
![Page 3: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/3.jpg)
Observations
Storage / Computation limitation Full contents of tuples of interest
cannot be stored in memory.
Cast as “caching” problems Query processing with memory constraint.
![Page 4: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/4.jpg)
“Caching” Problem in DSMS
window size
is the memory
size
sliding window joins
What tuples to store to
max the size of join
results?
sum of
Locality of reference properties (Denning & Schwatz)
![Page 5: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/5.jpg)
Locality-Aware Algorithms
Our Locality-aware
algorithms
Previous algorithms
![Page 6: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/6.jpg)
Our ContributionsCast query processing with memory constraint in DSMS as “caching” problem and analyze the two causes of reference locality
Provide a mathematical model and simple method to infer it to characterize the reference locality in data streams
Show how to improve performance of data stream applications with locality-aware algorithms
![Page 7: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/7.jpg)
Reference Locality - Definition
In a data stream recently appearing tuples have a high probability of appearing in the near future.
![Page 8: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/8.jpg)
Inter Arrival Distance (IAD) A random variable that corresponds to the number of tuples separating consecutive appearances of the same tuple.
2 2 4104 10 7 7 4 ……
3010 1IAD
![Page 9: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/9.jpg)
Calculate distribution of IAD
,,()( ixixPkd jnkni )|1...1 ixkj n
i ii kdpkd )()( Where pi is the frequency
of value i in this data stream
i a ceb a i ……
xn xn+k
distance is k
![Page 10: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/10.jpg)
Sources of Reference Locality
Long-term popularity vs. Short-term correlation (web traces, Bestavros and Crovella)
MS MS IBMIBMMS GG IBM MS MS
Reference locality due to long-term popularity
……
For example: Stock Traces
A MS MSAA GG GG MS IBM
Reference locality due to short-term correlation
……
George’s Company A listed today!
![Page 11: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/11.jpg)
Independent Reference Model
With the independent, identically-distributed (IID) assumption:
N
i
N
i
kiiii ppkdpkd
1 1
12 )1()()(
Problem: only captures reference locality due to skewed popularity profile.
1)1()( kiii ppkd
![Page 12: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/12.jpg)
Metrics of Reference Locality
How to distinguish the two causes of reference locality?
Compare IAD distribution of the two!
A MS MSAA GG GG MS IBM
Original Data Stream S
……
MS MS MSAGG IBM IBM MS IBM
Random Permutation of S
……
![Page 13: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/13.jpg)
Stock Transaction Traces
Daily stock transaction data from INET ATS, Inc.
Zipf-like Popularity Profile (log-log scale)
![Page 14: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/14.jpg)
Stock Transaction Traces
CDF of IAD for Original and Randomly Permuted Traces
Still has strong reference locality, due to skewed popularity distribution
![Page 15: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/15.jpg)
Network OD Flow TracesNetwork traces of Origin-Destination (OD) flows intwo major networks: US Abilene and Sprint-Europe
Zipf-like Popularity Profile (log-log scale)
![Page 16: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/16.jpg)
Network OD Flow Traces
CDF of IAD for Original andRandomly Permuted Traces
![Page 17: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/17.jpg)
Outline
Motivation Reference Locality: source and metrics A Locality-Aware Data Stream Model Application of Locality-Aware Model Max-subset Join Approximate count estimation Data summarization
Performance Study Conclusion
![Page 18: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/18.jpg)
Locality-Aware Stream Model
stream S
2 2 5104 10 7 7…
Index xn-1
P(xn=xn-
4)=a4
Recent h tuples
xn-h
Popularity Distribution of S
P
Recent h tuples of S
5
xn
![Page 19: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/19.jpg)
Locality-Aware Stream Model
stream S
2 2 5104 10 7 7…
Index xn-1
2
xn
P(xn=2 from popularity profile)=b*p(2)
Recent h tuples
xn-h
Recent h tuples of SPopularity Distribution of S
P
![Page 20: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/20.jpg)
Locality-Aware Stream Model
Xn
=
Xn-i with probability ai
Y with probability b
where 1 i h, and Y is a IID random variable w.r.t P, and
h
iiab
1
1
h
jjnjhnnn cxacbPxxcxP
11 ),()(),...,|(
where (xk,c)=1 if xk=c, and 0 otherwise.
Similar model appears for caching of web-traces, example Konstantinos Psounis, et. al
![Page 21: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/21.jpg)
Infer the Model
Expected value for xn:
h
j
D
ijnjn iiPbxax
1 1
* )(
Least square method:
minimize over a1, … , ah, b:2*
1
][ ii
N
hi
xx
Make N observations, infer ai and b (h+1) parameters
![Page 22: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/22.jpg)
Model on Real Traces- Stock
b: degree of reference locality due to long-term popularity 1-b: … due to short-term correlation
![Page 23: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/23.jpg)
Model on Real Traces- OD Flow
![Page 24: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/24.jpg)
Utilizing Model for Prediction
xn-h xnxn-1… xn+1 xn+2 … xn+TS ……
T
The expected number of occurrence for tuple with value e in a future period of T, ET(e).
Using only T+1 constants calculated based on thelocality model of S
![Page 25: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/25.jpg)
Outline
Motivation Reference Locality: source and metrics A Locality-Aware Data Stream Model Application of Locality-Aware Model Max-subset Join Approximate count estimation Data summarization
Performance Study Conclusion
![Page 26: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/26.jpg)
Approximate Sliding Window Join
window size
is the memory
size
sliding window joins
What tuples to store to
max the size of join
results?
sum of
![Page 27: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/27.jpg)
Existing Approach
Metrics: Max-subset Previous approach: Random load shedding: poor performance (J.
Kang et. al, A. Das et. al)
Frequency model: IID assumption (A. Das et. al)
Age-based model: too strict assumption (U. Srivastava et. al)
Stochastic model: not universal (J. Xie et. al)
![Page 28: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/28.jpg)
Marginal Utility
6 5 10810 12 10 …
Stream S
…
8 10 …
Stream R
…
nn-1
T=5
5,3 TU n
X R
![Page 29: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/29.jpg)
Calculate Marginal Utility10 x 8x13 x x 8 9S ……
nTuple Index:
9 7…n
x ?
P1
x ?
P2
…R
Based on locality model, we can show that:
))(,,,...,()( 11
xPbaaFxTbPPU h
T
ii
n
X R
where F depends the characteristic equation of Pi which is a linear recursive sequence!
![Page 30: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/30.jpg)
ELBA Exact Locality-Based Algorithm (ELBA) Based on the previous analysis, calculate
the marginal utility of tuples in the buffer, evict the victim with the smallest value
Expensive
![Page 31: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/31.jpg)
LBA
Locality-Based Algorithm (LBA) Assume T is fixed, approximate marginal utility based on the prediction power of locality model. Depends on only T+1 constants that could be pre-computed.
![Page 32: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/32.jpg)
Space Complexity
A histogram stores both P over a domain size D and T+1 constants histogram space usage is poly
logarithm: O(poly[logN]) space usage for N values (A. Gilbert, et. al)
![Page 33: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/33.jpg)
Sliding window join: varying buffer size – OD Flow
![Page 34: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/34.jpg)
Sliding window join: varying buffer size - Stock
![Page 35: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/35.jpg)
Sliding window join: varying window size - stock
![Page 36: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/36.jpg)
Conclusion
Reference locality property is important for query processing with memory constraint in data stream applications.
Most real data streams have strong temporal locality, i.e. short term correlations.
How about spatial locality, i.e. correlation among different attributes of the tuple?
![Page 37: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/37.jpg)
Thanks!
![Page 38: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/38.jpg)
Approximate Count Estimation
Derive much tighter space bound for Lossy-counting algorithm (G. Manku
et. al) using locality-aware techniques.Tight space bound is important, as it tells us how much memory space to allocate.
![Page 39: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/39.jpg)
Data SummarizationDefine Entropy over a window in data stream using locality-aware techniques, instead of the normal way of entropy definition.
1 1 221 2 3 3 3 ……
1 2 123 2 3 1 3 ……
Important for data summarization, change detection, etc.
For example:
![Page 40: Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649d7e5503460f94a60666/html5/thumbnails/40.jpg)
Data Stream Entropy
Data Streams Locality-Aware Entropy
Uniform IID 6.19
Permuted Stock Stream
5.48
Original Stock Stream 3.32Higher degree of reference locality infers less entropy