data streams topics in data mining fall 2015 bruno ribeiro © 2015 bruno ribeiro
TRANSCRIPT
![Page 1: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/1.jpg)
Data Streams
Topics in Data MiningFall 2015
Bruno Ribeiro
© 2015 Bruno Ribeiro
![Page 2: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/2.jpg)
2
Stream item counting
Stream statistics
Stream classification
Stream matching
Data Streams Applications
© 2015 Bruno Ribeiro
![Page 3: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/3.jpg)
3
Data Streams
◦ Data streams—continuous, ordered, changing, fast, huge amount
◦ Traditional DBMS—data stored in finite, persistent data sets Characteristics
◦ Huge volumes of continuous data, possibly infinite
◦ Fast changing and requires fast, real-time response
◦ Random access is expensive—single scan algorithm (only single pass)
◦ Store only the summary of the data seen thus far
◦ Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing
What are Data Streams?
Ack. From Jiawei Han
![Page 4: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/4.jpg)
4
Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply &
manufacturing Sensor, monitoring & surveillance: video streams, RFIDs Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too
expensive)
Examples
Ack. From Jiawei Han
![Page 5: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/5.jpg)
5
DBMS versus DSMS
Persistent relations One-time queries Random access “Unbounded” disk store Only current state matters No real-time services Relatively low update rate Data at any granularity Assume precise data Access plan determined by
query processor, physical DB design
Transient streams Continuous queries Sequential access Bounded main memory Historical data is important Real-time requirements Possibly multi-GB arrival rate Data at fine granularity Data stale/imprecise Unpredictable/variable data
arrival and characteristics
Ack. From Motwani’s PODS tutorial slides
![Page 6: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/6.jpg)
6
In General: Streaming algorithm
X1
stream processingengine
estimate of θ,summary(in memory)
Continuous Data Stream(Terabytes)
(Gigabytes)
XnX2
where θ = g(X1,...,Xn)
“indirect” observation
Query Q
…
Hashing
© 2015 Bruno Ribeiro
![Page 7: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/7.jpg)
7
Query types
◦ One-time query vs. continuous query (being evaluated continuously as stream continues to arrive)
◦ Predefined query vs. ad-hoc query (issued on-line) Unbounded memory requirements
◦ For real-time response, main memory algorithm should be used
◦ Memory requirement is unbounded if one will join future tuples Approximate query answering
◦ With bounded memory, it is not always possible to produce exact answers
◦ High-quality approximate answers are desired
◦ Data reduction and synopsis construction methods Sketches, random sampling, histograms, wavelets, etc.
Querying
Ack. From Jiawei Han
![Page 8: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/8.jpg)
8
Major challenges◦ Keep track of a large universe, e.g., pairs of IP address, not
ages Methodology
◦ Synopses (trade-off between accuracy and storage): A summary given in brief terms that covers the major points of a subject matter
◦ Use synopsis data structure, much smaller (O(logk N) space) than their base data set (O(N) space)
◦ Compute an approximate answer within a small error range (factor ε of the actual answer)
Major methods ◦ Random sampling◦ Histograms◦ Sliding windows◦ Multi-resolution model◦ Sketches◦ Radomized algorithms
Synopses/Approximate Answers
Ack. From Jiawei Han
![Page 9: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/9.jpg)
9
Sliding windows◦ Only over sliding windows of recent stream data ◦ Approximation but often more desirable in applications
Batched processing, sampling and synopses◦ Batched if update is fast but computing is slow
Compute periodically, not very timely◦ Sampling if update is slow but computing is fast
Compute using sample data◦ Synopsis data structures
Maintain a small synopsis or sketch of data Good for querying historical data
Blocking operators, e.g., sorting, avg, min, etc.◦ Blocking if unable to produce the first output until seeing the
entire input
Types of Streaming Algorihms
Ack. From Jiawei Han
![Page 10: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/10.jpg)
10
Random sampling (but without knowing the total length in advance)
Sliding windows
◦ Make decisions based only on recent data of sliding window size w
◦ An element arriving at time t expires at time t + w Histograms
◦ Approximate the frequency distribution of element values in a stream
◦ Partition data into a set of contiguous buckets
◦ Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket)
Multi-resolution models
◦ Popular models: balanced binary trees, micro-clusters, and wavelets
Stream Processing
Ack. From Jiawei Han
![Page 11: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/11.jpg)
11
Random Sampling:A Simple Approach to Item Counts
© 2015 Bruno Ribeiro
![Page 12: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/12.jpg)
12
Random Sampling: Packet sampling
Router
Internet
Bernoulli sampling
Internet Internet
Widely used: processing overhead controlled by sampling rate (1/200)
Traffic summary:* Find % traffic from Netflix @ Purdue
Estimate packet-level statistics
>>
© 2015 Bruno Ribeiro
![Page 13: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/13.jpg)
13
Find % connections from Netflix @ Purdue
A Fair Measure: Flow-level Statistics
Estimate flow-level statistics
>>
Estimate flow size distribution
© 2015 Bruno Ribeiro
![Page 14: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/14.jpg)
14
Reverse problem (inference problem)
Flow-level Statistics from Sampled Packets?
m oc a j g
© 2015 Bruno Ribeiro
![Page 15: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/15.jpg)
15
Finding estimates – schematic view
Sampling
Estimator
© 2015 Bruno Ribeiro
![Page 16: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/16.jpg)
16
Flow size distribution: maximum likelihood estimation
sampling rate = 1/200 128,000 sampled flows EM algorithm
◦ 2 initializations
1 2 3 4 5 6 7 8 9 10
11
12
13
14
15
16
17
18
19
30%
40%
50%
60%
70%
80%
90%
100%
Estimate 1
Estimate 2
Original
Flow size
Cum
ul. %
of
flow
s
Estimates highly sensitive to initialization© 2015 Bruno Ribeiro
![Page 17: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/17.jpg)
17
MLE: more samples
pkt sampling rate = 1/200, 1 trillion sampled flows
1 2 3 4 5 6 7 8 9 10
11
12
13
14
15
16
17
18
19
20
30%
40%
50%
60%
70%
80%
90%
100%
Es-ti-mate
Flow size
Cum
ul. %
of
flow
s
© 2015 Bruno Ribeiro
![Page 18: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/18.jpg)
Surface: 71% is water
18
Problem: Uniform sampling
Wikipedia
© 2015 Bruno Ribeiro
![Page 19: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/19.jpg)
19
Dedicates precious memory only to “important” observations
Sample flows, rather than packets◦ Problem?◦ Will likely miss large flows
Sample flows ∝ flow size◦ Problem?◦ Streaming setting: We don’t yet know the size
Example pf compromise: Sample and Hold◦ Sample packets, keep all remaining packets of same flow
Importance Sampling
© 2015 Bruno Ribeiro
![Page 20: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/20.jpg)
Different Sampling Designs
Packet Sampling = Packet Sampling: Sample elements with probability p
Flow Sampling = Flow sampling: Sample sets with probability q Sample & Hold = Randomly sample elements with probability q’ from the stream but collect all future elements with same color
Dual Sampling = Sample first element with high probability. Sample following elements with low probability and use “sequence numbers” to obtain elements lost “in the middle”
m oc a
seeing as a stream of elements
j g
© 2015 Bruno Ribeiro
![Page 21: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/21.jpg)
Results: Different Sampling Designs FS = Flow sampling SH = Sample and
hold
• DS = Dual sampling• PS = Packet
sampling
Tune & Veitch, 2014© 2015 Bruno Ribeiro
![Page 22: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/22.jpg)
22
Sketches
© 2015 Bruno Ribeiro
![Page 23: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/23.jpg)
Sketches
Note that not every problem can be solved well with sampling◦ Example: flow size estimation
“Sketch”: a linear transformation of the input◦ Model stream as defining a vector, sketch is result of
multiplying stream vector by an (implicit) matrix
linear projection
stream
sketch
X1 XnX2…
© 2015 Bruno Ribeiro
![Page 24: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/24.jpg)
24
Counting Sketch: Abhishek Kumar et al. 2004
Definitions◦ N → number of flows◦ W → maximum flow size◦ M → memory size
Space Complexity
◦ Available memoryM = k N log W, k < 1
© 2015 Bruno Ribeiro
![Page 25: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/25.jpg)
25
Flow Size Sketch: Kumar et al. 2004
Counters
elements offlow blue
elements offlow red
elements offlow green
f
f
f collision
Flow size distribution
Motivation:❍ Estimate flow size distribution
Hash function f
Uses precious memory with counters > 0
Hash function:Uniformly at random associates a newly arrived flow to a counter
© 2015 Bruno Ribeiro
![Page 26: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/26.jpg)
26
0
Data Streaming on Flow Size Estimation router
Estimation phase
powerfulback end
server
0
0
universal hash
function
1
12
0
0
Sketch phase
12
collision!!
counters
su
mm
ary flow size
distribution estimateDisambiguate
© 2015 Bruno Ribeiro
![Page 27: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/27.jpg)
27
Effectively only works if counter load < 2
In practice reduces required memory by 1/2
Very resource-intensive estimation procedure
Issues with Kumar et al.
© 2015 Bruno Ribeiro
![Page 28: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/28.jpg)
28
Ribeiro et al. 2008
Eviction Sketch
© 2015 Bruno Ribeiro
![Page 29: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/29.jpg)
Eviction Sketch: Probabilistic collision avoidance
2 0 1 6 0 06 1 2
Flows:
flow
7flow
8
Maximum hash value = M
M/2 counters
If hash(packet) < M/2 → red
Otherwise (hash(packet) mod M/2) → blue
flow
9Counters:
M/2 counters
Undetectable collision
Detectable blue – red collision: 1 bit required
© 2015 Bruno Ribeiro
![Page 30: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/30.jpg)
Eviction
Number of eviction classes ∞ Policy: Evicts random flow
Flow sampling
Folding: interesting fact
Collision policy:
“red flow cannot increment blue counter”
“blue flow overwrites red counter”
counter = 0 are red
Result: e.g. if 1 counter / flow All red counters are also blue counters = 0
Virtually expands hash table in ≈ 50% (virtual 2 counters/ flow)
Blue counters evict red counters Flow sampling effect: Discards 15% flows at random
2 0 1 3 0 06 1 2
Flows:
Counters:
0 0 0 1 0 01 1 1Counter colors:(extra bit)
© 2015 Bruno Ribeiro
![Page 31: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/31.jpg)
Group large flow sizes & Probabilistic counting [Morris 78]
Reduce counter size:Probabilisitc counter increments
With ma = 2ª , 6 bit counter bins up flows up to average size 1014
01Arrived packets:
…
k-1
2k-1kk+1
p=1/m1 k+
2
…
m1
…
m2
p=1/m2
k
average
Hash counter
Counter value k → average flow sizes = [k, k+m1-1] Counter value k+1 → average flow sizes = [k+m1, k+m1+m2-1]
© 2015 Bruno Ribeiro
![Page 32: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/32.jpg)
Experiment
Evaluated with simulations
Our worst result with Internet core traces◦ 9.5 million flows◦ 8MB of memory◦ k=16◦ W=1014
k
Same accuracy without counter folding requires 13MB of memory
© 2015 Bruno Ribeiro
![Page 33: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/33.jpg)
33
Final estimation result (over Internet traffic)
Input: 106 flows with 250KB memory
© 2015 Bruno Ribeiro
![Page 34: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/34.jpg)
34
Approximate Search
Bloom filters
© 2015 Bruno Ribeiro
Good Tutorial: Andrei Broder and Michael Mitzenmacher, Network Applications ofBloom Filters: A Survey, Internet Mathematics Vol. 1, No. 4: 485-509, 2003
![Page 35: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/35.jpg)
35
How Bloom Filters Work
Hash function f1 f2
f3
© 2015 Bruno Ribeiro
![Page 36: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/36.jpg)
S = set of items m = |S| k = hash functions n = number of stored bits in filter
Assume kn < m
To check membership: y S∊ , check whether fi(y), 1≤i≤k, are all set to 1
o If not, y S∉ o Else, we conclude that y S∊ , but sometimes y ∉ S (false positive)
In many applications, false positives are OK as long as happens with small probability
Why Bloom Filters Work
© 2015 Bruno Ribeiro
![Page 37: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/37.jpg)
Bloom Filter Errors Assumption: Hash functions look random
Given m bits for filter and n elements, choose number k of hash functions to minimize false positives:◦ Let ◦ Then,
As k increases, more chances to find at least one 0but we also insert more 1’s in bit vector
Optimal at k = (ln 2)m/n (derivative = 0, 2nd deriv > 0)
© 2015 Bruno Ribeiro
![Page 38: Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfeb1a28abf838cb7cdc/html5/thumbnails/38.jpg)
Example
0 2.3 4.5 6.8 9 11.30
0.025
0.05
0.075
0.1
Hash functions
Fal
se p
osit
ive
rate
m/n = 8
Opt k = 8 ln 2 = 5.45...
Ack Mitzenmacher© 2015 Bruno Ribeiro