may 21, 2003cs forum annual meeting1 randomization for massive and streaming data sets rajeev...

May 21, 2003 CS Forum Annual Meeting 1

Randomization for Massive Randomization for Massive and Streaming Data Setsand Streaming Data Sets

Rajeev Motwani

2

Data Streams Mangement SystemsData Streams Mangement Systems

Traditional DBMS – data stored in finite, persistent data setsdata sets

Data Streams – distributed, continuous, unbounded, rapid, time-varying, noisy, …

Emerging DSMS – variety of modern applications Network monitoring and traffic engineering Telecom call records Network security Financial applications Sensor networks Manufacturing processes Web logs and clickstreams Massive data sets

3

DSMS

Scratch Store

DSMS – Big PictureDSMS – Big Picture

Input streams

RegisterQuery

StreamedResult

StoredResult

ArchiveStored

Relations

4

Algorithmic IssuesAlgorithmic Issues

Computational Model Streaming data (or, secondary memory) Bounded main memory

Techniques New paradigms Negative Results and Approximation Randomization

Complexity Measures Memory Time per item (online, real-time) # Passes (linear scan in secondary memory)

5

Stream Model of ComputationStream Model of Computation

10

11

1

0

1

0

0

1

1

Increasi

ng time

Main Memory (Synopsis Data Structures)

Data Stream

Memory: poly(1/ε, log N)

Query/Update Time: poly(1/ε, log N)

N: # items so far, or window size

ε: error parameter

6

““Toy” Example – Network MonitoringToy” Example – Network Monitoring

RegisterMonitoring

Queries

DSMS

Scratch Store

Network measurements,Packet traces,

…

IntrusionWarnings

OnlinePerformance

Metrics

ArchiveLookupTables

7

Frequency Related ProblemsFrequency Related Problems

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Find all elements with frequency > 0.1%

Top-k most frequent elements

What is the frequency of element 3? What is the total frequency

of elements between 8 and 14?

Find elements that occupy 0.1% of the tail.

Mean + Variance?

Median?

How many elements have non-zero frequency?

Analytics on Packet Headers – IP AddressesAnalytics on Packet Headers – IP Addresses

8

Example 1– Distinct ValuesExample 1– Distinct Values

Input Sequence X = x1, x2, …, xn, … Domain U = {0,1,2, …, u-1} Compute D(X) number of distinct values

Remarks Assume stream size n is finite/known

(generally, n is window size) Domain could be arbitrary (e.g., text,

tuples)

9

Naïve ApproachNaïve Approach

Counter C(i) for each domain value i Initialize counters C(i) 0 Scan X incrementing appropriate counters

Problem Memory size M << n Space O(u) – possibly u >> n

(e.g., when counting distinct words in web crawl)

10

Negative ResultNegative Result

Theorem:Deterministic algorithms need M = Ω(n log u)

bits

Proof: Information-theoretic arguments

Note: Leaves open randomization/approximation

11

Randomized AlgorithmRandomized Algorithm

Analysis

Random h few collisions & avg list-size O(n/t)

Thus

Space: O(n) – since we need t = Ω(n)

Time: O(1) per item [Expected]

h:U [1..t]

Input Stream Hash Table

12

Improvement via Sampling?Improvement via Sampling?

Sample-based Estimation Random Sample R (of size r) of n values in X Compute D(R) Estimator E = D(R) x n/r

Benefit – sublinear space

Cost – estimation error is high Why? – low-frequency values

underrepresented

13

Negative Result for SamplingNegative Result for Sampling

Consider estimator E of D(X) examining r items in X Possibly in adaptive/randomized fashion.

Theorem: For any , E has relative error

with probability at least .

Remarks r = n/10 Error 75% with probability ½ Leaves open randomization/approximation on full scans

δ

1ln

2r

rn

reδ

δ

14

Randomized ApproximationRandomized Approximation Simplified Problem – For fixed t, is D(X) >> t?

Choose hash function h: U[1..t] Initialize answer to NO For each xi, if h(xi) = t, set answer to YES

Observe – need 1 bit memory only !

Theorem: If D(X) < t, P[output NO] > 0.25 If D(X) > 2t, P[output NO] < 0.14

Boolean Flag

Input Stream

h:U [1..t]

YES/NOYES/NOtt

11

15

AnalysisAnalysis

Let – Y be set of distinct elements of X output NO no element of Y hashes to t

P [element hashes to t] = 1/t Thus – P[output NO] = (1-1/t)|Y|

Since |Y| = D(X), D(X) < t P[output NO] > (1-1/t)t > 0.25 D(X) > 2t P[output NO] < (1-1/t)2t < 1/e^2

16

Boosting AccuracyBoosting Accuracy

With 1 bit distinguish D(X)<t from D(X)>2t

Running O(log 1/δ) instances in parallel reduce error probability to any δ>0

Running O(log n) in parallel for t = 1, 2, 4, 8,…, n can estimate D(X) within factor 2

Choice of multiplier 2 is arbitrary can use factor (1+ε) to reduce error to ε

Theorem: Can estimate D(X) within factor (1±ε) with probability (1-δ) using space

)(δ

1log

ε

nlogO 2

17

Example 2 – Elephants-and-AntsExample 2 – Elephants-and-Ants

Identify items whose current frequency exceeds support threshold s = 0.1%.

[Jacobson 2000, Estan-Verghese 2001]

Stream

18

Algorithm 1: Lossy CountingAlgorithm 1: Lossy Counting

Step 1: Divide the stream into ‘windows’

Window-size W is function of support s – specify later…

Window 1 Window 2 Window 3

19

Lossy Counting in Action ...Lossy Counting in Action ...

Empty

FrequencyCounts

At window boundary, decrement all counters by 1

+

First Window

20

Lossy Counting continued ...Lossy Counting continued ...FrequencyCounts

At window boundary, decrement all counters by 1

Next Window

+

21

Error AnalysisError Analysis

If current size of stream = Nand window-size W = 1/ε

then # windows = εN

Rule of thumb: Set ε = 10% of support sExample: Given support frequency s = 1%, set error frequency ε = 0.1%

frequency error

How much do we undercount?

22

Output: Elements with counter values exceeding (s-ε)N

Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N

Putting it all together…Putting it all together…

How many counters do we need?

Worst case bound: 1/ε log εN counters

Implementation details…

23

Algorithm 2: Sticky SamplingAlgorithm 2: Sticky Sampling

Stream

Create counters by sampling Maintain exact counts thereafter

What is sampling rate?

341530

283141233519

24

Sticky Sampling contd...Sticky Sampling contd...For finite stream of length N

Sampling rate = 2/εN log 1/s

Same Rule of thumb: Set ε = 10% of support sExample: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability = 0.01%

Output: Elements with counter values exceeding (s-ε)N

Same error guarantees as Lossy Counting but probabilistic

Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N

= probability of failure

25

Number of counters?Number of counters?

Finite stream of length NSampling rate: 2/εN log 1/s

Independent of N

Infinite stream with unknown NGradually adjust sampling rate

In either case,Expected number of counters = 2/ log 1/s

26

Example 3 – Correlated AttributesExample 3 – Correlated Attributes

C1 C2 C3 C4 C5R1 1 1 1 1 0R2 1 1 0 1 0R3 1 0 0 1 0R4 0 0 1 0 1R5 1 1 1 0 1R6 1 1 1 1 1R7 0 1 1 1 1R8 0 1 1 1 0

… … …

Input Stream – items with boolean attributes Matrix – M(r,c) = 1 Row r has Attribute c Identify – Highly-correlated column-pairs

27

Correlation Correlation Similarity Similarity

View column as set of row-indexes (where it has 1’s)

Set Similarity (Jaccard measure)

Example

ji

ji

jiCC

CC)C,sim(C

Ci Cj

0 11 01 1 sim(Ci,Cj) = 2/5 = 0.40 01 10 1

28

Identifying Similar Columns?Identifying Similar Columns?

Goal – finding candidate pairs in small memory

Signature Idea Hash columns Ci to small signature sig(Ci) Set of signatures fits in memory sim(Ci,Cj) approximated by sim(sig(Ci),sig(Cj))

Naïve Approach Sample P rows uniformly at random Define sig(Ci) as P bits of Ci in sample Problem

sparsity would miss interesting part of columns sample would get only 0’s in columns

29

Key ObservationKey Observation

For columns Ci, Cj, four types of rowsCi Cj

A 1 1B 1 0C 0 1D 0 0

Overload notation: A = # rows of type A

ObservationCBA

A)C,sim(C ji

30

Min HashingMin Hashing

Randomly permute rows Hash h(Ci) = index of first row with 1 in column Ci

Suprising Property P[h(Ci) = h(Cj)] = sim(Ci, Cj)

Why? Both are A/(A+B+C) Look down columns Ci, Cj until first non-Type-D row h(Ci) = h(Cj) if type A row

31

Min-Hash SignaturesMin-Hash Signatures

Pick – k random row permutations Min-Hash Signature

sig(C) = k indexes of first rows with 1 in column C

Similarity of signatures Define: sim(sig(Ci),sig(Cj)) = fraction of

permutations where Min-Hash values agree

Lemma E[sim(sig(Ci),sig(Cj))] = sim(Ci,Cj)

32

ExampleExample

C1 C2 C3

R1 1 0 1R2 0 1 1R3 1 0 0R4 1 0 1R5 0 1 0

Signatures S1 S2 S3

Perm 1 = (12345) 1 2 1Perm 2 = (54321) 4 5 4Perm 3 = (34512) 3 5 4

Similarities 1-2 1-3 2-3Col-Col 0.00 0.50 0.25Sig-Sig 0.00 0.67 0.00

33

Implementation TrickImplementation Trick

Permuting rows even once is prohibitive

Row Hashing Pick k hash functions hk: {1,…,n}{1,

…,O(n)} Ordering under hk gives random row

permutation One-pass implementation

34

Comparing SignaturesComparing Signatures

Signature Matrix S Rows = Hash Functions Columns = Columns Entries = Signatures

Need – Pair-wise similarity of signature columns Problem

MinHash fits column signatures in memory But comparing signature-pairs takes too much time Limiting candidate pairs – Locality Sensitive Hashing

35

SummarySummary

New algorithmic paradigms needed for streams and massive data sets

Negative results abound Need to approximate Power of randomization

36

Thank You!Thank You!

37

ReferencesReferences

Rajeev Motwani (http://theory.stanford.edu/~rajeev)

STREAM Project (http://www-db.stanford.edu/stream)

STREAM: The Stanford Stream Data Manager. Bulletin of the Technical Committee on Data Engineering 2003.

Motwani et al. Query Processing, Approximation, and Resource Management in a Data Stream Management System. CIDR 2003.

Babcock-Babu-Datar-Motwani-Widom. Models and Issues in Data Stream Systems. PODS 2002.

Manku-Motwani. Approximate Frequency Counts over Streaming Data. VLDB 2003.

Babcock-Datar-Motwani-O’Callahan. Maintaining Variance and K-Medians over Data Stream Windows. PODS 2003.

Guha-Meyerson-Mishra-Motwani-O’Callahan. Clustering Data Streams: Theory and Practice. IEEE TKDE 2003.

http://theory.stanford.edu/~rajeev

http://theory.stanford.edu/~rajeev

http://www-db.stanford.edu/stream

38

References (contdReferences (contd)) Datar-Gionis-Indyk-Motwani. Maintaining Stream Statistics

over Sliding Windows. SIAM Journal on Computing 2002. Babcock-Datar-Motwani. Sampling From a Moving Window

Over Streaming Data. SODA 2002. O’Callahan-Guha-Mishra-Meyerson-Motwani. High-

Performance Clustering of Streams and Large Data Sets. ICDE 2003.

Guha-Mishra-Motwani-O’Callagahan. Clustering Data Streams. FOCS 2000.

Cohen et al. Finding Interesting Associations without Support Pruning. ICDE 2000.

Charikar-Chaudhuri-Motwani-Narasayya. Towards Estimation Error Guarantees for Distinct Values. PODS 2000.

Gionis-Indyk-Motwani. Similarity Search in High Dimensions via Hashing. VLDB 1999.

Indyk-Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998.

may 21, 2003cs forum annual meeting1 randomization for massive and streaming data sets rajeev...

Documents

secondary memory slide

log n n

n time

stream size n

error parameter slide

web crawl slide

scan x

log n queryupdate time