may 21, 2003cs forum annual meeting1 randomization for massive and streaming data sets rajeev...

38
May 21, 2003 CS Forum Annual Meeting 1 Randomization for Randomization for Massive and Streaming Massive and Streaming Data Sets Data Sets Rajeev Motwani

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

May 21, 2003 CS Forum Annual Meeting 1

Randomization for Massive Randomization for Massive and Streaming Data Setsand Streaming Data Sets

Rajeev Motwani

Page 2: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

2

Data Streams Mangement SystemsData Streams Mangement Systems

Traditional DBMS – data stored in finite, persistent data setsdata sets

Data Streams – distributed, continuous, unbounded, rapid, time-varying, noisy, …

Emerging DSMS – variety of modern applications Network monitoring and traffic engineering Telecom call records Network security Financial applications Sensor networks Manufacturing processes Web logs and clickstreams Massive data sets

Page 3: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

3

DSMS

Scratch Store

DSMS – Big PictureDSMS – Big Picture

Input streams

RegisterQuery

StreamedResult

StoredResult

ArchiveStored

Relations

Page 4: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

4

Algorithmic IssuesAlgorithmic Issues

Computational Model Streaming data (or, secondary memory) Bounded main memory

Techniques New paradigms Negative Results and Approximation Randomization

Complexity Measures Memory Time per item (online, real-time) # Passes (linear scan in secondary memory)

Page 5: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

5

Stream Model of ComputationStream Model of Computation

10

11

1

0

1

0

0

1

1

Increasi

ng time

Main Memory (Synopsis Data Structures)

Data Stream

Memory: poly(1/ε, log N)

Query/Update Time: poly(1/ε, log N)

N: # items so far, or window size

ε: error parameter

Page 6: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

6

““Toy” Example – Network MonitoringToy” Example – Network Monitoring

RegisterMonitoring

Queries

DSMS

Scratch Store

Network measurements,Packet traces,

IntrusionWarnings

OnlinePerformance

Metrics

ArchiveLookupTables

Page 7: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

7

Frequency Related ProblemsFrequency Related Problems

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Find all elements with frequency > 0.1%

Top-k most frequent elements

What is the frequency of element 3? What is the total frequency

of elements between 8 and 14?

Find elements that occupy 0.1% of the tail.

Mean + Variance?

Median?

How many elements have non-zero frequency?

Analytics on Packet Headers – IP AddressesAnalytics on Packet Headers – IP Addresses

Page 8: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

8

Example 1– Distinct ValuesExample 1– Distinct Values

Input Sequence X = x1, x2, …, xn, … Domain U = {0,1,2, …, u-1} Compute D(X) number of distinct values

Remarks Assume stream size n is finite/known

(generally, n is window size) Domain could be arbitrary (e.g., text,

tuples)

Page 9: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

9

Naïve ApproachNaïve Approach

Counter C(i) for each domain value i Initialize counters C(i) 0 Scan X incrementing appropriate counters

Problem Memory size M << n Space O(u) – possibly u >> n

(e.g., when counting distinct words in web crawl)

Page 10: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

10

Negative ResultNegative Result

Theorem:Deterministic algorithms need M = Ω(n log u)

bits

Proof: Information-theoretic arguments

Note: Leaves open randomization/approximation

Page 11: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

11

Randomized AlgorithmRandomized Algorithm

Analysis

Random h few collisions & avg list-size O(n/t)

Thus

Space: O(n) – since we need t = Ω(n)

Time: O(1) per item [Expected]

h:U [1..t]

Input Stream Hash Table

Page 12: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

12

Improvement via Sampling?Improvement via Sampling?

Sample-based Estimation Random Sample R (of size r) of n values in X Compute D(R) Estimator E = D(R) x n/r

Benefit – sublinear space

Cost – estimation error is high Why? – low-frequency values

underrepresented

Page 13: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

13

Negative Result for SamplingNegative Result for Sampling

Consider estimator E of D(X) examining r items in X Possibly in adaptive/randomized fashion.

Theorem: For any , E has relative error

with probability at least .

Remarks r = n/10 Error 75% with probability ½ Leaves open randomization/approximation on full scans

δ

1ln

2r

rn

reδ

δ

Page 14: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

14

Randomized ApproximationRandomized Approximation Simplified Problem – For fixed t, is D(X) >> t?

Choose hash function h: U[1..t] Initialize answer to NO For each xi, if h(xi) = t, set answer to YES

Observe – need 1 bit memory only !

Theorem: If D(X) < t, P[output NO] > 0.25 If D(X) > 2t, P[output NO] < 0.14

Boolean Flag

Input Stream

h:U [1..t]

YES/NOYES/NOtt

11

Page 15: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

15

AnalysisAnalysis

Let – Y be set of distinct elements of X output NO no element of Y hashes to t

P [element hashes to t] = 1/t Thus – P[output NO] = (1-1/t)|Y|

Since |Y| = D(X), D(X) < t P[output NO] > (1-1/t)t > 0.25 D(X) > 2t P[output NO] < (1-1/t)2t < 1/e^2

Page 16: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

16

Boosting AccuracyBoosting Accuracy

With 1 bit distinguish D(X)<t from D(X)>2t

Running O(log 1/δ) instances in parallel reduce error probability to any δ>0

Running O(log n) in parallel for t = 1, 2, 4, 8,…, n can estimate D(X) within factor 2

Choice of multiplier 2 is arbitrary can use factor (1+ε) to reduce error to ε

Theorem: Can estimate D(X) within factor (1±ε) with probability (1-δ) using space

)(δ

1log

ε

nlogO 2

Page 17: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

17

Example 2 – Elephants-and-AntsExample 2 – Elephants-and-Ants

Identify items whose current frequency exceeds support threshold s = 0.1%.

[Jacobson 2000, Estan-Verghese 2001]

Stream

Page 18: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

18

Algorithm 1: Lossy CountingAlgorithm 1: Lossy Counting

Step 1: Divide the stream into ‘windows’

Window-size W is function of support s – specify later…

Window 1 Window 2 Window 3

Page 19: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

19

Lossy Counting in Action ...Lossy Counting in Action ...

Empty

FrequencyCounts

At window boundary, decrement all counters by 1

+

First Window

Page 20: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

20

Lossy Counting continued ...Lossy Counting continued ...FrequencyCounts

At window boundary, decrement all counters by 1

Next Window

+

Page 21: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

21

Error AnalysisError Analysis

If current size of stream = Nand window-size W = 1/ε

then # windows = εN

Rule of thumb: Set ε = 10% of support sExample: Given support frequency s = 1%, set error frequency ε = 0.1%

frequency error

How much do we undercount?

Page 22: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

22

Output: Elements with counter values exceeding (s-ε)N

Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N

Putting it all together…Putting it all together…

How many counters do we need?

Worst case bound: 1/ε log εN counters

Implementation details…

Page 23: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

23

Algorithm 2: Sticky SamplingAlgorithm 2: Sticky Sampling

Stream

Create counters by sampling Maintain exact counts thereafter

What is sampling rate?

341530

283141233519

Page 24: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

24

Sticky Sampling contd...Sticky Sampling contd...For finite stream of length N

Sampling rate = 2/εN log 1/s

Same Rule of thumb: Set ε = 10% of support sExample: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability = 0.01%

Output: Elements with counter values exceeding (s-ε)N

Same error guarantees as Lossy Counting but probabilistic

Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N

= probability of failure

Page 25: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

25

Number of counters?Number of counters?

Finite stream of length NSampling rate: 2/εN log 1/s

Independent of N

Infinite stream with unknown NGradually adjust sampling rate

In either case,Expected number of counters = 2/ log 1/s

Page 26: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

26

Example 3 – Correlated AttributesExample 3 – Correlated Attributes

C1 C2 C3 C4 C5R1 1 1 1 1 0R2 1 1 0 1 0R3 1 0 0 1 0R4 0 0 1 0 1R5 1 1 1 0 1R6 1 1 1 1 1R7 0 1 1 1 1R8 0 1 1 1 0

… … …

Input Stream – items with boolean attributes Matrix – M(r,c) = 1 Row r has Attribute c Identify – Highly-correlated column-pairs

Page 27: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

27

Correlation Correlation Similarity Similarity

View column as set of row-indexes (where it has 1’s)

Set Similarity (Jaccard measure)

Example

ji

ji

jiCC

CC)C,sim(C

Ci Cj

0 11 01 1 sim(Ci,Cj) = 2/5 = 0.40 01 10 1

Page 28: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

28

Identifying Similar Columns?Identifying Similar Columns?

Goal – finding candidate pairs in small memory

Signature Idea Hash columns Ci to small signature sig(Ci) Set of signatures fits in memory sim(Ci,Cj) approximated by sim(sig(Ci),sig(Cj))

Naïve Approach Sample P rows uniformly at random Define sig(Ci) as P bits of Ci in sample Problem

sparsity would miss interesting part of columns sample would get only 0’s in columns

Page 29: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

29

Key ObservationKey Observation

For columns Ci, Cj, four types of rowsCi Cj

A 1 1B 1 0C 0 1D 0 0

Overload notation: A = # rows of type A

ObservationCBA

A)C,sim(C ji

Page 30: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

30

Min HashingMin Hashing

Randomly permute rows Hash h(Ci) = index of first row with 1 in column Ci

Suprising Property P[h(Ci) = h(Cj)] = sim(Ci, Cj)

Why? Both are A/(A+B+C) Look down columns Ci, Cj until first non-Type-D row h(Ci) = h(Cj) if type A row

Page 31: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

31

Min-Hash SignaturesMin-Hash Signatures

Pick – k random row permutations Min-Hash Signature

sig(C) = k indexes of first rows with 1 in column C

Similarity of signatures Define: sim(sig(Ci),sig(Cj)) = fraction of

permutations where Min-Hash values agree

Lemma E[sim(sig(Ci),sig(Cj))] = sim(Ci,Cj)

Page 32: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

32

ExampleExample

C1 C2 C3

R1 1 0 1R2 0 1 1R3 1 0 0R4 1 0 1R5 0 1 0

Signatures S1 S2 S3

Perm 1 = (12345) 1 2 1Perm 2 = (54321) 4 5 4Perm 3 = (34512) 3 5 4

Similarities 1-2 1-3 2-3Col-Col 0.00 0.50 0.25Sig-Sig 0.00 0.67 0.00

Page 33: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

33

Implementation TrickImplementation Trick

Permuting rows even once is prohibitive

Row Hashing Pick k hash functions hk: {1,…,n}{1,

…,O(n)} Ordering under hk gives random row

permutation One-pass implementation

Page 34: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

34

Comparing SignaturesComparing Signatures

Signature Matrix S Rows = Hash Functions Columns = Columns Entries = Signatures

Need – Pair-wise similarity of signature columns Problem

MinHash fits column signatures in memory But comparing signature-pairs takes too much time Limiting candidate pairs – Locality Sensitive Hashing

Page 35: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

35

SummarySummary

New algorithmic paradigms needed for streams and massive data sets

Negative results abound Need to approximate Power of randomization

Page 36: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

36

Thank You!Thank You!

Page 37: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

37

ReferencesReferences

Rajeev Motwani (http://theory.stanford.edu/~rajeev)

STREAM Project (http://www-db.stanford.edu/stream)

STREAM: The Stanford Stream Data Manager. Bulletin of the Technical Committee on Data Engineering 2003.

Motwani et al. Query Processing, Approximation, and Resource Management in a Data Stream Management System. CIDR 2003.

Babcock-Babu-Datar-Motwani-Widom. Models and Issues in Data Stream Systems. PODS 2002.

Manku-Motwani. Approximate Frequency Counts over Streaming Data. VLDB 2003.

Babcock-Datar-Motwani-O’Callahan. Maintaining Variance and K-Medians over Data Stream Windows. PODS 2003.

Guha-Meyerson-Mishra-Motwani-O’Callahan. Clustering Data Streams: Theory and Practice. IEEE TKDE 2003.

Page 38: May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani

38

References (contdReferences (contd)) Datar-Gionis-Indyk-Motwani. Maintaining Stream Statistics

over Sliding Windows. SIAM Journal on Computing 2002. Babcock-Datar-Motwani. Sampling From a Moving Window

Over Streaming Data. SODA 2002. O’Callahan-Guha-Mishra-Meyerson-Motwani. High-

Performance Clustering of Streams and Large Data Sets. ICDE 2003.

Guha-Mishra-Motwani-O’Callagahan. Clustering Data Streams. FOCS 2000.

Cohen et al. Finding Interesting Associations without Support Pruning. ICDE 2000.

Charikar-Chaudhuri-Motwani-Narasayya. Towards Estimation Error Guarantees for Distinct Values. PODS 2000.

Gionis-Indyk-Motwani. Similarity Search in High Dimensions via Hashing. VLDB 1999.

Indyk-Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998.