sublinear algorihms for big data

31
Sublinear Algorihms for Big Data Grigory Yaroslavtsev http://grigory.us Lectur e 2

Upload: zubin

Post on 20-Jan-2016

32 views

Category:

Documents


1 download

DESCRIPTION

Sublinear Algorihms for Big Data. Lecture 2. Grigory Yaroslavtsev http://grigory.us. Recap. (Markov) For every ( Chebyshev ) For every ( Chernoff ) Let be independent and identically distributed r.vs with range [0, c] and expectation . Then if and. Today. Approximate Median - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sublinear Algorihms  for Big Data

Sublinear Algorihms for Big Data

Grigory Yaroslavtsevhttp://grigory.us

Lecture 2

Page 2: Sublinear Algorihms  for Big Data

Recap

• (Markov) For every

• (Chebyshev) For every

• (Chernoff) Let be independent and identically distributed r.vs with range [0, c] and expectation . Then if and

Page 3: Sublinear Algorihms  for Big Data

Today

• Approximate Median• Alon-Mathias-Szegedy Sampling• Frequency Moments• Distinct Elements• Count-Min

Page 4: Sublinear Algorihms  for Big Data

Data Streams

• Stream: elements from universe , e.g.

• = frequency of in the stream = # of occurrences of value

Page 5: Sublinear Algorihms  for Big Data

Approximate Median

• (all distinct) and let

• Problem: Find -approximate median, i.e. such that

• Exercise: Can we approximate the value of the median with additive error in sublinear time?

• Algorithm: Return the median of a sample of size taken from (with replacement).

Page 6: Sublinear Algorihms  for Big Data

Approximate Median

• Problem: Find -approximate median, i.e. such that

• Algorithm: Return the median of a sample of size taken from (with replacement).

• Claim: If then this algorithm gives -median with probability

Page 7: Sublinear Algorihms  for Big Data

Approximate Median• Partition into 3 groups

• Key fact: If less than elements from each of and are in sample then its median is in

• Let if -th sample is in and 0 otherwise.• Let . By Chernoff, if

• Same for + union bound error probability

Page 8: Sublinear Algorihms  for Big Data

AMS Sampling

• Problem: Estimate , for an arbitrary function with

• Estimator: Sample , where is sampled uniformly at random from and compute:

Output: • Expectation:

Page 9: Sublinear Algorihms  for Big Data

Frequency Moments

• Define for – # number of distinct elements– # elements– = “Gini index”, “surprise index”

Page 10: Sublinear Algorihms  for Big Data

Frequency Moments

• Define for • Use AMS estimator with

• Exercise: , where • Repeat times and take average . By Chernoff:

• Taking gives

Page 11: Sublinear Algorihms  for Big Data

Frequency Moments

• Lemma:

• Result: memory suffices for -approximation of • Question: What if we don’t know • Then we can use probabilistic guessing (similar to

Morris’s algorithm), replacing with .

Page 12: Sublinear Algorihms  for Big Data

Frequency Moments

• Lemma:

• Exercise: Hint: worst-case when . Use convexity of • Case 1:

Page 13: Sublinear Algorihms  for Big Data

Frequency Moments

• Lemma:

• Case 2:

Page 14: Sublinear Algorihms  for Big Data

Hash Functions

• Definition: A family of functions from is -wise independent if for any distinct and

• Example: If for prime

is a -wise independent family of hash functions.

Page 15: Sublinear Algorihms  for Big Data

Linear Sketches

• Sketching algorithm: picks a random matrix where and computes

• Can be incrementally updated:– We have a sketch – When arrives, new frequencies are – Updating the sketch:

• Need to choose random matrices carefully

Page 16: Sublinear Algorihms  for Big Data

𝐹 2• Problem: -approximation for • Algorithm:– Let , where entries of each row are -wise independent

and rows are independent– Don’t store the matrix: 4-wise independent hash

functions – Compute average squared entries “appropriately”

• Analysis: – Let be any entry of .– Lemma:– Lemma: Var

Page 17: Sublinear Algorihms  for Big Data

𝐹 2:Expectaton• Let be a row of with entries

• We used 2-wise independence for .

Page 18: Sublinear Algorihms  for Big Data

𝐹 2:Variance

• by 4-wise independence

Page 19: Sublinear Algorihms  for Big Data

: Distinct Elements

• Problem: -approximation for • Simplified: For fixed with prob. distinguish:

vs. • Original problem reduces by trying values of

T:

Page 20: Sublinear Algorihms  for Big Data

: Distinct Elements

• Simplified: For fixed with prob. distinguish: vs.

• Algorithm: – Choose random sets where – Compute – If at least of the values are zero, output

Page 21: Sublinear Algorihms  for Big Data

vs.

• Algorithm: – Choose random sets where – Compute – If at least of the values are zero, output

• Analysis: – If , then – If , then – Chernoff: gives correctness w.p.

Page 22: Sublinear Algorihms  for Big Data

vs.

• Analysis: – If , then – If , then

• If is large and is small then:

• If

• If

Page 23: Sublinear Algorihms  for Big Data

Count-Min Sketch

• https://sites.google.com/site/countminsketch/• Stream: elements from universe , e.g.

• = frequency of in the stream = # of occurrences of value

• Problems:– Point Query: For estimate – Range Query: For estimate – Quantile Query: For find with – Heavy Hitters: For find all with

Page 24: Sublinear Algorihms  for Big Data

Count-Min Sketch: Construction

• Let be -wise independent hash functions• We maintain counters with values:

elements in the stream with • For every the value and so:

• If and then:.

Page 25: Sublinear Algorihms  for Big Data

Count-Min Sketch: Analysis

• Define random variables such that

• Define if and 0 otherwise:

• By -wise independence:

• By Markov inequality,

Page 26: Sublinear Algorihms  for Big Data

Count-Min Sketch: Analysis

• All are independent

• The w.p. there exists such that

• CountMin estimates values up to with total memory

Page 27: Sublinear Algorihms  for Big Data

Dyadic Intervals

• Define partitions of :

• Exercise: Any interval can be written as a disjoint union of at most such intervals.

• Example: For

Page 28: Sublinear Algorihms  for Big Data

Count-Min: Range Queries and Quantiles

• Range Query: For estimate • Approximate median: Find such that:

and

Page 29: Sublinear Algorihms  for Big Data

Count-Min: Range Queries and Quantiles

• Algorithm: Construct Count-Min sketches, one for each such that for any we have an estimate for such that:

• To estimate let be decomposition:

• Hence,

Page 30: Sublinear Algorihms  for Big Data

Count-Min: Heavy Hitters

• Heavy Hitters: For find all with but no elements with • Algorithm:– Consider binary tree whose leaves are [n] and

associate internal nodes with intervals corresponding to descendant leaves

– Compute Count-Min sketches for each – Level-by-level from root, mark children of marked

nodes if – Return all marked leaves

• Finds heavy-hitters in steps

Page 31: Sublinear Algorihms  for Big Data

Thank you!

• Questions?