sublinear algorihms for big data

Sublinear Algorihms for Big Data

Grigory Yaroslavtsevhttp://grigory.us

Lecture 2

http://grigory.us/

Recap

• (Markov) For every

• (Chebyshev) For every

• (Chernoff) Let be independent and identically distributed r.vs with range [0, c] and expectation . Then if and

Today

• Approximate Median• Alon-Mathias-Szegedy Sampling• Frequency Moments• Distinct Elements• Count-Min

Data Streams

• Stream: elements from universe , e.g.

• = frequency of in the stream = # of occurrences of value

Approximate Median

• (all distinct) and let

• Problem: Find -approximate median, i.e. such that

• Exercise: Can we approximate the value of the median with additive error in sublinear time?

• Algorithm: Return the median of a sample of size taken from (with replacement).

Approximate Median

• Problem: Find -approximate median, i.e. such that

• Algorithm: Return the median of a sample of size taken from (with replacement).

• Claim: If then this algorithm gives -median with probability

Approximate Median• Partition into 3 groups

• Key fact: If less than elements from each of and are in sample then its median is in

• Let if -th sample is in and 0 otherwise.• Let . By Chernoff, if

• Same for + union bound error probability

AMS Sampling

• Problem: Estimate , for an arbitrary function with

• Estimator: Sample , where is sampled uniformly at random from and compute:

Output: • Expectation:

Frequency Moments

• Define for – # number of distinct elements– # elements– = “Gini index”, “surprise index”

Frequency Moments

• Define for • Use AMS estimator with

• Exercise: , where • Repeat times and take average . By Chernoff:

• Taking gives

Frequency Moments

• Lemma:

• Result: memory suffices for -approximation of • Question: What if we don’t know • Then we can use probabilistic guessing (similar to

Morris’s algorithm), replacing with .

Frequency Moments

• Lemma:

• Exercise: Hint: worst-case when . Use convexity of • Case 1:

Frequency Moments

• Lemma:

• Case 2:

Hash Functions

• Definition: A family of functions from is -wise independent if for any distinct and

• Example: If for prime

is a -wise independent family of hash functions.

Linear Sketches

• Sketching algorithm: picks a random matrix where and computes

• Can be incrementally updated:– We have a sketch – When arrives, new frequencies are – Updating the sketch:

• Need to choose random matrices carefully

𝐹 2• Problem: -approximation for • Algorithm:– Let , where entries of each row are -wise independent

and rows are independent– Don’t store the matrix: 4-wise independent hash

functions – Compute average squared entries “appropriately”

• Analysis: – Let be any entry of .– Lemma:– Lemma: Var

𝐹 2:Expectaton• Let be a row of with entries

• We used 2-wise independence for .

𝐹 2:Variance

• by 4-wise independence

: Distinct Elements

• Problem: -approximation for • Simplified: For fixed with prob. distinguish:

vs. • Original problem reduces by trying values of

T:

: Distinct Elements

• Simplified: For fixed with prob. distinguish: vs.

• Algorithm: – Choose random sets where – Compute – If at least of the values are zero, output

vs.

• Algorithm: – Choose random sets where – Compute – If at least of the values are zero, output

• Analysis: – If , then – If , then – Chernoff: gives correctness w.p.

vs.

• Analysis: – If , then – If , then

• If is large and is small then:

• If

• If

Count-Min Sketch

• https://sites.google.com/site/countminsketch/• Stream: elements from universe , e.g.

• = frequency of in the stream = # of occurrences of value

• Problems:– Point Query: For estimate – Range Query: For estimate – Quantile Query: For find with – Heavy Hitters: For find all with

https://sites.google.com/site/countminsketch/

https://sites.google.com/site/countminsketch/

Count-Min Sketch: Construction

• Let be -wise independent hash functions• We maintain counters with values:

elements in the stream with • For every the value and so:

• If and then:.

Count-Min Sketch: Analysis

• Define random variables such that

• Define if and 0 otherwise:

• By -wise independence:

• By Markov inequality,

Count-Min Sketch: Analysis

• All are independent

• The w.p. there exists such that

• CountMin estimates values up to with total memory

Dyadic Intervals

• Define partitions of :

…

• Exercise: Any interval can be written as a disjoint union of at most such intervals.

• Example: For

Count-Min: Range Queries and Quantiles

• Range Query: For estimate • Approximate median: Find such that:

and

Count-Min: Range Queries and Quantiles

• Algorithm: Construct Count-Min sketches, one for each such that for any we have an estimate for such that:

• To estimate let be decomposition:

• Hence,

Count-Min: Heavy Hitters

• Heavy Hitters: For find all with but no elements with • Algorithm:– Consider binary tree whose leaves are [n] and

associate internal nodes with intervals corresponding to descendant leaves

– Compute Count-Min sketches for each – Level-by-level from root, mark children of marked

nodes if – Return all marked leaves

• Finds heavy-hitters in steps

Thank you!

• Questions?

sublinear algorihms for big data

Documents

range queries

big data sublinear algorihms

analysisdyadic intervalscount

heavy hittersthank