sublinear algorihms for big data
DESCRIPTION
Sublinear Algorihms for Big Data. Lecture 2. Grigory Yaroslavtsev http://grigory.us. Recap. (Markov) For every ( Chebyshev ) For every ( Chernoff ) Let be independent and identically distributed r.vs with range [0, c] and expectation . Then if and. Today. Approximate Median - PowerPoint PPT PresentationTRANSCRIPT
Recap
• (Markov) For every
• (Chebyshev) For every
• (Chernoff) Let be independent and identically distributed r.vs with range [0, c] and expectation . Then if and
Today
• Approximate Median• Alon-Mathias-Szegedy Sampling• Frequency Moments• Distinct Elements• Count-Min
Data Streams
• Stream: elements from universe , e.g.
• = frequency of in the stream = # of occurrences of value
Approximate Median
• (all distinct) and let
• Problem: Find -approximate median, i.e. such that
• Exercise: Can we approximate the value of the median with additive error in sublinear time?
• Algorithm: Return the median of a sample of size taken from (with replacement).
Approximate Median
• Problem: Find -approximate median, i.e. such that
• Algorithm: Return the median of a sample of size taken from (with replacement).
• Claim: If then this algorithm gives -median with probability
Approximate Median• Partition into 3 groups
• Key fact: If less than elements from each of and are in sample then its median is in
• Let if -th sample is in and 0 otherwise.• Let . By Chernoff, if
• Same for + union bound error probability
AMS Sampling
• Problem: Estimate , for an arbitrary function with
• Estimator: Sample , where is sampled uniformly at random from and compute:
Output: • Expectation:
Frequency Moments
• Define for – # number of distinct elements– # elements– = “Gini index”, “surprise index”
Frequency Moments
• Define for • Use AMS estimator with
• Exercise: , where • Repeat times and take average . By Chernoff:
• Taking gives
Frequency Moments
• Lemma:
• Result: memory suffices for -approximation of • Question: What if we don’t know • Then we can use probabilistic guessing (similar to
Morris’s algorithm), replacing with .
Frequency Moments
• Lemma:
• Exercise: Hint: worst-case when . Use convexity of • Case 1:
Frequency Moments
• Lemma:
• Case 2:
Hash Functions
• Definition: A family of functions from is -wise independent if for any distinct and
• Example: If for prime
is a -wise independent family of hash functions.
Linear Sketches
• Sketching algorithm: picks a random matrix where and computes
• Can be incrementally updated:– We have a sketch – When arrives, new frequencies are – Updating the sketch:
• Need to choose random matrices carefully
𝐹 2• Problem: -approximation for • Algorithm:– Let , where entries of each row are -wise independent
and rows are independent– Don’t store the matrix: 4-wise independent hash
functions – Compute average squared entries “appropriately”
• Analysis: – Let be any entry of .– Lemma:– Lemma: Var
𝐹 2:Expectaton• Let be a row of with entries
• We used 2-wise independence for .
𝐹 2:Variance
• by 4-wise independence
: Distinct Elements
• Problem: -approximation for • Simplified: For fixed with prob. distinguish:
vs. • Original problem reduces by trying values of
T:
: Distinct Elements
• Simplified: For fixed with prob. distinguish: vs.
• Algorithm: – Choose random sets where – Compute – If at least of the values are zero, output
vs.
• Algorithm: – Choose random sets where – Compute – If at least of the values are zero, output
• Analysis: – If , then – If , then – Chernoff: gives correctness w.p.
vs.
• Analysis: – If , then – If , then
• If is large and is small then:
• If
• If
Count-Min Sketch
• https://sites.google.com/site/countminsketch/• Stream: elements from universe , e.g.
• = frequency of in the stream = # of occurrences of value
• Problems:– Point Query: For estimate – Range Query: For estimate – Quantile Query: For find with – Heavy Hitters: For find all with
Count-Min Sketch: Construction
• Let be -wise independent hash functions• We maintain counters with values:
elements in the stream with • For every the value and so:
• If and then:.
Count-Min Sketch: Analysis
• Define random variables such that
• Define if and 0 otherwise:
• By -wise independence:
• By Markov inequality,
Count-Min Sketch: Analysis
• All are independent
• The w.p. there exists such that
• CountMin estimates values up to with total memory
Dyadic Intervals
• Define partitions of :
…
• Exercise: Any interval can be written as a disjoint union of at most such intervals.
• Example: For
Count-Min: Range Queries and Quantiles
• Range Query: For estimate • Approximate median: Find such that:
and
Count-Min: Range Queries and Quantiles
• Algorithm: Construct Count-Min sketches, one for each such that for any we have an estimate for such that:
• To estimate let be decomposition:
• Hence,
Count-Min: Heavy Hitters
• Heavy Hitters: For find all with but no elements with • Algorithm:– Consider binary tree whose leaves are [n] and
associate internal nodes with intervals corresponding to descendant leaves
– Compute Count-Min sketches for each – Level-by-level from root, mark children of marked
nodes if – Return all marked leaves
• Finds heavy-hitters in steps
Thank you!
• Questions?