fast, small-space algorithms for approximate histogram maintenance (on a stream)

26
Algorithms for Approximate Histogram Maintenance (on a Stream). A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M. Strauss

Upload: china

Post on 21-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Fast, Small-Space Algorithms for Approximate Histogram Maintenance (on a Stream). A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M. Strauss. A data stream. Data items/updates arrive one at a time Small storage, no random access to data unless stored. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Fast, Small-Space Algorithms for Approximate Histogram

Maintenance (on a Stream).A. Gilbert, S. Guha, P. Indyk,Y. Kotidis, S. Muthukrishnan,

M. Strauss

Page 2: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

A data stream

Data items/updates arrive one at a timeSmall storage, no random access to data unless stored

Page 3: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Dimensionality reductionJohnson-Lindenstrauss Lemma:

x is an n-dimensional vectorA is a random n times k matrix, each entry independently drawn from e.g. Gaussian distribution, k=O(log N/2 )Then with probability 1-1/N

A can be pseudo-random222

)1( xAxx

Page 4: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

What it means Can maintain the sketch Ax of x when the coordinates are incremented:

A(x+b)=Ax+Ab

A x

Can maintain approximate 2-norm of x

Page 5: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

HistogramsView x as a function x:[1…n] -> [1…M]Approximate it using piecewise constant function h, with B pieces (buckets)

Page 6: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Find all Indians worth $200K - $300K1. Select on

country2. Select on worth

1. Select on worth2. Select on

country

Example app in DB

Page 7: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Example app continued

Page 8: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Our goal

Want to maintain the best B-bucket representation of x, under changes of xMeasure the error using 2-norm (1-norm also OK)

Page 9: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Our Approach

Maintain sketches Ax of xUsing Ax, construct B-histogram h which approximately minimizes ||x-h||

Page 10: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Our result

Can maintain a B-histogram h which minimizes ||x-h|| up to a factor of (1+), using poly(log n, B, 1/) time/space, with probability 1-1/poly(n)

Page 11: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Proof: by iterated improvement

B buckets, >nB construction timeB log n buckets, n3 construction timeB log2n buckets, n2 construction time B log2n buckets, n poly(B+log n) timeB logO(1) n buckets, poly(B+log n) timeB buckets, poly(B+log n) time

Page 12: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Exponential time approach

There are at most (Mn2)B functions hBy JL lemma, can reduce dimension to O(B log n), and approximately preserve ||x-h|| for all hTo reconstruct h, minimize ||Ax-Ah||Can be trivially done by enumerating all h’s

Page 13: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Greedy approach

Start from h=0Let be the characteristic function over interval IFind c and I minimizing

& repeat

I

IAx A(h c ) 2

Ih h c

Page 14: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Details

IAx A(h c ) 2

The square of

is a quadratic function of c

Once we compute the parameters of this function, e.g. E(c)=Ac2+Bc+D,

the minimum is achieved for c=B/(2A)

Page 15: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Example

Page 16: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

How does it helpO(n2) intervalsO(n) time to find best c minimizing

Overall: O(n3) time, O(k log (nM)) intervals

IAx A(h c ) 2

Page 17: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Approximation factorAssume for simplicityLet h* be the optimal k-histogram If we replaced the current histogram h by all k intervals of h* (with proper values c), we would reduce the squared error from ||x-h||2 to ||x-h*||2 Thus, there is an interval I of h* (and c) such that

||x-h||2-||x - h cI||2 > 1/k (||x-h||2 -||x-h*||2)

O(k log (nM2)) intervals enough to reduce the error to about ||x-h*||2

Page 18: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Dyadic intervals

Each interval can be decomposed into log n dyadic intervals [1,1],[2,2]…[1,2]...[1,4]We can assume opt h is defined by B log n dyadic intervalsThe number of dyadic intervals is n log nReduces the time to n2 log n

Page 19: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Range summability

RecallNeed to compute i.e., range sum of random variables Goal: time polylog n

IA

IAx A(h c ) 2

Page 20: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Naor & Reingold constructionMethod:

Generate sum of a1,a2,…,an

Generate sum of left half, conditioned on the total sumRecurse

Conditional distributions are explicitThe generation can be simulated by Nisan’s PRGResult: reduces the time to n polylog n

Page 21: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Fast selection of good intervals

Find which (dyadic) intervals to add in polylog n time Consider interval of length 1Need to find a “spike” in h-x (if exists)Assume only one spike

Page 22: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Chasing Bits Non-adaptive binary search

Essentially, we compose the signal with a filter

Page 23: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

More spikes

There are few large spikes Permute coordinates using pair-wise independent permutation. Likely that each interval contains only one spike Caveat : how does it work with the range summabilityResult: reduces the time to polylog n

Page 24: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Where are we

We managed to reduce the time to polylog nHowever, the number of buckets is B polylog nNeed to reduce the number of buckets to B

Page 25: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Getting rid of the buckets

B buckets, but O(1)-approximation:Compute h with B polylog n bucketsFind h’ with B buckets closest to h

An off-line problemCan be done approximately using dynamic programming

Factor O(1) by triangle inequality Factor (1+) is a mess (esp. for 1-norm)

Page 26: Fast, Small-Space Algorithms for Approximate Histogram     Maintenance (on a Stream)

Conclusions

Can efficiently maintain compact representation of an array of numbers under additive changesWorks well in practice [TGIK’02]