april 26, 2006k. chang - dissertation defense1 pass-efficient algorithms for clustering dissertation...

April 26, 2006 K. Chang - Dissertation Defense 1

Pass-Efficient Algorithms for Clustering

Dissertation Defense

Kevin Chang

Adviser: Ravi Kannan

Committee: Dana Angluin

Joan Feigenbaum

Petros Drineas (RPI)


Overview

• Massive Data Sets in Theoretical Computer Science – Algorithms for input that is too large to fit in

memory of computer.

• Clustering Problems– Learning generative models– Clustering via combinatorial optimization– Both massive data set and traditional

algorithms


Theoretical Abstractions for Massive Data Sets computation

• The input on disk/storage is modeled as a read-only array. Elements may only be accessed through a sequential pass. – Input elements may be arbitrarily ordered.

• Main memory is modeled as extra space used for intermediate calculations.

• Algorithm is allowed extra time before and after each pass to perform calculations.

• Goal: Minimize memory usage, number of passes.


Models of Computation

• Streaming Model: Algorithm may make a single pass over the data. Space must be o(n).

• Pass-Efficient Model: Algorithm may make a small, constant number of passes. Ideally, space is O(1).

• Other models: sublinear algorithms• Pass-Efficient more flexible than streaming, but

not suitable for “streaming” data arriving that is processed immediately and then “forgotten.”


Main Question: What will multiple passes buy you?

• Is it better, in terms of other resources, to make 3 passes instead of 1 pass?

• Example: Find the mean of an array of integers.• 1 pass requires O(1) space. • More passes don’t help.

• Example: Find the median. [MP ‘80]• 1 pass algorithm requires n space.• 2 pass needs only O(n1/2) space.

• We will study the Trade-Off between passes and memory.

• Other work: for graphs [FKMSZ ‘05]


Overview of Results

• General framework for Pass-Efficient clustering algorithms. Specific problems:– A learning problem: generative model of clustering.

• Sharp trade-off between passes-space• Lower bounds show this is nearly tight

– A combinatorial optimization problem: Facility Location

• Same sharp trade-off

• Algorithm for a graph partitioning problem– Can be considered a clustering problem.


Graph Partitioning: Sum-of-squares partition

The objective function is natural, because large components will have roughly equal size in the optimum partition.

• Joint with Alex Yampolskiy, Jim Aspnes.• Given a graph G = (V, E), remove m nodes or edges to disconnect the graph into components {Ci} in order to minimize objective function ∑i |Ci|2.


Graph Partitioning: Our Results

• Definition: (α, β)-bicriterion approximation:– remove αm nodes or edges to get partition {Ci}

– such that ∑i |Ci|2 < β OPT

– OPT is best sum of squares for removing m edges.

• Approximation Algorithm for SSP– Thm: There exists a (O(log 1.5 n), O(1)) algorithm.

• Hardness of Approximation– Thm: NP-Hard to compute a (1.18, 1) approximation.– By reduction from minimum Vertex Cover.


Outline of Algorithm

• Recursively partition the graph by removing sparse node or edge cuts. – Sparsest cut is an important theory problem.– Informally: find a cut that removes a small number of

edges to get two relatively large components. – Recent activity in this area. [LR99, ARV04].– Node cut problem reduces to edge cut.

• Similar approach taken by [KVV ’00] for a different clustering problem.


Rough Idea of Algorithm

• In first iteration, partition the graph G into C1, C2 by removing a sparse node cut.

G



• In first iteration, partition the graph G into C1, C2 by removing a sparse node cut.

C1 C2



C1 C2

• Subsequent iterations: calculate cuts in all components, and remove the cut that “most effectively” reduces the objective function.



• Continue procedure until Θ(log1.5n)m nodes have been removed.

C1C2 C3


Generative Models of Clustering

• Assume the input is generated by k different random processes.

• k distributions: F1, F2, … Fk , each with weight wi > 0

such that ∑i wi = 1.• A sample is drawn according to the mixture by picking Fi

with probability wi, and then drawing a point according to Fi.

• Alternatively, if Fi is a density function, then mixture has density F = ∑i wi Fi.

• Much TCS work on learning mixtures Gaussian Distributions in high dimension. [D ’99],[AK ’01], [VW ‘02], [KSV ‘05]


Learning Mixtures

• Q: Given samples from the mixture, can we learn the parameters of the k processes?

• A: Not always. Question can be ill-posed.– Two different mixtures can create the same density.

• Our problem: Can we learn the density F of the mixture in the pass-efficient model? – What are the trade-offs between space and passes?

• Suppose F is the density of mixture. Accuracy of our estimate G measured by L1 distance:

∫ |F-G|.


Mixtures of k uniform distributions in R

• Assume each Fi is a uniform distribution over some interval (ai, bi) in R.

• Then mixture density looks like a step distribution:

0

5

x1 x2 x3 x4 x5

A mixture of three distributions: F1 on (x1, x3), F2 on (x2, x3)and F3 on (x4,x5)

The problem reduces to learning a density function that is known to be piecewise constant with at most 2k-1 “steps”.


Algorithmic Results

• Pick any integer P>0, and 1>>0. 2P pass randomized algorithm to learn a mixture of k uniform distributions.

• Error at most P, memory required is O*(k3/2+Pk/).– Error drops exponentially in number of passes while memory

used grows very slowly.

• Error at most , memory required is O*(k3/2/P).– If you take 2 passes you need: ~(1/, 4 passes you need

~(1/, 8 passes, you need (1/)1/2

– Memory drops very sharply as P increases, while the error remains constant.

• Failure probability: 1-


Step distribution: First attempt

• Let X be the data stream of samples from F.

• Break the domain into bins and count the number of points that fall in each bin. Estimate F in each bin.

• In order to get accuracy P, you will need to store at least ¼ 1/P counters.

• Too much! We can do much better using a few more passes.


General Framework

• In one pass, take a small uniform subsample of size s from the data stream.

• Compute a coarse solution on the subsample.• In a second pass, for places where coarse

solution is reasonable, sharpen it.• Also in a second pass, find places where coarse

solution is very far off.• Recurse on these spots.

– Examine these trouble spots more carefully…zoom in!


Our algorithm

1. In one pass, draw a sample of size s = (k2/2) from X.2. Based on the sample, partition the domain into intervals

I such that sI F = (/2k). 1. Number of I’s will be O(k/).

3. In one pass, for each I, determine if F is very close to constant on I (Call subroutine Constant) . Also count the number of points of X that lie in I.

4. If F is constant on I, then |XÅ I| / (|X| length(I)) is very close to F.

5. If F is not constant on I, recurse on I. (Zoom in on the trouble spot).

• Requires |X|= (k6/6P). Space usage: O*(k3/2+P k/).


The Action

Sample

Data stream

…. Repeat this P-2 more times …

sub-data stream

Mark interval with jump


Why it works: bounding the error

• Easy to learn when F is constant. Our estimate is very accurate on these intervals.

• The weight of bins decreases exponentially at each iteration.

• At the Pth iteration, bins have weight at most P/4k

• Thus, we can estimate F as 0 on 2k bins where there is a jump, and incur an error of at most P/2k.


Generalizations

Can generalize our algorithm to learn the following types of distribution (roughly same trade-off of space and passes).

1. Mixtures of uniform distributions over axis-aligned rectangles in R2.

– Fi is uniform over a rectangle (ai, bi)£ (ci, di)½ R2

2. Mixtures of linear distributions in R: • The density of Fi is linear over some interval (ai, bi)½ R.

Heuristic: For a mixture of Gaussians, just treat it like a mixture of m linear distributions for some m>>k (theoretical error will be large if m is not really big).


Lower bounds

• We define a Generalized Learning Problem (GLP), which is a slight generalization of the learning mixtures of distribution problem.

• Thm: Any P pass randomized algorithm that solves the GLP must use at least (1/1/(2P-1)) bits of memory.– Proof uses “r round communication complexity”

• Thm: There exists a P pass algorithm that solves the GLP using at most O* (1/4/P) bits of memory. – Slight generalization of algorithm given above.

• Conclusion: trade-off is nearly tight for GLP.


General framework for clustering: adaptive sampling in 2P passes

• Pass 1: draw a small sample S from the data.– Compute a solution C on the subproblem S. (If S is

small, can do this in memory.)

• Pass 2: determine those points R that are not clustered well by C, and recurse on R.– If S is representative, then there won’t be many points

in R.

• If R is small, then we will sample it at a higher rate in subsequent iterations and get a better solution for these “outliers”.


Facility Location

2f2=40

1f1=2

4f4=25

3f3=8

d(1,2) = 5

d(2,3 ) = 6d(1,3) = 2

d(1,4) = 5 d(3,4) = 3

Input: Set X of points, with distances defined by function d and a facility cost fi for each point.

Cost of solution for building facilities at 1 and 3 is: 2+8+5+3 = 18

Problem: Find a set F of facilities that minimizes i2 F fi + i2 x d(i,F), d(i,F) = minj2 F d(i, j)


Quick facts about facility location

• Natural operations research problem.

• NP-Hard to solve optimally.

• Many approximation algorithms (not streaming or massive data set) – Best approximation ratio is 1.52.– Achieving a factor of 1.463 is hard.


Other NP-Hard clustering problems in Combinatorial Optimization

• k-center: Find a set of centers C, |C|=k, that minimizes maxi2 X d(i, C).– One pass approximation algorithm that

requires O(k) space [CCFM ‘97]

• k-median: Find a set of centers C, |C|=k, that minimizes ∑i2 X d(i, C).– One pass approximation algorithm that

requires O(k log2 n) space. [COP ’03]

• No pass-efficient algorithm for general FL!– One pass for very restricted inputs [Indyk ’04]


Memory issues for facility location

• Note that some instances of FL will require (n) facilities to get any reasonable approximation.

• Thm: Any P pass, randomized, algorithm for approximately computing the optimum FL cost requires (n/P) bits of memory.

• How to give reasonable guarantees that the space usage is “small”?

• Parameterize bounds on memory usage by number of facilities opened.


Algorithmic Results

• 3P pass algorithm for solving FL, using at most O*(k* n2/P) bits of extra memory, where k* is the number of facilities output by the algorithm

• Approximation ratio is O(P): if OPT is the cost of the optimum solution, then the algorithm will output a solution with cost at most ¢ 36¢ P¢ OPT

• Requires a black-box approximation algorithm for FL with approximation ratio . – Best so-far: = 1.52

• Surprising Fact: same trade-off, very different problem!


Simplifying the Problem

• Large number of facilities complicates matters. Algorithmic cure involves technical details that obscure the main point we would like to make.

• Easier problem to demonstrate our framework in action is k Facility Location.

• k-FL: Find a set of at most k facilities F (|F|≤k) that minimizes ∑i 2 F

fi + ∑i 2 Xd(i,F).

• We present a 3P pass algorithm that uses O*(k¢ n1/P) bits of memory. Approx ratio O(P).


Idea of our algorithm

• Our algorithm will make P triples of passes, total 3P passes.

• In the first two passes of each triple, we take a uniform sample of X and compute a FL solution on the sample.

• Intuitively, this will be a good solution for most of the input.

• In a third pass, identify bad points which are far away from our facilities.

• In subsequent passes, we recursively compute a solution for the bad points by restricting our algorithms to those points.


Algorithm, Part 1: Clustering a Sample

1. In one pass, draw a sample S1 of s = O*(k(P-1)/P n1/P) nodes from the data stream.

2. In a second pass, compute an O(1)-approximation to FL on S1 , but with facilities drawn from all of X and with distances in S scaled up by n/s. Call the set of facilities F1.


How good is the solution F1?

• Want to show that the cost of F1 on X is small.

• Easy to show that i2 F1 fi<O(1)OPT

• Best scenario: facilities that service the sample S1 well will service all points of X well: j2 F1

fj+j2 X d(j, F1) = O(1)¢ OPT.

• Unfortunately, this is not true in general.


Bounding the cost, similar to [Indyk ’99]

• Fix an optimum solution on X, a set of facilities F*, |F*| = k that partitions X into k clusters C1,…,Ck.

• If cluster Ci is “large” (i.e. |Ci|>n/s) then a sample S1 of X of size s will contain many points of Ci.

• Can prove a good clustering of S1 will be good for points in the large clusters Ci of X.– Will cost at most O(1)OPT to service large clusters.

• Can prove the number of “bad” points is at most O(k1/Pn1-1/P).


Algorithm, part 2: Recursing on the outliers

3. In a third pass, identify the O(k1/P n1-1/P ) points that are farthest away from the facilities in F1. Can do it using ~ O*(n1/P) space using random sampling.

4. Assign all other points to facilities in F11.• Cost of this at most O(1)OPT.

5. Recurse on the outliers.– In subsequent iterations, will draw a sample of size

s from outliers; thus, points will be sampled with much higher frequency.


The General Recursive Step

1. For the m’th iteration: Let Rm be the points remaining to be clustered.

2. Take a sample S of size s from Rm, and compute a FL solution Fm.

3. Identify the O*(k(m+1)/P n1-(m+1)/P) points that are farthest away from Fm , call this set Rm+1

– Can prove that Fm will service Rmn Rm+1 with cost O(1) OPT.

4. Recurse on the points in Rm+1.


Putting it all together

• Let the final set of facilities be F =[ Fi.

• After P iterations, all points will be assigned a facility.

• Total cost of solution will be O(P) OPT.• Note that possibly |F|>k. Can boil this down to k

facilities that still give O(P) OPT.– Hint: Use a known k-median algorithm.

• Algorithm generalizes to k-median, but requires more passes and space than [COP ‘03].


References

• J. Aspnes, K. Chang, A. Yampolskiy. Inoculation strategies for victims of viruses and sum-of-squares partition problem. To appear in Journal of Computer and System Sciences. Preliminary version appeared in SODA 2005.

• K. Chang, R. Kannan. Pass-Efficient Algorithms for Clustering. SODA 2006.

• K. Chang. Pass-Efficient Algorithms for Facility Location. Yale TR 1337.

• S. Arora, K. Chang. Approximation Schemes for low degree MST and Red-blue Separation problem. Algorithmica 40(3):189-210, 2004. Preliminary version appeared in ICALP 2003. (Not in thesis)


Future Directions

• Consider problems from data mining, abstract (and possibly simplify) them, and see what algorithmic insights TCS can provide.

• Possible extensions of this thesis work:– Design a pass-efficient algorithm for learning mixtures

of Gaussian distributions in high dimension. Give rigorous guarantees about accuracy.

– Design pass-efficient algorithms for other Comb. Opt. Problems. Possibility: find a set of centers C, |C| ≤ k to minimize i d(i, C)2. (Generalization of k-means objective…Euclidean case already solved).

• Many problems out there. Much more to be done!


Thanks for listening!

april 26, 2006k. chang - dissertation defense1 pass-efficient algorithms for clustering dissertation...

Documents

pass algorithm

passefficient model

single pass

sequential pass

approximation algorithm

o1 algorithm

o1 space

extra space