parallel 2d kolmogorov-smirnov statistic ian chan 5/12/02 6.338j/18.337j

Parallel 2D Kolmogorov-Smirnov Statistic

Ian Chan

5/12/02 6.338J/18.337J

http://web.mit.edu/ianchan/www/KS2D

Motivation: my friend’s research

A colossal X-ray flare, likely sparked by a central Milky Way black hole, produced the bright spot in this Chandra image. [Source CNN]

1D Kolmogorov Smirnov Statistic

test difference in two empirical distributions

F G nonparametrically D statistic: maximum difference between 2 CDF’s

1D KS Test Bound

Kolmogorov(1933) asymptotic bound:

2D Analog of KS Test

Peacock J, Monthly Notices of the Royal Astronomical Society, 1983, vol 202 p615: Two-Dimensional Goodness-of-Fit Testing in Astronomy

D statistic: considering all possible quadrant divisions, the largest possible difference in CDFs

2D KS Test Bound

Monte Carlo simulated bounds

Z = D n1/2

KS2D Test Brute Force Algorithm

O(n2), not exhaustive, quadrants centered at each data ponts O(n3), exhaustive, quadrants centered at each possible data x and data y

combination

O(nlogn) KS2D algorithm

Author: A. Cooke (1999) construction of binary tree data structure ( O(nlogn) ), require pre-sorted

sample data by y

How it works: (1) Tree construction

quadrants centered at (x,y) must have upper left quadrant contains all samples (a,b) where a < x AND b < y

If childless node, Dmin = Dmax = 1/Nsquare or –1/Ncircle, depending on class

How it works: (2) Upward Propagation

At node (2,3), we find the MIN and MAX from the 3 choices:

1 inherit Dmin/max from its left child (1,2), which implies that Q excludes (2,3) where Q is the quadrant that contains the largest |D|

2 D = delta(left child) + (0/Ns-1/Nc), which implies Q contains (2,3) and has (2,3) on its border. Delta(x) = diff in CDF if quadrant contains all samples in subtree at x

3 D = delta(left child) + (0/Ns-1/Nc) + Dmin/max (right child), which implies Q contains (2,3) and (2,3) is not on its border

The other 3 quadrants

We have considered the Top Left Quadrant, but the Top Right quadrant can be obtained from the same tree structure if we modify the upward propagation rule by swapping left & right, i.e.

At node (2,3), we find the MIN and MAX from the 3 choices:

1 inherit Dmin/max from its right child (1,2), which implies that Q excludes (2,3) where Q is the quadrant that contains the largest |D|

2 D = delta(right child) + (0/Ns-1/Nc), which implies Q contains (2,3) and has (2,3) on its border. Delta(x) = diff in CDF if quadrant contains all samples in subtree at x

3 D = delta(right child) + (0/Ns-1/Nc) + Dmin/max (left child), which implies Q contains (2,3) and (2,3) is not on its border

• The Bottom Left/Right Quadrants can be obtained if the tree is built with samples sorted by reverse order of y.

Parallel KS2D Algorithm

Speed possibly scales linearly with number of processors during the upward propagation step, cannot parallelize the tree construction step

Problem size scales linearly with number of processors because sample nodes are stored in processors distributively

Challenges

1. Load Balancing: Dividing the tree nodes equally among processors

2. Minimize communications: Try to store an entire subtree into a single processor so that less inter-processor communication is necessary.

Load Balancing and Minimum Communications

•Ideally…

Load Balancing Strategy (1) Pre-processing

Randomly sample 1000 data points. Sort them by x. Consider the 1/numproc*1000th, 2/numproc*1000th…,

(numproc-1)/n*1000th positions and use them to define intervals for load balancingDrawback: assumes x and y to be more or less independent

Load Balancing Strategy (2): adaptive

Keep a running average of the x values of nodes stored in each processor. For every CHECKPOINT(=2000) number of samples, if the load is skewed (if difference of load between the heaviest load processor and the lightest load processor > 30% of load of lightest processor) change the load balancing intervals to midpoints of the running averages.

Performance(1)

• 20,000 and 200,000 samples from uniform [0,1] distribution

Performance (2)

Effects of adaptive load-balancing on performance for samples from standard normal distributions centered at

(-0.7,0.7)

and (0.7, -0.7)

Conclusion for Parallel KS2D

Speedup is not great, especially when more processors are used because of communication overhead.

Load balancing strategies is noticeably effective for certain data distributions, need dependent on samples

Distributive Memory: Gains the ability to solve larger problems

parallel 2d kolmogorov-smirnov statistic ian chan 5/12/02 6.338j/18.337j

Documents