parallel 2d kolmogorov-smirnov statistic ian chan 5/12/02 6.338j/18.337j
TRANSCRIPT
Parallel 2D Kolmogorov-Smirnov Statistic
Ian Chan
5/12/02 6.338J/18.337J
http://web.mit.edu/ianchan/www/KS2D
Motivation: my friend’s research
A colossal X-ray flare, likely sparked by a central Milky Way black hole, produced the bright spot in this Chandra image. [Source CNN]
1D Kolmogorov Smirnov Statistic
test difference in two empirical distributions
F G nonparametrically D statistic: maximum difference between 2 CDF’s
1D KS Test Bound
Kolmogorov(1933) asymptotic bound:
2D Analog of KS Test
Peacock J, Monthly Notices of the Royal Astronomical Society, 1983, vol 202 p615: Two-Dimensional Goodness-of-Fit Testing in Astronomy
D statistic: considering all possible quadrant divisions, the largest possible difference in CDFs
2D KS Test Bound
Monte Carlo simulated bounds
Z = D n1/2
KS2D Test Brute Force Algorithm
O(n2), not exhaustive, quadrants centered at each data ponts O(n3), exhaustive, quadrants centered at each possible data x and data y
combination
O(nlogn) KS2D algorithm
Author: A. Cooke (1999) construction of binary tree data structure ( O(nlogn) ), require pre-sorted
sample data by y
How it works: (1) Tree construction
quadrants centered at (x,y) must have upper left quadrant contains all samples (a,b) where a < x AND b < y
If childless node, Dmin = Dmax = 1/Nsquare or –1/Ncircle, depending on class
How it works: (2) Upward Propagation
At node (2,3), we find the MIN and MAX from the 3 choices:
1 inherit Dmin/max from its left child (1,2), which implies that Q excludes (2,3) where Q is the quadrant that contains the largest |D|
2 D = delta(left child) + (0/Ns-1/Nc), which implies Q contains (2,3) and has (2,3) on its border. Delta(x) = diff in CDF if quadrant contains all samples in subtree at x
3 D = delta(left child) + (0/Ns-1/Nc) + Dmin/max (right child), which implies Q contains (2,3) and (2,3) is not on its border
The other 3 quadrants
We have considered the Top Left Quadrant, but the Top Right quadrant can be obtained from the same tree structure if we modify the upward propagation rule by swapping left & right, i.e.
At node (2,3), we find the MIN and MAX from the 3 choices:
1 inherit Dmin/max from its right child (1,2), which implies that Q excludes (2,3) where Q is the quadrant that contains the largest |D|
2 D = delta(right child) + (0/Ns-1/Nc), which implies Q contains (2,3) and has (2,3) on its border. Delta(x) = diff in CDF if quadrant contains all samples in subtree at x
3 D = delta(right child) + (0/Ns-1/Nc) + Dmin/max (left child), which implies Q contains (2,3) and (2,3) is not on its border
• The Bottom Left/Right Quadrants can be obtained if the tree is built with samples sorted by reverse order of y.
Parallel KS2D Algorithm
Speed possibly scales linearly with number of processors during the upward propagation step, cannot parallelize the tree construction step
Problem size scales linearly with number of processors because sample nodes are stored in processors distributively
Challenges
1. Load Balancing: Dividing the tree nodes equally among processors
2. Minimize communications: Try to store an entire subtree into a single processor so that less inter-processor communication is necessary.
Load Balancing and Minimum Communications
•Ideally…
Load Balancing Strategy (1) Pre-processing
Randomly sample 1000 data points. Sort them by x. Consider the 1/numproc*1000th, 2/numproc*1000th…,
(numproc-1)/n*1000th positions and use them to define intervals for load balancingDrawback: assumes x and y to be more or less independent
Load Balancing Strategy (2): adaptive
Keep a running average of the x values of nodes stored in each processor. For every CHECKPOINT(=2000) number of samples, if the load is skewed (if difference of load between the heaviest load processor and the lightest load processor > 30% of load of lightest processor) change the load balancing intervals to midpoints of the running averages.
Performance(1)
• 20,000 and 200,000 samples from uniform [0,1] distribution
Performance (2)
Effects of adaptive load-balancing on performance for samples from standard normal distributions centered at
(-0.7,0.7)
and (0.7, -0.7)
Conclusion for Parallel KS2D
Speedup is not great, especially when more processors are used because of communication overhead.
Load balancing strategies is noticeably effective for certain data distributions, need dependent on samples
Distributive Memory: Gains the ability to solve larger problems