Title: Parallel 2D Kolmogorov-Smirnov Statistic
1Parallel 2D Kolmogorov-Smirnov Statistic
- Ian Chan
- 5/12/02 6.338J/18.337J
- http//web.mit.edu/ianchan/www/KS2D
2Motivation my friends research
A colossal X-ray flare, likely sparked by a
central Milky Way black hole, produced the bright
spot in this Chandra image. Source CNN
31D Kolmogorov Smirnov Statistic
- test difference in two empirical distributions
- F ¹ G nonparametrically
- D statistic maximum difference between 2 CDFs
41D KS Test Bound
- Kolmogorov(1933) asymptotic bound
52D Analog of KS Test
- Peacock J, Monthly Notices of the Royal
Astronomical Society, 1983, vol 202 p615
Two-Dimensional Goodness-of-Fit Testing in
Astronomy - D statistic considering all possible quadrant
divisions, the largest possible difference in
CDFs
62D KS Test Bound
- Monte Carlo simulated bounds
Z D n1/2
7KS2D Test Brute Force Algorithm
- O(n2), not exhaustive, quadrants centered at each
data ponts - O(n3), exhaustive, quadrants centered at each
possible data x and data y combination
8O(nlogn) KS2D algorithm
- Author A. Cooke (1999)
- construction of binary tree data structure (
O(nlogn) ), require pre-sorted sample data by y
9How it works (1) Tree construction
- quadrants centered at (x,y) must have upper left
quadrant contains all samples (a,b) where a lt x
AND b lt y - If childless node, Dmin Dmax 1/Nsquare or
1/Ncircle, depending on class
10How it works (2) Upward Propagation
At node (2,3), we find the MIN and MAX from the 3
choices 1 inherit Dmin/max from its left child
(1,2), which implies that Q excludes (2,3) where
Q is the quadrant that contains the largest D 2
D delta(left child) (0/Ns-1/Nc), which
implies Q contains (2,3) and has (2,3) on its
border. Delta(x) diff in CDF if quadrant
contains all samples in subtree at x 3 D
delta(left child) (0/Ns-1/Nc) Dmin/max (right
child), which implies Q contains (2,3) and (2,3)
is not on its border
11The other 3 quadrants
- We have considered the Top Left Quadrant, but the
Top Right quadrant can be obtained from the same
tree structure if we modify the upward
propagation rule by swapping left right, i.e. - At node (2,3), we find the MIN and MAX from the 3
choices - 1 inherit Dmin/max from its right child (1,2),
which implies that Q excludes (2,3) where Q is
the quadrant that contains the largest D - 2 D delta(right child) (0/Ns-1/Nc), which
implies Q contains (2,3) and has (2,3) on its
border. Delta(x) diff in CDF if quadrant
contains all samples in subtree at x - 3 D delta(right child) (0/Ns-1/Nc) Dmin/max
(left child), which implies Q contains (2,3) and
(2,3) is not on its border - The Bottom Left/Right Quadrants can be obtained
if the tree is built with samples sorted by
reverse order of y.
12Parallel KS2D Algorithm
- Speed possibly scales linearly with number of
processors during the upward propagation step,
cannot parallelize the tree construction step - Problem size scales linearly with number of
processors because sample nodes are stored in
processors distributively - Challenges
- Load Balancing Dividing the tree nodes equally
among processors - Minimize communications Try to store an entire
subtree into a single processor so that less
inter-processor communication is necessary.
13Load Balancing and Minimum Communications
14Load Balancing Strategy (1) Pre-processing
- Randomly sample 1000 data points. Sort
them by x. Consider the 1/numproc1000th,
2/numproc1000th, - (numproc-1)/n1000th positions and use
them to define intervals for load balancing - Drawback assumes x and y to be more or less
independent
15Load Balancing Strategy (2) adaptive
- Keep a running average of the x values of
nodes stored in each processor. For every
CHECKPOINT(2000) number of samples, if the load
is skewed (if difference of load between the
heaviest load processor and the lightest load
processor gt 30 of load of lightest processor)
change the load balancing intervals to midpoints
of the running averages.
16Performance(1)
- 20,000 and 200,000 samples from uniform 0,1
distribution
17Performance (2)
- Effects of adaptive load-balancing on performance
for samples from standard normal distributions
centered at - (-0.7,0.7)
- and (0.7, -0.7)
18Conclusion for Parallel KS2D
- Speedup is not great, especially when more
processors are used because of communication
overhead. - Load balancing strategies is noticeably effective
for certain data distributions, need dependent on
samples - Distributive Memory Gains the ability to solve
larger problems