Parallel 2D Kolmogorov-Smirnov Statistic - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Parallel 2D Kolmogorov-Smirnov Statistic

Description:

A colossal X-ray flare, likely sparked by a central Milky Way black hole, ... If childless node, Dmin = Dmax = 1/Nsquare or 1/Ncircle, depending on class ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 19

Provided by: meanie

Category:

more less

Transcript and Presenter's Notes

Title: Parallel 2D Kolmogorov-Smirnov Statistic

1
Parallel 2D Kolmogorov-Smirnov Statistic

Ian Chan
5/12/02 6.338J/18.337J
http//web.mit.edu/ianchan/www/KS2D

2
Motivation my friends research
A colossal X-ray flare, likely sparked by a
central Milky Way black hole, produced the bright
spot in this Chandra image. Source CNN
3
1D Kolmogorov Smirnov Statistic

test difference in two empirical distributions
F ¹ G nonparametrically
D statistic maximum difference between 2 CDFs

4
1D KS Test Bound

Kolmogorov(1933) asymptotic bound

5
2D Analog of KS Test

Peacock J, Monthly Notices of the Royal
Astronomical Society, 1983, vol 202 p615
Two-Dimensional Goodness-of-Fit Testing in
Astronomy
D statistic considering all possible quadrant
divisions, the largest possible difference in
CDFs

6
2D KS Test Bound

Monte Carlo simulated bounds

Z D n1/2
7
KS2D Test Brute Force Algorithm

O(n2), not exhaustive, quadrants centered at each
data ponts
O(n3), exhaustive, quadrants centered at each
possible data x and data y combination

8
O(nlogn) KS2D algorithm

Author A. Cooke (1999)
construction of binary tree data structure (
O(nlogn) ), require pre-sorted sample data by y

9
How it works (1) Tree construction

quadrants centered at (x,y) must have upper left
quadrant contains all samples (a,b) where a lt x
AND b lt y
If childless node, Dmin Dmax 1/Nsquare or
1/Ncircle, depending on class

10
How it works (2) Upward Propagation
At node (2,3), we find the MIN and MAX from the 3
choices 1 inherit Dmin/max from its left child
(1,2), which implies that Q excludes (2,3) where
Q is the quadrant that contains the largest D 2
D delta(left child) (0/Ns-1/Nc), which
implies Q contains (2,3) and has (2,3) on its
border. Delta(x) diff in CDF if quadrant
contains all samples in subtree at x 3 D
delta(left child) (0/Ns-1/Nc) Dmin/max (right
child), which implies Q contains (2,3) and (2,3)
is not on its border
11
The other 3 quadrants

We have considered the Top Left Quadrant, but the
Top Right quadrant can be obtained from the same
tree structure if we modify the upward
propagation rule by swapping left right, i.e.
At node (2,3), we find the MIN and MAX from the 3
choices
1 inherit Dmin/max from its right child (1,2),
which implies that Q excludes (2,3) where Q is
the quadrant that contains the largest D
2 D delta(right child) (0/Ns-1/Nc), which
implies Q contains (2,3) and has (2,3) on its
border. Delta(x) diff in CDF if quadrant
contains all samples in subtree at x
3 D delta(right child) (0/Ns-1/Nc) Dmin/max
(left child), which implies Q contains (2,3) and
(2,3) is not on its border
The Bottom Left/Right Quadrants can be obtained
if the tree is built with samples sorted by
reverse order of y.

12
Parallel KS2D Algorithm

Speed possibly scales linearly with number of
processors during the upward propagation step,
cannot parallelize the tree construction step
Problem size scales linearly with number of
processors because sample nodes are stored in
processors distributively
Challenges
Load Balancing Dividing the tree nodes equally
among processors
Minimize communications Try to store an entire
subtree into a single processor so that less
inter-processor communication is necessary.

13
Load Balancing and Minimum Communications

Ideally

14
Load Balancing Strategy (1) Pre-processing

Randomly sample 1000 data points. Sort
them by x. Consider the 1/numproc1000th,
2/numproc1000th,
(numproc-1)/n1000th positions and use
them to define intervals for load balancing
Drawback assumes x and y to be more or less
independent

15
Load Balancing Strategy (2) adaptive

Keep a running average of the x values of
nodes stored in each processor. For every
CHECKPOINT(2000) number of samples, if the load
is skewed (if difference of load between the
heaviest load processor and the lightest load
processor gt 30 of load of lightest processor)
change the load balancing intervals to midpoints
of the running averages.

16
Performance(1)

20,000 and 200,000 samples from uniform 0,1
distribution

17
Performance (2)

Effects of adaptive load-balancing on performance
for samples from standard normal distributions
centered at
(-0.7,0.7)
and (0.7, -0.7)

18
Conclusion for Parallel KS2D

Speedup is not great, especially when more
processors are used because of communication
overhead.
Load balancing strategies is noticeably effective
for certain data distributions, need dependent on
samples
Distributive Memory Gains the ability to solve
larger problems

Write a Comment

User Comments (0)