Parallel 2D Kolmogorov-Smirnov Statistic - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Parallel 2D Kolmogorov-Smirnov Statistic

Description:

A colossal X-ray flare, likely sparked by a central Milky Way black hole, ... If childless node, Dmin = Dmax = 1/Nsquare or 1/Ncircle, depending on class ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 19
Provided by: meanie
Category:

less

Transcript and Presenter's Notes

Title: Parallel 2D Kolmogorov-Smirnov Statistic


1
Parallel 2D Kolmogorov-Smirnov Statistic
  • Ian Chan
  • 5/12/02 6.338J/18.337J
  • http//web.mit.edu/ianchan/www/KS2D

2
Motivation my friends research
A colossal X-ray flare, likely sparked by a
central Milky Way black hole, produced the bright
spot in this Chandra image. Source CNN
3
1D Kolmogorov Smirnov Statistic
  • test difference in two empirical distributions
  • F ¹ G nonparametrically
  • D statistic maximum difference between 2 CDFs

4
1D KS Test Bound
  • Kolmogorov(1933) asymptotic bound

5
2D Analog of KS Test
  • Peacock J, Monthly Notices of the Royal
    Astronomical Society, 1983, vol 202 p615
    Two-Dimensional Goodness-of-Fit Testing in
    Astronomy
  • D statistic considering all possible quadrant
    divisions, the largest possible difference in
    CDFs

6
2D KS Test Bound
  • Monte Carlo simulated bounds

Z D n1/2
7
KS2D Test Brute Force Algorithm
  • O(n2), not exhaustive, quadrants centered at each
    data ponts
  • O(n3), exhaustive, quadrants centered at each
    possible data x and data y combination

8
O(nlogn) KS2D algorithm
  • Author A. Cooke (1999)
  • construction of binary tree data structure (
    O(nlogn) ), require pre-sorted sample data by y

9
How it works (1) Tree construction
  • quadrants centered at (x,y) must have upper left
    quadrant contains all samples (a,b) where a lt x
    AND b lt y
  • If childless node, Dmin Dmax 1/Nsquare or
    1/Ncircle, depending on class

10
How it works (2) Upward Propagation
At node (2,3), we find the MIN and MAX from the 3
choices 1 inherit Dmin/max from its left child
(1,2), which implies that Q excludes (2,3) where
Q is the quadrant that contains the largest D 2
D delta(left child) (0/Ns-1/Nc), which
implies Q contains (2,3) and has (2,3) on its
border. Delta(x) diff in CDF if quadrant
contains all samples in subtree at x 3 D
delta(left child) (0/Ns-1/Nc) Dmin/max (right
child), which implies Q contains (2,3) and (2,3)
is not on its border
11
The other 3 quadrants
  • We have considered the Top Left Quadrant, but the
    Top Right quadrant can be obtained from the same
    tree structure if we modify the upward
    propagation rule by swapping left right, i.e.
  • At node (2,3), we find the MIN and MAX from the 3
    choices
  • 1 inherit Dmin/max from its right child (1,2),
    which implies that Q excludes (2,3) where Q is
    the quadrant that contains the largest D
  • 2 D delta(right child) (0/Ns-1/Nc), which
    implies Q contains (2,3) and has (2,3) on its
    border. Delta(x) diff in CDF if quadrant
    contains all samples in subtree at x
  • 3 D delta(right child) (0/Ns-1/Nc) Dmin/max
    (left child), which implies Q contains (2,3) and
    (2,3) is not on its border
  • The Bottom Left/Right Quadrants can be obtained
    if the tree is built with samples sorted by
    reverse order of y.

12
Parallel KS2D Algorithm
  • Speed possibly scales linearly with number of
    processors during the upward propagation step,
    cannot parallelize the tree construction step
  • Problem size scales linearly with number of
    processors because sample nodes are stored in
    processors distributively
  • Challenges
  • Load Balancing Dividing the tree nodes equally
    among processors
  • Minimize communications Try to store an entire
    subtree into a single processor so that less
    inter-processor communication is necessary.

13
Load Balancing and Minimum Communications
  • Ideally

14
Load Balancing Strategy (1) Pre-processing
  • Randomly sample 1000 data points. Sort
    them by x. Consider the 1/numproc1000th,
    2/numproc1000th,
  • (numproc-1)/n1000th positions and use
    them to define intervals for load balancing
  • Drawback assumes x and y to be more or less
    independent

15
Load Balancing Strategy (2) adaptive
  • Keep a running average of the x values of
    nodes stored in each processor. For every
    CHECKPOINT(2000) number of samples, if the load
    is skewed (if difference of load between the
    heaviest load processor and the lightest load
    processor gt 30 of load of lightest processor)
    change the load balancing intervals to midpoints
    of the running averages.

16
Performance(1)
  • 20,000 and 200,000 samples from uniform 0,1
    distribution

17
Performance (2)
  • Effects of adaptive load-balancing on performance
    for samples from standard normal distributions
    centered at
  • (-0.7,0.7)
  • and (0.7, -0.7)

18
Conclusion for Parallel KS2D
  • Speedup is not great, especially when more
    processors are used because of communication
    overhead.
  • Load balancing strategies is noticeably effective
    for certain data distributions, need dependent on
    samples
  • Distributive Memory Gains the ability to solve
    larger problems
Write a Comment
User Comments (0)
About PowerShow.com