Title: A%20computational%20tool%20for%20depth-based%20Statistical%20analysis
1A computational tool fordepth-based Statistical
analysis
- Eynat Rafalin, Tufts University
- Computer Science Department
2The tool
- Easy to use, efficient and expandable interface,
for statistical research, based on the notion of
data depth. - For scientists with no computer science
background.
3Our goal
- Present the tool to the community
- Code\software available on request
- Run on real data
- Get feedback
- Is such a tool needed?
- Additions\improvements?
4General
- C based software (no additional tools\software
needed) - Simple interface. Should allow to
- enter data files, sort the data points and filter
unwanted data - perform calculations
- present the results in an easy to understand
graphical interface - Save and output data for future use
- Fast
- Portable code
5General description
Data filter
txt, excel files
output
Statistical modules
Geomview
Contours display and selection
6Data filter
- Graphical user interface developed in C
- Used to crop\manipulate a data set before it is
fed into the statistical modules - Fast and light
- Convenient and easy to use user interface
- Portable code (UNIX, Solaris, Linux, Win)
7Data filter
8Statistical modules
- Depth contours (2D)
- Half-space (location) depth contours
- optimal O(n2) time
- Supports two approaches for defining contours
- Including Tukey median and the bagplot
- Including contours parameters (size, etc..)
- Convex hull peeling depth contours
- Simplicial depth contours
- Tukey median computation (O(nlog3n))
- Locating a new point in a set of depth contours
(O(log n) query time)
9Approaches for defining depth contours
- P. Rousseeuw et al.
- The k-th depth contour is the boundary of the set
of points in the plane with depth ?k - R. Liu et al. (based on order statistics)
- The sample p-th central hull is the convex hull
containing the most central fraction p sample
points.
10Half-space (location) depth contours module
Depth contours for a sample set with 8 data points
Depth contours for a data set describing diabetic
patients
11Statistical modules cntd.
- Plots
- DD (Depth vs. Depth) plots
- O(n2) time
- Shrinkage plots
- Fan plots
12DD (Depth vs. Depth) plots module
Depth according to set A
Depth according to set B
Two 2D data sets of 50 points each, created from
normal distribution, centered at (0,0), with
different covariance matrices (1 and 4 id).
13Fan plots
Relative area (CH of p/CH)
Percentile of points
50 data points, created from a random
distribution, with covariance matrix 4 times
identity. The fans are created for data sets
containing the 1/6, 2/6, ..central regions. For
each region the area of the CH of 2, 4, 6, of
the points is computed.
14Graphical contour selection tool
- Plots depth contours and selects data ranges.
- Actions
- Import\export
- Select points
- Depth slider
- Filter
15Future work
- Run the tool on existing data sets
- Distribute preliminary versions and get users
feedback - Data filter
- Group by row\column
- Filter by row\column
- Interactions between rows\columns (addition,
substitution, logical operations) - Statistical modules
- Implement additional modules
- Improve running times
16Contributors
- Prof. Diane Souvaine
- Prof. Alva Couch
- Eynat Rafalin
- Michael Burr
- Joe Handelman
- James Hayes
- Ori Taka
- Alok Lal
- Janet Luan
- Kim Miller
- Tim Mitchell
- Nikolai Shvertner