Title: Geometric and combinatorial issues in data depth
1Geometric and combinatorial issues in data depth
- Greg Aloupis
- Université Libre de Bruxelles
2What is data depth?
-
- A quantitative measurement of how central a point
is with respect to a data set. - Goals to be able to rank data points, and to
find the center of the data cloud.
3Some geometric bivariate medians
- Convex hull peeling (Tukey 70s)
- 85 Chazelle ?(nlogn)
- Halfspace median (Tukey 74)
- 01 Langerman-Steiger O(nlog 3 n), 03 Chan
O(nlog n) -randomized - Oja median (Oja 83)
- 01 G.A.-Langerman-Soss-Toussaint O(nlog 3 n)
- Simplicial median (Liu 88)
- 01 ALST O(n4)
4Convex hull peeling
5Convex hull peeling
6Convex hull peeling
7Convex hull peeling
8Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- Each median is a point with max/min depth
9Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- (Tukey) halfspace depth
- For every line through ?, count points
above/below. - Return minimum number counted over all lines.
10Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- (Tukey) halfspace depth
- For every line through ?, count points
above/below. - Return minimum number counted over all lines.
11Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- (Tukey) halfspace depth
- For every line through ?, count points
above/below. - Return minimum number counted over all lines.
12Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- (Tukey) halfspace depth
- For every line through ?, count points above.
- Return minimum number counted over all lines.
13Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- (Liu) simplicial depth
- Count the closed triangles in S that contain ?.
14Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- (Liu) simplicial depth
- Count the closed triangles in S that contain ?.
15Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- (Liu) simplicial depth
- Count the closed triangles in S that contain ?.
16Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- (Liu) simplicial depth
- Count the closed triangles in S that contain ?.
17Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- (Liu) simplicial depth
- Count the closed triangles in S that contain ?.
18Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- (Liu) simplicial depth
- Count the closed triangles in S that contain ?.
- etc
19Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- Oja depth
- Sum areas of all triangles with vertices (?,si
,sj)
20Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- Oja depth
- Sum areas of all triangles with vertices (?,si
,sj)
21Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- Oja depth
- Sum areas of all triangles with vertices (?,si
,sj)
22Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- Oja depth
- Sum areas of all triangles with vertices (?,si
,sj) - etc
23Halfspace, simplicial and Oja depthsof a point ?
in bivariate data set S
- O(nlog n) Khuller-Mitchell 89,
Gil-Steiger-Wigderson 92, Roussewu-Ruts 96 - W(nlog n) G.A.-Cortes-Gomez-Soss-Toussaint 01,
Langerman-Steiger 01, G.A.-McLeish 05
24Issue 1What is the complexity of computing the
depth k of a point if k is known to be
small/large?
- If the peel median has depth kgt1 then can we
compute it faster? (GSW92) - !!! this just in simplicial depth in O(nnlog
(1 k/n)) - Elmasry-Elbassioni ? CCCG last week
- Is there a lower bound, sensitive to parameter k?
- Something similar for halfspace depth?
- Current attempts for O(nlog k)
25Issue 2 (Improve) simplicial median computation
- Remember, that horrible n4 result a few slides
back
26 Easy observation
- I the set of line segments between pairs in S.
- The simplicial median is on an intersection of
two segments in I.
27Outline of a method
- Preprocessing O(n3) brute-force, actually O(n2)
- Count number of points above/below each segment.
- Compute depth of all points.
- For each segment,
- sort all intersections with other segments.
- O(n2log n).
- Calculate depth of each intersection in O(1)
time - O(n2)
- Overall O(n4log n)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37- Constant time to update depth as we walk
38- Instead of sorting intersection points and
processing each segment alone, we can use
topological sweep. - The time complexity becomes O(n4) and the space
used is O(n2). - Can we improve this?
- i.e. find some structure in this depth function
39Conjecture
- A point of maximum simplicial depth can always be
found on the intersection of two halving segments - (weak) experiments have not contradicted this
40Desirable properties of data depth functions
- Affine invariance (at the very least)
- Robustness
- Outliers should not influence the center.
- Monotonicity
- Center should move in same direction as
perturbations
41monotonicity
42(No Transcript)
43Robustness to outliers
- breakdown point fraction of data that must be
moved/added so that median is placed at infinity. - Oja median was considered to be robust, but
finally it was shown that the breakdown point can
be near zero for certain configurations. (planar
case) (Niinimaa,Oja,Tableman 90) - simplicial median dont know. But the data
point of maximum depth can be moved away with few
corrupting points (GSW 92) (planar case) - halfspace median great! 1/(d1)
(Donoho,Gasko 92)
44Robustness to outliers
- breakdown point fraction of data that must be
moved/added so that median is placed at infinity. - Max breakdown ½
- In 1D, only the median is affine invariant,
monotonic and has max breakdown - Is there such an estimator in higher dimensions?
45Issue 3How does the breakdown point depend on
the depth of the median?
- Convex peeling breakdown is zero, unless depth
is linear (GSW92) - Halfspace breakdown is higher (1/3) for
centrosymmetric data distributions, where depth
is roughly 1/2 - Instead of 1/(d1)
- So what can we say about other estimators?
- For deepest point in plane
- For deepest data point
46Issue 4Non-strategic breakdown
- All work so far involved carefully placing
outliers (erroneous or corrupt data), to move an
estimator far away. - (is corrupt data really placed carefully in
practice?) -
- What about
- average outliers (random or evenly spaced
placement) - strong breakdown (should work regardless of
direction at infinity) - special-case outliers (axis-parallel, or radial
extension, or ?)
47Issue 5Computing/analyzing other estimators
- Projection outlyingness of q
(Donoho-Gasko 92) - Take max of the following, over all
projections - q-Median / (median deviation from
median) - Find an algorithm for the least outlying point.
- Gil-Steiger-Wigderson
- superposition of unit vectors to data points
v(ai) - median is a (data) point with v(ai) R lt 1
??? - computation in o(n2) ? Properties?
- Zonoid depth, Delaunay depth
48Issue 6Points of high depth
- A point w/ Tukey depth ? n/(d1) is a
centerpoint. - Guaranteed to exist, by Hellys thm.
- O(n) time computation (Jadhav-Mukhopadhyay 94)
- Can be considered to be a median generalization.
- ¼ (n 3) ? simplicial depth ? 2/9 (n 3)
(Boros-Furedi 82) - (in R2 , ignoring quadratic terms)
- Can we compute a high depth point quickly?
- Tverberg points in R2 have depth ? 1/27 (n 3)
and can be computed in O(n) time. Anything
better? - Is there a point with high Oja depth?
(normalized)
49Things I may have mentioned in the abstract but
forgot to include here
- Is it faster to locate a deep point without
computing its depth? - How many points have depthgtk ?
- When do simplicial depth levels become
disconnected?
50merci