BINF 733 Spring 2005 Statistical Methods of Outlier Detection - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

BINF 733 Spring 2005 Statistical Methods of Outlier Detection

Description:

For he that knows the ways of nature will more easily observe her deviations; ... to brachiosaurus, diplodocus, triceratops, Asian elephant, and Africa elephant. ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 38
Provided by: binf
Category:

less

Transcript and Presenter's Notes

Title: BINF 733 Spring 2005 Statistical Methods of Outlier Detection


1
BINF 733 Spring 2005 Statistical Methods of
Outlier Detection
  • Jeff Solka Ph.D.
  • Jennifer Weller Ph.D.

2
Sir Francis Bacon Novum Organum 1620
  • For he that knows the ways of nature will more
    easily observe her deviations and on the other
    hand he that knows her deviations will more
    accurately describe her ways.

3
Sir Francis Bacon Revisited
  • To identify outliers we need some sort of model
    to start with.
  • We can do a better job at identifying our model
    if we first remove the outliers.
  • The process of outlier identification/model
    building is an iterative process.

4
What is an Outlier?
  • Given a set of observations X an outlier is an
    observations that is an element of this set but
    which is inconsistent with the majority of the
    data.

http//www.ncl.ac.uk/cpact/demo_outlier.jpg
5
Manifestation of Outliers in Gene Expression Data
  • Given a set of replicate arrays the replicates
    can be used to identify an aberrant spot.
  • Xgi transformed and normalized spot intensity
    measurements for the gth gene on the ith array
  • An outlier is an observation Xgi that is markedly
    different from his fellow observations

6
Nonresistent Rules for Outlier Identification

7
The z-score Rule Grubbs Test
  • The z-score rule (Grubbs test). Calculate a
    z-score zgi for every observation
  • Where and sg are the mean and standard
    deviation of the gth gene. Call Xgj an outlier is
    zgj is larger say greater than five

8
The CV Rule
  • The CV Rule Call the furthest observation Xgi
    from the mean, , and outlier if the coefficient
    of variation CVg exceeds some prespecified
    cutoff.

9
Problems With the z-score and CV Methods of
Outlier Detection
  • They are both based on measures that are heavily
    influenced by outliers, the mean and the standard
    deviation.
  • Masking An outlier remains undetected because
    it is hidden by its own influence on the
    methodologies parameters or else by another
    adjacent outlier.
  • Swamping A normal observation is classified as
    an outlier due to the presence of an unrelated
    outlier or outliers.

10
Resistant Rules for Outlier Detection

11
One Approach to Crafting Resistant Rules for
Outlier Detection
  • Based on outlier resistant statistical measures
  • Median
  • Median absolute deviation from the median

12
The Resistant z-score Rule
  • The resistant z-score rule. Calculate a resistant
    z-score, zgi for every observation using
  • and are the median and MAD of the gth gene.
    Call Xgi and outlier if zgi is large, say,
    greater than five.

13
Problem of Too Few Replicates
  • Microarray experiments usually have little
    replication
  • Median and MAD are not dependable estimates of
    the location and scale of the data

14
A Strategy for the Problem of Too Few Replicates
- I
  • With microarray data there is a relationship
    between the median and MAD across all of the
    genes
  • Assume this relationship is a true relationship
  • s2g f(mg)
  • Use this to compute a smoothed version of MAD,
    , that will be more stable as it boorows
    strength from similarly expressing genes

15
A Strategy for the Problem of Too Few Replicates
- II
Run a smoother such as a smoothing spline through
the relationship of ADgi versus Use the
fitted value, , as an estimator for the
gth gene.
16
A Strategy for the Problem of Too Few Replicates
- III
  • The revised z-score rule
  • Call Xgi an outlier if the computed score is
    large say greater than five

17
Mahalanobis Distance for Outlier Detection
18
Advantages of the Mahalanobis Distance Approach
  • Mahalanobis' distance identifies observations
    which lie far away from the centre of the data
    cloud, giving less weight to variables with large
    variances or to groups of highly correlated
    variables (Joliffe, 1986).
  • This distance is often preferred to the Euclidean
    distance which ignores the covariance structure
    and thus treats all variables equally.

19
A Circle Becomes an Ellipse Based on the
Mahalanobis Distance
http//www.famsi.org/reports/98061/images/fig18.gi
f
20
A Test Statistic for the Mahalanobis Distance
21
Principal Components
  • Huber (1985) cites two main reasons why principal
    components are interesting projections
  • first, in the case of clustered data, the leading
    principal axes pick projections with good
    separations
  • secondly, the leading principal components
    collect the systematic structure of the data.
  • Thus, the first principal component reflects the
    first major linear trend, the second principal
    component, the second major linear trend, etc.
  • So, if an observation is located far away from
    any of the major linear trends it can be
    considered an outlier.

22
Clustering and Outlier Detection
  • Cluster Analysis can be used for outlier
    detection. 
  • Outliers may emerge as singletons or as small
    clusters far removed from the others. 
  • To do  outlier detection at the same time as
    clustering the main body of the data, use enough
    clusters to represent both the main body of the
    data and the outliers. 

23
Fisher Iris Data
  • 150 Cases
  • 5 variables
  • Sepal length
  • Sepal width
  • Petal length
  • Petal width
  • Species (3 types)

24
Iris data
Classic Dendrogram
Classic Data Image
25
Line Example
Which of these are outliers?
26
Data Image of the Interpoint Distance Matrix of
the Line Example
Both outliers
Euclidean Distance
Mahalanobis Distance
Triangle outlier
Outliers manifest themselves as vs or plus sign
structures in the data image
27
Body Weight Brain Weight Data
Data Image shows outliers and subclusters of the
outliers
The outliers are number sequentially and
correspond to brachiosaurus, diplodocus,
triceratops, Asian elephant, and Africa elephant.
28
Stackloss Dataset
Rousseeuw and Leroy 1987 report 1, 3, 4, 21 and
maybe 2 as outliers.
4, 21
1, 2, 3
Outliers have been labeled as triangles.
29
Data Image for the Mahalanobis Distance
Presence of outliers is not clearly discernible.
30
Data Image for the Mahalanobis Distance Where the
Covariance in the Mahalanobis Distance
Calculation is Constructed Using Observations 4 -
21
21
1, 2, 3
31
An Artificial Dataset from Rousseeuw and Leroy
1987
Cluster structures of the outliers revealed in
the data image.
32
A Particularly Onerous Elliptical Dataset
First suggested by Dan Carr.
33
Euclidean and Mahalanobis Data Images of the
Ellipse Data
34
Pairs Plot and Data Image for 5 Dimensional
Sphere Case
35
Artificial Nose Dataset
  • Fiber optic artificial olfactory system
  • 19 fibers x 2 wavelengths 60 times/inhalation
    2280
  • Each data point resides in R2280

36
Artificial Nose Data Image of TCE Present
Chloroform Observations
37
References
  • Afifi, A.A., and Azen, S.P. (1972), Statistical
    analysis a computer oriented approach, Academic
    Press, New York.
  • Barnett, V. and T. Lewis (1994) Outliers in
    Statistical Data. New Your Wiley
  • Huber, P.J. (1985), Projection pursuit, The
    Annals of Statistics, 13(2), 435-475.
  • David J. Marchette and Jeffrey L. Solka Using
    data images for outlier detection  Computational
    Statistics Data Analysis, Volume 43, Issue 4,
    28 August 2003, Pages 541-552
  • Joliffe, I.T. (1986) Principal Component
    Analysis, Springer-Verlag, New York.
  • Robust Regression and Outlier Detection (Wiley
    Series in Probability and Statistics) by Peter J.
    Rousseeuw, Annick M. Leroy , Wiley-Interscience
    (September 19, 2003)
Write a Comment
User Comments (0)
About PowerShow.com