BINF 733 Spring 2005 Statistical Methods of Outlier Detection presentation

About This Presentation

Transcript and Presenter's Notes

Title: BINF 733 Spring 2005 Statistical Methods of Outlier Detection

1
BINF 733 Spring 2005 Statistical Methods of
Outlier Detection

Jeff Solka Ph.D.
Jennifer Weller Ph.D.

2
Sir Francis Bacon Novum Organum 1620

For he that knows the ways of nature will more
easily observe her deviations and on the other
hand he that knows her deviations will more
accurately describe her ways.

3
Sir Francis Bacon Revisited

To identify outliers we need some sort of model
to start with.
We can do a better job at identifying our model
if we first remove the outliers.
The process of outlier identification/model
building is an iterative process.

4
What is an Outlier?

Given a set of observations X an outlier is an
observations that is an element of this set but
which is inconsistent with the majority of the
data.

http//www.ncl.ac.uk/cpact/demo_outlier.jpg
5
Manifestation of Outliers in Gene Expression Data

Given a set of replicate arrays the replicates
can be used to identify an aberrant spot.
Xgi transformed and normalized spot intensity
measurements for the gth gene on the ith array
An outlier is an observation Xgi that is markedly
different from his fellow observations

6
Nonresistent Rules for Outlier Identification

7
The z-score Rule Grubbs Test

The z-score rule (Grubbs test). Calculate a
z-score zgi for every observation
Where and sg are the mean and standard
deviation of the gth gene. Call Xgj an outlier is
zgj is larger say greater than five

8
The CV Rule

The CV Rule Call the furthest observation Xgi
from the mean, , and outlier if the coefficient
of variation CVg exceeds some prespecified
cutoff.

9
Problems With the z-score and CV Methods of
Outlier Detection

They are both based on measures that are heavily
influenced by outliers, the mean and the standard
deviation.
Masking An outlier remains undetected because
it is hidden by its own influence on the
methodologies parameters or else by another
adjacent outlier.
Swamping A normal observation is classified as
an outlier due to the presence of an unrelated
outlier or outliers.

10
Resistant Rules for Outlier Detection

11
One Approach to Crafting Resistant Rules for
Outlier Detection

Based on outlier resistant statistical measures
Median
Median absolute deviation from the median

12
The Resistant z-score Rule

The resistant z-score rule. Calculate a resistant
z-score, zgi for every observation using
and are the median and MAD of the gth gene.
Call Xgi and outlier if zgi is large, say,
greater than five.

13
Problem of Too Few Replicates

Microarray experiments usually have little
replication
Median and MAD are not dependable estimates of
the location and scale of the data

14
A Strategy for the Problem of Too Few Replicates
- I

With microarray data there is a relationship
between the median and MAD across all of the
genes
Assume this relationship is a true relationship
s2g f(mg)
Use this to compute a smoothed version of MAD,
, that will be more stable as it boorows
strength from similarly expressing genes

15
A Strategy for the Problem of Too Few Replicates
- II
Run a smoother such as a smoothing spline through
the relationship of ADgi versus Use the
fitted value, , as an estimator for the
gth gene.
16
A Strategy for the Problem of Too Few Replicates
- III

The revised z-score rule
Call Xgi an outlier if the computed score is
large say greater than five

17
Mahalanobis Distance for Outlier Detection
18
Advantages of the Mahalanobis Distance Approach

Mahalanobis' distance identifies observations
which lie far away from the centre of the data
cloud, giving less weight to variables with large
variances or to groups of highly correlated
variables (Joliffe, 1986).
This distance is often preferred to the Euclidean
distance which ignores the covariance structure
and thus treats all variables equally.

19
A Circle Becomes an Ellipse Based on the
Mahalanobis Distance
http//www.famsi.org/reports/98061/images/fig18.gi
f
20
A Test Statistic for the Mahalanobis Distance
21
Principal Components

Huber (1985) cites two main reasons why principal
components are interesting projections
first, in the case of clustered data, the leading
principal axes pick projections with good
separations
secondly, the leading principal components
collect the systematic structure of the data.
Thus, the first principal component reflects the
first major linear trend, the second principal
component, the second major linear trend, etc.
So, if an observation is located far away from
any of the major linear trends it can be
considered an outlier.

22
Clustering and Outlier Detection

Cluster Analysis can be used for outlier
detection.
Outliers may emerge as singletons or as small
clusters far removed from the others.
To do outlier detection at the same time as
clustering the main body of the data, use enough
clusters to represent both the main body of the
data and the outliers.

23
Fisher Iris Data

150 Cases
5 variables
Sepal length
Sepal width
Petal length
Petal width
Species (3 types)

24
Iris data
Classic Dendrogram
Classic Data Image
25
Line Example
Which of these are outliers?
26
Data Image of the Interpoint Distance Matrix of
the Line Example
Both outliers
Euclidean Distance
Mahalanobis Distance
Triangle outlier
Outliers manifest themselves as vs or plus sign
structures in the data image
27
Body Weight Brain Weight Data
Data Image shows outliers and subclusters of the
outliers
The outliers are number sequentially and
correspond to brachiosaurus, diplodocus,
triceratops, Asian elephant, and Africa elephant.
28
Stackloss Dataset
Rousseeuw and Leroy 1987 report 1, 3, 4, 21 and
maybe 2 as outliers.
4, 21
1, 2, 3
Outliers have been labeled as triangles.
29
Data Image for the Mahalanobis Distance
Presence of outliers is not clearly discernible.
30
Data Image for the Mahalanobis Distance Where the
Covariance in the Mahalanobis Distance
Calculation is Constructed Using Observations 4 -
21
21
1, 2, 3
31
An Artificial Dataset from Rousseeuw and Leroy
1987
Cluster structures of the outliers revealed in
the data image.
32
A Particularly Onerous Elliptical Dataset
First suggested by Dan Carr.
33
Euclidean and Mahalanobis Data Images of the
Ellipse Data
34
Pairs Plot and Data Image for 5 Dimensional
Sphere Case
35
Artificial Nose Dataset

Fiber optic artificial olfactory system
19 fibers x 2 wavelengths 60 times/inhalation
2280
Each data point resides in R2280

36
Artificial Nose Data Image of TCE Present
Chloroform Observations
37
References

Afifi, A.A., and Azen, S.P. (1972), Statistical
analysis a computer oriented approach, Academic
Press, New York.
Barnett, V. and T. Lewis (1994) Outliers in
Statistical Data. New Your Wiley
Huber, P.J. (1985), Projection pursuit, The
Annals of Statistics, 13(2), 435-475.
David J. Marchette and Jeffrey L. Solka Using
data images for outlier detection Computational
Statistics Data Analysis, Volume 43, Issue 4,
28 August 2003, Pages 541-552
Joliffe, I.T. (1986) Principal Component
Analysis, Springer-Verlag, New York.
Robust Regression and Outlier Detection (Wiley
Series in Probability and Statistics) by Peter J.
Rousseeuw, Annick M. Leroy , Wiley-Interscience
(September 19, 2003)

Write a Comment

User Comments (0)

About PowerShow.com

BINF 733 Spring 2005 Statistical Methods of Outlier Detection PowerPoint PPT Presentation