Title: BINF 733 Spring 2005 Statistical Methods of Outlier Detection
1BINF 733 Spring 2005 Statistical Methods of
Outlier Detection
- Jeff Solka Ph.D.
- Jennifer Weller Ph.D.
2Sir Francis Bacon Novum Organum 1620
- For he that knows the ways of nature will more
easily observe her deviations and on the other
hand he that knows her deviations will more
accurately describe her ways.
3Sir Francis Bacon Revisited
- To identify outliers we need some sort of model
to start with. - We can do a better job at identifying our model
if we first remove the outliers. - The process of outlier identification/model
building is an iterative process.
4What is an Outlier?
- Given a set of observations X an outlier is an
observations that is an element of this set but
which is inconsistent with the majority of the
data.
http//www.ncl.ac.uk/cpact/demo_outlier.jpg
5Manifestation of Outliers in Gene Expression Data
- Given a set of replicate arrays the replicates
can be used to identify an aberrant spot. - Xgi transformed and normalized spot intensity
measurements for the gth gene on the ith array - An outlier is an observation Xgi that is markedly
different from his fellow observations
6Nonresistent Rules for Outlier Identification
7The z-score Rule Grubbs Test
- The z-score rule (Grubbs test). Calculate a
z-score zgi for every observation - Where and sg are the mean and standard
deviation of the gth gene. Call Xgj an outlier is
zgj is larger say greater than five
8The CV Rule
- The CV Rule Call the furthest observation Xgi
from the mean, , and outlier if the coefficient
of variation CVg exceeds some prespecified
cutoff.
9Problems With the z-score and CV Methods of
Outlier Detection
- They are both based on measures that are heavily
influenced by outliers, the mean and the standard
deviation. - Masking An outlier remains undetected because
it is hidden by its own influence on the
methodologies parameters or else by another
adjacent outlier. - Swamping A normal observation is classified as
an outlier due to the presence of an unrelated
outlier or outliers.
10Resistant Rules for Outlier Detection
11One Approach to Crafting Resistant Rules for
Outlier Detection
- Based on outlier resistant statistical measures
- Median
- Median absolute deviation from the median
12The Resistant z-score Rule
- The resistant z-score rule. Calculate a resistant
z-score, zgi for every observation using -
- and are the median and MAD of the gth gene.
Call Xgi and outlier if zgi is large, say,
greater than five.
13Problem of Too Few Replicates
- Microarray experiments usually have little
replication - Median and MAD are not dependable estimates of
the location and scale of the data
14A Strategy for the Problem of Too Few Replicates
- I
- With microarray data there is a relationship
between the median and MAD across all of the
genes - Assume this relationship is a true relationship
- s2g f(mg)
- Use this to compute a smoothed version of MAD,
, that will be more stable as it boorows
strength from similarly expressing genes
15A Strategy for the Problem of Too Few Replicates
- II
Run a smoother such as a smoothing spline through
the relationship of ADgi versus Use the
fitted value, , as an estimator for the
gth gene.
16A Strategy for the Problem of Too Few Replicates
- III
- The revised z-score rule
- Call Xgi an outlier if the computed score is
large say greater than five
17Mahalanobis Distance for Outlier Detection
18Advantages of the Mahalanobis Distance Approach
- Mahalanobis' distance identifies observations
which lie far away from the centre of the data
cloud, giving less weight to variables with large
variances or to groups of highly correlated
variables (Joliffe, 1986). - This distance is often preferred to the Euclidean
distance which ignores the covariance structure
and thus treats all variables equally.
19A Circle Becomes an Ellipse Based on the
Mahalanobis Distance
http//www.famsi.org/reports/98061/images/fig18.gi
f
20A Test Statistic for the Mahalanobis Distance
21Principal Components
- Huber (1985) cites two main reasons why principal
components are interesting projections - first, in the case of clustered data, the leading
principal axes pick projections with good
separations - secondly, the leading principal components
collect the systematic structure of the data. - Thus, the first principal component reflects the
first major linear trend, the second principal
component, the second major linear trend, etc. - So, if an observation is located far away from
any of the major linear trends it can be
considered an outlier.
22Clustering and Outlier Detection
- Cluster Analysis can be used for outlier
detection. -
- Outliers may emerge as singletons or as small
clusters far removed from the others. - To do outlier detection at the same time as
clustering the main body of the data, use enough
clusters to represent both the main body of the
data and the outliers.
23Fisher Iris Data
- 150 Cases
- 5 variables
- Sepal length
- Sepal width
- Petal length
- Petal width
- Species (3 types)
24Iris data
Classic Dendrogram
Classic Data Image
25Line Example
Which of these are outliers?
26Data Image of the Interpoint Distance Matrix of
the Line Example
Both outliers
Euclidean Distance
Mahalanobis Distance
Triangle outlier
Outliers manifest themselves as vs or plus sign
structures in the data image
27Body Weight Brain Weight Data
Data Image shows outliers and subclusters of the
outliers
The outliers are number sequentially and
correspond to brachiosaurus, diplodocus,
triceratops, Asian elephant, and Africa elephant.
28Stackloss Dataset
Rousseeuw and Leroy 1987 report 1, 3, 4, 21 and
maybe 2 as outliers.
4, 21
1, 2, 3
Outliers have been labeled as triangles.
29Data Image for the Mahalanobis Distance
Presence of outliers is not clearly discernible.
30Data Image for the Mahalanobis Distance Where the
Covariance in the Mahalanobis Distance
Calculation is Constructed Using Observations 4 -
21
21
1, 2, 3
31An Artificial Dataset from Rousseeuw and Leroy
1987
Cluster structures of the outliers revealed in
the data image.
32A Particularly Onerous Elliptical Dataset
First suggested by Dan Carr.
33Euclidean and Mahalanobis Data Images of the
Ellipse Data
34Pairs Plot and Data Image for 5 Dimensional
Sphere Case
35Artificial Nose Dataset
- Fiber optic artificial olfactory system
- 19 fibers x 2 wavelengths 60 times/inhalation
2280 - Each data point resides in R2280
36Artificial Nose Data Image of TCE Present
Chloroform Observations
37References
- Afifi, A.A., and Azen, S.P. (1972), Statistical
analysis a computer oriented approach, Academic
Press, New York. - Barnett, V. and T. Lewis (1994) Outliers in
Statistical Data. New Your Wiley - Huber, P.J. (1985), Projection pursuit, The
Annals of Statistics, 13(2), 435-475. - David J. Marchette and Jeffrey L. Solka Using
data images for outlier detection Computational
Statistics Data Analysis, Volume 43, Issue 4,
28 August 2003, Pages 541-552 - Joliffe, I.T. (1986) Principal Component
Analysis, Springer-Verlag, New York. - Robust Regression and Outlier Detection (Wiley
Series in Probability and Statistics) by Peter J.
Rousseeuw, Annick M. Leroy , Wiley-Interscience
(September 19, 2003)