Title: BINF636%20Clustering%20and%20Classification
1BINF636Clustering and Classification
- Jeff Solka Ph.D.
- Fall 2008
2Gene Expression Data
samples
Genes
xgi expression for gene g in sample i
3The Pervasive Notion of Distance
- We have to be able to measure similarity or
dissimilarity in order to perform clustering,
dimensionality reduction, visualization, and
discriminant analysis. - How we measure distance can have a profound
effect on the performance of these algorithms.
4Distance Measures and Clustering
- Most of the common clustering methods such as
k-means, partitioning around medoid (PAM) and
hierarchical clustering are dependent on the
calculation of distance or an interpoint distance
matrix. - Some clustering methods such as those based on
spectral decomposition have a less clear
dependence on the distance measure.
5Distance Measures and Discriminant Analysis
- Many supervised learning procedures (a.k.a.
discriminant analysis procedures) also depend on
the concept of a distance. - nearest neighbors
- k-nearest neighbors
- Mixture-models
6(No Transcript)
7(No Transcript)
8(No Transcript)
9Two Main Classes of Distances
- Consider two gene expression profiles as
expressed across I samples. Each of these can be
considered as points in RI space. We can
calculate the distance between these two points. - Alternatively we can view the gene expression
profiles as being manifestations of samples from
two different probability distributions.
10(No Transcript)
11A General Framework for Distances Between Points
- Consider two m-vectors x (x1, , xm) and y
(y1, , ym). Define a generalized distance of the
form
- We call this a pairwise distance function as the
pairing of features within observations is
preserved.
12Minkowski Metric
- Special case of our generalized metric
13Euclidean and Manhattan Metric
14Correlation-based Distance Measures
- Championed for use within the microarray
literature by Eisen. - Types
- Pearsons sample correlation distance.
- Eisens cosine correlation distance.
- Spearman sample correlation distance.
- Kendalls t sample correlation.
15Pearson Sample Correlation Distance (COR)
16Eisen Cosine Correlation Distance (EISEN)
17Spearman Sample Correlation Distance (SPEAR)
18Tau Kendalls t Sample Correlation (TAU)
19Some Observations - I
- Since we are subtracting the correlation measures
from 1, things that are perfectly positively
correlated (correlation measure of 1) will have a
distance close to 0 and things that are perfectly
negatively correlated (correlation measure of -1)
will have a distance close to 2. - Correlation measures in general are invariant to
location and scale transformations and tend to
group together genes whose expression values are
linearly related.
20Some Observations - II
- The parametric methods (COR and EISEN) tend to be
more negatively effected by the presence of
outliers than the non-parametric methods (SPEAR
and TAU) - Under the assumption that we have standardized
the data so that x and y are m-vectors with zero
mean and unit length then there is a simple
relationship between the Pearson correlation
coefficient r(x,y) and the Euclidean distance
21Mahalanobis Distance
- This allows data directional variability to come
into play when calculating distances. - How do we estimate S?
22Distances and Transformations
- Assume that g is an invertible possible
non-linear transformation g x ? x - This transformation induces a new metric d via
23Distances and Scales
- Original scanned fluorescence intensities
- Logarithmically transformed data
- Data transformed by the general logarithm
24Experiment-specific Distances Between Genes
- One might like to use additional experimental
design information in deterring how one
calculates distances between the genes. - One might wish to used smoothed estimates or
other sorts of statistical fits and measure
distances between these. - In time course data distances that honor the time
order of the data are appropriate.
25Standardizing Genes
26Standardizing Arrays (Samples)
27Scaling and Its Implication to Data Analysis - I
- Types of gene expression data
- Relative (cDNA)
- Absolute (Affymetrix)
- xgi is the expression of gene g on sample I as
measured on a log scale - Let ygi xgi xgA patient A is our reference
- The distance between patient samples
28Scaling and Its Implication to Data Analysis - II
29Scaling and Its Implication to Data Analysis - III
The distance between two genes are given by
30Summary of Effects of Scaling on Distance Measures
- Minkowski distances
- Distance between samples is the same for relative
and absolute measures - Distance between genes is not the same for
relative and absolute measures - Pearson correlation-based distance
- Distances between genes is the same for relative
and absolute measures - Distances between samples is not the same for
relative and absolute measures
31What is Cluster Analysis?
- Given a collection of n objects each of which is
described by a set of p characteristics or
variables derive a useful division into a number
of classes. - Both the number of classes and the properties of
the classes are to be determined. - (Everitt 1993)
32Why Do This?
- Organize
- Prediction
- Etiology (Causes)
33How Do We Measure Quality?
- Multiple Clusters
- Male, Female
- Low, Middle, Upper Income
- Neither True Nor False
- Measured by Utility
34Difficulties In Clustering
- Cluster structure may be manifest in a multitude
of ways - Large data sets and high dimensionality
complicate matters
35Clustering Prerequisites
- Method to measure the distance between
observations and clusters - Similarity
- Dissimilarity
- This was discussed previously
- Method of normalizing the data
- We discussed this previously
- Method of reducing the dimensionality of the data
- We discussed this previously
36The Number of Groups Problem
- How Do We Decide on the Appropriate Number of
Clusters? - Duda, Hart and Stork (2001)
- Form Je(2)/Je(1) where Je(M) is the sum of
squares error criterion for the m cluster model.
The distribution of this ratio is usually not
known.
37Optimization Methods
- Minimizing or Maximizing Some Criteria
- Does Not Necessarily Form Hierarchical Clusters
38Clustering Criteria
The Sum of Squared Error Criteria
39Spoofing of the Sum of Squares Error Criterion
40Related Criteria
- With a little manipulation we obtain
- Instead of using average squared distances
betweens points in a cluster as indicated above
we could use perhaps the median or maximum
distance - Each of these will produce its own variant
41Scatter Criteria
42Relationship of the Scattering Criteria
43Measuring the Size of Matrices
- So we wish to minimize SW while maximizing SB
- We will measure the size of a matrix by using its
trace of determinant - These are equivalent in the case of univariate
data
44Interpreting the Trace Criteria
45The Determinant Criteria
- SB will be singular if the number of clusters is
less than or equal to the dimensionality - Partitions based on Je may change under linear
transformations of the data - This is not the case with Jd
46Other Invariant Criteria
- It can be shown that the eigenvalues of SW-1SB
are invariant under nonsingular linear
transformation - We might choose to maximize
47k-means Clustering
- Begin initialize n, k, m1, m2, , mk
- Do classify n samples according to nearest mi
- Recompute mi
- Until no no change in mi
- Return m1, m2, .., mk
- End
- Complexity of the algorithm is O(ndkT)
- T is the number of iterations
- T is typically ltlt n
48Example Mean Trajectories
49Optimizing the Clustering Criterion
- N(n,g) The number of partitions of n
individuals into g groups - N(15,3)2,375,101
- N(20,4)45,232,115,901
- N(25,8)690,223,721,118,368,580
- N(100,5)1068
- Note that the 3.15 x 10 17 is the estimated age
of the - universe in seconds
50Hill Climbing Algorithms
- 1 - Form initial partition into required number
of groups - 2 - Calculate change in clustering criteria
produced by moving each individual from its own
to another cluster. - 3 - Make the change which leads to the greatest
improvement in the value of the clustering
criterion. - 4 - Repeat steps (2) and (3) until no move of a
single individual causes the clustering criterion
to improve. - Guarantees local not global optimum
51How Do We Choose c
- Randomly classify points to generate the mis
- Randomly generate mis
- Base location of the c solution on the c-1
solution - Base location of the c solution on a hierarchical
solution
52Alternative Methods
- Simulated Annealing
- Genetic Algorithms
- Quantum Computing
53Hierarchical Cluster Analysis
- 1 Cluster to n Clusters
- Agglomerative Methods
- Fusion of n Data Points into Groups
- Divisive Methods
- Separate the n Data Points Into Finer Groupings
54Dendrograms
- agglomerative
- 0 1 2 3 4
- (1) (1,2) (1,2,3,4,5)
- (2) (3,4,5)
- (3)
- (4) (4,5)
- 4 3 2 1 0
- divisive
55Agglomerative Algorithm(Bottom Up or Clumping)
- Start Clusters C1, C2, ..., Cn each with 1
- data point
- 1 - Find nearest pair Ci, Cj, merge Ci and Cj,
- delete Cj, and decrement cluster count by 1
- If number of clusters is greater than 1 then
- go back to step 1
56Inter-cluster Dissimilarity Choices
- Furthest Neighbor (Complete Linkage)
- Nearest Neighbor (Single Linkage)
- Group Average
57Single Linkage (Nearest Neighbor) Clustering
- Distance Between Groups is Defined as That of the
Closest Pair of Individuals Where We Consider 1
Individual From Each Group - This method may be adequate when the clusters are
fairly well separated Gaussians but it is subject
to problems with chaining
58Example of Single Linkage Clustering
- 1 2 3 4 5
- 1 0.0
- 2 2.0 0.0
- 3 6.0 5.0 0.0
- 4 10.0 9.0 4.0 0.0
- 5 9.0 8.0 5.0 3.0 0.0
- (1 2) 3 4 5
- (1 2) 0.0
- 3 5.0 0.0
- 4 9.0 4.0 0.0
- 5 8.0 5.0 3.0 0.0
59Complete Linkage Clustering (Furthest Neighbor)
- Distance Between Groups is Defined as Most
Distance Pairs of Individuals
60Complete Linkage Example
- 1 2 3 4 5
- 1 0.0
- 2 2.0 0.0
- 3 6.0 5.0 0.0
- 4 10.0 9.0 4.0 0.0
- 5 9.0 8.0 5.0 3.0 0.0
- (1,2) is the First Cluster
- d(12) 3 maxd13,d23d136.0
- d(12)4 maxd14,d24d1410.0
- d(12)5 maxd15,d25d159.0
- So the cluster consisting of (12) will be merged
with the - cluster consisting of (3)
61Group Average Clustering
- Distance between clusters is the average of the
distance between all pairs of individuals between
the 2 groups - A compromise between single linkage and complete
linkage
62Centroid Clusters
- We use centroid of a group once it is formed.
63Problems With Hierarchical Clustering
- Well it really gives us a continuum of different
clusterings of the data - As stated previously there are specific artifacts
of the various methods
64Dendrogram
65Data Color Histogram or Data Image
Orderings of the data matrix were first discussed
in Bertin. Wegman in 1990 coined the term data
color histogram. Mike Minnotte and Webster West
subsequently termed the phrase data image in
1998.
66Data Image Reveals Obfuscated Cluster Structure
Subset of the pairs plot
Sorted on Observations
Sorted on Observations and Features
90 observations in R100 drawn from a standard
normal distribution The first and second 30 rows
were shifted by 20 in their first and second
dimensions respectively. This data matrix was
then multiplied by a 100 x 100 matrix of
Gaussian noise.
67The Data Image in the Gene Expression Community
68Example Dataset
69Complete Linkage Clustering
70Single Linkage Clustering
71Average Linkage Clustering
72Pruning Our Tree
- cutree(tree, k NULL, h NULL)
- tree a tree as produced by hclust. cutree() only
expects a list with components merge, height,
and labels, of appropriate content each. - k an integer scalar or vector with the desired
number of groups - h numeric scalar or vector with heights where
the tree should be cut. - At least one of k or h must be specified, k
overrides h if - both are given.
- Values Returned
- cutree returns a vector with group memberships
if k or h are scalar, otherwise a matrix with
group meberships is returned where each column
corresponds to the elements of k or h,
respectively (which are also used as column
names).
73Example Pruning
- gt x.cl2lt-cutree(hclust(x.dist),k2)
- gt x.cl2110
- 1 1 1 1 1 1 1 1 1 1 1
- gt x.cl2190200
- 1 2 2 2 2 2 2 2 2 2 2 2
74Identifying the Number of Clusters
- As indicated previously we really have no way of
identify the true cluster structure unless we
have divine intervention - In the next several slides we present some
well-known methods
75Method of Mojena
- Select the number of groups based on the first
stage of the dendogram that satisfies - The a0,a1,a2,... an-1 are the fusion levels
corresponding to stages with n, n-1, ,1
clusters. and are the mean and unbiased
standard deviation of these fusion levels and k
is a constant. - Mojena (1977) 2.75 lt k lt 3.5
- Milligan and Cooper (1985) k1.25
76Hartigans k-means theory
- When deciding on the number of clusters,
- Hartigan (1975, pp 90-91) suggests the
- following rough rule of thumb. If k is the
- result of kmeans with k groups and kplus1 is
- the result with k1 groups, then it is
- justifiable to add the extra group when
- (sum(kwithinss)/sum(kplus1withinss)-1)(nrow(x)-
k-1) - is greater than 10.
77kmeans Applied to our Data Set
78The 3 term kmeans solution
79The 4 term kmeans Solution
80Determination of the Number of Clusters Using the
Hartigan Criteria
81MIXTURE-BASED CLUSTERING
82HOW DO WE CHOOSE g?
- Human Intervention
- Divine Intervention
- Likelihood Ratio Test Statistic
- Wolfes Method
- Bootstrap
- AIC,BIC, MDL
- Adaptive Mixtures Based Methods
- Pruning
- SHIP (AKMM)
83Akaike's Information criteria (AIC)
- AIC(g) -2L(f) N(g) where N(g) is the number
of free parameters in the model of size g. - We Choose g In Order to Minimize the AIC
Condition - This Criterion is Subject to the Same Regularity
Conditions as -2logl
84MIXTURE VISUALIZATION 2-d
85MODEL-BASED CLUSTERING
- This technique takes a density function approach.
- Uses finite mixture densities as models for
cluster analysis. - Each component density characterizes a cluster.
86Minimal Spanning Tree-Based Clustering
Diansheng Guo Donna Peuquet Mark Gahegan, (2002)
, Opening the black box interactive hierarchical
clustering for multivariate spatial patterns,
Geographic Information Systems Proceedings of the
tenth ACM international symposium on Advances in
geographic information systems, McLean, Virginia,
USA
87What is Pattern Recognition?
- From Devroye, Györfi and Lugosi
- Pattern recognition or discrimination is about
guessing or predicting the unknown nature of an
observation, a discrete quantity such as black or
white, one or zero, sick or healthy, real or
fake. - From Duda, Hart and Stork
- The act of taking in raw data and taking an
action based on the category of the pattern.
88Isnt This Just Statistics?
- Short answer yes.
- Breiman (Statistical Sciences, 2001) suggests
there are two cultures within statistical
modeling Stochastic Modelers and Algorithmic
Modelers.
89Algorithmic Modeling
- Pattern recognition (classification) is concerned
with predicting class membership of an
observation. - This can be done from the perspective of
(traditional statistical) data models. - Often, the data is high dimensional, complex, and
of unknown distributional origin. - Thus, pattern recognition often falls into the
algorithmic modeling camp. - The measure of performance is whether it
accurately predicts the class, not how well it
models the distribution. - Empirical evaluations often are more compelling
than asymptotic theorems.
90Pattern Recognition Flowchart
91Pattern Recognition Concerns
- Feature extraction and distance calculation
- Development of automated algorithms for
classification. - Classifier performance evaluation.
- Latent or hidden class discovery based on
etracted feature analysis. - Theoretical considerations.
92Linear and Quadratic Discriminant Analysis in
Action
93Nearest Neighbor Classifier
94SVM Training Cartoon
95CART Analysis of the Fisher Iris Data
96Random Forests
- Create a large number of trees based on random
samples of our dataset. - Use a bootstrap sample for each random sample.
- Variables used to create the splits are a random
sub-sample of all of the features. - All trees are grown fully.
- Majority vote determines membership of a new
observation.
97Boosting and Bagging
98Boosting
99Evaluating Classifiers
100Resubstitution
101Cross Validation
102Leave-k-Out
103Cross-Validation Notes
104Test Set
105Some Classifier Results on the Golub ALL vs AML
Dataset
106References - I
- Richard O. Duda, Peter E. Hart, David G. Stork,
2001, Pattern Calssification, 2nd Edition. - Eisen MB, Spellman PT, Brown PO and Botstein D.
(1998). Cluster Analysis and Display of
Genome-Wide Expression Patterns. Proc Natl Acad
Sci U S A 95, 14863-8. - Brian S. Everitt, Sabine Landau, Morven Leese
,(2001), Cluster Analysis,4th Edition, arnold. - Gasch AP and Eisen MB (2002). Exploring the
conditional coregulation of yeast gene expression
through fuzzy k-means clustering. Genome Biology
3(11), 1-22. - Gad Getz, Erel Levine, and Eytan Domany . (2000)
Coupled two-way clustering analysis of gene
microarray data, PNAS, vol. 97, no. 22, pp.
1207912084. - Hastie T, Tibshirani R, Eisen MB, Alizadeh A,
Levy R, Staudt L, Chan WC, Botstein D and Brown
P. (2000). 'Gene Shaving' as a Method for
Identifying Distinct Sets of Genes with Similar
Expression Patterns. GenomeBiology.com 1,
107References - II
- A. K. Jain, M. N. Murty, P. J. Flynn , (1999)
Data clustering a review, ACM Computing Surveys
(CSUR), Volume 31 Issue 3. - John Quackenbush, (2001),COMPUTATIONAL ANALYSIS
OF MICROARRAY DATA NATURE REVIEWS GENETICS VOLUME
2, 419, pp. 418 427 - Ying Xu, Victor Olman, and Dong Xu
(2002)Clustering gene expression data using a
graph-theoretic approach an application of
minimum spanning trees Bioinformatics 2002 18
536-545.
108References - III
- Hastie, Tibshirani, Friedman, The Elements of
Statistical Learning Data Mining, Inference, and
Prediction, 2001. - Devroye, Györfi, Lugosi, A Probabilistic Theory
of Pattern Recognition,1996. - Ripley, Pattern Recognition and Neural Networks,
1996. - Fukunaga, Introduction to Statistical Pattern
Recognition, 1990.