BINF636%20Clustering%20and%20Classification

About This Presentation

Title:

BINF636%20Clustering%20and%20Classification

Description:

We have to be able to measure similarity or dissimilarity in ... How we measure distance can have a profound effect on the performance of ... Stork (2001) ... – PowerPoint PPT presentation

Number of Views:271

Avg rating:3.0/5.0

Slides: 109

Provided by: jeffs59

Learn more at: http://www.binf.gmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: BINF636%20Clustering%20and%20Classification

1
BINF636Clustering and Classification

Jeff Solka Ph.D.
Fall 2008

2
Gene Expression Data
samples
Genes
xgi expression for gene g in sample i
3
The Pervasive Notion of Distance

We have to be able to measure similarity or
dissimilarity in order to perform clustering,
dimensionality reduction, visualization, and
discriminant analysis.
How we measure distance can have a profound
effect on the performance of these algorithms.

4
Distance Measures and Clustering

Most of the common clustering methods such as
k-means, partitioning around medoid (PAM) and
hierarchical clustering are dependent on the
calculation of distance or an interpoint distance
matrix.
Some clustering methods such as those based on
spectral decomposition have a less clear
dependence on the distance measure.

5
Distance Measures and Discriminant Analysis

Many supervised learning procedures (a.k.a.
discriminant analysis procedures) also depend on
the concept of a distance.
nearest neighbors
k-nearest neighbors
Mixture-models

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
Two Main Classes of Distances

Consider two gene expression profiles as
expressed across I samples. Each of these can be
considered as points in RI space. We can
calculate the distance between these two points.
Alternatively we can view the gene expression
profiles as being manifestations of samples from
two different probability distributions.

10
(No Transcript)
11
A General Framework for Distances Between Points

Consider two m-vectors x (x1, , xm) and y
(y1, , ym). Define a generalized distance of the
form

We call this a pairwise distance function as the
pairing of features within observations is
preserved.

12
Minkowski Metric

Special case of our generalized metric

13
Euclidean and Manhattan Metric
14
Correlation-based Distance Measures

Championed for use within the microarray
literature by Eisen.
Types
Pearsons sample correlation distance.
Eisens cosine correlation distance.
Spearman sample correlation distance.
Kendalls t sample correlation.

15
Pearson Sample Correlation Distance (COR)
16
Eisen Cosine Correlation Distance (EISEN)
17
Spearman Sample Correlation Distance (SPEAR)
18
Tau Kendalls t Sample Correlation (TAU)
19
Some Observations - I

Since we are subtracting the correlation measures
from 1, things that are perfectly positively
correlated (correlation measure of 1) will have a
distance close to 0 and things that are perfectly
negatively correlated (correlation measure of -1)
will have a distance close to 2.
Correlation measures in general are invariant to
location and scale transformations and tend to
group together genes whose expression values are
linearly related.

20
Some Observations - II

The parametric methods (COR and EISEN) tend to be
more negatively effected by the presence of
outliers than the non-parametric methods (SPEAR
and TAU)
Under the assumption that we have standardized
the data so that x and y are m-vectors with zero
mean and unit length then there is a simple
relationship between the Pearson correlation
coefficient r(x,y) and the Euclidean distance

21
Mahalanobis Distance

This allows data directional variability to come
into play when calculating distances.
How do we estimate S?

22
Distances and Transformations

Assume that g is an invertible possible
non-linear transformation g x ? x
This transformation induces a new metric d via

23
Distances and Scales

Original scanned fluorescence intensities
Logarithmically transformed data
Data transformed by the general logarithm

24
Experiment-specific Distances Between Genes

One might like to use additional experimental
design information in deterring how one
calculates distances between the genes.
One might wish to used smoothed estimates or
other sorts of statistical fits and measure
distances between these.
In time course data distances that honor the time
order of the data are appropriate.

25
Standardizing Genes
26
Standardizing Arrays (Samples)
27
Scaling and Its Implication to Data Analysis - I

Types of gene expression data
Relative (cDNA)
Absolute (Affymetrix)
xgi is the expression of gene g on sample I as
measured on a log scale
Let ygi xgi xgA patient A is our reference
The distance between patient samples

28
Scaling and Its Implication to Data Analysis - II
29
Scaling and Its Implication to Data Analysis - III
The distance between two genes are given by
30
Summary of Effects of Scaling on Distance Measures

Minkowski distances
Distance between samples is the same for relative
and absolute measures
Distance between genes is not the same for
relative and absolute measures
Pearson correlation-based distance
Distances between genes is the same for relative
and absolute measures
Distances between samples is not the same for
relative and absolute measures

31
What is Cluster Analysis?

Given a collection of n objects each of which is
described by a set of p characteristics or
variables derive a useful division into a number
of classes.
Both the number of classes and the properties of
the classes are to be determined.
(Everitt 1993)

32
Why Do This?

Organize
Prediction
Etiology (Causes)

33
How Do We Measure Quality?

Multiple Clusters
Male, Female
Low, Middle, Upper Income
Neither True Nor False
Measured by Utility

34
Difficulties In Clustering

Cluster structure may be manifest in a multitude
of ways
Large data sets and high dimensionality
complicate matters

35
Clustering Prerequisites

Method to measure the distance between
observations and clusters
Similarity
Dissimilarity
This was discussed previously
Method of normalizing the data
We discussed this previously
Method of reducing the dimensionality of the data
We discussed this previously

36
The Number of Groups Problem

How Do We Decide on the Appropriate Number of
Clusters?
Duda, Hart and Stork (2001)
Form Je(2)/Je(1) where Je(M) is the sum of
squares error criterion for the m cluster model.
The distribution of this ratio is usually not
known.

37
Optimization Methods

Minimizing or Maximizing Some Criteria
Does Not Necessarily Form Hierarchical Clusters

38
Clustering Criteria
The Sum of Squared Error Criteria
39
Spoofing of the Sum of Squares Error Criterion
40
Related Criteria

With a little manipulation we obtain
Instead of using average squared distances
betweens points in a cluster as indicated above
we could use perhaps the median or maximum
distance
Each of these will produce its own variant

41
Scatter Criteria
42
Relationship of the Scattering Criteria
43
Measuring the Size of Matrices

So we wish to minimize SW while maximizing SB
We will measure the size of a matrix by using its
trace of determinant
These are equivalent in the case of univariate
data

44
Interpreting the Trace Criteria

45
The Determinant Criteria

SB will be singular if the number of clusters is
less than or equal to the dimensionality
Partitions based on Je may change under linear
transformations of the data
This is not the case with Jd

46
Other Invariant Criteria

It can be shown that the eigenvalues of SW-1SB
are invariant under nonsingular linear
transformation
We might choose to maximize

47
k-means Clustering

Begin initialize n, k, m1, m2, , mk
Do classify n samples according to nearest mi
Recompute mi
Until no no change in mi
Return m1, m2, .., mk
End
Complexity of the algorithm is O(ndkT)
T is the number of iterations
T is typically ltlt n

48
Example Mean Trajectories
49
Optimizing the Clustering Criterion

N(n,g) The number of partitions of n
individuals into g groups
N(15,3)2,375,101
N(20,4)45,232,115,901
N(25,8)690,223,721,118,368,580
N(100,5)1068
Note that the 3.15 x 10 17 is the estimated age
of the
universe in seconds

50
Hill Climbing Algorithms

1 - Form initial partition into required number
of groups
2 - Calculate change in clustering criteria
produced by moving each individual from its own
to another cluster.
3 - Make the change which leads to the greatest
improvement in the value of the clustering
criterion.
4 - Repeat steps (2) and (3) until no move of a
single individual causes the clustering criterion
to improve.
Guarantees local not global optimum

51
How Do We Choose c

Randomly classify points to generate the mis
Randomly generate mis
Base location of the c solution on the c-1
solution
Base location of the c solution on a hierarchical
solution

52
Alternative Methods

Simulated Annealing
Genetic Algorithms
Quantum Computing

53
Hierarchical Cluster Analysis

1 Cluster to n Clusters
Agglomerative Methods
Fusion of n Data Points into Groups
Divisive Methods
Separate the n Data Points Into Finer Groupings

54
Dendrograms

agglomerative
0 1 2 3 4
(1) (1,2) (1,2,3,4,5)
(2) (3,4,5)
(3)
(4) (4,5)
4 3 2 1 0
divisive

55
Agglomerative Algorithm(Bottom Up or Clumping)

Start Clusters C1, C2, ..., Cn each with 1
data point
1 - Find nearest pair Ci, Cj, merge Ci and Cj,
delete Cj, and decrement cluster count by 1
If number of clusters is greater than 1 then
go back to step 1

56
Inter-cluster Dissimilarity Choices

Furthest Neighbor (Complete Linkage)
Nearest Neighbor (Single Linkage)
Group Average

57
Single Linkage (Nearest Neighbor) Clustering

Distance Between Groups is Defined as That of the
Closest Pair of Individuals Where We Consider 1
Individual From Each Group
This method may be adequate when the clusters are
fairly well separated Gaussians but it is subject
to problems with chaining

58
Example of Single Linkage Clustering

1 2 3 4 5
1 0.0
2 2.0 0.0
3 6.0 5.0 0.0
4 10.0 9.0 4.0 0.0
5 9.0 8.0 5.0 3.0 0.0
(1 2) 3 4 5
(1 2) 0.0
3 5.0 0.0
4 9.0 4.0 0.0
5 8.0 5.0 3.0 0.0

59
Complete Linkage Clustering (Furthest Neighbor)

Distance Between Groups is Defined as Most
Distance Pairs of Individuals

60
Complete Linkage Example

1 2 3 4 5
1 0.0
2 2.0 0.0
3 6.0 5.0 0.0
4 10.0 9.0 4.0 0.0
5 9.0 8.0 5.0 3.0 0.0
(1,2) is the First Cluster
d(12) 3 maxd13,d23d136.0
d(12)4 maxd14,d24d1410.0
d(12)5 maxd15,d25d159.0
So the cluster consisting of (12) will be merged
with the
cluster consisting of (3)

61
Group Average Clustering

Distance between clusters is the average of the
distance between all pairs of individuals between
the 2 groups
A compromise between single linkage and complete
linkage

62
Centroid Clusters

We use centroid of a group once it is formed.

63
Problems With Hierarchical Clustering

Well it really gives us a continuum of different
clusterings of the data
As stated previously there are specific artifacts
of the various methods

64
Dendrogram
65
Data Color Histogram or Data Image
Orderings of the data matrix were first discussed
in Bertin. Wegman in 1990 coined the term data
color histogram. Mike Minnotte and Webster West
subsequently termed the phrase data image in
1998.
66
Data Image Reveals Obfuscated Cluster Structure
Subset of the pairs plot
Sorted on Observations
Sorted on Observations and Features
90 observations in R100 drawn from a standard
normal distribution The first and second 30 rows
were shifted by 20 in their first and second
dimensions respectively. This data matrix was
then multiplied by a 100 x 100 matrix of
Gaussian noise.
67
The Data Image in the Gene Expression Community

Extracted from

68
Example Dataset
69
Complete Linkage Clustering
70
Single Linkage Clustering

71
Average Linkage Clustering

72
Pruning Our Tree

cutree(tree, k NULL, h NULL)
tree a tree as produced by hclust. cutree() only
expects a list with components merge, height,
and labels, of appropriate content each.
k an integer scalar or vector with the desired
number of groups
h numeric scalar or vector with heights where
the tree should be cut.
At least one of k or h must be specified, k
overrides h if
both are given.
Values Returned
cutree returns a vector with group memberships
if k or h are scalar, otherwise a matrix with
group meberships is returned where each column
corresponds to the elements of k or h,
respectively (which are also used as column
names).

73
Example Pruning

gt x.cl2lt-cutree(hclust(x.dist),k2)
gt x.cl2110
1 1 1 1 1 1 1 1 1 1 1
gt x.cl2190200
1 2 2 2 2 2 2 2 2 2 2 2

74
Identifying the Number of Clusters

As indicated previously we really have no way of
identify the true cluster structure unless we
have divine intervention
In the next several slides we present some
well-known methods

75
Method of Mojena

Select the number of groups based on the first
stage of the dendogram that satisfies
The a0,a1,a2,... an-1 are the fusion levels
corresponding to stages with n, n-1, ,1
clusters. and are the mean and unbiased
standard deviation of these fusion levels and k
is a constant.
Mojena (1977) 2.75 lt k lt 3.5
Milligan and Cooper (1985) k1.25

76
Hartigans k-means theory

When deciding on the number of clusters,
Hartigan (1975, pp 90-91) suggests the
following rough rule of thumb. If k is the
result of kmeans with k groups and kplus1 is
the result with k1 groups, then it is
justifiable to add the extra group when
(sum(kwithinss)/sum(kplus1withinss)-1)(nrow(x)-
k-1)
is greater than 10.

77
kmeans Applied to our Data Set
78
The 3 term kmeans solution
79
The 4 term kmeans Solution
80
Determination of the Number of Clusters Using the
Hartigan Criteria
81
MIXTURE-BASED CLUSTERING
82
HOW DO WE CHOOSE g?

Human Intervention
Divine Intervention
Likelihood Ratio Test Statistic
Wolfes Method
Bootstrap
AIC,BIC, MDL
Adaptive Mixtures Based Methods
Pruning
SHIP (AKMM)

83
Akaike's Information criteria (AIC)

AIC(g) -2L(f) N(g) where N(g) is the number
of free parameters in the model of size g.
We Choose g In Order to Minimize the AIC
Condition
This Criterion is Subject to the Same Regularity
Conditions as -2logl

84
MIXTURE VISUALIZATION 2-d
85
MODEL-BASED CLUSTERING

This technique takes a density function approach.
Uses finite mixture densities as models for
cluster analysis.
Each component density characterizes a cluster.

86
Minimal Spanning Tree-Based Clustering
Diansheng Guo Donna Peuquet Mark Gahegan, (2002)
, Opening the black box interactive hierarchical
clustering for multivariate spatial patterns,
Geographic Information Systems Proceedings of the
tenth ACM international symposium on Advances in
geographic information systems, McLean, Virginia,
USA
87
What is Pattern Recognition?

From Devroye, Györfi and Lugosi
Pattern recognition or discrimination is about
guessing or predicting the unknown nature of an
observation, a discrete quantity such as black or
white, one or zero, sick or healthy, real or
fake.
From Duda, Hart and Stork
The act of taking in raw data and taking an
action based on the category of the pattern.

88
Isnt This Just Statistics?

Short answer yes.
Breiman (Statistical Sciences, 2001) suggests
there are two cultures within statistical
modeling Stochastic Modelers and Algorithmic
Modelers.

89
Algorithmic Modeling

Pattern recognition (classification) is concerned
with predicting class membership of an
observation.
This can be done from the perspective of
(traditional statistical) data models.
Often, the data is high dimensional, complex, and
of unknown distributional origin.
Thus, pattern recognition often falls into the
algorithmic modeling camp.
The measure of performance is whether it
accurately predicts the class, not how well it
models the distribution.
Empirical evaluations often are more compelling
than asymptotic theorems.

90
Pattern Recognition Flowchart
91
Pattern Recognition Concerns

Feature extraction and distance calculation
Development of automated algorithms for
classification.
Classifier performance evaluation.
Latent or hidden class discovery based on
etracted feature analysis.
Theoretical considerations.

92
Linear and Quadratic Discriminant Analysis in
Action
93
Nearest Neighbor Classifier
94
SVM Training Cartoon
95
CART Analysis of the Fisher Iris Data
96
Random Forests

Create a large number of trees based on random
samples of our dataset.
Use a bootstrap sample for each random sample.
Variables used to create the splits are a random
sub-sample of all of the features.
All trees are grown fully.
Majority vote determines membership of a new
observation.

97
Boosting and Bagging
98
Boosting
99
Evaluating Classifiers
100
Resubstitution
101
Cross Validation
102
Leave-k-Out
103
Cross-Validation Notes
104
Test Set
105
Some Classifier Results on the Golub ALL vs AML
Dataset
106
References - I

Richard O. Duda, Peter E. Hart, David G. Stork,
2001, Pattern Calssification, 2nd Edition.
Eisen MB, Spellman PT, Brown PO and Botstein D.
(1998). Cluster Analysis and Display of
Genome-Wide Expression Patterns. Proc Natl Acad
Sci U S A 95, 14863-8.
Brian S. Everitt, Sabine Landau, Morven Leese
,(2001), Cluster Analysis,4th Edition, arnold.
Gasch AP and Eisen MB (2002). Exploring the
conditional coregulation of yeast gene expression
through fuzzy k-means clustering. Genome Biology
3(11), 1-22.
Gad Getz, Erel Levine, and Eytan Domany . (2000)
Coupled two-way clustering analysis of gene
microarray data, PNAS, vol. 97, no. 22, pp.
1207912084.
Hastie T, Tibshirani R, Eisen MB, Alizadeh A,
Levy R, Staudt L, Chan WC, Botstein D and Brown
P. (2000). 'Gene Shaving' as a Method for
Identifying Distinct Sets of Genes with Similar
Expression Patterns. GenomeBiology.com 1,

107
References - II

A. K. Jain, M. N. Murty, P. J. Flynn , (1999)
Data clustering a review, ACM Computing Surveys
(CSUR), Volume 31 Issue 3.
John Quackenbush, (2001),COMPUTATIONAL ANALYSIS
OF MICROARRAY DATA NATURE REVIEWS GENETICS VOLUME
2, 419, pp. 418 427
Ying Xu, Victor Olman, and Dong Xu
(2002)Clustering gene expression data using a
graph-theoretic approach an application of
minimum spanning trees Bioinformatics 2002 18
536-545.

108
References - III

Hastie, Tibshirani, Friedman, The Elements of
Statistical Learning Data Mining, Inference, and
Prediction, 2001.
Devroye, Györfi, Lugosi, A Probabilistic Theory
of Pattern Recognition,1996.
Ripley, Pattern Recognition and Neural Networks,
1996.
Fukunaga, Introduction to Statistical Pattern
Recognition, 1990.

Write a Comment

User Comments (0)