Title: Clustering microarray data
1Clustering microarray data
2Sub-classes of lung cancer types have signature
genes
(Bhattacharjee 2001)
3Promoter analysis of commonly regulated genes
David J. Lockhart Elizabeth A. Winzeler, NATURE
VOL 405 15 JUNE 2000, p827
4Discovery of new cancer subtype
These classes are unknown at the time of study.
5Overview
- Clustering is an unsupervised learning clustering
is used to build groups of genes with related
expression patterns. - The classes are not known in advance.
- Aim is to discover new patterns from microarray
data. - In contrast, supervised learning refers to the
learning process where classes are known. The aim
is to define classification rules to separate the
classes. Supervised learning will be discussed in
the next lecture.
6Dissimilar function
- To identify clusters, we first need to define
what close means. There are many choices of
distances - Euclidian distance
- 1 Pearson correlation
- Manhattan distance
-
7(No Transcript)
8Where is the truth?
- In the context of unsupervised learning, there
is no such direct measure of success. It is
difficult to ascertain the validity of inference
drawn from the output of most unsupervised
learning algorithms. One must often resort to
heuristic arguments not only for motivating the
algorithm, but also for judgments as to the
quality of results. This uncomfortable situation
has led to heavy proliferation of proposed
methods, since effectiveness is a matter of
opinion and cannot be verified directly.
Hastie et al. 2001 ESL
9Clustering Methods
- Partitioning methods
- Seek to optimally divide objects into a fixed
number of clusters. - Hierarchical methods
- Produce a nested sequence of clusters
(Speed, Chapter 4)
10Methods
- k-means
- Hierarchical clustering
- Self-organizing maps (SOM)
11k-means
- Divide objects into k clusters.
- Goal is to minimize total intra-cluster variance
- Global minimum is difficult to obtain.
12Algorithm for k-means clustering
- Step 1 Initialization randomly select k
centroids. - Step 2 For each object, find its closest
centroid, assign the object to the corresponding
cluster. - Step 3 For each cluster, update its centroid to
the mean position of all objects in that cluster. - Repeat Steps 2 and 3 until convergence.
13Shows the initial randomized centers and a number
of points.
14Centers have been associated with the points and
have been moved to the respective centroids.
15Now, the association is shown in more detail,
once the centroids have been moved.
16Again, the centers are moved to the centroids of
the corresponding associated points.
17Properties of k-means
- Achieves local minimum of
- Very fast.
18Practical issues with k-means
- k must be known in advance
- Results are dependent on initial assignment of
centroids.
19How to choose k?
Milligan Cooper(1985) compared 30 published
rules. 1. Calinski Harabasz (1974) 2.
Hartigan (1975) , Stop when
H(k)lt10 .
W(k) total sum of squares within clusters B(k)
sum of squares between cluster means
20How to choose k (continued)?
Random
(Tibshriani 2001) Estimate log Wk for randomly
data (uniformly distributed in a
rectangle) Choose k so that Gap is largest.
Observed
log WK
Gap
k
21How to select initial centroids
- Repeat the procedure many times with randomly
chosen initial centroids. - Alternatively, initialize centroids smartly,
e.g. by hierarchical clustering
22K-means requires good initial values.
Hierarchical Clustering could be used but
sometimes performs poorly.
with-in sum of Sq. X965.32 O305.09
23Hierarchical clustering
Hierarchical clustering builds a hierarchy of
clusters, represented by a tree (called a
dendrogram). Close clusters are joined together.
Height of a branch represents the dissimilarity
between the two clusters joined by it.
24How to construct a dendrogram
- Bottom-up approach
- Initialization each cluster contains a single
object - Iteration merge the closest clusters.
- Stop when all objects are included in a single
cluster - Top-down approach
- Starting from a single cluster containing all
objects, iteratively partition into smaller
clusters. - Truncate dendrogram at a similarity threshold
level, e.g., correlation gt 0.6 or requiring a
cluster containing at least a minimum number of
objects.
25Hierarchical Clustering
5
3
4
2
1
6
26Dendrogram can be reordered
27Ordered dendrograms
- 2 n-1 linear orderings of n elements
- (n genes or conditions)
- Maximizing adjacent similarity is impractical.
So order by - Average expression level,
- Time of max induction, or
- Chromosome positioning
Eisen98
28Properties of Hierarchical Clustering
- Top-down approach is more favorable when only a
few clusters are desired. - Single linkage tends to produce long chains of
clusters. - Complete linkage tends to produce compact
clusters.
29(No Transcript)
30Partitioning clustering vs hierarchical clustering
5
3
4
2
1
6
k 4
31Partitioning clustering vs hierarchical clustering
5
3
4
2
1
6
k 3
32Partitioning clustering vs hierarchical clustering
5
3
4
2
1
6
k 2
33Self-organizing Map
- Impose partial structure on the clusters (in
contrast to the rigid structure of hierarchical
clustering, the strong prior hypotheses used in
Bayesian clus-tering, and the nonstructure of
k-means clustering) - easy visualization and interpretation.
34SOM Algorithm
- Initialize prototypes mj on a lattice of p X q
nodes. Each prototype is a weight vector whose
dimension is the same as input data. - Iteration for each observation xi, find the
closest prototype mj, and for all neighbors of mk
of mj, move by - During iterations, reduce learning rate a and
neighborhood size r gradually. - May take many iterations before convergence.
35(No Transcript)
36(Hastie 2001)
37(Hastie 2001)
38(Hastie 2001)
39SOM clustering of periodic genes
40Applications to microarray data
41- With only a few nodes, one tends not to see
distinct patterns and there is large
within-cluster scatter. As nodes are added,
distinctive and tight clusters emerge. - SOM is an incremental learning algorithm
involving cases by case presentation rather than
batch presentation. - As with all exploratory data analysis tools, the
use of SOMs involves inspection of the data to
extract insights.
42Other Clustering Methods
- Gene Shaving
- MDS
- Affinity Propagation
- Spectral Clustering
- Two-way clustering
43- Algorithms for unsupervised classification or
cluster analysis abound. Unfortunately however,
algorithm development seems to be a preferred
activity to algorithm evaluation among
methodologists. -
- No consensus or clear guidelines exist to guide
these decisions. Cluster analysis always produces
clustering, but whether a pattern observed in the
sample data characterizes a pattern present in
the population remains an open question.
Resampling-based methods can address this last
point, but results indicate that most clusterings
in microarray data sets are unlikely to reflect
reproducible patterns or patterns in the overall
population. - -Allison et al. (2006)
44Stability of a cluster
- Motivation Real clusters should be reproducible
under perturbation adding noise, omission of
data, etc. - Procedure
- Perturb observed data by adding noise.
- Apply clustering procedure to cluster the
perturbed data. - Repeat the above procedures, generate a sample of
clusters. - Global test
- Cluster-specific tests R-index, D-index.
(McShane et al. 2002)
455
3
4
2
1
6
46Global test
- Null hypothesis Data come from a multivariate
Gaussian distribution. - Procedure
- Consider a subspace spanned by top principle
components. - Estimate distribution of nearest neighbor
distances - Compare observed with simulated data.
47R-index
- If cluster i contains ni objects, then it
contains mi ni(ni 1)/2 of pairs. - Let ci be the number of pairs that fall in the
same cluster for the re-clustered perturbed data. - ri ci/mi measures the robustness of the
cluster i. - R-index Si ci / Si mi measures overall
stability of a clustering algorithm.
48D-index
- For each cluster, determine the closest cluster
for the perturbed data - Calculated the average discrepancy between the
clusters for the original and perturbed data
omission vs addition. - D-index is a summation of all cluster-specific
discrepancy.
49Applications
- 16 prostate cancer 9 benign tumor
- 6500 genes
- Use hierarchical clustering to obtain 2,3, and 4
clusters. - Questions are these clusters reliable?
50(No Transcript)
51(No Transcript)
52Issues with calculating R and D indices
- How big is the size of perturbation?
- How to quantify the significance level?
- What about nested consistency?
53Acknowldegment
- Slide sources from
- Cheng Li