Clustering microarray data - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Clustering microarray data

Description:

Clustering microarray data 09/26/07 Overview Clustering is an unsupervised learning clustering is used to build groups of genes with related expression patterns. – PowerPoint PPT presentation

Number of Views:324
Avg rating:3.0/5.0
Slides: 54
Provided by: gcyuan
Category:

less

Transcript and Presenter's Notes

Title: Clustering microarray data


1
Clustering microarray data
  • 09/26/07

2
Sub-classes of lung cancer types have signature
genes
(Bhattacharjee 2001)
3
Promoter analysis of commonly regulated genes
David J. Lockhart Elizabeth A. Winzeler, NATURE
VOL 405 15 JUNE 2000, p827
4
Discovery of new cancer subtype
These classes are unknown at the time of study.
5
Overview
  • Clustering is an unsupervised learning clustering
    is used to build groups of genes with related
    expression patterns.
  • The classes are not known in advance.
  • Aim is to discover new patterns from microarray
    data.
  • In contrast, supervised learning refers to the
    learning process where classes are known. The aim
    is to define classification rules to separate the
    classes. Supervised learning will be discussed in
    the next lecture.

6
Dissimilar function
  • To identify clusters, we first need to define
    what close means. There are many choices of
    distances
  • Euclidian distance
  • 1 Pearson correlation
  • Manhattan distance

7
(No Transcript)
8
Where is the truth?
  • In the context of unsupervised learning, there
    is no such direct measure of success. It is
    difficult to ascertain the validity of inference
    drawn from the output of most unsupervised
    learning algorithms. One must often resort to
    heuristic arguments not only for motivating the
    algorithm, but also for judgments as to the
    quality of results. This uncomfortable situation
    has led to heavy proliferation of proposed
    methods, since effectiveness is a matter of
    opinion and cannot be verified directly.

Hastie et al. 2001 ESL
9
Clustering Methods
  • Partitioning methods
  • Seek to optimally divide objects into a fixed
    number of clusters.
  • Hierarchical methods
  • Produce a nested sequence of clusters

(Speed, Chapter 4)
10
Methods
  • k-means
  • Hierarchical clustering
  • Self-organizing maps (SOM)

11
k-means
  • Divide objects into k clusters.
  • Goal is to minimize total intra-cluster variance
  • Global minimum is difficult to obtain.

12
Algorithm for k-means clustering
  • Step 1 Initialization randomly select k
    centroids.
  • Step 2 For each object, find its closest
    centroid, assign the object to the corresponding
    cluster.
  • Step 3 For each cluster, update its centroid to
    the mean position of all objects in that cluster.
  • Repeat Steps 2 and 3 until convergence.

13
Shows the initial randomized centers and a number
of points.
14
Centers have been associated with the points and
have been moved to the respective centroids.
15
Now, the association is shown in more detail,
once the centroids have been moved.
16
Again, the centers are moved to the centroids of
the corresponding associated points.
17
Properties of k-means
  • Achieves local minimum of
  • Very fast.

18
Practical issues with k-means
  • k must be known in advance
  • Results are dependent on initial assignment of
    centroids.

19
How to choose k?
Milligan Cooper(1985) compared 30 published
rules. 1. Calinski Harabasz (1974) 2.
Hartigan (1975) , Stop when
H(k)lt10 .
W(k) total sum of squares within clusters B(k)
sum of squares between cluster means
20
How to choose k (continued)?
Random
(Tibshriani 2001) Estimate log Wk for randomly
data (uniformly distributed in a
rectangle) Choose k so that Gap is largest.
Observed
log WK
Gap
k
21
How to select initial centroids
  • Repeat the procedure many times with randomly
    chosen initial centroids.
  • Alternatively, initialize centroids smartly,
    e.g. by hierarchical clustering

22
K-means requires good initial values.
Hierarchical Clustering could be used but
sometimes performs poorly.
with-in sum of Sq. X965.32 O305.09
23
Hierarchical clustering
Hierarchical clustering builds a hierarchy of
clusters, represented by a tree (called a
dendrogram). Close clusters are joined together.
Height of a branch represents the dissimilarity
between the two clusters joined by it.
24
How to construct a dendrogram
  • Bottom-up approach
  • Initialization each cluster contains a single
    object
  • Iteration merge the closest clusters.
  • Stop when all objects are included in a single
    cluster
  • Top-down approach
  • Starting from a single cluster containing all
    objects, iteratively partition into smaller
    clusters.
  • Truncate dendrogram at a similarity threshold
    level, e.g., correlation gt 0.6 or requiring a
    cluster containing at least a minimum number of
    objects.

25
Hierarchical Clustering
5
3
4
2
1
6
26
Dendrogram can be reordered
27
Ordered dendrograms
  • 2 n-1 linear orderings of n elements
  • (n genes or conditions)
  • Maximizing adjacent similarity is impractical.
    So order by
  • Average expression level,
  • Time of max induction, or
  • Chromosome positioning

Eisen98
28
Properties of Hierarchical Clustering
  • Top-down approach is more favorable when only a
    few clusters are desired.
  • Single linkage tends to produce long chains of
    clusters.
  • Complete linkage tends to produce compact
    clusters.

29
(No Transcript)
30
Partitioning clustering vs hierarchical clustering
5
3
4
2
1
6
k 4
31
Partitioning clustering vs hierarchical clustering
5
3
4
2
1
6
k 3
32
Partitioning clustering vs hierarchical clustering
5
3
4
2
1
6
k 2
33
Self-organizing Map
  • Impose partial structure on the clusters (in
    contrast to the rigid structure of hierarchical
    clustering, the strong prior hypotheses used in
    Bayesian clus-tering, and the nonstructure of
    k-means clustering)
  • easy visualization and interpretation.

34
SOM Algorithm
  • Initialize prototypes mj on a lattice of p X q
    nodes. Each prototype is a weight vector whose
    dimension is the same as input data.
  • Iteration for each observation xi, find the
    closest prototype mj, and for all neighbors of mk
    of mj, move by
  • During iterations, reduce learning rate a and
    neighborhood size r gradually.
  • May take many iterations before convergence.

35
(No Transcript)
36
(Hastie 2001)
37
(Hastie 2001)
38
(Hastie 2001)
39
SOM clustering of periodic genes
40
Applications to microarray data
41
  • With only a few nodes, one tends not to see
    distinct patterns and there is large
    within-cluster scatter. As nodes are added,
    distinctive and tight clusters emerge.
  • SOM is an incremental learning algorithm
    involving cases by case presentation rather than
    batch presentation.
  • As with all exploratory data analysis tools, the
    use of SOMs involves inspection of the data to
    extract insights.

42
Other Clustering Methods
  • Gene Shaving
  • MDS
  • Affinity Propagation
  • Spectral Clustering
  • Two-way clustering

43
  • Algorithms for unsupervised classification or
    cluster analysis abound. Unfortunately however,
    algorithm development seems to be a preferred
    activity to algorithm evaluation among
    methodologists.
  • No consensus or clear guidelines exist to guide
    these decisions. Cluster analysis always produces
    clustering, but whether a pattern observed in the
    sample data characterizes a pattern present in
    the population remains an open question.
    Resampling-based methods can address this last
    point, but results indicate that most clusterings
    in microarray data sets are unlikely to reflect
    reproducible patterns or patterns in the overall
    population.
  • -Allison et al. (2006)

44
Stability of a cluster
  • Motivation Real clusters should be reproducible
    under perturbation adding noise, omission of
    data, etc.
  • Procedure
  • Perturb observed data by adding noise.
  • Apply clustering procedure to cluster the
    perturbed data.
  • Repeat the above procedures, generate a sample of
    clusters.
  • Global test
  • Cluster-specific tests R-index, D-index.

(McShane et al. 2002)
45
5
3
4
2
1
6
46
Global test
  • Null hypothesis Data come from a multivariate
    Gaussian distribution.
  • Procedure
  • Consider a subspace spanned by top principle
    components.
  • Estimate distribution of nearest neighbor
    distances
  • Compare observed with simulated data.

47
R-index
  • If cluster i contains ni objects, then it
    contains mi ni(ni 1)/2 of pairs.
  • Let ci be the number of pairs that fall in the
    same cluster for the re-clustered perturbed data.
  • ri ci/mi measures the robustness of the
    cluster i.
  • R-index Si ci / Si mi measures overall
    stability of a clustering algorithm.

48
D-index
  • For each cluster, determine the closest cluster
    for the perturbed data
  • Calculated the average discrepancy between the
    clusters for the original and perturbed data
    omission vs addition.
  • D-index is a summation of all cluster-specific
    discrepancy.

49
Applications
  • 16 prostate cancer 9 benign tumor
  • 6500 genes
  • Use hierarchical clustering to obtain 2,3, and 4
    clusters.
  • Questions are these clusters reliable?

50
(No Transcript)
51
(No Transcript)
52
Issues with calculating R and D indices
  • How big is the size of perturbation?
  • How to quantify the significance level?
  • What about nested consistency?

53
Acknowldegment
  • Slide sources from
  • Cheng Li
Write a Comment
User Comments (0)
About PowerShow.com