Important clustering methods used in microarray data analysis - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

Important clustering methods used in microarray data analysis

Description:

Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA Contents Multidimensional scaling plots Related to ... – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 17

Provided by: SHor1

Category:

more less

Transcript and Presenter's Notes

Title: Important clustering methods used in microarray data analysis

1
Important clustering methods used in microarray
data analysis

Steve Horvath
Human Genetics and Biostatistics
UCLA

2
Contents

Multidimensional scaling plots
Related to principal component analysis
k-means clustering
hierarchical clustering

3
Introduction to clustering
4
MDS plot of clusters
5
MDS plot of clusters
6
2 references for clustering

T. Hastie, R. Tibshirani, J. Friedman (2002) The
elements of Statistical Learning. Springer Series
L. Kaufman, P. Rousseeuw (1990) Finding groups in
data. Wiley Series in Probability

7
Introduction to clustering
Cluster analysis aims to group or segment a
collection of objects into subsets or "clusters",
such that those within each cluster are more
closely related to one another than objects
assigned to different clusters. An object can
be described by a set of measurements (e.g.
covariates, features, attributes) or by its
relation to other objects. Sometimes the goal
is to arrange the clusters into a natural
hierarchy, which involves successively grouping
or merging the clusters themselves so that at
each level of the hierarchy clusters within the
same group are more similar to each other than
those in different groups.

8
Proximity matrices are the input to most
clustering algorithms

Proximity between pairs of objects similarity or
dissimilarity. If the original data were
collected as similarities, a monotone-decreasing
function can be used to convert them to
dissimilarities. Most algorithms use
(symmetric) dissimilarities (e.g. distances) But
the triangle inequality does not have to hold.
Triangle inequality
9
Different intergroup dissimilarities
Let G and H represent 2 groups.
10
Agglomerative clustering, hierarchical clustering
and dendrograms
11
Hierarchical clustering plot
12
Agglomerative clustering

Agglomerative clustering algorithms begin with
every observation representing a singleton
cluster.
At each of the N-1 the closest 2 (least
dissimilar) clusters are merged into a single
cluster.
Therefore a measure of dissimilarity between 2
clusters must be defined.

13
Comparing different linkage methods

If there is a strong clustering tendency, all 3
methods produce similar results.
Single linkage has a tendency to combine
observations linked by a series of close
intermediate observations ("chaining). Good for
elongated clusters
Bad Complete linkage may lead to clusters where
observations assigned to a cluster can be much
closer to members of other clusters than they are
to some members of their own cluster. Use for
very compact clusters (like perls on a string).
Group average clustering represents a compromise
between the extremes of single and complete
linkage. Use for ball shaped clusters

14
Dendrogram

Recursive binary splitting/agglomeration can be
represented by a rooted binary tree.
The root node represents the entire data set.
The N terminal nodes of the trees represent
individual observations.
Each nonterminal node ("parent") has two daughter
nodes.
Thus the binary tree can be plotted so that the
height of each node is proportional to the value
of the intergroup dissimilarity between its 2
daughters.
A dendrogram provides a complete description of
the hierarchical clustering in graphical format.

15
Comments on dendrograms

Caution different hierarchical methods as well
as small changes in the data can lead to
different dendrograms.
Hierarchical methods impose hierarchical
structure whether or not such structure actually
exists in the data.
In general dendrograms are a description of the
results of the algorithm and not graphical
summary of the data.
Only valid summary to the extent that the
pairwise observation dissimilarities obey the
ultrametric inequality