Title: Clustering methods used in microarray data analysis
1Clustering methods used in microarray data
analysis
- Steve Horvath
- Human Genetics and Biostatistics
- UCLA
- Acknowledgement based in part on lecture notes
from - Darlene Goldstein web site http//ludwig-sun2.un
il.ch/darlene/
2Contents
- Background clustering
- k-means clustering
- hierarchical clustering
3References for clustering
- Gentleman, Carey, et al (Bioinformatics and Comp
Biology Solutings Using R) Chapters 11,12,13, - T. Hastie, R. Tibshirani, J. Friedman (2002) The
elements of Statistical Learning. Springer Series - L. Kaufman, P. Rousseeuw (1990) Finding groups in
data. Wiley Series in Probability -
4Clustering
- Historically, objects are clustered into groups
- periodic table of the elements (chemistry)
- taxonomy (zoology, botany)
- Why cluster?
- Understand the global structure of the data see
the forest instead of the trees - detect heterogeneity in the data, e.g. different
tumor classes - Find biological pathways (cluster gene expression
profiles) - Find data outliers (cluster microarray samples)
5Classification, Clustering and Prediction
- WARNING
- many people talk about classification when they
mean clustering (unsupervised learning) - Other people talk about classification when they
mean prediction (supervised learning) - Usually, the meaning is context specific. I
prefer to avoid the term classification and to
talk about clustering or prediction or another
more specific term. - Common denominator classification divides
objects into groups based on a set of values - Unlike a theory, clustering is neither true nor
false, and should be judged largely on the
usefulness of results. - CLUSTERING IS AND ALWAYS WILL BE SOMEWHAT OF AN
ARTFORM - However, a classification (clustering) may be
useful for suggesting a theory, which could then
be tested
6Cluster analysis
- Addresses the problem Given n objects, each
described by p variables (or features), derive a
useful division into a number of classes - Usually want a partition of objects
- But also fuzzy clustering
- Could also take an exploratory perspective
- Unsupervised learning
7Difficulties in defining cluster
8Wordy Definition
Cluster analysis aims to group or segment a
collection of objects into subsets or "clusters",
such that those within each cluster are more
closely related to one another than objects
assigned to different clusters. An object can
be described by a set of measurements (e.g.
covariates, features, attributes) or by its
relation to other objects. Sometimes the goal
is to arrange the clusters into a natural
hierarchy, which involves successively grouping
or merging the clusters themselves so that at
each level of the hierarchy clusters within the
same group are more similar to each other than
those in different groups.
9Clustering Gene Expression Data
- Can cluster genes (rows), e.g. to (attempt to)
identify groups of co-regulated genes - Can cluster samples (columns), e.g. to identify
tumors based on profiles - Can cluster both rows and columns at the same
time (to my knowledge, not in R)
10Clustering Gene Expression Data
- Leads to readily interpretable figures
- Can be helpful for identifying patterns in time
or space - Useful (essential?) when seeking new subclasses
of samples - Can be used for exploratory purposes
11SimilarityProximity
- Similarity sij indicates the strength of
relationship between two objects i and j - Usually 0 sij 1
- Ex 1 absolute value of the Pearson correlation
coefficient - Use of correlation-based similarity is quite
common in gene expression studies but is in
general contentious... - Ex 2 co-expression network methods topological
overlap matrix - Ex 3 random forest similarity
12Proximity matrices are the input to most
clustering algorithms
Proximity between pairs of objects similarity or
dissimilarity. If the original data were
collected as similarities, a monotone-decreasing
function can be used to convert them to
dissimilarities. Most algorithms use
(symmetric) dissimilarities (e.g. distances) But
the triangle inequality does not have to hold.
Triangle inequality
13Dissimilarity and Distance
- Associated with similarity measures sij bounded
by 0 and 1 is a dissimilarity dij 1 - sij - Distance measures have the metric property (dij
dik djk) - Many examples Euclidean (as the crow flies),
Manhattan (city block), etc. - Distance measure has a large effect on
performance - Behavior of distance measure related to scale of
measurement
14Partitioning Methods
- Partition the objects into a prespecified number
of groups K - Iteratively reallocate objects to clusters until
some criterion is met (e.g. minimize within
cluster sums of squares) - Examples k-means, self-organizing maps (SOM),
partitioning around medoids (PAM), model-based
clustering
15K-means clustering
- Prespecify number of clusters K, and cluster
centers - Minimize within cluster sum of squares from the
centers - Iterate (until cluster assignments do not
change) - For a given cluster assignment, find the cluster
means - For a given set of means, minimize the within
cluster sum of squares by allocating each object
to the closest cluster mean - Intended for situtations where all variables are
quantitative, with (squared) Euclidean distance
(so scale variables suitably before use)
16PAM clustering
- Also need to prespecify number of clusters K
- Unlike K-means, the cluster centers (medoids)
are objects, not averages of objects - Can use general dissimilarity
- Minimize (unsquared) distances from objects to
cluster centers, so more robust than K-means
17Combinatorial clustering algorithms.Example
K-means clustering
18Clustering algorithms
- Goal partition the observations into groups
("clusters") so that the pairwise dissimilarities
between those assigned to the same cluster tend
to be smaller than those in different clusters. - 3 types of clustering algorithms mixture
modeling, mode seekers (e.g. PRIM algorithm), and
combinatorial algorithms. - We focus on the most popular combinatorial
algorithms.
19Combinatorial clustering algorithms
- Most popular clustering algorithms directly
assign each observation to a group or cluster
without regard to a probability model describing
the data. - Notation Label observations by an integer i in
1,...,N and clusters by an integer k in
1,...,K. - The cluster assignments can be characterized by a
- many to one mapping C(i) that assigns the i-th
- observation to the k-th cluster C(i)k. (aka
encoder) -
- One seeks a particular encoder C(i) that
minimizes a particular loss function (aka
energy function).
20Loss functions for judging clusterings
- One seeks a particular encoder C(i) that
minimizes a particular loss function (aka
energy function). - Example within cluster point scatters
21Cluster analysis by combinatorial optimization
- Straightforward in principle Simply minimize
W(C) over all possible assignments of the N data
points to K clusters. - Unfortunately such optimization by complete
enumeration is feasible only for small data sets. - For this reason practical clustering algorithms
are able to examine only a fraction of all
possible encoders C. - The goal is to identify a small subset that is
likely to contain the optimal one or at least a
good sub-optimal partition. - Feasible strategies are based on iterative greedy
descent.
22K-means clustering is a very popular iterative
descent clustering methods.
- Setting all variables are of the quantitative
type and one uses a squared Euclidean distance. - In this case
- Note that this can be re-expressed as
23Thus one can obtain the optimal C by solving the
enlarged optimization problem
This can be minimized by an alternating
optimization procedure given on the next slide
24K-means clustering algorithm leads to a local
minimum
- 1. For a given cluster assignment C, the total
cluster variance -
- is minimized with respect to m1,...,mk yielding
the means of the currently assigned clusters,
i.e. find the cluster means. - 2. Given the current set of means, TotVar is
minimized by assigning each observation to the
closest (current) cluster mean. That is - C(i)argmink xi-mk2
- 3. Steps 1 and 2 are iterated until the
assignments do not change.
25Recommendations for k-means clustering
- Either Start with many different random choices
of - starting means, and choose the solution having
smallest value of - the objective function.
- Or use another clustering method (e.g.
hierarchical clustering) - to determine an initial set of cluster centers.
26Agglomerative clustering, hierarchical clustering
and dendrograms
27Hierarchical clustering plot
28Hierarchical Clustering
- Produce a dendrogram
- Avoid prespecification of the number of clusters
K - The tree can be built in two distinct ways
- Bottom-up agglomerative clustering
- Top-down divisive clustering
29Agglomerative Methods
- Start with n mRNA sample (or G gene) clusters
- At each step, merge the two closest clusters
using a measure of between-cluster dissimilarity
which reflects the shape of the clusters - Examples of between-cluster dissimilarities
- Unweighted Pair Group Method with Arithmetic Mean
(UPGMA) average of pairwise dissimilarities - Single-link (NN) minimum of pairwise
dissimilarities - Complete-link (FN) maximum of pairwise
dissimilarities
30Agglomerative clustering
- Agglomerative clustering algorithms begin with
every observation representing a singleton
cluster. - At each of the N-1 the closest 2 (least
dissimilar) clusters are merged into a single
cluster. - Therefore a measure of dissimilarity between 2
clusters must be defined. -
31Between cluster distances (also known as linkage
methods)
32Different intergroup dissimilarities
Let G and H represent 2 groups.
33Comparing different linkage methods
- If there is a strong clustering tendency, all 3
methods produce similar results. - Single linkage has a tendency to combine
observations linked by a series of close
intermediate observations ("chaining). Good for
elongated clusters - Bad Complete linkage may lead to clusters where
observations assigned to a cluster can be much
closer to members of other clusters than they are
to some members of their own cluster. Use for
very compact clusters (like perls on a string). - Group average clustering represents a compromise
between the extremes of single and complete
linkage. Use for ball shaped clusters
34Dendrogram
- Recursive binary splitting/agglomeration can be
represented by a rooted binary tree. - The root node represents the entire data set.
- The N terminal nodes of the trees represent
individual observations. - Each nonterminal node ("parent") has two daughter
nodes. - Thus the binary tree can be plotted so that the
height of each node is proportional to the value
of the intergroup dissimilarity between its 2
daughters. - A dendrogram provides a complete description of
the hierarchical clustering in graphical format.
35Comments on dendrograms
- Caution different hierarchical methods as well
as small changes in the data can lead to
different dendrograms. - Hierarchical methods impose hierarchical
structure whether or not such structure actually
exists in the data. - In general dendrograms are a description of the
results of the algorithm and not graphical
summary of the data. - Only valid summary to the extent that the
pairwise observation dissimilarities obey the
ultrametric inequality
for all i,i,k
36Figure 1
average
complete
single
37Divisive Methods
- Start with only one cluster
- At each step, split clusters into two parts
- Advantage Obtain the main structure of the data
(i.e. focus on upper levels of dendrogram) - Disadvantage Computational difficulties when
considering all possible divisions into two groups
38Discussion
39Partitioning vs. Hierarchical
- Partitioning
- Advantage Provides clusters that satisfy some
optimality criterion (approximately) - Disadvantages Need initial K, long computation
time - Hierarchical
- Advantage Fast computation (agglomerative)
- Disadvantages Rigid, cannot correct later for
erroneous decisions made earlier - Word on the street most data analysts prefer
hierarchical clustering over partitioning methods
when it comes to gene expression data
40Generic Clustering Tasks
- Estimating number of clusters
- Assigning each object to a cluster
- Assessing strength/confidence of cluster
assignments for individual objects - Assessing cluster homogeneity
41How many clusters K?
- Many suggestions for how to decide this!
- Milligan and Cooper (Psychometrika 50159-179,
1985) studied 30 methods - A number of new methods, including GAP
(Tibshirani) and clest (Fridlyand and Dudoit,
uses bootstrapping), see also prediction strength
methods - http//www.genetics.ucla.edu/labs/horvath/General
PredictionStrength/
42R clustering
- A number of R packages (libraries) contain
functions to carry out clustering, including - mva kmeans, hclust
- cluster pam (among others)
- cclust convex clustering, also methods to
estimate K - mclust model-based clustering
- GeneSOM