Cluster analysis for microarray data - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Cluster analysis for microarray data

Description:

Clustering may be applied either to genes or ... Sauerbraten. 7. 23. 12. 35. 11. 9. 19. 18. 11. Microarray experiment. Genes. Experimental. condition ... – PowerPoint PPT presentation

Number of Views:211

Avg rating:3.0/5.0

Slides: 33

Provided by: adm25

Category:

more less

Transcript and Presenter's Notes

Title: Cluster analysis for microarray data

1
Cluster analysis for microarray data

Anja von Heydebreck

2
Aim of clustering Group objects according to
their similarity
Cluster a set of objects that are similar to
each other and separated from the
other objects. Example green/ red data
points were generated from two different normal
distributions
3
Clustering microarray data
gene expression data matrix

Genes and experiments/samples are given as the
row and column vectors of a gene expression data
matrix.
Clustering may be applied either to genes or
experiments (regarded as vectors in Rp or Rn).

n experiments
p genes
4
Why cluster genes?

Identify groups of possibly co-regulated genes
(e.g. in conjunction with sequence data).
Identify typical temporal or spatial gene
expression patterns (e.g. cell cycle data).
Arrange a set of genes in a linear order that is
at least not totally meaningless.

5
Why cluster experiments/samples?

Quality control Detect experimental
artifacts/bad hybridizations
Check whether samples are grouped according to
known categories (though this might be better
addressed using a supervised approach
statistical tests, classification)
Identify new classes of biological samples (e.g.
tumor subtypes)

6
Alizadeh et al., Nature 403503-11, 2000
7
Cluster analysis

Generally, cluster analysis is based on two
ingredients
Distance measure Quantification of
(dis-)similarity of objects.
Cluster algorithm A procedure to group objects.
Aim small within-cluster distances, large
between-cluster distances.

8
Some distance measures

Given vectors x (x1, , xn), y (y1, , yn)
Euclidean distance
Manhattan distance
Correlation
distance

9
Which distance measure to use?

The choice of distance measure should be based
on the application area. What sort of
similarities would you like to detect?
Correlation distance dc measures trends/relative
differences
dc(x, y) dc(axb, y) if a gt 0.

x (1, 1, 1.5, 1.5) y (2.5, 2.5, 3.5, 3.5)
2x 0.5 z (1.5, 1.5, 1, 1)
dc(x, y) 0, dc(x, z) 2. dE(x, z) 1, dE(x,
y) 3.54.
10
Which distance measure to use?

Euclidean and Manhattan distance both measure
absolute differences between vectors. Manhattan
distance is more robust against outliers.
May apply standardization to the observations
Subtract mean and divide by standard
deviation
After standardization, Euclidean and correlation
distance are equivalent

11
K-means clustering

Input N objects given as data points in Rp
Specify the number k of clusters.
Initialize k cluster centers. Iterate until
convergence
- Assign each object to the cluster with the
closest center (wrt Euclidean distance).
- The centroids/mean vectors of the obtained
clusters are taken as new cluster centers.
K-means can be seen as an optimization problem
Minimize the sum of squared within-cluster
distances,
Results depend on the initialization. Use several
starting points and choose the best solution
(with minimal W(C)).

12
K-means/PAM How to choose K (the number of
clusters)?

There is no easy answer.
Many heuristic approaches try to compare the
quality of clustering results for different
values of K (for an overview see Dudoit/Fridlyand
2002).
The problem can be better addressed in
model-based clustering, where each cluster
represents a probability distribution, and a
likelihood-based framework can be used.

13
Hierarchical clustering

Similarity of objects is represented in a tree
structure (dendrogram).
Advantage no need to specify the number of
clusters in advance. Nested clusters can be
represented.

Golub data different types of leukemia.
Clustering based on the 150 genes with highest
variance across all samples.
14
Agglomerative hierarchical clustering

Bottom-up algorithm (top-down (divisive) methods
are less common).
Start with the objects as clusters.
In each iteration, merge the two clusters with
the minimal distance from each other - until you
are left with a single cluster comprising all
objects.
But what is the distance between two clusters?

15
Distances between clusters used for hierarchical
clustering

Calculation of the distance between two
clusters is based on the pairwise distances
between members of the clusters.
Complete linkage largest distance
Average linkage average distance
Single linkage smallest distance

Complete linkage gives preference to
compact/spherical clusters. Single linkage can
produce long stretched clusters.
16
Hierarchical clustering

The height of a node in the dendrogram represents
the distance of the two children clusters.
Loss of information n objects have n(n-1)/2
pairwise distances, tree has n-1 inner nodes.
The ordering of the leaves is not uniquely
defined by the dendrogram 2n-2 possible choices.

Golub data different types of leukemia.
Clustering based on the 150 genes with highest
variance across all samples.
17
Alternative direct visualization of
similarity/distance matrices
Useful if one wants to investigate a specific
factor (advantage no loss of information). Sort
experiments according to that factor.
18
Clustering of time course data

Suppose we have expression data from different
time points t1, , tn, and want to identify
typical temporal expression profiles by
clustering the genes.
Usual clustering methods/distance measures dont
take the ordering of the time points into account
the result would be the same if the time points
were permuted.
Simple modification Consider the difference
xi(j1) xij between consecutive timepoints as
an additional observation yij. Then apply a
clustering algorithm such as K-means to the
augmented data matrix.

19
Biclustering

Usual clustering algorithms are based on global
similarities of rows or columns of an expression
data matrix.
But the similarity of the expression profiles of
a group of genes may be restricted to certain
experimental conditions.
Goal of biclustering identify homogeneous
submatrices.
Difficulties computational complexity, assessing
the statistical significance of results
Example Tanay et al. 2002.

20
The role of feature selection

Sometimes, people first select genes that appear
to be differentially expressed between groups of
samples. Then they cluster the samples based on
the expression levels of these genes. Is it
remarkable if the samples then cluster into the
two groups?
No, this doesnt prove anything, because the
genes were selected with respect to the two
groups! Such effects can even be obtained with a
matrix of i.i.d. random numbers.

21
Classification Additional class information
given
object-space
22
Classification methods

Linear Discriminant Analysis (LDA, Fisher)
Nearest neighbor procedures
Neural nets
Support vector machines

23
Embedding methods

Attempt to represent high-dimensional objects as
points in a low-dimensional space (2 or 3
dimensions)
Principal Component Analysis
Correspondence Analysis
(Multidimensional Scaling represents objects for
which distances are given)

24
Principal Component Analysis

Finds a low-dimensional projection such that the
sum-of-squares of the residuals is minimized

25
Contingency Tables Results from a poll among
people of different nationality
nationality
Ge
Au
It
favorite dish
Pasta
35
12
11
Wiener Schnitzel
23
11
18
Sauerbraten
7
9
19
26
Microarray experiment
Genes
Cdc14
Gle2
Gal1
Experimental condition
Wt yeast
35
12
11
wt yeast galactose
23
11
18
Yeast transgene
7
9
19
27
Correspondence Analysis and Principal Component
Analysis

Objects are depicted as points in a plane or in
three-dimensional space, trying to maintain their
proximity
CA is a variant of PCA, although based on another
distance measure ?2-distance
Embedding tries to preserve ?2-distance

28
Correspondence analysis Interpretation

Shows both gene-vectors and condition-vectors as
dots in the plane
Genes that are nearby are similar
Conditions that are nearby are similar
When genes and conditions point in the same
direction then the gene is up-regulated in that
condition.

29
?2-distance
2
(fkj?fk - flj ?fl)
dkl ?
fj ?n
j1..n
30
Correspondence analysis Interpretation

Shows both gene-vectors and condition-vectors as
dots in the plane
Genes that are nearby are similar
Conditions that are nearby are similar
When genes and conditions point in the same
direction then the gene is up-regulated in that
condition.