CLUSTER ANALYSIS - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

CLUSTER ANALYSIS

Description:

Average expression vector is calculated for each cluster (cluster's profile) ... For each gene- compute its similarity to the cluster profiles. ... – PowerPoint PPT presentation

Number of Views:1620

Avg rating:3.0/5.0

Slides: 41

Provided by: YossiS7

Category:

more less

Transcript and Presenter's Notes

Title: CLUSTER ANALYSIS

1
CLUSTER ANALYSIS
2
Clustering - Goal

Partition of the genes in the dataset into
distinct sets (clusters), according to similarity
in their expression profiles across the probed
conditions

3
Clustering yeast cell cycle dataset
4
Clustering why?

Reduce the dimensionality of the problem
identify the major patterns in the dataset
Co-expression ? Co-function
Functional annotation of ESTs
Links among pathways
Co-expression ? Co-regulation
Dissection of regulatory networks

5
Similarity measures

Clustering identifies group of genes with
similar expression profiles
How similarity/distance between genes expression
profiles is measured?
Euclidian distance
Correlation coefficient
Others

M conditions X (x1, x2, x3, , xm) Y (y1,
y2, y3, , ym)
6
Similarity measure - Euclidian distance
In general m experiments X (x1, x2, x3, ,
xm) Y (y1, y2, y3, , ym)
7
Similarity measure Correlation Coefficient

X (x1, x2, x3, , xm)
Y (y1, y2, y3, , ym)

-1 S(X,Y) 1
8
Euclidian vs Correlation

Euclidian distance takes into account the
magnitude of the expression
Correlation coef - insensitive to the amplitude
of expression, takes into account the trends of
the change.
Common trends are considered very biologically
relevant, the magnitude is considered less
important ? correlation

9
Standardization of expression levels
X (x1, x2, x3, , xm) Xj ? Xj
mean(X)/std(X), (doesnt change corr(X,Y))
Before standardization
After standardization
10
Clustering Algorithms

Kmeans
SOMs
CLICK
Hierarchical clustering

11
K-MEANS

The user sets the number of clusters- k
Initialization each gene is randomly assigned to
one of the k clusters
Average expression vector is calculated for each
cluster (clusters profile)
Iterate over the genes
For each gene- compute its similarity to the
cluster profiles.
Move the gene to the cluster it is most similar
to.
Recalculated cluster profiles.
Score current partition sum of distances between
genes and the profile of the cluster they are
assigned to (homogeneity of the solution).
Stop criteria further shuffling of genes results
in minor improvement in the clustering score

12
How Many Clusters?

Try several parameters and compare the clustering
solutions
Criteria for comparison later in the
presentation
PCA (Principle Component Analysis)
A technique for projecting the gene expression
data set onto a reduced (2 or 3 dimensional)
easily visualized space

13
PCA - Example

Dataset Thousands of genes probed in 5
conditions (time points relative to treatment)
The expression profile of each gene is presented
by the vector of its expression levels X (X1,
X2, X3, X4, X5)
Imagine each gene X as a point in a 5-dimentional
space.
Each direction/axis corresponds to a specific
condition
Genes with similar profiles are close to each
other in this space
PCA- Project this dataset to 2 dimensions,
preserving as much information as possible

14
PCA Example
Visual estimation of the number of clusters in
the data
15
K-MEANS example 4 clusters
16
Cluster 1
Cluster 3
Mis-classified
Cluster 4
Cluster 2
17
K-means example 3 clusters
18
Too few clusters K2
19
SOMs (Self-Organizing Maps)

User sets the number of clusters in a form of a
rectangular grid (e.g., 3x2) map nodes
Imagine genes as points in (M-dimensional) space
Initialization map nodes are randomly placed in
the data space

20
Genes data points
Clusters map nodes
21
SOM - Scheme

Randomly choose a data point (gene).
Find its closest map node
Move this map node towards the data point
Move the neighbor map nodes towards this point,
but to lesser extent
Iterate over data points

The extent of node displacements is relaxed with
the iteration number
After thousands of iterations
Assign each gene to the map node (cluster) it is
most similar to

23
(No Transcript)
24
CLICK (CLuster Identification via Connectivity
Kernels)

Compute similarity between all pairs of genes
Construct weighted similarity graph
Genes represented by nodes
The weight of an edge connecting 2 genes reflects
their expression similarity
Find minimum weight cut that separates the
graph into 2 un-connected sub-graphs
Iterate on cutting subgraphs
Stop criteria for cutting

25
CLICK

Estimates the optimal number of clusters in the
dataset
Identify outlier genes and leave them
un-clustered (singletons)

26
Hierarchical Clustering

Organize the genes in a structure of a
hierarchical tree
Initial step each gene is regarded as a cluster
with one item
Find the 2 most similar clusters and merge them
into a common node
The length of the branch is proportional to the
distance
Iterate on merging nodes until all genes are
contained in one cluster- the root of the tree.

27
Hierarchical Clustering distance between
clusters
Single-linkage
Average-linkage
Complete-linkage
28
Mathematical evaluation of clustering solution

Merits of a good clustering solution
Homogeneity
Genes inside a cluster are highly similar to each
other.
Average similarity between a gene and the center
(average profile) of its cluster.
Separation
Genes from different clusters have low similarity
to each other.
Weighted average similarity between centers of
clusters.
These are conflicting features increasing the
number of clusters tends to improve with-in
cluster Homogeneity on the expense of
between-cluster Separation

29
Performance on Yeast Cell Cycle Data
698 genes, 72 conditions (Spellman et al. 1998).
Each algorithm was run by its authors in a
blind test.
Ben-Dor, Shamir, Yakhini 1999
30
Which genes to cluster?

Apply filtering prior to clustering focus the
analysis on the responding genes
Applying controlled statistical tests to identify
responding genes usually ends up with too few
genes that doesnt allow global characterization
of the response
Fold change choose genes that changed by at
least M-folds in at least L conditions
Variance choose top P genes with the highest
variance over the dataset
Try various filtering scheme to find the setting
that gives the best results (biologically)

31
Clustering Tools

Cluster (Eisen) hierarchical
GeneCluster (Tamayo) SOM
TIGR MeV K-Means, SOM, hierarchical, QTC, CAST
Expander CLICK, SOM, K-means, hierarchical
Many others

32
Ascribe Biological Meaning to Clusters

Identify over-represented functional categories
in the clusters (i.e., cluster contains much more
genes of specific biological process than
expected by chance)
Requirements for systematic analysis
Controlled vocabulary for describing biological
processes (protein biosynthesis\translation,
apoptosis\programmed cell death)
Standard assignment of genes into functional
categories

33
Gene Ontology (GO) project

Defined controlled terms (ontologies) for
description of gene products from 3 aspects
Biological process (DNA repair, mitosis)
Molecular function (protein serine/threonine
kinase activity, transcription factor activity)
Cellular component (nucleus, ribosome)
Unified framework for genes annotation
species-independent vocabularies
A gene can have multiple associations in each
ontology
GO terms are organized in hierarchical structures
called directed acyclic graphs (DAGs)
Very general terms at top levels of the graph
Terms get more specialized at lower levels