Title: CLUSTER ANALYSIS
1CLUSTER ANALYSIS
2Clustering - Goal
- Partition of the genes in the dataset into
distinct sets (clusters), according to similarity
in their expression profiles across the probed
conditions
3Clustering yeast cell cycle dataset
4Clustering why?
- Reduce the dimensionality of the problem
identify the major patterns in the dataset - Co-expression ? Co-function
- Functional annotation of ESTs
- Links among pathways
- Co-expression ? Co-regulation
- Dissection of regulatory networks
5Similarity measures
- Clustering identifies group of genes with
similar expression profiles - How similarity/distance between genes expression
profiles is measured? - Euclidian distance
- Correlation coefficient
- Others
M conditions X (x1, x2, x3, , xm) Y (y1,
y2, y3, , ym)
6Similarity measure - Euclidian distance
In general m experiments X (x1, x2, x3, ,
xm) Y (y1, y2, y3, , ym)
7Similarity measure Correlation Coefficient
- X (x1, x2, x3, , xm)
- Y (y1, y2, y3, , ym)
-1 S(X,Y) 1
8Euclidian vs Correlation
- Euclidian distance takes into account the
magnitude of the expression - Correlation coef - insensitive to the amplitude
of expression, takes into account the trends of
the change. - Common trends are considered very biologically
relevant, the magnitude is considered less
important ? correlation
9Standardization of expression levels
X (x1, x2, x3, , xm) Xj ? Xj
mean(X)/std(X), (doesnt change corr(X,Y))
Before standardization
After standardization
10Clustering Algorithms
- Kmeans
- SOMs
- CLICK
- Hierarchical clustering
11K-MEANS
- The user sets the number of clusters- k
- Initialization each gene is randomly assigned to
one of the k clusters - Average expression vector is calculated for each
cluster (clusters profile) - Iterate over the genes
- For each gene- compute its similarity to the
cluster profiles. - Move the gene to the cluster it is most similar
to. - Recalculated cluster profiles.
- Score current partition sum of distances between
genes and the profile of the cluster they are
assigned to (homogeneity of the solution). - Stop criteria further shuffling of genes results
in minor improvement in the clustering score
12How Many Clusters?
- Try several parameters and compare the clustering
solutions - Criteria for comparison later in the
presentation - PCA (Principle Component Analysis)
- A technique for projecting the gene expression
data set onto a reduced (2 or 3 dimensional)
easily visualized space
13PCA - Example
- Dataset Thousands of genes probed in 5
conditions (time points relative to treatment) - The expression profile of each gene is presented
by the vector of its expression levels X (X1,
X2, X3, X4, X5) - Imagine each gene X as a point in a 5-dimentional
space. - Each direction/axis corresponds to a specific
condition - Genes with similar profiles are close to each
other in this space - PCA- Project this dataset to 2 dimensions,
preserving as much information as possible
14PCA Example
Visual estimation of the number of clusters in
the data
15K-MEANS example 4 clusters
16Cluster 1
Cluster 3
Mis-classified
Cluster 4
Cluster 2
17K-means example 3 clusters
18Too few clusters K2
19SOMs (Self-Organizing Maps)
- User sets the number of clusters in a form of a
rectangular grid (e.g., 3x2) map nodes - Imagine genes as points in (M-dimensional) space
- Initialization map nodes are randomly placed in
the data space
20Genes data points
Clusters map nodes
21SOM - Scheme
- Randomly choose a data point (gene).
- Find its closest map node
- Move this map node towards the data point
- Move the neighbor map nodes towards this point,
but to lesser extent - Iterate over data points
22- The extent of node displacements is relaxed with
the iteration number - After thousands of iterations
- Assign each gene to the map node (cluster) it is
most similar to
23(No Transcript)
24CLICK (CLuster Identification via Connectivity
Kernels)
- Compute similarity between all pairs of genes
- Construct weighted similarity graph
- Genes represented by nodes
- The weight of an edge connecting 2 genes reflects
their expression similarity - Find minimum weight cut that separates the
graph into 2 un-connected sub-graphs - Iterate on cutting subgraphs
- Stop criteria for cutting
25CLICK
- Estimates the optimal number of clusters in the
dataset - Identify outlier genes and leave them
un-clustered (singletons)
26Hierarchical Clustering
- Organize the genes in a structure of a
hierarchical tree - Initial step each gene is regarded as a cluster
with one item - Find the 2 most similar clusters and merge them
into a common node - The length of the branch is proportional to the
distance - Iterate on merging nodes until all genes are
contained in one cluster- the root of the tree.
27Hierarchical Clustering distance between
clusters
Single-linkage
Average-linkage
Complete-linkage
28Mathematical evaluation of clustering solution
- Merits of a good clustering solution
- Homogeneity
- Genes inside a cluster are highly similar to each
other. - Average similarity between a gene and the center
(average profile) of its cluster. - Separation
- Genes from different clusters have low similarity
to each other. - Weighted average similarity between centers of
clusters. - These are conflicting features increasing the
number of clusters tends to improve with-in
cluster Homogeneity on the expense of
between-cluster Separation
29Performance on Yeast Cell Cycle Data
698 genes, 72 conditions (Spellman et al. 1998).
Each algorithm was run by its authors in a
blind test.
Ben-Dor, Shamir, Yakhini 1999
30Which genes to cluster?
- Apply filtering prior to clustering focus the
analysis on the responding genes - Applying controlled statistical tests to identify
responding genes usually ends up with too few
genes that doesnt allow global characterization
of the response - Fold change choose genes that changed by at
least M-folds in at least L conditions - Variance choose top P genes with the highest
variance over the dataset - Try various filtering scheme to find the setting
that gives the best results (biologically)
31Clustering Tools
- Cluster (Eisen) hierarchical
- GeneCluster (Tamayo) SOM
- TIGR MeV K-Means, SOM, hierarchical, QTC, CAST
- Expander CLICK, SOM, K-means, hierarchical
- Many others
32Ascribe Biological Meaning to Clusters
- Identify over-represented functional categories
in the clusters (i.e., cluster contains much more
genes of specific biological process than
expected by chance) - Requirements for systematic analysis
- Controlled vocabulary for describing biological
processes (protein biosynthesis\translation,
apoptosis\programmed cell death) - Standard assignment of genes into functional
categories
33Gene Ontology (GO) project
- Defined controlled terms (ontologies) for
description of gene products from 3 aspects - Biological process (DNA repair, mitosis)
- Molecular function (protein serine/threonine
kinase activity, transcription factor activity) - Cellular component (nucleus, ribosome)
- Unified framework for genes annotation
species-independent vocabularies - A gene can have multiple associations in each
ontology - GO terms are organized in hierarchical structures
called directed acyclic graphs (DAGs) - Very general terms at top levels of the graph
- Terms get more specialized at lower levels
34(No Transcript)
35Genes annotations using GO
- Human LocusLink (NCBI) GOA (EBI) 15K genes
with biological process annotation - Mouse MGI GOA 10K annotated genes
- Rat RGD 2.5k annotated genes
- Fly FlyBase 4.5k annotated genes
- Arabidopsis TAIR 12k annotated genes
- Yeast SGD
- Affymetrix chips Netaffx
36Ascribe Biological Meaning to Clusters
- This analysis is NOT INFORMATIVE!
- Some of the abundances can be explained just by
chance - Statistical tests are essential to detect
significant phenomena
37Identifying enriched GO categories in clusters
- In the previous example
- Total number of chips genes with annotation
5000 - Total number of chips genes associated with
metabolism GO category 3,600 - Number of annotated genes in cluster 3 73
- Number of metabolic genes in cluster 3 50
- Is it statistically significant phenomena?
- Hyper-Geometric probability score
38(No Transcript)
39Functional GO enrichment - Tools
- FatiGO
- GoMiner
- Expander
- DAVID
- EPGO (EBI)
40Acknowledgements
- SOM Figures in this presentations were taken from
presentation of Benedikt Brors