CLUSTER ANALYSIS - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

CLUSTER ANALYSIS

Description:

Average expression vector is calculated for each cluster (cluster's profile) ... For each gene- compute its similarity to the cluster profiles. ... – PowerPoint PPT presentation

Number of Views:1620
Avg rating:3.0/5.0
Slides: 41
Provided by: YossiS7
Category:

less

Transcript and Presenter's Notes

Title: CLUSTER ANALYSIS


1
CLUSTER ANALYSIS
2
Clustering - Goal
  • Partition of the genes in the dataset into
    distinct sets (clusters), according to similarity
    in their expression profiles across the probed
    conditions

3
Clustering yeast cell cycle dataset
4
Clustering why?
  • Reduce the dimensionality of the problem
    identify the major patterns in the dataset
  • Co-expression ? Co-function
  • Functional annotation of ESTs
  • Links among pathways
  • Co-expression ? Co-regulation
  • Dissection of regulatory networks

5
Similarity measures
  • Clustering identifies group of genes with
    similar expression profiles
  • How similarity/distance between genes expression
    profiles is measured?
  • Euclidian distance
  • Correlation coefficient
  • Others

M conditions X (x1, x2, x3, , xm) Y (y1,
y2, y3, , ym)
6
Similarity measure - Euclidian distance
In general m experiments X (x1, x2, x3, ,
xm) Y (y1, y2, y3, , ym)
7
Similarity measure Correlation Coefficient
  • X (x1, x2, x3, , xm)
  • Y (y1, y2, y3, , ym)

-1 S(X,Y) 1
8
Euclidian vs Correlation
  • Euclidian distance takes into account the
    magnitude of the expression
  • Correlation coef - insensitive to the amplitude
    of expression, takes into account the trends of
    the change.
  • Common trends are considered very biologically
    relevant, the magnitude is considered less
    important ? correlation

9
Standardization of expression levels
X (x1, x2, x3, , xm) Xj ? Xj
mean(X)/std(X), (doesnt change corr(X,Y))
Before standardization
After standardization
10
Clustering Algorithms
  • Kmeans
  • SOMs
  • CLICK
  • Hierarchical clustering

11
K-MEANS
  • The user sets the number of clusters- k
  • Initialization each gene is randomly assigned to
    one of the k clusters
  • Average expression vector is calculated for each
    cluster (clusters profile)
  • Iterate over the genes
  • For each gene- compute its similarity to the
    cluster profiles.
  • Move the gene to the cluster it is most similar
    to.
  • Recalculated cluster profiles.
  • Score current partition sum of distances between
    genes and the profile of the cluster they are
    assigned to (homogeneity of the solution).
  • Stop criteria further shuffling of genes results
    in minor improvement in the clustering score

12
How Many Clusters?
  • Try several parameters and compare the clustering
    solutions
  • Criteria for comparison later in the
    presentation
  • PCA (Principle Component Analysis)
  • A technique for projecting the gene expression
    data set onto a reduced (2 or 3 dimensional)
    easily visualized space

13
PCA - Example
  • Dataset Thousands of genes probed in 5
    conditions (time points relative to treatment)
  • The expression profile of each gene is presented
    by the vector of its expression levels X (X1,
    X2, X3, X4, X5)
  • Imagine each gene X as a point in a 5-dimentional
    space.
  • Each direction/axis corresponds to a specific
    condition
  • Genes with similar profiles are close to each
    other in this space
  • PCA- Project this dataset to 2 dimensions,
    preserving as much information as possible

14
PCA Example
Visual estimation of the number of clusters in
the data
15
K-MEANS example 4 clusters
16
Cluster 1
Cluster 3
Mis-classified
Cluster 4
Cluster 2
17
K-means example 3 clusters
18
Too few clusters K2
19
SOMs (Self-Organizing Maps)
  • User sets the number of clusters in a form of a
    rectangular grid (e.g., 3x2) map nodes
  • Imagine genes as points in (M-dimensional) space
  • Initialization map nodes are randomly placed in
    the data space

20
Genes data points
Clusters map nodes
21
SOM - Scheme
  • Randomly choose a data point (gene).
  • Find its closest map node
  • Move this map node towards the data point
  • Move the neighbor map nodes towards this point,
    but to lesser extent
  • Iterate over data points

22
  • The extent of node displacements is relaxed with
    the iteration number
  • After thousands of iterations
  • Assign each gene to the map node (cluster) it is
    most similar to

23
(No Transcript)
24
CLICK (CLuster Identification via Connectivity
Kernels)
  • Compute similarity between all pairs of genes
  • Construct weighted similarity graph
  • Genes represented by nodes
  • The weight of an edge connecting 2 genes reflects
    their expression similarity
  • Find minimum weight cut that separates the
    graph into 2 un-connected sub-graphs
  • Iterate on cutting subgraphs
  • Stop criteria for cutting

25
CLICK
  • Estimates the optimal number of clusters in the
    dataset
  • Identify outlier genes and leave them
    un-clustered (singletons)

26
Hierarchical Clustering
  • Organize the genes in a structure of a
    hierarchical tree
  • Initial step each gene is regarded as a cluster
    with one item
  • Find the 2 most similar clusters and merge them
    into a common node
  • The length of the branch is proportional to the
    distance
  • Iterate on merging nodes until all genes are
    contained in one cluster- the root of the tree.

27
Hierarchical Clustering distance between
clusters
Single-linkage
Average-linkage
Complete-linkage
28
Mathematical evaluation of clustering solution
  • Merits of a good clustering solution
  • Homogeneity
  • Genes inside a cluster are highly similar to each
    other.
  • Average similarity between a gene and the center
    (average profile) of its cluster.
  • Separation
  • Genes from different clusters have low similarity
    to each other.
  • Weighted average similarity between centers of
    clusters.
  • These are conflicting features increasing the
    number of clusters tends to improve with-in
    cluster Homogeneity on the expense of
    between-cluster Separation

29
Performance on Yeast Cell Cycle Data
698 genes, 72 conditions (Spellman et al. 1998).
Each algorithm was run by its authors in a
blind test.
Ben-Dor, Shamir, Yakhini 1999
30
Which genes to cluster?
  • Apply filtering prior to clustering focus the
    analysis on the responding genes
  • Applying controlled statistical tests to identify
    responding genes usually ends up with too few
    genes that doesnt allow global characterization
    of the response
  • Fold change choose genes that changed by at
    least M-folds in at least L conditions
  • Variance choose top P genes with the highest
    variance over the dataset
  • Try various filtering scheme to find the setting
    that gives the best results (biologically)

31
Clustering Tools
  • Cluster (Eisen) hierarchical
  • GeneCluster (Tamayo) SOM
  • TIGR MeV K-Means, SOM, hierarchical, QTC, CAST
  • Expander CLICK, SOM, K-means, hierarchical
  • Many others

32
Ascribe Biological Meaning to Clusters
  • Identify over-represented functional categories
    in the clusters (i.e., cluster contains much more
    genes of specific biological process than
    expected by chance)
  • Requirements for systematic analysis
  • Controlled vocabulary for describing biological
    processes (protein biosynthesis\translation,
    apoptosis\programmed cell death)
  • Standard assignment of genes into functional
    categories

33
Gene Ontology (GO) project
  • Defined controlled terms (ontologies) for
    description of gene products from 3 aspects
  • Biological process (DNA repair, mitosis)
  • Molecular function (protein serine/threonine
    kinase activity, transcription factor activity)
  • Cellular component (nucleus, ribosome)
  • Unified framework for genes annotation
    species-independent vocabularies
  • A gene can have multiple associations in each
    ontology
  • GO terms are organized in hierarchical structures
    called directed acyclic graphs (DAGs)
  • Very general terms at top levels of the graph
  • Terms get more specialized at lower levels

34
(No Transcript)
35
Genes annotations using GO
  • Human LocusLink (NCBI) GOA (EBI) 15K genes
    with biological process annotation
  • Mouse MGI GOA 10K annotated genes
  • Rat RGD 2.5k annotated genes
  • Fly FlyBase 4.5k annotated genes
  • Arabidopsis TAIR 12k annotated genes
  • Yeast SGD
  • Affymetrix chips Netaffx

36
Ascribe Biological Meaning to Clusters
  • This analysis is NOT INFORMATIVE!
  • Some of the abundances can be explained just by
    chance
  • Statistical tests are essential to detect
    significant phenomena

37
Identifying enriched GO categories in clusters
  • In the previous example
  • Total number of chips genes with annotation
    5000
  • Total number of chips genes associated with
    metabolism GO category 3,600
  • Number of annotated genes in cluster 3 73
  • Number of metabolic genes in cluster 3 50
  • Is it statistically significant phenomena?
  • Hyper-Geometric probability score

38
(No Transcript)
39
Functional GO enrichment - Tools
  • FatiGO
  • GoMiner
  • Expander
  • DAVID
  • EPGO (EBI)

40
Acknowledgements
  • SOM Figures in this presentations were taken from
    presentation of Benedikt Brors
Write a Comment
User Comments (0)
About PowerShow.com