Cluster Analysis of Microarray Expression Data Matrices - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Cluster Analysis of Microarray Expression Data Matrices

Description:

Application of cluster analysis techniques in the elucidation gene expression data ... First proposed in: 'Discovering statically significant biclusters in gene ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 17
Provided by: brya84
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis of Microarray Expression Data Matrices


1
Cluster Analysis of Microarray Expression Data
Matrices
  • Application of cluster analysis techniques in the
    elucidation gene expression data

2
Function of Genes
The features of a living organism are governed
principally by its genes. If we want to fully
understand living systems we must know the
function of each gene. Once we know a genes
sequence we can design experiments to find its
function
The Classical Approach of Assigning a function to
a Gene
Delete Gene X
Gene X
Conclusion Gene X left eye gene.
However this approach is too slow to handle all
the gene sequence information we have today
(HGSP). A recently developed high throughput Gene
Analysis Technique is Microarray Analysis
3
Microarray Analysis
Microarray analysis allows the monitoring of the
activities of many genes over many different
conditions. Experiments are carried out on a
Physical Matrix like the one below
To facilitate computational analysis the physical
matrix which may contain 1000s of genes is
converted into a numerical matrix using image
analysis equipment.
Possible inference If Gene Xs activity
(expression) is affected by Condition Y (Extreme
Heat), then Gene X may be involved in protecting
the cellular components from extreme heat. Each
Gene has its corresponding Expression Profile for
a set of conditions. This Expression Profile may
be thought of as a feature profile for that gene
for that set of conditions (A condition feature
profile).
4
Cluster Analysis
  • Cluster Analysis is an unsupervised procedure
    which involves grouping of objects based on their
    similarity in feature space.
  • In the Gene Expression context Genes are grouped
    based on the similarity of their Condition
    feature profile.
  • Cluster analysis was first applied to Gene
    Expression data from Brewers Yeast
    (Saccharomyces cerevisiae) by Eisen et al. (1998).

Clustering
  • Two general conclusions can be drawn from these
    clusters
  • Genes clustered together may be related within a
    biological module/system.
  • If there are genes of known function within a
    cluster these may help to class this
    biological/module system.

5
From Data to Biological Hypothesis
Gene Expression Microarray
Cluster Set
Cluster C with four Genes may represent System
C Relating these genes aids in elucidation of
this System C
Conditions (A-Z)
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7
System C
External Stimulus( Condition X)
Cell Membrane
Regulator Protein
6
Some Drawbacks of Clustering Biological Data
  • Clustering works well over small numbers of
    conditions but a typical Microarray may have
    hundreds of experimental conditions. A global
    clustering may not offer sufficient resolution
    with so many features.
  • As with other clustering applications, it may be
    difficult to cluster noisy expression data.
  • Biological Systems tend to be inter-related and
    may share numerous factors (Genes) Clustering
    enforces partitions which may not accurately
    represent these intimacies.
  • Clustering Genes over all Conditions only finds
    the strongest signals in the dataset as a whole.
    More local signals within the data matrix may
    be missed.

7
How do we better model more complex systems?
  • One technique that allows detection of all
    signals in the data is biclustering.
  • Instead of clustering genes over all conditions
    biclustering clusters genes with respect to
    subsets of conditions.

This enables better representation of
8
Biclustering
Conditions
A B C D E F G H
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene
7 Gene 8 Gene 9
Clustering misses local signal (B,E,F),(1,4,6,7,9
) present over subset of conditions.
Biclustering discovers local coherences over a
subset of conditions
  • Technique first described by J.A. Hartigan in
    1972 and termed Direct Clustering.
  • First Introduced to Microarray expression data by
    Cheng and Church(2000)

9
Approaches to Biclustering Microarray Gene
Expression
  • First applied to Gene Expression Data by Cheng
    and Church(2000).
  • Used a sub-matrix scoring technique to locate
    biclusters.
  • Tanay et al.(2000)
  • Modelled the expression data on Bipartite graphs
    and used graph techniques to find complete
    graphs or biclusters.
  • Lazzeroni and Owen
  • Used matrix reordering to represent different
    layers of signals ( biclusters) Plaid Models
    to represent multiple signals within data.

10
Bipartite Graph Modelling
  • First proposed in Discovering statically
    significant biclusters in gene expressing data
    Tanay et al. Bioinformatics 2000

Within the graph modelling paradigm biclusters
are equivalent to complete bipartite
sub-graphs. Tanay and colleagues used
probabilistic models to determine the least
probable sub-graphs (those showing most order) to
identify biclusters.
11
The Cheng and Church Approach
The core element in this approach is the
development of a scoring to prioritise
sub-matrices. This scoring is based on the
concept of the residue of an entry in a
matrix. In the Matrix (I,J) the residue score of
element is given by
J
j
I
a
i
12
The Cheng and Church Approach(2)
The mean squared residue score (H) for a matrix
(I,J) is then calculated
This Global H score gives an indication of how
the data fits together within that matrix-
whether it has some coherence or is random.
A high H value signifies that the data is
uncorrelated. - a matrix of equally spread
random values over the range a,b, has an
expected H score of (b-a)/12. range 0,800
then H(I,J) 53,333
A low H score means that there is a correlation
in the matrix - a score of H(I,J) 0 would
mean that the data in the matrix fluctuates in
unison i.e. the sub-matrix is a bicluster
13
Worked example of H score
Matrix (M) Avg. 6.5
Row Avg. 2 5 8 11
R(1) 1- 2 - 5.4 6.5 0.1 R(2) 2 - 2 - 6.4
6.5 0.1 R(12) 12 - 11 -7.4 6.5
0.1
Col Avg. 5.4 6.4 7.4
H (M) (0.01x12)/12 0.01
If 5 was replaced with 3 then the score would
changed to H(M2) 2.06 If the matrix was
reshuffled randomly the score would be
around H(M3) sqr(12-1)/12 10.08
14
The Cheng and Church Approach Node Deletion
Biclustering Algorithm
In order to find all possible biclusters in an
Expression Matrix all sub-matrices must be tested
using the H score.
Node Deletion
In a node deletion algorithm all columns and rows
are tested for deletion. If removing a row or
column decreases the H score of the Matrix than
it is removed.
This continues until it is not possible to
decrease the H score further. This low H score
coherent sub-matrix (bicluster) is then returned.
15
The Cheng and Church Approach
Some results on lymphoma data (4026?96)
16
  • Conclusions
  • High throughput Functional Genomics (Microarrays)
    requires Data Mining Applications.
  • Biclustering resolves Expression Data more
    effectively than single dimensional Cluster
    Analysis.
  • Cheng and Church Approach offers good base for
    future work.
  • Future Research/Questions
  • Implement a simple H score program to facilitate
    study if H score concept.
  • Are there other alternative scorings which would
    better apply to gene expression data?
  • Have unbiclustered genes any significance?
    Horizontally transferred genes?
  • Implement full scale biclustering program and
    look at better adaptation to expression data sets
    and the biological context.
Write a Comment
User Comments (0)
About PowerShow.com