Title: Cluster Analysis of Microarray Expression Data Matrices
1Cluster Analysis of Microarray Expression Data
Matrices
- Application of cluster analysis techniques in the
elucidation gene expression data
2Function of Genes
The features of a living organism are governed
principally by its genes. If we want to fully
understand living systems we must know the
function of each gene. Once we know a genes
sequence we can design experiments to find its
function
The Classical Approach of Assigning a function to
a Gene
Delete Gene X
Gene X
Conclusion Gene X left eye gene.
However this approach is too slow to handle all
the gene sequence information we have today
(HGSP). A recently developed high throughput Gene
Analysis Technique is Microarray Analysis
3Microarray Analysis
Microarray analysis allows the monitoring of the
activities of many genes over many different
conditions. Experiments are carried out on a
Physical Matrix like the one below
To facilitate computational analysis the physical
matrix which may contain 1000s of genes is
converted into a numerical matrix using image
analysis equipment.
Possible inference If Gene Xs activity
(expression) is affected by Condition Y (Extreme
Heat), then Gene X may be involved in protecting
the cellular components from extreme heat. Each
Gene has its corresponding Expression Profile for
a set of conditions. This Expression Profile may
be thought of as a feature profile for that gene
for that set of conditions (A condition feature
profile).
4Cluster Analysis
- Cluster Analysis is an unsupervised procedure
which involves grouping of objects based on their
similarity in feature space. - In the Gene Expression context Genes are grouped
based on the similarity of their Condition
feature profile. - Cluster analysis was first applied to Gene
Expression data from Brewers Yeast
(Saccharomyces cerevisiae) by Eisen et al. (1998).
Clustering
- Two general conclusions can be drawn from these
clusters - Genes clustered together may be related within a
biological module/system. - If there are genes of known function within a
cluster these may help to class this
biological/module system.
5From Data to Biological Hypothesis
Gene Expression Microarray
Cluster Set
Cluster C with four Genes may represent System
C Relating these genes aids in elucidation of
this System C
Conditions (A-Z)
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7
System C
External Stimulus( Condition X)
Cell Membrane
Regulator Protein
6Some Drawbacks of Clustering Biological Data
- Clustering works well over small numbers of
conditions but a typical Microarray may have
hundreds of experimental conditions. A global
clustering may not offer sufficient resolution
with so many features. - As with other clustering applications, it may be
difficult to cluster noisy expression data. - Biological Systems tend to be inter-related and
may share numerous factors (Genes) Clustering
enforces partitions which may not accurately
represent these intimacies. - Clustering Genes over all Conditions only finds
the strongest signals in the dataset as a whole.
More local signals within the data matrix may
be missed.
7How do we better model more complex systems?
- One technique that allows detection of all
signals in the data is biclustering. - Instead of clustering genes over all conditions
biclustering clusters genes with respect to
subsets of conditions.
This enables better representation of
8Biclustering
Conditions
A B C D E F G H
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene
7 Gene 8 Gene 9
Clustering misses local signal (B,E,F),(1,4,6,7,9
) present over subset of conditions.
Biclustering discovers local coherences over a
subset of conditions
- Technique first described by J.A. Hartigan in
1972 and termed Direct Clustering. - First Introduced to Microarray expression data by
Cheng and Church(2000)
9Approaches to Biclustering Microarray Gene
Expression
- First applied to Gene Expression Data by Cheng
and Church(2000). - Used a sub-matrix scoring technique to locate
biclusters. - Tanay et al.(2000)
- Modelled the expression data on Bipartite graphs
and used graph techniques to find complete
graphs or biclusters. - Lazzeroni and Owen
- Used matrix reordering to represent different
layers of signals ( biclusters) Plaid Models
to represent multiple signals within data.
10Bipartite Graph Modelling
- First proposed in Discovering statically
significant biclusters in gene expressing data
Tanay et al. Bioinformatics 2000
Within the graph modelling paradigm biclusters
are equivalent to complete bipartite
sub-graphs. Tanay and colleagues used
probabilistic models to determine the least
probable sub-graphs (those showing most order) to
identify biclusters.
11The Cheng and Church Approach
The core element in this approach is the
development of a scoring to prioritise
sub-matrices. This scoring is based on the
concept of the residue of an entry in a
matrix. In the Matrix (I,J) the residue score of
element is given by
J
j
I
a
i
12The Cheng and Church Approach(2)
The mean squared residue score (H) for a matrix
(I,J) is then calculated
This Global H score gives an indication of how
the data fits together within that matrix-
whether it has some coherence or is random.
A high H value signifies that the data is
uncorrelated. - a matrix of equally spread
random values over the range a,b, has an
expected H score of (b-a)/12. range 0,800
then H(I,J) 53,333
A low H score means that there is a correlation
in the matrix - a score of H(I,J) 0 would
mean that the data in the matrix fluctuates in
unison i.e. the sub-matrix is a bicluster
13Worked example of H score
Matrix (M) Avg. 6.5
Row Avg. 2 5 8 11
R(1) 1- 2 - 5.4 6.5 0.1 R(2) 2 - 2 - 6.4
6.5 0.1 R(12) 12 - 11 -7.4 6.5
0.1
Col Avg. 5.4 6.4 7.4
H (M) (0.01x12)/12 0.01
If 5 was replaced with 3 then the score would
changed to H(M2) 2.06 If the matrix was
reshuffled randomly the score would be
around H(M3) sqr(12-1)/12 10.08
14The Cheng and Church Approach Node Deletion
Biclustering Algorithm
In order to find all possible biclusters in an
Expression Matrix all sub-matrices must be tested
using the H score.
Node Deletion
In a node deletion algorithm all columns and rows
are tested for deletion. If removing a row or
column decreases the H score of the Matrix than
it is removed.
This continues until it is not possible to
decrease the H score further. This low H score
coherent sub-matrix (bicluster) is then returned.
15The Cheng and Church Approach
Some results on lymphoma data (4026?96)
16- Conclusions
- High throughput Functional Genomics (Microarrays)
requires Data Mining Applications. - Biclustering resolves Expression Data more
effectively than single dimensional Cluster
Analysis. - Cheng and Church Approach offers good base for
future work. - Future Research/Questions
- Implement a simple H score program to facilitate
study if H score concept. - Are there other alternative scorings which would
better apply to gene expression data? - Have unbiclustered genes any significance?
Horizontally transferred genes? - Implement full scale biclustering program and
look at better adaptation to expression data sets
and the biological context.