Title: On the Theory and Practice of Co-clustering
1On the Theory and Practice of Co-clustering
- Srujana Merugu, Gunjan Gupta, Joydeep Ghosh
2Overview
- What is co-clustering ?
- Why Co-clustering on Gene Expression data ?
- Defining quality measures for co-clusters.
- Structural types of co-clusters.
- Algorithmic strategies.
3The dual problem in Clustering
- Often, we can cluster points based on features,
or features based on points. For example - Market basket data customers vs. products.
- Web documents vs. words.
- Genes gene expressions vs. experimental
conditions.
- Good clusters of documents depend on relevant
words, but ..
.. the choice of relevant words depends on
the documents being clustered.
4Can we simultaneously cluster/optimize for both?
- A brief history ..
- 1965 Problem first described formally by I. J.
Good - 1972 First solution Direct Clustering (J. A.
Hartigan, JASA) - 2000 First use in Bioinformatics Bi-Clustering
(Cheng Church, AAAI)
5Gene Expression Data from DNA Micro Arrays
Condition 1. Condition J . Condition m
Gene 1 A11 A1J A1M
Gene . . . .
Gene I AI1 AIJ AIM
Gene . . . .
Gene N AN1 ANJ ANM
6An Example Genes vs. individual cancer samples
23,100x67 matrix from 67 tumors, 17,108
genes. Source PNAS, David Botstein, 2001
7Objectives when analyzing Gene Expression Data
Identifying a new metabolic processes/regulation,
or identifying genes involved in a process by say-
Grouping genes by similar expression under some
conditions
Classification of a new gene G4 showing similar
expressions
C1 C2 C5 C6
conditions
G1
C1 C2 C5 C6
G1
G2
process X
G3
G2
genes
process X, labeled by a biologist
G3
G4
- Similarly we can solve the dual problem
finding, that a new condition C7 also involves
the identified biological process X.
8Important Point Many-to-many and simultaneous
mappingof both genes and conditions to Metabolic
Processes (clusters)
A subset of Genes under a specific subset of
Conditions form one metabolic process.
traditional clustering and
partitioning are mutually exclusive. Cannot find
both Process 1 and 2 simultaneously.
9Summary Why Gene Expression data and
Co-clustering ?
- Overlapping clusters natural to many
co-clustering techniques - Only subsets of conditions/genes may be catered
to - Good for Sparse data
- Noisy data
- Hard to control all experimental conditions
(origin of cells, temperature, osmotic stress,
centrifuge process etc.) - smoothing
10Co-clustering publications in Bioinformatics
At least 14 papers since Cheng Churchs paper
in 2000, ALL on Gene expression data, including
- 2000, 4 papers
- Cheng Church (yeast human microarray)
- Califano,Stolovitzky Tu (phenotype
classification gene microarray data) - Getz, Levine Domany (gene expression data on
cancer) - Lazzeroni Owen (yeast gene expression data)
- 2001, 2 papers
- 2002, 3 papers including
- Ben-Dor et al. (breast tumor set, gene expression
data) - 2003, 5 papers, all on cancer or yeast gene
expression data - 2004 1, yeast and cancer gene expression data
- Cho, Dhillon, Guan Sra, Minimum Sum-Squared
Residue Co-clustering of Gene Expression Data
11Types of co-clusters in Bioinformatics (1)
12Types of co-clusters in Bioinformatics (2)
13A model for various types of co-clusters
Models for discovering co-cluster set a(I,J), I?
(1..N), J? (1..M), for a set of N genes (rows)
and M conditions (columns) of type A,B,C,D and E
Additive model (1) or
Multiplicative model (2)
 Â
Global mean
Adjustment for column j? J
Adjustment for row i? I
14Types of Co-clusters in Bioinformatics(1)
15Types of Co-clusters in Bioinformatics (2)
Green and red coherently regulated, even if out
of phase
F. Coherent Evolutionary(row, column, or both
axes)
10
19
13
70
35
49
40
29
Example, coherent co-evolution on columns
15
27
20
40
12
20
15
90
16Co-clustering with Noise
A noise term can be introduced. Cheng Church
introduced one such concept called the Mean
Squared Residue. The residue of element (i, j)
is defined as
Co-cluster mean
noisy value
row i mean
column j mean
and
Then the Mean Squared Residue of co-cluster (I,J)
is given by
Minimize this for a good co-cluster
17Flavors of co-clustering structures and their
relevance to Gene Array data
Various algorithms, irrespective of the
similarity type, discover one or more of the
following co-cluster structures, in increasing
order of generality
Single
Exclusive row column
Checkerboard
Exclusive row, or Exclusive column
Non-overlapping, non-exclusive
Non-overlapping, tree structure
Overlapping with hierarchical structure
Arbitrarily positioned overlapping
18Various Algorithmic approaches for searching
optimal co-clusters
- Iterative row and column clustering combination
- CTWC (Coupled Two Way Clustering), ITWC
(Interrelated Two-Way Clustering), Double
Conjugate Clustering (DCC) and more .. - One example CTWC
- Partition separately rows in R clusters,columns C
clusters using T as annealing to control size.
Lower T results in smaller clusters.. - Co-clusters defined as (R,C) tuples . Prune
unstable (R,C). Defined as clusters that broke up
fast within a short T interval. Stop if large
stable clusters not in the pruned (R,C) set any
more. - Decrease T, Repeat Step 1 on each (R,C) tuple,
keeping track of the parent tuple. - Divide-and-conquer
- Greedy iterative search add remove points based
on local gain, for example Cheng Church using
Mean Squared Residue - Exhaustive enumeration not exponential by
limiting the number of columns a gene activity is
non-zero - relevant to gene expression. - Distribution parameter identification assume a
parametric distribution of clusters and search
for the parameters for the K clusters.
19Another way of looking at Co-clustering a
Bipartite graph
Example Samba, using exhaustive enumeration for a
limited degree of a gene node.
Minimize weighted/binary edge cut to produce
heaviest bipartite sub-graphs to get co-clusters.
Can find overlapping sub-graphs.
20To Summarize existing work, and some comments ..
- Coherent additive or coherent co-evolutionary
approaches are quite popular and adaptive enough
for biology. - All co-clustering papers in bio-informatics have
looked at 2-D gene array data. - Quality measures diverse and vary substantially
by algorithm, only some of them relevant to
gene-expression data. Choose carefully. - Clustering structures diverse and vary
substantially by algorithm, only some of them
relevant to gene-expression data. Choose
carefully. - But .. More flexible and overlapping clustering
topologies have higher computational complexity
can we put lower bounds on these ?
21Discovering coherent co-evolution co-clusters
- Co-clusters that form gene groups that have
coherent behavior, even if out of phase if gene
A vs. B show opposite expression trend (if one
increases the other one decreases), then they
show opposite trend for all columns. See Figure
F. - Various kinds of co-evolutionary similarity
methods for co-clusters - Ben-Dor et al. Order Preserving Sub-Matrix
same ordering across all columns (see example F) - Similarly, Lui Wang Order Preserving Cluster
- Murali Kasif State preserving xMOTIFs all
genes in same state in a row. - SAMBA two state expression of genes across all
conditions in the co-cluster. - Cho, Dhillion et al. sum-squared residue
coherent clusters.
 Â
22Results Sparsity (Dhillon, 2003)
Before Co-clustering
(light regions are zeros)
23Results Sparsity (Dhillon, 2003)
After Co-clustering
(light regions are zeros)
Large empty regions not relevant to co-clustering
get separated out