On the Theory and Practice of Co-clustering - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

On the Theory and Practice of Co-clustering

Description:

2000: First use in Bioinformatics: Bi-Clustering (Cheng & Church, AAAI) ... in Bioinformatics. 2000, 4 ... Types of co-clusters in Bioinformatics (1) ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 24

Provided by: vikasa

Category:

more less

Transcript and Presenter's Notes

Title: On the Theory and Practice of Co-clustering

1
On the Theory and Practice of Co-clustering

Srujana Merugu, Gunjan Gupta, Joydeep Ghosh

2
Overview

What is co-clustering ?
Why Co-clustering on Gene Expression data ?
Defining quality measures for co-clusters.
Structural types of co-clusters.
Algorithmic strategies.

3
The dual problem in Clustering

Often, we can cluster points based on features,
or features based on points. For example
Market basket data customers vs. products.
Web documents vs. words.
Genes gene expressions vs. experimental
conditions.

Good clusters of documents depend on relevant
words, but ..

.. the choice of relevant words depends on
the documents being clustered.
4
Can we simultaneously cluster/optimize for both?

A brief history ..
1965 Problem first described formally by I. J.
Good
1972 First solution Direct Clustering (J. A.
Hartigan, JASA)
2000 First use in Bioinformatics Bi-Clustering
(Cheng Church, AAAI)

5
Gene Expression Data from DNA Micro Arrays
Condition 1. Condition J . Condition m
Gene 1 A11 A1J A1M
Gene . . . .
Gene I AI1 AIJ AIM
Gene . . . .
Gene N AN1 ANJ ANM
6
An Example Genes vs. individual cancer samples
23,100x67 matrix from 67 tumors, 17,108
genes. Source PNAS, David Botstein, 2001
7
Objectives when analyzing Gene Expression Data
Identifying a new metabolic processes/regulation,
or identifying genes involved in a process by say-
Grouping genes by similar expression under some
conditions
Classification of a new gene G4 showing similar
expressions
C1 C2 C5 C6
conditions
G1
C1 C2 C5 C6
G1
G2
process X
G3
G2
genes
process X, labeled by a biologist
G3
G4

Similarly we can solve the dual problem
finding, that a new condition C7 also involves
the identified biological process X.

8
Important Point Many-to-many and simultaneous
mappingof both genes and conditions to Metabolic
Processes (clusters)
A subset of Genes under a specific subset of
Conditions form one metabolic process.
traditional clustering and
partitioning are mutually exclusive. Cannot find
both Process 1 and 2 simultaneously.
9
Summary Why Gene Expression data and
Co-clustering ?

Overlapping clusters natural to many
co-clustering techniques
Only subsets of conditions/genes may be catered
to
Good for Sparse data
Noisy data
Hard to control all experimental conditions
(origin of cells, temperature, osmotic stress,
centrifuge process etc.)
smoothing

10
Co-clustering publications in Bioinformatics
At least 14 papers since Cheng Churchs paper
in 2000, ALL on Gene expression data, including

2000, 4 papers
Cheng Church (yeast human microarray)
Califano,Stolovitzky Tu (phenotype
classification gene microarray data)
Getz, Levine Domany (gene expression data on
cancer)
Lazzeroni Owen (yeast gene expression data)
2001, 2 papers
2002, 3 papers including
Ben-Dor et al. (breast tumor set, gene expression
data)
2003, 5 papers, all on cancer or yeast gene
expression data
2004 1, yeast and cancer gene expression data
Cho, Dhillon, Guan Sra, Minimum Sum-Squared
Residue Co-clustering of Gene Expression Data

11
Types of co-clusters in Bioinformatics (1)
12
Types of co-clusters in Bioinformatics (2)
13
A model for various types of co-clusters
Models for discovering co-cluster set a(I,J), I?
(1..N), J? (1..M), for a set of N genes (rows)
and M conditions (columns) of type A,B,C,D and E

Additive model (1) or
Multiplicative model (2)

Global mean
Adjustment for column j? J
Adjustment for row i? I
14
Types of Co-clusters in Bioinformatics(1)
15
Types of Co-clusters in Bioinformatics (2)
Green and red coherently regulated, even if out
of phase
F. Coherent Evolutionary(row, column, or both
axes)
10
19
13
70
35
49
40
29
Example, coherent co-evolution on columns
15
27
20
40
12
20
15
90
16
Co-clustering with Noise
A noise term can be introduced. Cheng Church
introduced one such concept called the Mean
Squared Residue. The residue of element (i, j)
is defined as
Co-cluster mean
noisy value
row i mean
column j mean
and
Then the Mean Squared Residue of co-cluster (I,J)
is given by
Minimize this for a good co-cluster
17
Flavors of co-clustering structures and their
relevance to Gene Array data
Various algorithms, irrespective of the
similarity type, discover one or more of the
following co-cluster structures, in increasing
order of generality
Single
Exclusive row column
Checkerboard
Exclusive row, or Exclusive column
Non-overlapping, non-exclusive
Non-overlapping, tree structure
Overlapping with hierarchical structure
Arbitrarily positioned overlapping
18
Various Algorithmic approaches for searching
optimal co-clusters

Iterative row and column clustering combination
CTWC (Coupled Two Way Clustering), ITWC
(Interrelated Two-Way Clustering), Double
Conjugate Clustering (DCC) and more ..
One example CTWC
Partition separately rows in R clusters,columns C
clusters using T as annealing to control size.
Lower T results in smaller clusters..
Co-clusters defined as (R,C) tuples . Prune
unstable (R,C). Defined as clusters that broke up
fast within a short T interval. Stop if large
stable clusters not in the pruned (R,C) set any
more.
Decrease T, Repeat Step 1 on each (R,C) tuple,
keeping track of the parent tuple.
Divide-and-conquer
Greedy iterative search add remove points based
on local gain, for example Cheng Church using
Mean Squared Residue
Exhaustive enumeration not exponential by
limiting the number of columns a gene activity is
non-zero - relevant to gene expression.
Distribution parameter identification assume a
parametric distribution of clusters and search
for the parameters for the K clusters.

19
Another way of looking at Co-clustering a
Bipartite graph
Example Samba, using exhaustive enumeration for a
limited degree of a gene node.
Minimize weighted/binary edge cut to produce
heaviest bipartite sub-graphs to get co-clusters.
Can find overlapping sub-graphs.
20
To Summarize existing work, and some comments ..

Coherent additive or coherent co-evolutionary
approaches are quite popular and adaptive enough
for biology.
All co-clustering papers in bio-informatics have
looked at 2-D gene array data.
Quality measures diverse and vary substantially
by algorithm, only some of them relevant to
gene-expression data. Choose carefully.
Clustering structures diverse and vary
substantially by algorithm, only some of them
relevant to gene-expression data. Choose
carefully.
But .. More flexible and overlapping clustering
topologies have higher computational complexity
can we put lower bounds on these ?

21
Discovering coherent co-evolution co-clusters

Co-clusters that form gene groups that have
coherent behavior, even if out of phase if gene
A vs. B show opposite expression trend (if one
increases the other one decreases), then they
show opposite trend for all columns. See Figure
F.
Various kinds of co-evolutionary similarity
methods for co-clusters
Ben-Dor et al. Order Preserving Sub-Matrix
same ordering across all columns (see example F)
Similarly, Lui Wang Order Preserving Cluster
Murali Kasif State preserving xMOTIFs all
genes in same state in a row.
SAMBA two state expression of genes across all
conditions in the co-cluster.
Cho, Dhillion et al. sum-squared residue
coherent clusters.

22
Results Sparsity (Dhillon, 2003)
Before Co-clustering
(light regions are zeros)
23
Results Sparsity (Dhillon, 2003)
After Co-clustering
(light regions are zeros)
Large empty regions not relevant to co-clustering
get separated out

Write a Comment

User Comments (0)