Title: Iclust: information based clustering
1Iclust information based clustering
Noam Slonim The Lewis-Sigler Institute for
Integrative Genomics Princeton University
Joint work with Gurinder Atwal Gasper
Tkacik Bill Bialek
2Running example
Gene expression data
N conditions
2
12
-1
-1
6
-3
8
??
7
-5
3
-4
12
??
-5
11
-2
6
11
11
-8
12
??
-2
??
12
5
12
4
-1
8
-2
??
5
14
??
8
1
12
1
14
-8
??
-2
5
14
-8
-7
5
-5
11
17
-2
15
5
14
-8
5
16
2
(log) ratio of the mRNA expression level of a
gene in a specific condition
K genes
1
11
-8
0
5
-5
5
14
18
??
2
1
-6
12
4
12
4
7
-1
3
-7
3
7
-5
21
??
??
3
2
4
-11
-3
3
-3
??
9
Relations between genes? Relations between
experimental conditions?
3Information as a correlation/similarity measure
Some nice features of the information
measure Model independent Responsive to any
type of dependency Captures more than just
pairwise relations Suitable for both continuous
and discrete data Independent of the measurement
scale Axiomatic
4Mutual information - definition
We have some uncertainty about the state of
gene-A but now someone told us the state of
gene-B
5Model independence responsiveness to
complicated relations
6Capturing more than just pairwise relations
Using a model-dependent correlation measure might
result in missing significant dependencies in
our data.
7Mutual-information vs. Pearson-Correlation
results in bacteria gene-expression data
Mycobacterium tuberculosis 81 experiments
Mutual information
Pearson Correlation
8Information relations between gene expression
profiles
Given the expression of gene-A, how much
information do we have about the expression of
gene-B ? (when averaging over all conditions)
( sample size number of conditions - 173 in
Gasch data )
Once we find these information relations, we
often want to apply cluster analysis.
Numerous clustering methods are available but
typically they assume a particular model.
For example, K-means corresponds to the modeling
assumption that each cluster can be described by
a spherical Gaussian.
Back in square one ?
9Iclust information based clustering
What is a good cluster?
A simple proposal given a cluster, we pick two
items at random, and we want them to be as
similar to each other as possible.
Namely, we wish to maximize the average
information relations in our clusters, or to
find clusters s.t. in each cluster all items are
highly informative about each other.
10Iclust information based clustering (cont.)
A penalty term that we wish to minimize, as in
rate-distortion theory
11Iclust information based clustering (cont.)
The intuitive clustering problem can be turned
into a General mathematical optimization problem
Clustering is formulated as trading bits of
similarity against bits of descriptive power,
without any further assumptions.
12Relations with other classical rate distortion
If the distortion/similarity matrix is a kernel
matrix the formulations are equivalent
13And yet some important differences
Iclust is applicable when the raw data is given
directly as pairwise relations
Iclust do not require a definition of a
prototype (or centroid)
Iclust can handle more than just pairwise
correlations
14Iclust vs. classical rate-distortion decoding
Original figure 220 gray levels
15Iclust algorithm - freely available Web
implementation
16Iclust clusters examples
17Coherence results comparison to alternative
algorithms
ESR
SP 500
EachMovie
K-means
K-means
K-means
K-medians
K-medians
K-medians
Hierarchical
Hierarchical
Hierarchical
18Quick Summary
Information as the core measure of data analysis
with many appealing features
Iclust - a novel information-theoretic
formulation of clustering, with some intriguing
relations with classical rate distortion
clustering.
Validations finding coherent gene clusters based
on information relations in gene-expression data
and finding coherent stocks clusters, coherent
movies clusters
and genotype-phenotype association in bacteria,
based on phylogenetic data - Slonim, Elemento
Tavazoie (2005), Mol. Systems Biol., in press.
and more?