Iclust: information based clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Iclust: information based clustering

Description:

Gasper Tkacik. Bill Bialek. 2. 2. 12 -1 -1. 6 -3. 8. 7 -5. 3 -4. 12 -5. 11 -2. 6. 11. 11 -8. 12 ... Suitable for both continuous and discrete data. Independent ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 19
Provided by: nslo
Category:

less

Transcript and Presenter's Notes

Title: Iclust: information based clustering


1
Iclust information based clustering
Noam Slonim The Lewis-Sigler Institute for
Integrative Genomics Princeton University
Joint work with Gurinder Atwal Gasper
Tkacik Bill Bialek
2
Running example
Gene expression data
N conditions
2
12
-1
-1
6
-3
8
??
7
-5
3
-4
12
??
-5
11
-2
6
11
11
-8
12
??
-2
??
12
5
12
4
-1
8
-2
??
5
14
??
8
1
12
1
14
-8
??
-2
5
14
-8
-7
5
-5
11
17
-2
15
5
14
-8
5
16
2
(log) ratio of the mRNA expression level of a
gene in a specific condition
K genes
1
11
-8
0
5
-5
5
14
18
??
2
1
-6
12
4
12
4
7
-1
3
-7
3
7
-5
21
??
??
3
2
4
-11
-3
3
-3
??
9
Relations between genes? Relations between
experimental conditions?
3
Information as a correlation/similarity measure
Some nice features of the information
measure Model independent Responsive to any
type of dependency Captures more than just
pairwise relations Suitable for both continuous
and discrete data Independent of the measurement
scale Axiomatic
4
Mutual information - definition
We have some uncertainty about the state of
gene-A but now someone told us the state of
gene-B
5
Model independence responsiveness to
complicated relations
6
Capturing more than just pairwise relations
Using a model-dependent correlation measure might
result in missing significant dependencies in
our data.
7
Mutual-information vs. Pearson-Correlation
results in bacteria gene-expression data
Mycobacterium tuberculosis 81 experiments
Mutual information
Pearson Correlation
8
Information relations between gene expression
profiles
Given the expression of gene-A, how much
information do we have about the expression of
gene-B ? (when averaging over all conditions)
( sample size number of conditions - 173 in
Gasch data )
Once we find these information relations, we
often want to apply cluster analysis.
Numerous clustering methods are available but
typically they assume a particular model.
For example, K-means corresponds to the modeling
assumption that each cluster can be described by
a spherical Gaussian.
Back in square one ?
9
Iclust information based clustering
What is a good cluster?
A simple proposal given a cluster, we pick two
items at random, and we want them to be as
similar to each other as possible.
Namely, we wish to maximize the average
information relations in our clusters, or to
find clusters s.t. in each cluster all items are
highly informative about each other.
10
Iclust information based clustering (cont.)
A penalty term that we wish to minimize, as in
rate-distortion theory
11
Iclust information based clustering (cont.)
The intuitive clustering problem can be turned
into a General mathematical optimization problem
Clustering is formulated as trading bits of
similarity against bits of descriptive power,
without any further assumptions.
12
Relations with other classical rate distortion
If the distortion/similarity matrix is a kernel
matrix the formulations are equivalent
13
And yet some important differences
Iclust is applicable when the raw data is given
directly as pairwise relations
Iclust do not require a definition of a
prototype (or centroid)
Iclust can handle more than just pairwise
correlations
14
Iclust vs. classical rate-distortion decoding
Original figure 220 gray levels
15
Iclust algorithm - freely available Web
implementation
16
Iclust clusters examples
17
Coherence results comparison to alternative
algorithms
ESR
SP 500
EachMovie
K-means
K-means
K-means
K-medians
K-medians
K-medians
Hierarchical
Hierarchical
Hierarchical
18
Quick Summary
Information as the core measure of data analysis
with many appealing features
Iclust - a novel information-theoretic
formulation of clustering, with some intriguing
relations with classical rate distortion
clustering.
Validations finding coherent gene clusters based
on information relations in gene-expression data
and finding coherent stocks clusters, coherent
movies clusters
and genotype-phenotype association in bacteria,
based on phylogenetic data - Slonim, Elemento
Tavazoie (2005), Mol. Systems Biol., in press.
and more?
Write a Comment
User Comments (0)
About PowerShow.com