Clustering gene expression profiles following Chinese restaurant process - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Clustering gene expression profiles following Chinese restaurant process

Description:

Clustering gene expression profiles following Chinese restaurant ... Hubert and Arable 1985. Ranges (0, 1). 0 random 1 perfect match. Yeung and Ruzzo 2001. ... – PowerPoint PPT presentation

Number of Views:170

Avg rating:3.0/5.0

Slides: 43

Provided by: sphU

Category:

more less

Transcript and Presenter's Notes

Title: Clustering gene expression profiles following Chinese restaurant process

1
Clustering gene expression profiles following
Chinese restaurant process

Steve Qin
Bioinformatics workshop

2
Background
3
Goal

Partition the data, such that
Data within classes are similar.
Classes are different among themselves.

4
Popular clustering methods
5
Challenges

Simple and model free.
Depend on distance definition.
Less flexible in handling noise and missing data.
Expensive to compute for large dataset.
No probabilistic foundation, hard to do inference
on the result.

6
Model based clustering

Finite mixture model
Mclachlan and Basford, 1988.
Banfield and Raftery, 1993.
K is determined using BIC, then clustering is
performed conditional on K using EM algorithm.

7
Discussion

Sound probabilistic foundation, able to handle
large dataset.
Separate clustering from
estimating cluster size.

8
Dirichlet process mixture model

Infinite mixture model.
Do not require K a priori.
Chinese restaurant process.
Due to Dubins and Pitman.
Aldous (1985), Pitman (1996).

9
Chinese restaurant process

10
Chinese restaurant process
The probability of joining these tables

11
About DP

Let (?,?) be a measurable space, G0 be a
probability measure on the space, and ? be a
positive real number
A Dirichlet process is any distribution of a
random probability measure G over (?,?) such
that, for all finite partitions (A1,,Ar) of ?,
Draws G from DP are generally not distinct
The number of distinct values grows with O(log n)

12
Exchangeable

In general, an infinite set of random variables
is said to be infinitely exchangeable if every
finite subset xi,,xn is exchangeable.
Using DeFinettis theorem, it is possible to show
that our draws ? are infinitely exchangeable
Thus the mixture components may be sampled in any
order

13
General scheme
14
Dirichlet process

G DP(a, G0)
G0 is continuous, so the probability that any
two samples are equal is precisely zero.
However, G is a discrete distribution, made up of
a countably infinite number of point masses
Blackwell
Therefore, there is always a non-zero
probability of two samples colliding

15
Dirichlet process
16
History

Polya urn process.
Stick breaking.
Infinite mixture model.
Bayesian nonparametric model
Historical references
Ferguson 1973.
Blackwell and McQueen 1973.
Antoniak 1974.

17
References

FMM
McLachlan et al. 2002, Yeung et al. 202.
IMM
Medvedovic and Sivaganesan 2002.
Yeung et al 2003, Medvedovic 2004.
Tadesse et al. 2005, Kim et al. 2006.

18
Notation

N number of genes
M number of experiments
K number of clusters (unknown).
Xxij expression profile.
EE(i) indicator of cluster membership.

19
Model

Gene profiles within a cluster follow the same
set of Gaussian distributions.
Likelihood

20
Marginal likelihood

Conjugate priors

Integrating out nuisance parameters. Predictive
updating Liu 1994. Chen and Liu 1995.
21
Posterior inference

Prior
Posterior

Weighted Chinese restaurant process Lo 2005.
22
Algorithm

Initialization
randomly assign genes into an arbitrary number
of K0 clusters 1 K0 N.
For each gene i, perform the following
reassignment
Remove gene i from its current cluster, given the
current assignment of all the other genes,
calculate the probability of this gene joining
each of the existing cluster as well as being
alone.
Assign gene i to the K 1 possible clusters
according to probabilities. Update indicator
variable E(i) based on the assignment.
Repeat the above two steps for every gene, and
repeat for a large number of rounds until
convergence.

23
Correlations
24
Add a model selection step
Try to fit different versions of this vector to
all clusters.
25
Remarks (I)

For each gene, provide posterior probability for
joining its current cluster.
For each cluster, provide likelihood ratio to
measure its tightness.
For each pair of cluster, provide log likelihood
ratio as a distance measure. Can draw a
dendrogram for all clusters.

26
Remarks (II)

Assume all experiments are independent.
Pathetic, but still works.
Can easily add covariance structure if needed,
e.g., for time course data.
Tolerate sporadic missing data.
If data missing from an experiment, give equal
probability to join each existing clusters in
this experiment.

27
Remarks (III)

Choice of prior
Data dependent , a 0.5, b
2sd(x).
Fix tuning parameter a 1. higher a will produce
more clusters.
Start from clusters.
Run 20 parallel chains, each go through 100
cycles.

28
Simulation study

400 genes, 20 experiments, 5 clusters.

K 1, 2, 3.
29
Trace plots
30
Adjusted Rand Index

Hubert and Arable 1985.
Ranges (0, 1).
0 random 1 perfect match.

Yeung and Ruzzo 2001.
31
Results
Hierarchical clustering on the complex dataset is
57.4.
32
Galactose dataset

Microarrays were used to measure the mRNA
expression profiles of yeast growing under 20
different perturbations to the GAL pathway.
Ideker et al, 2001.
205 genes whose expression patterns reflect 4 GO
functional categories.
4 replicates.
8 missing data, imputed by knn (k12)
approach. Troyanskaya et al. 2002.

33
Trace plot
34
Results
35
Results
GIMM with replicates 84.69 (4), 95.29 (5), 95.01
(6). No replicate 56 67.
36
Discussion

GIMM
hierarchical model,
Model covariance,
Model replicates,
Measure frequency of co-occurrences, perform
hierarchical clustering.
This algorithm
Independent model,
Predictive updating,
Allow missing data,
Allow complex relationships,
No distance defined.

37
Discussion

Distance based clustering relies on gene-gene
comparison, O(n2), model based clustering perform
gene-cluster comparison, O(nlog(n)m), more
efficient for large dataset.
Combine cluster number estimation and actual
clustering in one unified process seems to be
advantageous.

38
Robustness of cluster size Correlation between
different K
39
Discussion

Distance based clustering are more vulnerable to
adverse complications, such as missing data,
substantial noise. E.g., if data from one
experiment is corrupted.

40
Limitations

Currently, each gene belongs to one cluster. In
reality, one gene can participated in multiple
pathways.
Magnitude of data dominate clustering decision.
Trend should be an important factor for
consideration.

41
Acknowledgement