Title: Clustering gene expression profiles following Chinese restaurant process
1Clustering gene expression profiles following
Chinese restaurant process
- Steve Qin
- Bioinformatics workshop
2Background
3Goal
- Partition the data, such that
- Data within classes are similar.
- Classes are different among themselves.
4Popular clustering methods
5Challenges
- Simple and model free.
- Depend on distance definition.
- Less flexible in handling noise and missing data.
- Expensive to compute for large dataset.
- No probabilistic foundation, hard to do inference
on the result.
6Model based clustering
- Finite mixture model
- Mclachlan and Basford, 1988.
- Banfield and Raftery, 1993.
- K is determined using BIC, then clustering is
performed conditional on K using EM algorithm.
7Discussion
- Sound probabilistic foundation, able to handle
large dataset. - Separate clustering from
- estimating cluster size.
8Dirichlet process mixture model
- Infinite mixture model.
- Do not require K a priori.
- Chinese restaurant process.
- Due to Dubins and Pitman.
- Aldous (1985), Pitman (1996).
9Chinese restaurant process
10Chinese restaurant process
The probability of joining these tables
11About DP
- Let (?,?) be a measurable space, G0 be a
probability measure on the space, and ? be a
positive real number - A Dirichlet process is any distribution of a
random probability measure G over (?,?) such
that, for all finite partitions (A1,,Ar) of ?, - Draws G from DP are generally not distinct
- The number of distinct values grows with O(log n)
12Exchangeable
- In general, an infinite set of random variables
is said to be infinitely exchangeable if every
finite subset xi,,xn is exchangeable. - Using DeFinettis theorem, it is possible to show
that our draws ? are infinitely exchangeable - Thus the mixture components may be sampled in any
order
13General scheme
14Dirichlet process
- G DP(a, G0)
- G0 is continuous, so the probability that any
two samples are equal is precisely zero. - However, G is a discrete distribution, made up of
a countably infinite number of point masses
Blackwell - Therefore, there is always a non-zero
probability of two samples colliding
15Dirichlet process
16History
- Polya urn process.
- Stick breaking.
- Infinite mixture model.
- Bayesian nonparametric model
- Historical references
- Ferguson 1973.
- Blackwell and McQueen 1973.
- Antoniak 1974.
17References
- FMM
- McLachlan et al. 2002, Yeung et al. 202.
- IMM
- Medvedovic and Sivaganesan 2002.
- Yeung et al 2003, Medvedovic 2004.
- Tadesse et al. 2005, Kim et al. 2006.
18Notation
- N number of genes
- M number of experiments
- K number of clusters (unknown).
- Xxij expression profile.
- EE(i) indicator of cluster membership.
19Model
- Gene profiles within a cluster follow the same
set of Gaussian distributions. - Likelihood
20Marginal likelihood
Integrating out nuisance parameters. Predictive
updating Liu 1994. Chen and Liu 1995.
21Posterior inference
Weighted Chinese restaurant process Lo 2005.
22Algorithm
- Initialization
- randomly assign genes into an arbitrary number
of K0 clusters 1 K0 N. - For each gene i, perform the following
reassignment - Remove gene i from its current cluster, given the
current assignment of all the other genes,
calculate the probability of this gene joining
each of the existing cluster as well as being
alone. - Assign gene i to the K 1 possible clusters
according to probabilities. Update indicator
variable E(i) based on the assignment. - Repeat the above two steps for every gene, and
repeat for a large number of rounds until
convergence.
23Correlations
24Add a model selection step
Try to fit different versions of this vector to
all clusters.
25Remarks (I)
- For each gene, provide posterior probability for
joining its current cluster. - For each cluster, provide likelihood ratio to
measure its tightness. - For each pair of cluster, provide log likelihood
ratio as a distance measure. Can draw a
dendrogram for all clusters.
26Remarks (II)
- Assume all experiments are independent.
- Pathetic, but still works.
- Can easily add covariance structure if needed,
e.g., for time course data. - Tolerate sporadic missing data.
- If data missing from an experiment, give equal
probability to join each existing clusters in
this experiment.
27Remarks (III)
- Choice of prior
- Data dependent , a 0.5, b
2sd(x). - Fix tuning parameter a 1. higher a will produce
more clusters. - Start from clusters.
- Run 20 parallel chains, each go through 100
cycles.
28Simulation study
- 400 genes, 20 experiments, 5 clusters.
K 1, 2, 3.
29Trace plots
30Adjusted Rand Index
- Hubert and Arable 1985.
- Ranges (0, 1).
- 0 random 1 perfect match.
Yeung and Ruzzo 2001.
31Results
Hierarchical clustering on the complex dataset is
57.4.
32Galactose dataset
- Microarrays were used to measure the mRNA
expression profiles of yeast growing under 20
different perturbations to the GAL pathway.
Ideker et al, 2001. - 205 genes whose expression patterns reflect 4 GO
functional categories. - 4 replicates.
- 8 missing data, imputed by knn (k12)
approach. Troyanskaya et al. 2002.
33Trace plot
34Results
35Results
GIMM with replicates 84.69 (4), 95.29 (5), 95.01
(6). No replicate 56 67.
36Discussion
- GIMM
- hierarchical model,
- Model covariance,
- Model replicates,
- Measure frequency of co-occurrences, perform
hierarchical clustering. - This algorithm
- Independent model,
- Predictive updating,
- Allow missing data,
- Allow complex relationships,
- No distance defined.
37Discussion
- Distance based clustering relies on gene-gene
comparison, O(n2), model based clustering perform
gene-cluster comparison, O(nlog(n)m), more
efficient for large dataset. - Combine cluster number estimation and actual
clustering in one unified process seems to be
advantageous.
38Robustness of cluster size Correlation between
different K
39Discussion
- Distance based clustering are more vulnerable to
adverse complications, such as missing data,
substantial noise. E.g., if data from one
experiment is corrupted.
40Limitations
- Currently, each gene belongs to one cluster. In
reality, one gene can participated in multiple
pathways. - Magnitude of data dominate clustering decision.
Trend should be an important factor for
consideration.
41Acknowledgement
- Michael Elliott
- Debashis Ghosh
- Mario Medvedovic
42Thank You