Title: Multiway clustering of microarray data using Probabilistic Sparse Matrix Factorization
1Multi-way clustering of microarray data
usingProbabilistic Sparse Matrix Factorization
- Delbert Dueck, Quaid Morris, Brendan Frey
- Probabilistic Statistical Inference Group
- Department of Electrical and Computer Engineering
- University of Toronto
2Outline
- Introduction
- Biological Background
- Mouse genome data set
- Probabilistic Sparse Matrix Factorization
- Generative Model
- Approximate Inference
- Results
- Visualizations
- Statistical Significance
- Summary
3Introduction
- Genes encode basic information about an organism
- Gene expression is influenced by the presence of
transcription factors - The activity of each gene can be explained by the
activities of a small number of transcription
factors - Expression data is from Zhang et al. (2004)
- Contains clear expression profiles for over
20,000 known and predicted genes across 55 mouse
tissue types
4Introduction Dataset
? G22709 genes ?
Entire data set X GT matrix (G22709, T55)
? 100 genes ?
? T55 tissues ?
T55tissues
5Generative Model
- We introduce an unsupervised technique that
renders a multi-way clustering of the data - Each genes expression profile (xg) is
- a linear combination (weights ygc, c?sg)
- of a small number (rgltN)
- of C possible factor profiles (zc, c?sg)
6Generative Model (entire matrix view)
- Y is constrained structurally
- rows must have ltN non-zero elements
( indicates a non-zero entry)
7Generative Model (likelihoods)
- Form a joint distribution, assuming
- varying levels of Gaussian noise in the data
- nothing about factor weights
- normally-distributed factor profiles
- uniformly-distributed factor assignments
- multinomially-distributed factor counts
8Approximate Inference Techniques
- Exact inference of Ps hidden variables is
intractable - We examined two approximate techniques
- Sparse Matrix Factorization (Srebro Jaakkola,
2001) - Search for a configuration of the hidden
variables that maximizes P (iterated conditional
modes) - Probabilistic Sparse Matrix Factorization (PSMF)
- Search for a distribution over configurations of
the hidden variables that accurately approximates
P (variational EM)
9Factorized Variational Inference
- Parameterize Q
- Accounts for noise in factor profiles and
uncertainty in factor selection - Minimize KL-divergence between P and Q
10Variational EM algorithm for PSMFconvert into
image
- Use coordinate descent on free energy, F
11PSMF visualization (1 of 3)
- Probabilistic Sparse Matrix Factorization
- C50 possible factors
- maximum of N3 factors per gene
- Expression profiles (rows) are grouped by primary
factor (sg1), then secondary factor (sg2), etc.
12PSMF visualization (2 of 3)
- Zoom in where primary factor is 3
- High expression in colon, small intestine, large
intestine tissues - enriched for GO-BP category lipid metabolism
GO0006629 (p-value lt 10-10)
13PSMF visualization (3 of 3)
- Zoom in where primary factor is 3 and secondary
factor is 33
14Results p-value histograms
- Genes are clustered by primary factor,
secondary factor, etc. - Compare these clusters with annotated GO
categories by computing hypergeometric p-values
of all possible cluster-GO category pairings - Statistical significance of each cluster is the
p-value of most enriched-for GO category
Histograms of cluster p-values
in? significant insignificant ? (a0.05 plus
Bonferoni correction)
15Results significant factors (N1)
16Results significant factors (N1)
It would be nice to get rid of random clustering
for the next three slides
17Results significant factors (N2)
18Results significant factors (N3)
19Results complete
20PSMF vs. SMF maximizing likelihoods
- SMF makes hard decisions during maximization
- Immediately converges to poor local maximum
- PSMF makes soft decisions (better approximation
of posterior)
21Summary
- We introduce probabilistic sparse matrix
factorization - Useful tool for data vectors that are most
naturally explained as a linear combination of a
selection of prototype vectors - Computes probability distributions instead of
making hard decisions during factor selection - PSMF finds more highly-enriched functional
clusters than SMF and standard techniques - Also outputs secondary and higher-order labels
for each expression vector - Useful for more refined visualization and
functional prediction
22Questions?
- For more details and software,see
www.psi.toronto.edu/delbert
23Future Directions different Q?
Iterated conditional modes (point estimates)