Multiway clustering of microarray data using Probabilistic Sparse Matrix Factorization

About This Presentation

Title:

Multiway clustering of microarray data using Probabilistic Sparse Matrix Factorization

Description:

nothing about factor weights. normally-distributed factor profiles ... Use coordinate descent on free energy, F: PSMF visualization (1 of 3) ... – PowerPoint PPT presentation

Number of Views:184

Avg rating:3.0/5.0

Slides: 19

Provided by: delber

Category:

more less

Transcript and Presenter's Notes

Title: Multiway clustering of microarray data using Probabilistic Sparse Matrix Factorization

1
Multi-way clustering of microarray data
usingProbabilistic Sparse Matrix Factorization

Delbert Dueck, Quaid Morris, Brendan Frey
Probabilistic Statistical Inference Group
Department of Electrical and Computer Engineering
University of Toronto

2
Outline

Introduction
Biological Background
Mouse genome data set
Probabilistic Sparse Matrix Factorization
Generative Model
Approximate Inference
Results
Visualizations
Statistical Significance
Summary

3
Introduction

Genes encode basic information about an organism
Gene expression is influenced by the presence of
transcription factors
The activity of each gene can be explained by the
activities of a small number of transcription
factors
Expression data is from Zhang et al. (2004)
Contains clear expression profiles for over
20,000 known and predicted genes across 55 mouse
tissue types

4
Introduction Dataset
? G22709 genes ?
Entire data set X GT matrix (G22709, T55)
? 100 genes ?
? T55 tissues ?
T55tissues
5
Generative Model

We introduce an unsupervised technique that
renders a multi-way clustering of the data
Each genes expression profile (xg) is
a linear combination (weights ygc, c?sg)
of a small number (rgltN)
of C possible factor profiles (zc, c?sg)

6
Generative Model (entire matrix view)

Y is constrained structurally
rows must have ltN non-zero elements

( indicates a non-zero entry)
7
Generative Model (likelihoods)

Form a joint distribution, assuming
varying levels of Gaussian noise in the data
nothing about factor weights
normally-distributed factor profiles
uniformly-distributed factor assignments
multinomially-distributed factor counts

8
Approximate Inference Techniques

Exact inference of Ps hidden variables is
intractable
We examined two approximate techniques
Sparse Matrix Factorization (Srebro Jaakkola,
2001)
Search for a configuration of the hidden
variables that maximizes P (iterated conditional
modes)
Probabilistic Sparse Matrix Factorization (PSMF)
Search for a distribution over configurations of
the hidden variables that accurately approximates
P (variational EM)

9
Factorized Variational Inference

Parameterize Q
Accounts for noise in factor profiles and
uncertainty in factor selection
Minimize KL-divergence between P and Q

10
Variational EM algorithm for PSMFconvert into
image

Use coordinate descent on free energy, F

11
PSMF visualization (1 of 3)

Probabilistic Sparse Matrix Factorization
C50 possible factors
maximum of N3 factors per gene
Expression profiles (rows) are grouped by primary
factor (sg1), then secondary factor (sg2), etc.

12
PSMF visualization (2 of 3)

Zoom in where primary factor is 3
High expression in colon, small intestine, large
intestine tissues
enriched for GO-BP category lipid metabolism
GO0006629 (p-value lt 10-10)

13
PSMF visualization (3 of 3)

Zoom in where primary factor is 3 and secondary
factor is 33

14
Results p-value histograms

Genes are clustered by primary factor,
secondary factor, etc.
Compare these clusters with annotated GO
categories by computing hypergeometric p-values
of all possible cluster-GO category pairings
Statistical significance of each cluster is the
p-value of most enriched-for GO category

Histograms of cluster p-values
in? significant insignificant ? (a0.05 plus
Bonferoni correction)
15
Results significant factors (N1)
16
Results significant factors (N1)
It would be nice to get rid of random clustering
for the next three slides
17
Results significant factors (N2)
18
Results significant factors (N3)
19
Results complete
20
PSMF vs. SMF maximizing likelihoods

SMF makes hard decisions during maximization
Immediately converges to poor local maximum
PSMF makes soft decisions (better approximation
of posterior)

21
Summary

We introduce probabilistic sparse matrix
factorization
Useful tool for data vectors that are most
naturally explained as a linear combination of a
selection of prototype vectors
Computes probability distributions instead of
making hard decisions during factor selection
PSMF finds more highly-enriched functional
clusters than SMF and standard techniques
Also outputs secondary and higher-order labels
for each expression vector
Useful for more refined visualization and
functional prediction

22
Questions?

For more details and software,see
www.psi.toronto.edu/delbert

23
Future Directions different Q?
Iterated conditional modes (point estimates)

Write a Comment

User Comments (0)

About PowerShow.com

Multiway clustering of microarray data using Probabilistic Sparse Matrix Factorization - PowerPoint PPT Presentation

Multiway clustering of microarray data using Probabilistic Sparse Matrix Factorization

nothing about factor weights. normally-distributed factor profiles ... Use coordinate descent on free energy, F: PSMF visualization (1 of 3) ... – PowerPoint PPT presentation