Title: Probabilistic Sparse Matrix Factorization
1Probabilistic Sparse Matrix Factorization
- Delbert Dueck, Quaid Morris, Brendan
Frey(Probabilistic Statistical Inference
Group) - Tim Hughes(Banting and Best Department of
Medical Research)
2Objective
- Patterns in gene expression array data can be
used to help understand gene regulation and
predict the function of yet-uncharacterized genes - Objective To develop a method of probabilistic
sparse matrix factorization (PSMF) and apply it
to gene expression data to learn the hidden
structure underlying the data.
3Biological Background
- Genes encode basic information about an organism
- They tend to be highly expressed in tissues
related to their functional role - Mouse gene expression data is from Zhang, Morris,
et al. (2004) - Gene expression is influenced by the presence of
transcription factors (TFs) - Co-expressed genes are likely activated by the
same TFs - The activity of each gene can be explained by the
activities of a small number of transcription
factors
4Gene Expression Array Dataset
? G22709 genes ?
Entire data set X GT matrix (G22709, T55)
? 100 genes ?
? T55 tissues ?
T55tissues
5Sparse Matrix Factorization
- Gene expression data model
- Each genes expression profile (xg) is
- a linear combination (weighted by ygc, c?sg)
- of a small number (rgltN)
- of C possible transcription factor profiles (zc,
c?sg)
6Sparse Matrix Factorization
Matrix format (entire dataset)
7Probabilistic Sparse Matrix Factorization
- To express as a distribution, assume
- varying levels of Gaussian noise in the data
- nothing about transcription factor weights
- normally-distributed transcription factor
profiles - uniformly-distributed factor assignments
- multinomially-distributed factor counts
8Probabilistic Sparse Matrix Factorization
- To express as a distribution, assume
- varying levels of Gaussian noise in the data
- nothing about transcription factor weights
- normally-distributed transcription factor
profiles - uniformly-distributed factor assignments
- multinomially-distributed factor counts
- Multiply together to get joint distribution
9Factorized Variational Inference
- Exact inference is intractable with P()
10Factorized Variational Inference
- Exact inference is intractable with P()
- Approximate it by a simpler distribution, Q(),
and perform inference on that
11Factorized Variational Inference
- Parameterize Q()
- Accounts for noise in transcription factor
profiles and uncertainty in transcription factor
selection
12Factorized Variational Inference
- Parameterize Q()
- Accounts for noise in transcription factor
profiles and uncertainty in transcription factor
selection - Minimize KL-divergence between P(), Q()
13Factorized Variational Inference
- Parameterize Q()
- Accounts for noise in transcription factor
profiles and uncertainty in transcription factor
selection - Minimize KL-divergence between P(), Q()
14Variational EM algorithm
- Use coordinate descent on free energy
15Variational EM Free Energy
iteration
16Visualization
PROBABILISTIC SPARSE MATRIX FACTORIZATION C50
possible factors N3 factors per gene (max)
P(rg).55 .27 .18
Sorted by primary transcription factor (sg1)
17Results p-value histograms
- Genes can be partitioned into primary
categories (i.e. same sg1 value), secondary
classes, etc. - Compare classes with annotated gene ontology
(GO-BP) categories for statistical significance
18Results mean log10 p-values
19Results count of significant p-values
20Future Directions different Q()
Iterated conditional modes (point estimates)
21Summary
- Introduced probabilistic sparse matrix
factorization (PSMF), each row is a linear
combination of a small number of hidden factors
selected from a larger set. - Described a variational inference algorithm for
fitting the PSMF model. - Evaluated model on a gene functional prediction
task.