Multiway clustering of microarray data using Probabilistic Sparse Matrix Factorization - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Multiway clustering of microarray data using Probabilistic Sparse Matrix Factorization

Description:

nothing about factor weights. normally-distributed factor profiles ... Use coordinate descent on free energy, F: PSMF visualization (1 of 3) ... – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 19
Provided by: delber
Category:

less

Transcript and Presenter's Notes

Title: Multiway clustering of microarray data using Probabilistic Sparse Matrix Factorization


1
Multi-way clustering of microarray data
usingProbabilistic Sparse Matrix Factorization
  • Delbert Dueck, Quaid Morris, Brendan Frey
  • Probabilistic Statistical Inference Group
  • Department of Electrical and Computer Engineering
  • University of Toronto

2
Outline
  • Introduction
  • Biological Background
  • Mouse genome data set
  • Probabilistic Sparse Matrix Factorization
  • Generative Model
  • Approximate Inference
  • Results
  • Visualizations
  • Statistical Significance
  • Summary

3
Introduction
  • Genes encode basic information about an organism
  • Gene expression is influenced by the presence of
    transcription factors
  • The activity of each gene can be explained by the
    activities of a small number of transcription
    factors
  • Expression data is from Zhang et al. (2004)
  • Contains clear expression profiles for over
    20,000 known and predicted genes across 55 mouse
    tissue types

4
Introduction Dataset
? G22709 genes ?
Entire data set X GT matrix (G22709, T55)
? 100 genes ?
? T55 tissues ?
T55tissues
5
Generative Model
  • We introduce an unsupervised technique that
    renders a multi-way clustering of the data
  • Each genes expression profile (xg) is
  • a linear combination (weights ygc, c?sg)
  • of a small number (rgltN)
  • of C possible factor profiles (zc, c?sg)

6
Generative Model (entire matrix view)
  • Y is constrained structurally
  • rows must have ltN non-zero elements

( indicates a non-zero entry)
7
Generative Model (likelihoods)
  • Form a joint distribution, assuming
  • varying levels of Gaussian noise in the data
  • nothing about factor weights
  • normally-distributed factor profiles
  • uniformly-distributed factor assignments
  • multinomially-distributed factor counts

8
Approximate Inference Techniques
  • Exact inference of Ps hidden variables is
    intractable
  • We examined two approximate techniques
  • Sparse Matrix Factorization (Srebro Jaakkola,
    2001)
  • Search for a configuration of the hidden
    variables that maximizes P (iterated conditional
    modes)
  • Probabilistic Sparse Matrix Factorization (PSMF)
  • Search for a distribution over configurations of
    the hidden variables that accurately approximates
    P (variational EM)

9
Factorized Variational Inference
  • Parameterize Q
  • Accounts for noise in factor profiles and
    uncertainty in factor selection
  • Minimize KL-divergence between P and Q

10
Variational EM algorithm for PSMFconvert into
image
  • Use coordinate descent on free energy, F

11
PSMF visualization (1 of 3)
  • Probabilistic Sparse Matrix Factorization
  • C50 possible factors
  • maximum of N3 factors per gene
  • Expression profiles (rows) are grouped by primary
    factor (sg1), then secondary factor (sg2), etc.

12
PSMF visualization (2 of 3)
  • Zoom in where primary factor is 3
  • High expression in colon, small intestine, large
    intestine tissues
  • enriched for GO-BP category lipid metabolism
    GO0006629 (p-value lt 10-10)

13
PSMF visualization (3 of 3)
  • Zoom in where primary factor is 3 and secondary
    factor is 33

14
Results p-value histograms
  • Genes are clustered by primary factor,
    secondary factor, etc.
  • Compare these clusters with annotated GO
    categories by computing hypergeometric p-values
    of all possible cluster-GO category pairings
  • Statistical significance of each cluster is the
    p-value of most enriched-for GO category

Histograms of cluster p-values
in? significant insignificant ? (a0.05 plus
Bonferoni correction)
15
Results significant factors (N1)
16
Results significant factors (N1)
It would be nice to get rid of random clustering
for the next three slides
17
Results significant factors (N2)
18
Results significant factors (N3)
19
Results complete
20
PSMF vs. SMF maximizing likelihoods
  • SMF makes hard decisions during maximization
  • Immediately converges to poor local maximum
  • PSMF makes soft decisions (better approximation
    of posterior)

21
Summary
  • We introduce probabilistic sparse matrix
    factorization
  • Useful tool for data vectors that are most
    naturally explained as a linear combination of a
    selection of prototype vectors
  • Computes probability distributions instead of
    making hard decisions during factor selection
  • PSMF finds more highly-enriched functional
    clusters than SMF and standard techniques
  • Also outputs secondary and higher-order labels
    for each expression vector
  • Useful for more refined visualization and
    functional prediction

22
Questions?
  • For more details and software,see
    www.psi.toronto.edu/delbert

23
Future Directions different Q?
Iterated conditional modes (point estimates)
Write a Comment
User Comments (0)
About PowerShow.com