Bayesian mixture models for analysing gene expression data - PowerPoint PPT Presentation

About This Presentation
Title:

Bayesian mixture models for analysing gene expression data

Description:

LS model: H is normal with variance proportional to variance of the data: ... T, LS and Gamma models perform similarly ... LS model: empirical vs fully Bayesian ... – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 25
Provided by: nab7
Category:

less

Transcript and Presenter's Notes

Title: Bayesian mixture models for analysing gene expression data


1
Bayesian mixture models for analysing gene
expression data
  • Natalia Bochkina
  • In collaboration with
  • Alex Lewin , Sylvia Richardson,
  • BAIR Consortium
  • Imperial College London, UK

2
Introduction
  • We use a fully Bayesian approach to model data
    and MCMC for parameter estimation.
  • Models all parameters simultaneously.
  • Prior information can be included in the model.
  • Variances are automatically adjusted to avoid
    unstable estimates for small number of
    observations.
  • Inference is based on the posterior distribution
    of all parameters.
  • Use the mean of the posterior distribution as an
    estimate for all parameters.

3
Differential expression
Condition 1
Condition 2
Distribution of expression index for gene g ,
condition 1
Distribution of expression index for gene g ,
condition 2
Distribution of differential expression parameter
4
Bayesian Model
2 conditions
Number of replicates in each condition
yg1r N( ?g - ½ dg , sg1) , r 1, R1 yg2r
N( ?g ½ dg , sg2 ), r 1, R2
Mean
Difference (log fold change)
s2gk IG(ak, bk), k1,2 E(s2gks2gk) (Rk-1)
s2gk 2bk/(Rk-12ak) Non-informative priors on
?g , ak , bk.
Prior model
Prior distribution on dg?
(Assume data is background corrected,
log-transformed and normalised)
5
Modelling differential expression
Prior information / assumption
Genes are either differentially expressed or not
(of interest or not)
Can include this in the model via modelling the
difference as a mixture
How to choose H?
dg (1-p) d0(dg) p H(dg ?g)
Advantages
H1
H0
  • Automatically selects threshold as opposed to
    specifying constants as in the non-informative
    prior model for differences
  • Interpretable can use Bayesian classification
    to select differentially expressed genes
  • Pg in H1 data Pg in H0 data.
  • Can estimate false discovery and non-discovery
    rates (Newton et al 2004).

6
Considered mixture models
We choose several distributions as the non-zero
part in the mixture distribution for dg double
gamma, Student t distribution, the conjugate
model (Lonnstedt and Speed (2002)) and the
uniform distribution in a fully Bayesian context.
LS model H is normal with variance proportional
to variance of the data dg (1-p)d0 p N (0, c
sg2)
Gamma model H is double gamma distribution
sg2 sg12/R1 sg22/R2
T model H is Student t distribution dg
(1-p)d0 p T (?, µ, t)
Uniform model H is uniform distribution dg
(1-p)d0 p U(-m1, m2)
Priors on hyperparameters are either
non-informative or weakly informative G(1,1) for
parameters with support on positive semiline.
(-m1, m2) - slightly widened range of observed
differences
7
Simulated data
We compare performance of the four models on
simulated data. For simplicity we considered a
one group model (or a paired two group model). We
simulate a data set with 1000 variables and 8
replicates
Plot of the simulated data set
Variance
Hyperparameters of variance a1.5, b0.05 are
chosen close to Bayesian estimates of those in a
real data set
Difference
8
Differences
Mixture estimates vs true values
  • Gamma, T and LS models estimate differences well
  • Uniform model shrinks values to zero
  • Compared to empirical Bayes, posterior estimates
    in the fully Bayesian approach do not shrink
    large values of the differences

Posterior mean
Posterior mean
Posterior mean
Posterior mean
9
Bayesian estimates of variance
  • T and Gamma models have very similar variance
    estimates
  • Uniform model produces similar estimates for
    small values and higher estimates for larger
    values compared with T and Gamma models
  • LS model has more pertubation at both higher and
    lower values compared to T and Gamma models

Blue variance estimate based on Bayesian model
with non-informative prior on differences.
Uniform model
Gamma model
E(s2y)
E(s2y)
sample variance
sample variance
LS model
T model
E(s2y)
Mixture estimate of the variance can be larger
than the sample variance
E(s2y)
sample variance
sample variance
10
Classification
Diff. Expressed genes (200)
  • T, LS and Gamma models perform similarly
  • Uniform model has a smaller number of false
    positives but also a smaller number of true
    positives

Non D. Expressed genes (800)
Uniform prior is more conservative
11
Wrongly classified by mixture truly dif.
expressed, truly not dif. expressed
Classification errors are on the
borderline Confusion between size of fold change
and biological variability
12
Another simulation
2628 data points Many points added on
borderline classification errors in red
Can we improve estimation of within
condition biological variability ?
13
DAG for the mixture model
The variance estimates are influenced by the
mixture parameters
Use only partial information from the
replicates to estimate ?2gs and feed forward in
the mixture ?
g 1G
14
Estimation
  • Estimation of all parameters combines information
    from biological replicates and between condition
    contrasts
  • s2gs 1/Rs Sr (ygsr - ygs. )2 , s 1,2
  • Within condition biological variability
  • 1/Rs Sr ygsr ygs. ,
  • Average expression over replicates
  • ½(yg1. yg2.) Average expression over conditions
  • ½(yg1.- yg2.) Between conditions contrast

15
Mixture, full vs partial
Classification altered for 57 points
Work in progress
16
Difference cut and no cut
Different classification Truly
diff.expressed Truly not diff.expressed
Variance
Posterior probability
Sample st.dev. vs diff.
Cut
Full
Full
17
Microarray data
Variance
Posterior probability
Sample st.dev. vs diff.
Cut
Pooled sample st.dev.
Sample difference
Full
Full
Genes classified differently by the full model
and the model with feedback cut follow a curve.
18
Since variance is overestimated in full mixture
model compared to mixture model with cut, the
number of False Negatives is lower for model with
cut than for the full model.
19
LS model empirical vs fully Bayesian
Compare the Lonnstedt and Speed (LS) model
  • in fully Bayesian model (FB) and
  • empirical Bayes (EB) model.

Estimated parameters
Classification
  • If parameter p is specified correctly, empirical
    and fully Bayesian models do not differ
  • If parameter p is misspecified, estimate of the
    parameter c changes which leads to
    misclassification

20
Small p (p0.01)
Cut
No Cut
21
Bayesian Estimate of FDR
  • Step 1 Choose a gene specific parameter (e.g.
    dg ) or a gene statistic
  • Step 2 Model its prior distribution using a
    mixture model
  • -- with one component to model the unaffected
    genes (null hypothesis) e.g. point mass at 0 for
    dg
  • -- other components to model (flexibly) the
    alternative
  • Step 3 Calculate the posterior probability for
    any gene to belong to the unmodified component
    pg0 data
  • Step 4 Evaluate FDR (and FNR) for any list
  • assuming that all the gene classification are
    independent
  • (Broët et al 2004)

Bayes FDR (list) data 1/card(list) Sg ? list
pg0
22
Multiple Testing Problem
  • Gene lists can be built by computing separately a
    criteria for each gene and ranking
  • Thousands of genes are considered simultaneously
  • How to assess the performance of such lists ?

Statistical Challenge Select interesting genes
without including too many false positives in a
gene list
A gene is a false positive if it is included in
the list when it is truly unmodified under the
experimental set up
Want an evaluation of the expected false
discovery rate (FDR)
23
Bayes rule
FDR (black) FNR (blue) as a function of 1- pg0
Observed and estimated FDR/FNR correspond well
Post Prob (g ? H1) 1- pg0
24
Summary
  • Mixture models estimate differences and
    hyperparameters well on simulated data.
  • Variance is overestimated for some genes.
  • Mixture model with uniform alternative
    distribution is more conservative in classifying
    genes than structured models.
  • Lonnstedt and Speed model performs better in
    fully Bayesian framework because parameter p is
    estimated from data.
  • Estimates of false discovery and non-discovery
    rates are close to the true values
Write a Comment
User Comments (0)
About PowerShow.com