Bayesian mixture models for analysing gene expression data - PowerPoint PPT Presentation

About This Presentation

Title:

Bayesian mixture models for analysing gene expression data

Description:

LS model: H is normal with variance proportional to variance of the data: ... T, LS and Gamma models perform similarly ... LS model: empirical vs fully Bayesian ... – PowerPoint PPT presentation

Number of Views:155

Avg rating:3.0/5.0

Slides: 25

Provided by: nab7

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian mixture models for analysing gene expression data

1
Bayesian mixture models for analysing gene
expression data

Natalia Bochkina
In collaboration with
Alex Lewin , Sylvia Richardson,
BAIR Consortium
Imperial College London, UK

2
Introduction

We use a fully Bayesian approach to model data
and MCMC for parameter estimation.
Models all parameters simultaneously.
Prior information can be included in the model.
Variances are automatically adjusted to avoid
unstable estimates for small number of
observations.
Inference is based on the posterior distribution
of all parameters.
Use the mean of the posterior distribution as an
estimate for all parameters.

3
Differential expression
Condition 1
Condition 2
Distribution of expression index for gene g ,
condition 1
Distribution of expression index for gene g ,
condition 2
Distribution of differential expression parameter
4
Bayesian Model
2 conditions
Number of replicates in each condition
yg1r N( ?g - ½ dg , sg1) , r 1, R1 yg2r
N( ?g ½ dg , sg2 ), r 1, R2
Mean
Difference (log fold change)
s2gk IG(ak, bk), k1,2 E(s2gks2gk) (Rk-1)
s2gk 2bk/(Rk-12ak) Non-informative priors on
?g , ak , bk.
Prior model
Prior distribution on dg?
(Assume data is background corrected,
log-transformed and normalised)
5
Modelling differential expression
Prior information / assumption
Genes are either differentially expressed or not
(of interest or not)
Can include this in the model via modelling the
difference as a mixture
How to choose H?
dg (1-p) d0(dg) p H(dg ?g)
Advantages
H1
H0

Automatically selects threshold as opposed to
specifying constants as in the non-informative
prior model for differences
Interpretable can use Bayesian classification
to select differentially expressed genes
Pg in H1 data Pg in H0 data.
Can estimate false discovery and non-discovery
rates (Newton et al 2004).

6
Considered mixture models
We choose several distributions as the non-zero
part in the mixture distribution for dg double
gamma, Student t distribution, the conjugate
model (Lonnstedt and Speed (2002)) and the
uniform distribution in a fully Bayesian context.
LS model H is normal with variance proportional
to variance of the data dg (1-p)d0 p N (0, c
sg2)
Gamma model H is double gamma distribution
sg2 sg12/R1 sg22/R2
T model H is Student t distribution dg
(1-p)d0 p T (?, µ, t)
Uniform model H is uniform distribution dg
(1-p)d0 p U(-m1, m2)
Priors on hyperparameters are either
non-informative or weakly informative G(1,1) for
parameters with support on positive semiline.
(-m1, m2) - slightly widened range of observed
differences
7
Simulated data
We compare performance of the four models on
simulated data. For simplicity we considered a
one group model (or a paired two group model). We
simulate a data set with 1000 variables and 8
replicates
Plot of the simulated data set
Variance
Hyperparameters of variance a1.5, b0.05 are
chosen close to Bayesian estimates of those in a
real data set
Difference
8
Differences
Mixture estimates vs true values

Gamma, T and LS models estimate differences well
Uniform model shrinks values to zero
Compared to empirical Bayes, posterior estimates
in the fully Bayesian approach do not shrink
large values of the differences

Posterior mean
Posterior mean
Posterior mean
Posterior mean
9
Bayesian estimates of variance

T and Gamma models have very similar variance
estimates
Uniform model produces similar estimates for
small values and higher estimates for larger
values compared with T and Gamma models
LS model has more pertubation at both higher and
lower values compared to T and Gamma models

Blue variance estimate based on Bayesian model
with non-informative prior on differences.
Uniform model
Gamma model
E(s2y)
E(s2y)
sample variance
sample variance
LS model
T model
E(s2y)
Mixture estimate of the variance can be larger
than the sample variance
E(s2y)
sample variance
sample variance
10
Classification
Diff. Expressed genes (200)

T, LS and Gamma models perform similarly
Uniform model has a smaller number of false
positives but also a smaller number of true
positives

Non D. Expressed genes (800)
Uniform prior is more conservative
11
Wrongly classified by mixture truly dif.
expressed, truly not dif. expressed
Classification errors are on the
borderline Confusion between size of fold change
and biological variability
12
Another simulation
2628 data points Many points added on
borderline classification errors in red
Can we improve estimation of within
condition biological variability ?
13
DAG for the mixture model
The variance estimates are influenced by the
mixture parameters
Use only partial information from the
replicates to estimate ?2gs and feed forward in
the mixture ?
g 1G
14
Estimation

Estimation of all parameters combines information
from biological replicates and between condition
contrasts
s2gs 1/Rs Sr (ygsr - ygs. )2 , s 1,2
Within condition biological variability
1/Rs Sr ygsr ygs. ,
Average expression over replicates
½(yg1. yg2.) Average expression over conditions
½(yg1.- yg2.) Between conditions contrast

15
Mixture, full vs partial
Classification altered for 57 points
Work in progress
16
Difference cut and no cut
Different classification Truly
diff.expressed Truly not diff.expressed
Variance
Posterior probability
Sample st.dev. vs diff.
Cut
Full
Full
17
Microarray data
Variance
Posterior probability
Sample st.dev. vs diff.
Cut
Pooled sample st.dev.
Sample difference
Full
Full
Genes classified differently by the full model
and the model with feedback cut follow a curve.
18
Since variance is overestimated in full mixture
model compared to mixture model with cut, the
number of False Negatives is lower for model with
cut than for the full model.
19
LS model empirical vs fully Bayesian
Compare the Lonnstedt and Speed (LS) model

in fully Bayesian model (FB) and
empirical Bayes (EB) model.

Estimated parameters
Classification

If parameter p is specified correctly, empirical
and fully Bayesian models do not differ
If parameter p is misspecified, estimate of the
parameter c changes which leads to
misclassification

20
Small p (p0.01)
Cut
No Cut
21
Bayesian Estimate of FDR

Step 1 Choose a gene specific parameter (e.g.
dg ) or a gene statistic
Step 2 Model its prior distribution using a
mixture model
-- with one component to model the unaffected
genes (null hypothesis) e.g. point mass at 0 for
dg
-- other components to model (flexibly) the
alternative
Step 3 Calculate the posterior probability for
any gene to belong to the unmodified component
pg0 data
Step 4 Evaluate FDR (and FNR) for any list
assuming that all the gene classification are
independent
(Broët et al 2004)

Bayes FDR (list) data 1/card(list) Sg ? list
pg0
22
Multiple Testing Problem

Gene lists can be built by computing separately a
criteria for each gene and ranking
Thousands of genes are considered simultaneously
How to assess the performance of such lists ?

Statistical Challenge Select interesting genes
without including too many false positives in a
gene list
A gene is a false positive if it is included in
the list when it is truly unmodified under the
experimental set up
Want an evaluation of the expected false
discovery rate (FDR)
23
Bayes rule
FDR (black) FNR (blue) as a function of 1- pg0
Observed and estimated FDR/FNR correspond well
Post Prob (g ? H1) 1- pg0
24
Summary

Mixture models estimate differences and
hyperparameters well on simulated data.
Variance is overestimated for some genes.
Mixture model with uniform alternative
distribution is more conservative in classifying
genes than structured models.
Lonnstedt and Speed model performs better in
fully Bayesian framework because parameter p is
estimated from data.
Estimates of false discovery and non-discovery
rates are close to the true values