Title: Bayesian mixture models for analysing gene expression data
1Bayesian mixture models for analysing gene
expression data
- Natalia Bochkina
- In collaboration with
- Alex Lewin , Sylvia Richardson,
- BAIR Consortium
- Imperial College London, UK
2Introduction
- We use a fully Bayesian approach to model data
and MCMC for parameter estimation. - Models all parameters simultaneously.
- Prior information can be included in the model.
- Variances are automatically adjusted to avoid
unstable estimates for small number of
observations. - Inference is based on the posterior distribution
of all parameters. - Use the mean of the posterior distribution as an
estimate for all parameters.
3Differential expression
Condition 1
Condition 2
Distribution of expression index for gene g ,
condition 1
Distribution of expression index for gene g ,
condition 2
Distribution of differential expression parameter
4Bayesian Model
2 conditions
Number of replicates in each condition
yg1r N( ?g - ½ dg , sg1) , r 1, R1 yg2r
N( ?g ½ dg , sg2 ), r 1, R2
Mean
Difference (log fold change)
s2gk IG(ak, bk), k1,2 E(s2gks2gk) (Rk-1)
s2gk 2bk/(Rk-12ak) Non-informative priors on
?g , ak , bk.
Prior model
Prior distribution on dg?
(Assume data is background corrected,
log-transformed and normalised)
5Modelling differential expression
Prior information / assumption
Genes are either differentially expressed or not
(of interest or not)
Can include this in the model via modelling the
difference as a mixture
How to choose H?
dg (1-p) d0(dg) p H(dg ?g)
Advantages
H1
H0
- Automatically selects threshold as opposed to
specifying constants as in the non-informative
prior model for differences - Interpretable can use Bayesian classification
to select differentially expressed genes - Pg in H1 data Pg in H0 data.
- Can estimate false discovery and non-discovery
rates (Newton et al 2004).
6Considered mixture models
We choose several distributions as the non-zero
part in the mixture distribution for dg double
gamma, Student t distribution, the conjugate
model (Lonnstedt and Speed (2002)) and the
uniform distribution in a fully Bayesian context.
LS model H is normal with variance proportional
to variance of the data dg (1-p)d0 p N (0, c
sg2)
Gamma model H is double gamma distribution
sg2 sg12/R1 sg22/R2
T model H is Student t distribution dg
(1-p)d0 p T (?, µ, t)
Uniform model H is uniform distribution dg
(1-p)d0 p U(-m1, m2)
Priors on hyperparameters are either
non-informative or weakly informative G(1,1) for
parameters with support on positive semiline.
(-m1, m2) - slightly widened range of observed
differences
7Simulated data
We compare performance of the four models on
simulated data. For simplicity we considered a
one group model (or a paired two group model). We
simulate a data set with 1000 variables and 8
replicates
Plot of the simulated data set
Variance
Hyperparameters of variance a1.5, b0.05 are
chosen close to Bayesian estimates of those in a
real data set
Difference
8Differences
Mixture estimates vs true values
- Gamma, T and LS models estimate differences well
- Uniform model shrinks values to zero
- Compared to empirical Bayes, posterior estimates
in the fully Bayesian approach do not shrink
large values of the differences
Posterior mean
Posterior mean
Posterior mean
Posterior mean
9Bayesian estimates of variance
- T and Gamma models have very similar variance
estimates - Uniform model produces similar estimates for
small values and higher estimates for larger
values compared with T and Gamma models - LS model has more pertubation at both higher and
lower values compared to T and Gamma models
Blue variance estimate based on Bayesian model
with non-informative prior on differences.
Uniform model
Gamma model
E(s2y)
E(s2y)
sample variance
sample variance
LS model
T model
E(s2y)
Mixture estimate of the variance can be larger
than the sample variance
E(s2y)
sample variance
sample variance
10Classification
Diff. Expressed genes (200)
- T, LS and Gamma models perform similarly
- Uniform model has a smaller number of false
positives but also a smaller number of true
positives
Non D. Expressed genes (800)
Uniform prior is more conservative
11Wrongly classified by mixture truly dif.
expressed, truly not dif. expressed
Classification errors are on the
borderline Confusion between size of fold change
and biological variability
12Another simulation
2628 data points Many points added on
borderline classification errors in red
Can we improve estimation of within
condition biological variability ?
13DAG for the mixture model
The variance estimates are influenced by the
mixture parameters
Use only partial information from the
replicates to estimate ?2gs and feed forward in
the mixture ?
g 1G
14Estimation
- Estimation of all parameters combines information
from biological replicates and between condition
contrasts - s2gs 1/Rs Sr (ygsr - ygs. )2 , s 1,2
- Within condition biological variability
- 1/Rs Sr ygsr ygs. ,
- Average expression over replicates
- ½(yg1. yg2.) Average expression over conditions
- ½(yg1.- yg2.) Between conditions contrast
15Mixture, full vs partial
Classification altered for 57 points
Work in progress
16Difference cut and no cut
Different classification Truly
diff.expressed Truly not diff.expressed
Variance
Posterior probability
Sample st.dev. vs diff.
Cut
Full
Full
17Microarray data
Variance
Posterior probability
Sample st.dev. vs diff.
Cut
Pooled sample st.dev.
Sample difference
Full
Full
Genes classified differently by the full model
and the model with feedback cut follow a curve.
18Since variance is overestimated in full mixture
model compared to mixture model with cut, the
number of False Negatives is lower for model with
cut than for the full model.
19LS model empirical vs fully Bayesian
Compare the Lonnstedt and Speed (LS) model
- in fully Bayesian model (FB) and
- empirical Bayes (EB) model.
Estimated parameters
Classification
- If parameter p is specified correctly, empirical
and fully Bayesian models do not differ - If parameter p is misspecified, estimate of the
parameter c changes which leads to
misclassification
20Small p (p0.01)
Cut
No Cut
21Bayesian Estimate of FDR
- Step 1 Choose a gene specific parameter (e.g.
dg ) or a gene statistic - Step 2 Model its prior distribution using a
mixture model - -- with one component to model the unaffected
genes (null hypothesis) e.g. point mass at 0 for
dg - -- other components to model (flexibly) the
alternative - Step 3 Calculate the posterior probability for
any gene to belong to the unmodified component
pg0 data - Step 4 Evaluate FDR (and FNR) for any list
- assuming that all the gene classification are
independent - (Broët et al 2004)
Bayes FDR (list) data 1/card(list) Sg ? list
pg0
22Multiple Testing Problem
- Gene lists can be built by computing separately a
criteria for each gene and ranking - Thousands of genes are considered simultaneously
- How to assess the performance of such lists ?
Statistical Challenge Select interesting genes
without including too many false positives in a
gene list
A gene is a false positive if it is included in
the list when it is truly unmodified under the
experimental set up
Want an evaluation of the expected false
discovery rate (FDR)
23Bayes rule
FDR (black) FNR (blue) as a function of 1- pg0
Observed and estimated FDR/FNR correspond well
Post Prob (g ? H1) 1- pg0
24Summary
- Mixture models estimate differences and
hyperparameters well on simulated data. - Variance is overestimated for some genes.
- Mixture model with uniform alternative
distribution is more conservative in classifying
genes than structured models. - Lonnstedt and Speed model performs better in
fully Bayesian framework because parameter p is
estimated from data. - Estimates of false discovery and non-discovery
rates are close to the true values