Title: b
1Statistical Analysis of Gene Expression Data
Sylvia Richardson Centre for Biostatistics Imperia
l College, London
In collaboration with Natalia Bochkina, Anne
Mette Hein, Alex Lewin (St Marys) Tim Aitman
(Hammersmith) Peter Green (Bristol)
Biological Atlas of Insulin Resistance
www.bgx.org.uk
BBSRC
2Statistical modelling and biology
- Extracting the message from microarray data
needs statistical as well as biological
understanding - Statistical modelling in contrast to data
analysis gives a framework for formally
organising assumptions about signal and noise - Our models are structured, reflecting data
generation process - Bayesian hierarchical modelling approach
- Inference based on posterior distribution of
quantities of interest
3What are gene expression data ?
- DNA Microarrays are used to measure the relative
abundance of mRNA, providing information on gene
expression in a particular cell type, under
specific conditions - Gene expression data (e.g. Affymetrix?) results
from the scanning of arrays where hybridisation
between a sample and a large number of probes has
taken place - gene expression measure for each gene
- The expression level of ten of thousands of
probes are measured on a single
microarray - gene expression profile
- Typically, gene expression profiles are obtained
for several samples, in a single
or related experiments - gene expression data matrix
4Common characteristics of data sets in
transcriptomic
- High dimensional data (ten of thousands of genes)
and few samples - Many sources of variability (low signal/noise
ratio)
- within/between array variation
- gene specific variability of the probes for a
gene (e.g. for Affymetrix?)
- condition/treatment
- biological
- array manufacture
- imaging
- technical
5Analysing gene expression data
Gene expression data matrix
- Gene expression data can be used in several types
of analysis - -- Comparison of gene expression under different
experimental conditions, or in different tissues - -- Building a predictive model for classification
or prognosis based on gene expression
measurements - -- Exploration of patterns in gene expression
matrices
Samples
Genes (20000)
Gene expression level
6Common statistical issues
- Pre-processing and data reduction
- account for the uncertainty of the signal?
- making arrays comparable normalisation
- Realistic assessment of uncertainty
- Multiplicity control of error rates
- Need to borrow information
- Importance to include prior biological knowledge
Illustrate how structured statistical modelling
can help to tease out signal from noise and
strengthen inference in the context of
differential expression studies
7Outline
- Background
- Modelling uncertainty in the signal
- Bayesian hierarchical models for differential
expression experiments - posterior predictive checks
- use of posterior distribution of parameters of
interest to select genes of interest - Further structure mixture models
8I Modelling uncertainty in the signalA fully
Bayesian Gene expression index for Affymetrix
Gene Chip arrays (Anne Mette Hein)
Data Affymetrix chip - Each gene g
is represented by a probe set, consisting of a
number of probe pairs (reporters) j Perfect
match (PM) and Mismatch (MM) Aim Formulate a
model to combine PM and MM values into a new
expression value for the gene -- BGX -
Base the model on biological assumptions
- Combine good features of Li and Wong (dChip)
and RMA (Robust Multichip Analysis,
Irrizarry et al)
- Use a flexible Bayesian framework that will allow
- to get a measure of uncertainty of the
expression - to integrate further components of the
experimental design
9Single array model Motivation
Key observations
Conclusions
- PMs and MMs both increase with spike-in
concentration (MMs slower than PMs)
MMs bind fraction of signal
Multiplicative (and additive) error
transformation needed
- Spread of PMs increase with level
- Considerable variability in PM (and MM) response
within a probe set
Varying reliability in gene expression estimation
for different genes
Estimate gene expression measure from PMs and MMs
on log scale
- Probe effects approximately additive on
log-scale
10BGX single array model
PMgj ? N( Sgj Hgj , t2) MMgj ? N(F Sgj Hgj
, t2)
Background noise, additive
Gene and probe specific S and H (g1,,1000s,
j1,,tens)
Non-specific hybridisation array wide
distribution j1,,J (20), g1,,G
Expression measure for gene g is built from
j1,,J (20)
log(Sgj1) ? TN(µg,sg2)
Shrinkage exchangeability
log(Hgj1) ? TN(?, ?2)
log(sg2)?N(a, b2)
BGX expression measure
Remaining priors vague
Emp. Bayes
11BGX model inference Hein et al, Biostatistics,
2005
For each gene g obtain a distribution for signal
(log scale)
PM MM
- Implemented in WinBugs and C (MCMC)
- All parameters estimated jointly in full
Bayesian framework - Posterior distributions of parameters (and
functions) obtained
The single array model can be extended to
estimate signal from several biological
replicates, as well as differential signal
between conditions
12Single array modelexamples of posterior
distributions of BGX indices
Each curve represents a gene
Examples with data o log(PMgj-MMgj)
j1,,J (at 0 if not defined)
Mean ? 1SD
13Comparison with other expression measures
11 genes spiked in at 13 (increasing)
concentrations
BGX index µg increases with concentration ..
except for gene 7 (incorrectly spiked-in??)
Indication of smooth sustained increase over
a wider range of concentrations
1495 credibility intervals for Bayesian gene
expression index
11 spike-in genes at 13 different concentrations
Each colour corresponds to a different spike-in
gene Gene 7 broken red line
Note how the variability is substantially larger
for low expression level
15II Modelling differential expression
Condition 2
Condition 1
Start with given point estimates of expression
Hierarchical model of replicate variability and
array effect
Hierarchical model of replicate variability and
array effect
Posterior distribution (flat prior)
Differential expression parameter
Mixture modelling for classification
16Data Sets and Biological question
- Biological Question
- Understand the mechanisms of insulin resistance
- Using animal models where key genes are knockout
- A) Cd36 Knock out Data set (MAS 5) 3 wildtype
(normal) mice compared with 3 mice with Cd36
knocked out - (? 12000 genes on each array )
- B) IRS2 Knock out Data set (RMA) 8 wildtype
(normal) mice compared with 8 mice with IRS2
gene knocked out - (? 22700 genes on each array)
17Exploratory analysis showing array effect
Mouse data set A
Condition 1 (3 replicates)
Needs normalisation Spline curves shown
Condition 2 (3 replicates)
18Bayesian hierarchical model for differential
expression (Lewin et al, Biometrics, 2005)
- Data ygcr log gene expression gene g,
replicate r, condition c - ?g gene effect
- dg differential effect for gene g between
2 conditions - ?r(g)c array effect modelled as a smooth
(spline) function of ?g - ?gc2 gene specific variance
- 1st level yg1r ? N(?g ½ dg ?r(g)1 ,
?g12) - yg2r ? N(?g ½ dg ?r(g)2 , ?g22)
- Sr?r(g)c 0, ?r(g)c function of ?g ,
parameters c,d -
- 2nd level Flat priors for ?g , dg, c,d
- ?gc2 ? lognormal (ac, bc)
Exchangeable variances
19Directed Acyclic Graph for the differential
expression model (no array effect represented)
a1, b1
½(yg1.- yg2.)
dg
?2g1
s2g1
?2g2
s2g2
½(yg1. yg2.)
?g
a2, b2
20Differential expression model
- Joint modelling of array effects and differential
expression - Performs normalisation simultaneously with
estimation - Gives fewer false positives
How to check some of the modelling
assumptions? Posterior predictive checks How to
use the posterior distribution of dg to select
genes of interest ? Decision rules
21Bayesian Model Checking
- Check assumptions on gene variances, e.g.
exchangeable variances, what distribution ? - Predict sample variance sg2 new (a chosen
checking function) from the model specification
(not using the data for this) - Compare predicted sg2 new with observed sg2 obs
- Bayesian p-value Prob( sg2 new gt sg2 obs )
- Distribution of p-values approx Uniform if model
is true (Marshall and Spiegelhalter, 2003) - Easily implemented in MCMC algorithm
22?2g1
new
Bayesian model checking
a1, b1
s2g1
new
½(yg1.- yg2.)
dg
obs
?2g1
s2g1
?2g2
s2g2
½(yg1. yg2.)
?g
a2, b2
23Mouse Data set A
24Use of tail probabilities for selecting gene lists
- dg log fold change
- tg dg / (s2 g1 / n1 s2 g2 / n2 )½
standardised difference - (n1 and n2 replicates in each condition)
- -- Obtain the posterior distribution of dg
and/or tg - -- Compute directly posterior probability of
genes satisfying criterion X of interest, e.g.
dg gt threshold or tg gt percentile - pg,X Prob( g of interest Criterion X,
data) - -- Compute the distributions of ranks, .
Interesting statistical issues on relative merits
and properties of different selection rules based
on tail probabilities
25Using the posterior distribution of tg
(standardised difference) (Natalia Bochkina)
- Compute
- Probability ( tg gt 2 data)
- Bayesian T test
- Order genes
- Select genes such that
Data set B
Probability ( tg gt 2 data) gt cut-off ( in
blue) By comparison, additional genes selected by
a standard T test with p value lt 5 are in red)
26Credibility intervals for ranks
100 genes with lowest rank (most under/ over
expressed)
Low rank, high uncertainty
Low rank, low uncertainty
27III Mixture and Bayesian estimation of False
Discovery Rates (FDR)
- Mixture models can be used to perform a model
based classification - Mixture models can be considered at the level of
the data (e.g. clustering time profiles) or for
the underlying parameters - Mixture models can be used to detect
differentially expressed genes if a model of the
alternative is specified - One benefit is that an estimate of the
uncertainty of the classification the False
Discovery Rate is simultaneously obtained
28Mixture framework for differential expression
- yg1r ?g - ½ dg ?g1r , r 1, R1
- yg2r ?g ½ dg ?g2r , r 1, R2
- (We assume that the data has been pre normalised)
- Var(?gcr ) s2gc IG(ac, bc)
- dg p0d0 p1G (-x1.5, ?1) p2G (x1.5, ?2)
- H0 H1
- Dirichlet distribution for (p0, p1, p2)
- Exp(1) hyper prior for ?1 and ?2
Explicit modelling of the alternative
29Mixture for classification of DE genes
- Calculate the posterior probability for any gene
of belonging to the unmodified component pg0
data - Classify using a cut-off on pg0
- i.e. declare gene is DE if 1- pg0 gt pcut
- Bayes rule corresponds to pcut 0.5
- Bayesian estimate of FDR (and FNR) for any list
- (Newton et al 2003, Broët et al 2004)
Bayes FDR (list) data 1/card(list) Sg ? list
pg0
30Performance of the mixture prior
- Joint estimation of all the mixture parameters
(including p0) using MCMC algorithms avoids
plugging-in of values that are influential on
the classification - Estimation of all parameters combines
information from biological replicates and
between condition contrasts - Performance has been tested on simulated data
sets
31p0 0.9, 250 DE
p0 0.8, 500 DE
Plot of true difference in each case
p0 0.99, 25 DE
p0 0.95, 125 DE
p0 0.80, 500 DE
32Examples of simulated data for each case
33Results averaged over 50 replications
Good estimates of p0 Prob(null) for each case
Av. p0 0.99
34Comparison of estimated (dotted lines) and
observed (full) FDR (black) and FNR (red) rates
as cut-off for declaring DE is varied
- Bayesian mixture
- good estimates of
- FDR and FNR
- easy way to
- choose efficient
- classification rule
35In summary
- Integrated gene expression analysis
- Uses the natural hierarchical structure of the
data e.g. probes within genes within replicate
arrays within condition to synthesize, borrow
information and provide realistic quantification
of uncertainty - Posterior distributions can be exploited for
inference with few replicates choice of decision
rules - Framework where biological prior information,
e.g. on the structure of the probes or on
chromosomic location, can be incorporated - Model based classification, e.g. through
mixtures, provides interpretable output and a
structure to deal with multiplicity
General framework for investigating other
questions
36Many interesting questions in the analysis of
gene expression data
-- Comparison of gene expression under different
experimental conditions, or in different tissues
-- Integrated gene expression analysis
-- Investigate high dimensional classification
rules (prediction with large number of variables)
and large p small n regression problems
(shrinkage or variable selection)
-- Building a predictive model for classification
or prognosis based on gene expression
measurements, finding signatures
37Association of gene expression with prognosis
Expression plot of 115 prognostic genes
comprising The Ovarian Cancer Prognostic Profile
Investigate properties of high dimensional
classification rules (prediction with large
number of variables) and large p small n
regression problems (shrinkage or variable
selection)
38Other questions .
-- Comparison of gene expression under different
experimental conditions, or in different
tissues -- Building a predictive model for
classification or prognosis based on gene
expression measurements, finding signatures
-- Integrated gene expression analysis
-- Investigate high dimensional classification
rules (prediction with large number of variables)
and large p small n regression problems
(shrinkage or variable selection)
-- Perform unsupervised model based clustering --
Estimate graphical models
-- Exploration of patterns and association
networks in gene expression matrices
39Exploration of patterns in gene expression
matrices
-- Comparison of gene expression under different
experimental conditions, or in different
tissues -- Classification of gene expression
profiles and association of gene expression with
other factors, e.g. prognosis (prediction
problem)
Perform unsupervised model based clustering (e.g.
semi-parametric using basis functions, mixtures
or DP processes)
Development of central nervous systems in rats (9
time points)
40Thanks
BBSRC Exploiting Genomics grant Colleagues
Natalia Bochkina, Anne Mette Hein, Alex Lewin
(Imperial College) Peter Green (Bristol
University) Philippe Broët (INSERM,
Paris) Papers and technical reports
www.bgx.org.uk/