Title: Structured statistical modelling of gene expression data
1Structured statistical modelling of gene
expression data
Windsor, October 2004
- Peter Green (Bristol)
- Sylvia Richardson, Alex Lewin, Anne-Mette Hein
(Imperial) - with Clare Marshall, Natalia Bochkina (Imperial)
- Graeme Ambler (Bristol)
- Tim Aitman and Helen Causton (Hammersmith)
BGX
2Statistical modelling and biology
- Extracting the message from microarray data needs
statistical as well as biological understanding - Statistical modelling in contrast to data
analysis gives a framework for formally
organising assumptions about signal and noise - Our models are structured, reflecting data
generation process highly structured stochastic
systems
3Background and 3 studies
- Hierarchical modelling
- A fully Bayesian gene expression index (BGX)
- Differential expression and array effects
- Two-way clustering
4Part 1
- Hierarchical modelling
- A fully Bayesian gene expression index (BGX)
- Differential expression and array effects
- Two-way clustering
5Gene expression using Affymetrix chips
Zoom Image of Hybridised Array
Hybridised Spot
Single stranded, labeled RNA sample
Oligonucleotide element
20µm
Millions of copies of a specific oligonucleotide
sequence element
Expressed genes
Approx. ½ million different complementary
oligonucleotides
Non-expressed genes
Slide courtesy of Affymetrix
1.28cm
Image of Hybridised Array
6Variation and uncertainty
Gene expression data (e.g. Affymetrix?) is the
result of multiple sources of variability
- condition/treatment
- biological
- array manufacture
- imaging
- technical
- within/between array variation
- gene-specific variability
Structured statistical modelling allows
considering all uncertainty at once
7Costs and benefits of this approach
- Advantages of avoiding plug-in approach
- Uncertainties propagated throughout model
- Realistic estimates of variability
- Avoid bias
- The price you pay computational costs
- Intricate implementation
- Longer run times (but far less than experimental
protocol!)
8Part 2
- Hierarchical modelling
- A fully Bayesian gene expression index (BGX)
- Differential expression and array effects
- Two-way clustering
9A fully Bayesian Gene eXpression indexfor
Affymetrix GeneChip arraysAnne-Mette
HeinSylvia Richardson, Helen Causton, Graeme
Ambler, Peter Green
Gene specific variability (probe)
PM MM
PM MM
PM MM
PM MM
BGX Gene index
10Single array model motivation
Key observations
Conclusions
PMs and MMs both increase with spike-in
concentration (MMs slower than PMs)
MMs bind fraction of signal
Multiplicative (and additive) error
transformation needed
Spread of PMs increase with level
Considerable variability in PM (and MM) response
within a probe set
Varying reliability in gene expression estimation
for different genes
Probe effects approximately additive on
log-scale
Estimate gene expression measure from PMs and MMs
on log scale
11Model assumptions and key biological parameters
- The intensity for the PM measurement for probe
(reporter) j and gene g is due to binding - of labelled fragments that perfectly match the
oligos in the spot (the true signal Sgj) - of labelled fragments that do not perfectly match
these oligos (the non-specific hybridisation Hgj) - The intensity of the corresponding MM measurement
is caused - by a binding fraction F of the true signal Sgj
- by non-specific hybridisation Hgj
12BGX single array model
g1,,G (thousands), j1,,J (11-20)
13Markov chain Monte Carlo (MCMC) computation
- Fitting of Bayesian models hugely facilitated by
advent of these simulation methods - Produce a large sample of values of all unknowns,
? from posterior given data - Easy to set up for hierarchical models
- BUT can be slow to run (for many variables!)
- and can fail to converge reliably
14Sample in place of a distribution - 1D
15Sample in place of a distribution - 2D
16Single array model performance
- Data set varying concentrations (geneLogic)
- 14 samples of cRNA from acute myeloid leukemia
(AML) tumor cell line - In sample k each of 11 genes spiked in at
concentration ck - sample k 1 2 3
4 5 6 7 8 9 10 11
12 13 14 - conc. (pM) 0.0 0.5 0.75 1.0
1.5 2.0 3.0 5.0 12.5 25 50 75
100 150 - Each sample hybridised to an array
- Consider subset consisting of 500 normal genes
11 spike-ins
17Signal expression indices
10 arrays gene 1 spiked-in at increasing
concentrations
Lines 95 credibility intervals for
log(Sgj1) Curves posterior for signal
true signal/ expression index BGX increases
with concentration
18Non-specific hybridisation
10 arrays gene 1 spiked-in at increasing
concentrations
Lines 95 credibility intervals for
log(Hgj1) Curves posterior for signal
Non-specific hybridisation does not increase
with concentration
19Comparison with other expression measures
11 genes spiked in at 13 (increasing)
concentrations
BGX index qg increases with concentration ..
except for gene 7 (incorrectly spiked-in??)
Indication of smooth sustained increase over
a wider range of concentrations
20Single array modelexamples of posterior
distributions of BGX indices
Each curve represents a gene
Examples with data o log(PMgj-MMgj)
j1,,Jg (at 0 if not defined)
Mean ? 1SD
2195 credibility intervals for Bayesian gene
expression index
11 spike-in genes at 13 different concentrations
(data set A)
Each colour corresponds to a different spike-in
gene Gene 7 broken red line
Note how the variability is substantially larger
for low expression level
22Part 3
- Hierarchical modelling
- A fully Bayesian gene expression index (BGX)
- Differential expression and array effects
- Two-way clustering
23Bayesian modelling of differential gene
expression, adjusting for array effectsAlex
LewinSylvia Richardson, Natalia Bochkina,Clare
Marshall, Anne Glazier, Tim Aitman
- The spontaneously hypertensive rat (SHR) A model
of human insulin resistance syndromes. - Deficiency in gene Cd36 found to be associated
with insulin resistance in SHR - Following this, several animal models were
developed where other relevant genes are knocked
out comparison between knocked out and wildtype
(normal) mice or rats.
See poster!
24Data set biological question
Microarray Data Data set A (MAS 5) (? 12000
genes on each array) 3 SHR compared with 3
transgenic rats Data set B (RMA) (? 22700 genes
on each array) 8 wildtype (normal) mice compared
with 8 knocked out mice Biological
Question Find genes which are expressed
differently in wildtype and knockout / transgenic
mice
25Exploratory analysis showing array effect
26Differential expression model
The quantity of interest is the difference
between conditions for each gene dg , g 1,
,N Joint model for the 2 conditions yg1r ?g
- ½ dg ?1r(?g) ?g1r , r 1, R1 yg2r
?g ½ dg ?2r(?g) ?g2r , r 1, R2
where ygcr is log gene expression for gene g,
condition c, replicate r ?g is overall gene
effect ?cr(?) is array effect - a smooth function
of ? ?gcr is normally distributed error, with
gene- and condition- specific variance
27Differential expression model
- Joint modelling of array effects and differential
expression - Performs normalisation simultaneously with
estimation - Gives fewer false positives
- Can work with any desired composite criterion for
identifying interesting genes, e.g. fold change
and overall expression level
28Data set A 3 wildtype mice compared to 3
knockout mice (U74A chip) MAS5
Criterion
Gene is of interest if log fold change gt
log(2) and log (overall expression) gt 4
Plot of log fold change versus overall expression
level
Genes with pg,X gt 0.5 (green) 280 pg,X gt 0.8
(red) 46
The majority of the genes have very small pg,X
90 of genes have pg,X lt 0.2
pg,X 0.49
Genes with low overall expression have a greater
range of fold change than those with higher
expression
29Data set B 8 wildtype mice compared to 8 knockout
mice RMA
Gene is of interest if log fold change gt log
(1.5)
Criterion
Plot of log fold change versus overall expression
level
Genes with pg,X gt 0.5 (green) 292 pg,X gt 0.8
(red) 139
The majority of the genes have very small pg,X
97 of genes have pg,X lt 0.2
30Integrated modelling of Affymetrix data
Condition 2
Condition 1
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
Gene specific variability (probe) Gene index BGX
Gene specific variability (probe) Gene index BGX
Hierarchical model of replicate (biological)
variability and array effect
Hierarchical model of replicate (biological)
variability and array effect
Distribution of expression index for gene g ,
condition 1
Distribution of expression index for gene g ,
condition 2
Distribution of differential expression parameter
31Part 4
- Hierarchical modelling
- A fully Bayesian gene expression index (BGX)
- Differential expression and array effects
- Two-way (gene by sample) clustering
32Hierarchical clustering of samples
The gene expression profiles cluster according
to tissue of origin of the samples
A subset of 1161 gene expression profiles,
obtained in 60 different samples
Red more mRNA Green less mRNA in the sample
compared to a reference
Ross et al, Nature Genetics, 2000
33Non-model-based clustering
- Many clustering algorithms have been developed
and used for exploratory purposes - They rely on a measure of distance
(dissimilarity) between gene or sample profiles,
e.g. Euclidean - Hierarchical clustering proceeds in an
agglomerative manner single profiles are joined
to form groups using the distance metric,
recursively - Good visual tool, but many arbitrary choices
care in interpretation!
34Model-based clustering
- Build the cluster structure into the model,
rather than estimating gene effects (say) first,
and post-processing to seek clusters - Bayesian setting allows use of real prior
information where it is exists (biological
understanding of pathways, etc, previous
experiments, )
35Additive ANOVA models for (log-) gene
expression
The simplest model gene sample
ggene
ssample/condition
Under standard conditions, the (least-squares)
estimates of gene effects are
The model generates the method, and in this case
performs a simple form of normalisation
36... bring in mixture modelling
(single sample first!)
ggene
Tg unknown cluster to which gene g belongs This
is a mixture model
37 finally allow clusters to overlap Plaid
model
h denotes a cluster, block or layer
pathway? ?gh 0 or 1 and ?sh 0 or 1
38Plaid model
samples
genes
39An early experiment artificial raw data
Artificial data from a very special case of the
Plaid model single sample s
True H3, b(h)2.2, 3.4 and 4.7, ??N(0,?2) 500
genes, some in each of 238 configurations of ?gh
? 8 overlapping normal clusters
40true H was 3
true b(h) were 2.2, 3.4, 4.7
41Human fibroblast data Lemon et al (2002)
- 18 samples split into 3 categories serum
starved, serum stimulated and a 5050 mix of
starved/stimulated. - We used the natural logarithm of Lemon et al.s
calculated LWF values as our measure of
expression and subtracted gene and sample mean
levels. - We then selected the 100 most variable genes
across all 18 samples and used this 18100 array
as the input to our analysis.
42Bayesian clustering
- Hierarchical model allows us to learn about all
unknowns simultaneously - In particular, this includes complete 2-way
classification, gene by sample, with numerical
uncertainties - We then construct visualisations of interesting
aspects (marginal distributions) of this posterior
43Bayesian clustering samples
44Bayesian clustering genes
45More details, papers and code
- www.stats.bris.ac.uk/BGX/
- www.bgx.org.uk