Structured statistical modelling of gene expression data - PowerPoint PPT Presentation

About This Presentation

Title:

Structured statistical modelling of gene expression data

Description:

Fitting of Bayesian models hugely facilitated by advent of these simulation methods ... standard conditions, the (least-squares) estimates of gene effects are ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 43

Provided by: gre6174

Category:

more less

Transcript and Presenter's Notes

Title: Structured statistical modelling of gene expression data

1
Structured statistical modelling of gene
expression data
Windsor, October 2004

Peter Green (Bristol)
Sylvia Richardson, Alex Lewin, Anne-Mette Hein
(Imperial)
with Clare Marshall, Natalia Bochkina (Imperial)
Graeme Ambler (Bristol)
Tim Aitman and Helen Causton (Hammersmith)

BGX
2
Statistical modelling and biology

Extracting the message from microarray data needs
statistical as well as biological understanding
Statistical modelling in contrast to data
analysis gives a framework for formally
organising assumptions about signal and noise
Our models are structured, reflecting data
generation process highly structured stochastic
systems

3
Background and 3 studies

Hierarchical modelling
A fully Bayesian gene expression index (BGX)
Differential expression and array effects
Two-way clustering

4
Part 1

Hierarchical modelling
A fully Bayesian gene expression index (BGX)
Differential expression and array effects
Two-way clustering

5
Gene expression using Affymetrix chips
Zoom Image of Hybridised Array
Hybridised Spot
Single stranded, labeled RNA sample
Oligonucleotide element
20µm
Millions of copies of a specific oligonucleotide
sequence element
Expressed genes
Approx. ½ million different complementary
oligonucleotides
Non-expressed genes
Slide courtesy of Affymetrix
1.28cm
Image of Hybridised Array
6
Variation and uncertainty
Gene expression data (e.g. Affymetrix?) is the
result of multiple sources of variability

condition/treatment
biological
array manufacture
imaging
technical

within/between array variation
gene-specific variability

Structured statistical modelling allows
considering all uncertainty at once
7
Costs and benefits of this approach

Advantages of avoiding plug-in approach
Uncertainties propagated throughout model
Realistic estimates of variability
Avoid bias

The price you pay computational costs
Intricate implementation
Longer run times (but far less than experimental
protocol!)

8
Part 2

Hierarchical modelling
A fully Bayesian gene expression index (BGX)
Differential expression and array effects
Two-way clustering

9
A fully Bayesian Gene eXpression indexfor
Affymetrix GeneChip arraysAnne-Mette
HeinSylvia Richardson, Helen Causton, Graeme
Ambler, Peter Green
Gene specific variability (probe)
PM MM
PM MM
PM MM
PM MM
BGX Gene index
10
Single array model motivation
Key observations
Conclusions
PMs and MMs both increase with spike-in
concentration (MMs slower than PMs)
MMs bind fraction of signal
Multiplicative (and additive) error
transformation needed
Spread of PMs increase with level
Considerable variability in PM (and MM) response
within a probe set
Varying reliability in gene expression estimation
for different genes
Probe effects approximately additive on
log-scale
Estimate gene expression measure from PMs and MMs
on log scale
11
Model assumptions and key biological parameters

The intensity for the PM measurement for probe
(reporter) j and gene g is due to binding
of labelled fragments that perfectly match the
oligos in the spot (the true signal Sgj)
of labelled fragments that do not perfectly match
these oligos (the non-specific hybridisation Hgj)
The intensity of the corresponding MM measurement
is caused
by a binding fraction F of the true signal Sgj
by non-specific hybridisation Hgj

12
BGX single array model
g1,,G (thousands), j1,,J (11-20)
13
Markov chain Monte Carlo (MCMC) computation

Fitting of Bayesian models hugely facilitated by
advent of these simulation methods
Produce a large sample of values of all unknowns,
? from posterior given data
Easy to set up for hierarchical models
BUT can be slow to run (for many variables!)
and can fail to converge reliably

14
Sample in place of a distribution - 1D
15
Sample in place of a distribution - 2D
16
Single array model performance

Data set varying concentrations (geneLogic)
14 samples of cRNA from acute myeloid leukemia
(AML) tumor cell line
In sample k each of 11 genes spiked in at
concentration ck
sample k 1 2 3
4 5 6 7 8 9 10 11
12 13 14
conc. (pM) 0.0 0.5 0.75 1.0
1.5 2.0 3.0 5.0 12.5 25 50 75
100 150
Each sample hybridised to an array
Consider subset consisting of 500 normal genes
11 spike-ins

17
Signal expression indices
10 arrays gene 1 spiked-in at increasing
concentrations
Lines 95 credibility intervals for
log(Sgj1) Curves posterior for signal
true signal/ expression index BGX increases
with concentration
18
Non-specific hybridisation
10 arrays gene 1 spiked-in at increasing
concentrations
Lines 95 credibility intervals for
log(Hgj1) Curves posterior for signal
Non-specific hybridisation does not increase
with concentration
19
Comparison with other expression measures
11 genes spiked in at 13 (increasing)
concentrations
BGX index qg increases with concentration ..
except for gene 7 (incorrectly spiked-in??)
Indication of smooth sustained increase over
a wider range of concentrations
20
Single array modelexamples of posterior
distributions of BGX indices
Each curve represents a gene
Examples with data o log(PMgj-MMgj)
j1,,Jg (at 0 if not defined)
Mean ? 1SD
21
95 credibility intervals for Bayesian gene
expression index
11 spike-in genes at 13 different concentrations
(data set A)
Each colour corresponds to a different spike-in
gene Gene 7 broken red line
Note how the variability is substantially larger
for low expression level
22
Part 3

Hierarchical modelling
A fully Bayesian gene expression index (BGX)
Differential expression and array effects
Two-way clustering

23
Bayesian modelling of differential gene
expression, adjusting for array effectsAlex
LewinSylvia Richardson, Natalia Bochkina,Clare
Marshall, Anne Glazier, Tim Aitman

The spontaneously hypertensive rat (SHR) A model
of human insulin resistance syndromes.
Deficiency in gene Cd36 found to be associated
with insulin resistance in SHR
Following this, several animal models were
developed where other relevant genes are knocked
out comparison between knocked out and wildtype
(normal) mice or rats.

See poster!
24
Data set biological question
Microarray Data Data set A (MAS 5) (? 12000
genes on each array) 3 SHR compared with 3
transgenic rats Data set B (RMA) (? 22700 genes
on each array) 8 wildtype (normal) mice compared
with 8 knocked out mice Biological
Question Find genes which are expressed
differently in wildtype and knockout / transgenic
mice
25
Exploratory analysis showing array effect
26
Differential expression model
The quantity of interest is the difference
between conditions for each gene dg , g 1,
,N Joint model for the 2 conditions yg1r ?g
- ½ dg ?1r(?g) ?g1r , r 1, R1 yg2r
?g ½ dg ?2r(?g) ?g2r , r 1, R2
where ygcr is log gene expression for gene g,
condition c, replicate r ?g is overall gene
effect ?cr(?) is array effect - a smooth function
of ? ?gcr is normally distributed error, with
gene- and condition- specific variance
27
Differential expression model

Joint modelling of array effects and differential
expression
Performs normalisation simultaneously with
estimation
Gives fewer false positives
Can work with any desired composite criterion for
identifying interesting genes, e.g. fold change
and overall expression level

28
Data set A 3 wildtype mice compared to 3
knockout mice (U74A chip) MAS5
Criterion
Gene is of interest if log fold change gt
log(2) and log (overall expression) gt 4
Plot of log fold change versus overall expression
level
Genes with pg,X gt 0.5 (green) 280 pg,X gt 0.8
(red) 46
The majority of the genes have very small pg,X
90 of genes have pg,X lt 0.2
pg,X 0.49
Genes with low overall expression have a greater
range of fold change than those with higher
expression
29
Data set B 8 wildtype mice compared to 8 knockout
mice RMA
Gene is of interest if log fold change gt log
(1.5)
Criterion
Plot of log fold change versus overall expression
level
Genes with pg,X gt 0.5 (green) 292 pg,X gt 0.8
(red) 139
The majority of the genes have very small pg,X
97 of genes have pg,X lt 0.2
30
Integrated modelling of Affymetrix data
Condition 2
Condition 1
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
Gene specific variability (probe) Gene index BGX
Gene specific variability (probe) Gene index BGX
Hierarchical model of replicate (biological)
variability and array effect
Hierarchical model of replicate (biological)
variability and array effect
Distribution of expression index for gene g ,
condition 1
Distribution of expression index for gene g ,
condition 2
Distribution of differential expression parameter
31
Part 4

Hierarchical modelling
A fully Bayesian gene expression index (BGX)
Differential expression and array effects
Two-way (gene by sample) clustering

32
Hierarchical clustering of samples
The gene expression profiles cluster according
to tissue of origin of the samples
A subset of 1161 gene expression profiles,
obtained in 60 different samples

Red more mRNA Green less mRNA in the sample
compared to a reference
Ross et al, Nature Genetics, 2000
33
Non-model-based clustering

Many clustering algorithms have been developed
and used for exploratory purposes
They rely on a measure of distance
(dissimilarity) between gene or sample profiles,
e.g. Euclidean
Hierarchical clustering proceeds in an
agglomerative manner single profiles are joined
to form groups using the distance metric,
recursively
Good visual tool, but many arbitrary choices
care in interpretation!

34
Model-based clustering

Build the cluster structure into the model,
rather than estimating gene effects (say) first,
and post-processing to seek clusters
Bayesian setting allows use of real prior
information where it is exists (biological
understanding of pathways, etc, previous
experiments, )

35
Additive ANOVA models for (log-) gene
expression
The simplest model gene sample
ggene
ssample/condition
Under standard conditions, the (least-squares)
estimates of gene effects are
The model generates the method, and in this case
performs a simple form of normalisation
36
... bring in mixture modelling
(single sample first!)
ggene
Tg unknown cluster to which gene g belongs This
is a mixture model
37
finally allow clusters to overlap Plaid
model
h denotes a cluster, block or layer
pathway? ?gh 0 or 1 and ?sh 0 or 1
38
Plaid model
samples
genes
39
An early experiment artificial raw data
Artificial data from a very special case of the
Plaid model single sample s
True H3, b(h)2.2, 3.4 and 4.7, ??N(0,?2) 500
genes, some in each of 238 configurations of ?gh
? 8 overlapping normal clusters
40
true H was 3
true b(h) were 2.2, 3.4, 4.7
41
Human fibroblast data Lemon et al (2002)

18 samples split into 3 categories serum
starved, serum stimulated and a 5050 mix of
starved/stimulated.
We used the natural logarithm of Lemon et al.s
calculated LWF values as our measure of
expression and subtracted gene and sample mean
levels.
We then selected the 100 most variable genes
across all 18 samples and used this 18100 array
as the input to our analysis.

42
Bayesian clustering

Hierarchical model allows us to learn about all
unknowns simultaneously
In particular, this includes complete 2-way
classification, gene by sample, with numerical
uncertainties
We then construct visualisations of interesting
aspects (marginal distributions) of this posterior

43
Bayesian clustering samples
44
Bayesian clustering genes
45
More details, papers and code