Structured statistical modelling of gene expression data - PowerPoint PPT Presentation

About This Presentation
Title:

Structured statistical modelling of gene expression data

Description:

Fitting of Bayesian models hugely facilitated by advent of these simulation methods ... standard conditions, the (least-squares) estimates of gene effects are ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 43
Provided by: gre6174
Category:

less

Transcript and Presenter's Notes

Title: Structured statistical modelling of gene expression data


1
Structured statistical modelling of gene
expression data
Windsor, October 2004
  • Peter Green (Bristol)
  • Sylvia Richardson, Alex Lewin, Anne-Mette Hein
    (Imperial)
  • with Clare Marshall, Natalia Bochkina (Imperial)
  • Graeme Ambler (Bristol)
  • Tim Aitman and Helen Causton (Hammersmith)

BGX
2
Statistical modelling and biology
  • Extracting the message from microarray data needs
    statistical as well as biological understanding
  • Statistical modelling in contrast to data
    analysis gives a framework for formally
    organising assumptions about signal and noise
  • Our models are structured, reflecting data
    generation process highly structured stochastic
    systems

3
Background and 3 studies
  • Hierarchical modelling
  • A fully Bayesian gene expression index (BGX)
  • Differential expression and array effects
  • Two-way clustering

4
Part 1
  • Hierarchical modelling
  • A fully Bayesian gene expression index (BGX)
  • Differential expression and array effects
  • Two-way clustering

5
Gene expression using Affymetrix chips
Zoom Image of Hybridised Array
Hybridised Spot
Single stranded, labeled RNA sample
Oligonucleotide element
20µm
Millions of copies of a specific oligonucleotide
sequence element
Expressed genes
Approx. ½ million different complementary
oligonucleotides
Non-expressed genes
Slide courtesy of Affymetrix
1.28cm
Image of Hybridised Array
6
Variation and uncertainty
Gene expression data (e.g. Affymetrix?) is the
result of multiple sources of variability
  • condition/treatment
  • biological
  • array manufacture
  • imaging
  • technical
  • within/between array variation
  • gene-specific variability

Structured statistical modelling allows
considering all uncertainty at once
7
Costs and benefits of this approach
  • Advantages of avoiding plug-in approach
  • Uncertainties propagated throughout model
  • Realistic estimates of variability
  • Avoid bias
  • The price you pay computational costs
  • Intricate implementation
  • Longer run times (but far less than experimental
    protocol!)

8
Part 2
  • Hierarchical modelling
  • A fully Bayesian gene expression index (BGX)
  • Differential expression and array effects
  • Two-way clustering

9
A fully Bayesian Gene eXpression indexfor
Affymetrix GeneChip arraysAnne-Mette
HeinSylvia Richardson, Helen Causton, Graeme
Ambler, Peter Green
Gene specific variability (probe)
PM MM
PM MM
PM MM
PM MM
BGX Gene index
10
Single array model motivation
Key observations
Conclusions
PMs and MMs both increase with spike-in
concentration (MMs slower than PMs)
MMs bind fraction of signal
Multiplicative (and additive) error
transformation needed
Spread of PMs increase with level
Considerable variability in PM (and MM) response
within a probe set
Varying reliability in gene expression estimation
for different genes
Probe effects approximately additive on
log-scale
Estimate gene expression measure from PMs and MMs
on log scale
11
Model assumptions and key biological parameters
  • The intensity for the PM measurement for probe
    (reporter) j and gene g is due to binding
  • of labelled fragments that perfectly match the
    oligos in the spot (the true signal Sgj)
  • of labelled fragments that do not perfectly match
    these oligos (the non-specific hybridisation Hgj)
  • The intensity of the corresponding MM measurement
    is caused
  • by a binding fraction F of the true signal Sgj
  • by non-specific hybridisation Hgj

12
BGX single array model
g1,,G (thousands), j1,,J (11-20)
13
Markov chain Monte Carlo (MCMC) computation
  • Fitting of Bayesian models hugely facilitated by
    advent of these simulation methods
  • Produce a large sample of values of all unknowns,
    ? from posterior given data
  • Easy to set up for hierarchical models
  • BUT can be slow to run (for many variables!)
  • and can fail to converge reliably

14
Sample in place of a distribution - 1D
15
Sample in place of a distribution - 2D
16
Single array model performance
  • Data set varying concentrations (geneLogic)
  • 14 samples of cRNA from acute myeloid leukemia
    (AML) tumor cell line
  • In sample k each of 11 genes spiked in at
    concentration ck
  • sample k 1 2 3
    4 5 6 7 8 9 10 11
    12 13 14
  • conc. (pM) 0.0 0.5 0.75 1.0
    1.5 2.0 3.0 5.0 12.5 25 50 75
    100 150
  • Each sample hybridised to an array
  • Consider subset consisting of 500 normal genes
    11 spike-ins

17
Signal expression indices
10 arrays gene 1 spiked-in at increasing
concentrations
Lines 95 credibility intervals for
log(Sgj1) Curves posterior for signal
true signal/ expression index BGX increases
with concentration
18
Non-specific hybridisation
10 arrays gene 1 spiked-in at increasing
concentrations
Lines 95 credibility intervals for
log(Hgj1) Curves posterior for signal
Non-specific hybridisation does not increase
with concentration
19
Comparison with other expression measures
11 genes spiked in at 13 (increasing)
concentrations
BGX index qg increases with concentration ..
except for gene 7 (incorrectly spiked-in??)
Indication of smooth sustained increase over
a wider range of concentrations
20
Single array modelexamples of posterior
distributions of BGX indices
Each curve represents a gene
Examples with data o log(PMgj-MMgj)
j1,,Jg (at 0 if not defined)
Mean ? 1SD
21
95 credibility intervals for Bayesian gene
expression index
11 spike-in genes at 13 different concentrations
(data set A)
Each colour corresponds to a different spike-in
gene Gene 7 broken red line
Note how the variability is substantially larger
for low expression level
22
Part 3
  • Hierarchical modelling
  • A fully Bayesian gene expression index (BGX)
  • Differential expression and array effects
  • Two-way clustering

23
Bayesian modelling of differential gene
expression, adjusting for array effectsAlex
LewinSylvia Richardson, Natalia Bochkina,Clare
Marshall, Anne Glazier, Tim Aitman
  • The spontaneously hypertensive rat (SHR) A model
    of human insulin resistance syndromes.
  • Deficiency in gene Cd36 found to be associated
    with insulin resistance in SHR
  • Following this, several animal models were
    developed where other relevant genes are knocked
    out comparison between knocked out and wildtype
    (normal) mice or rats.

See poster!
24
Data set biological question
Microarray Data Data set A (MAS 5) (? 12000
genes on each array) 3 SHR compared with 3
transgenic rats Data set B (RMA) (? 22700 genes
on each array) 8 wildtype (normal) mice compared
with 8 knocked out mice Biological
Question Find genes which are expressed
differently in wildtype and knockout / transgenic
mice
25
Exploratory analysis showing array effect
26
Differential expression model
The quantity of interest is the difference
between conditions for each gene dg , g 1,
,N Joint model for the 2 conditions yg1r ?g
- ½ dg ?1r(?g) ?g1r , r 1, R1 yg2r
?g ½ dg ?2r(?g) ?g2r , r 1, R2
where ygcr is log gene expression for gene g,
condition c, replicate r ?g is overall gene
effect ?cr(?) is array effect - a smooth function
of ? ?gcr is normally distributed error, with
gene- and condition- specific variance
27
Differential expression model
  • Joint modelling of array effects and differential
    expression
  • Performs normalisation simultaneously with
    estimation
  • Gives fewer false positives
  • Can work with any desired composite criterion for
    identifying interesting genes, e.g. fold change
    and overall expression level

28
Data set A 3 wildtype mice compared to 3
knockout mice (U74A chip) MAS5
Criterion
Gene is of interest if log fold change gt
log(2) and log (overall expression) gt 4
Plot of log fold change versus overall expression
level
Genes with pg,X gt 0.5 (green) 280 pg,X gt 0.8
(red) 46
The majority of the genes have very small pg,X
90 of genes have pg,X lt 0.2
pg,X 0.49
Genes with low overall expression have a greater
range of fold change than those with higher
expression
29
Data set B 8 wildtype mice compared to 8 knockout
mice RMA
Gene is of interest if log fold change gt log
(1.5)
Criterion
Plot of log fold change versus overall expression
level
Genes with pg,X gt 0.5 (green) 292 pg,X gt 0.8
(red) 139
The majority of the genes have very small pg,X
97 of genes have pg,X lt 0.2
30
Integrated modelling of Affymetrix data
Condition 2
Condition 1
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
PM MM
Gene specific variability (probe) Gene index BGX
Gene specific variability (probe) Gene index BGX
Hierarchical model of replicate (biological)
variability and array effect
Hierarchical model of replicate (biological)
variability and array effect
Distribution of expression index for gene g ,
condition 1
Distribution of expression index for gene g ,
condition 2
Distribution of differential expression parameter
31
Part 4
  • Hierarchical modelling
  • A fully Bayesian gene expression index (BGX)
  • Differential expression and array effects
  • Two-way (gene by sample) clustering

32
Hierarchical clustering of samples
The gene expression profiles cluster according
to tissue of origin of the samples
A subset of 1161 gene expression profiles,
obtained in 60 different samples

Red more mRNA Green less mRNA in the sample
compared to a reference
Ross et al, Nature Genetics, 2000
33
Non-model-based clustering
  • Many clustering algorithms have been developed
    and used for exploratory purposes
  • They rely on a measure of distance
    (dissimilarity) between gene or sample profiles,
    e.g. Euclidean
  • Hierarchical clustering proceeds in an
    agglomerative manner single profiles are joined
    to form groups using the distance metric,
    recursively
  • Good visual tool, but many arbitrary choices
    care in interpretation!

34
Model-based clustering
  • Build the cluster structure into the model,
    rather than estimating gene effects (say) first,
    and post-processing to seek clusters
  • Bayesian setting allows use of real prior
    information where it is exists (biological
    understanding of pathways, etc, previous
    experiments, )

35
Additive ANOVA models for (log-) gene
expression
The simplest model gene sample
ggene
ssample/condition
Under standard conditions, the (least-squares)
estimates of gene effects are
The model generates the method, and in this case
performs a simple form of normalisation
36
... bring in mixture modelling
(single sample first!)
ggene
Tg unknown cluster to which gene g belongs This
is a mixture model
37
finally allow clusters to overlap Plaid
model
h denotes a cluster, block or layer
pathway? ?gh 0 or 1 and ?sh 0 or 1
38
Plaid model
samples
genes
39
An early experiment artificial raw data
Artificial data from a very special case of the
Plaid model single sample s
True H3, b(h)2.2, 3.4 and 4.7, ??N(0,?2) 500
genes, some in each of 238 configurations of ?gh
? 8 overlapping normal clusters
40
true H was 3
true b(h) were 2.2, 3.4, 4.7
41
Human fibroblast data Lemon et al (2002)
  • 18 samples split into 3 categories serum
    starved, serum stimulated and a 5050 mix of
    starved/stimulated.
  • We used the natural logarithm of Lemon et al.s
    calculated LWF values as our measure of
    expression and subtracted gene and sample mean
    levels.
  • We then selected the 100 most variable genes
    across all 18 samples and used this 18100 array
    as the input to our analysis.

42
Bayesian clustering
  • Hierarchical model allows us to learn about all
    unknowns simultaneously
  • In particular, this includes complete 2-way
    classification, gene by sample, with numerical
    uncertainties
  • We then construct visualisations of interesting
    aspects (marginal distributions) of this posterior

43
Bayesian clustering samples
44
Bayesian clustering genes
45
More details, papers and code
  • www.stats.bris.ac.uk/BGX/
  • www.bgx.org.uk
Write a Comment
User Comments (0)
About PowerShow.com