b presentation | free to download

About This Presentation

Transcript and Presenter's Notes

Title: b

1
Statistical Analysis of Gene Expression Data
Sylvia Richardson Centre for Biostatistics Imperia
l College, London
In collaboration with Natalia Bochkina, Anne
Mette Hein, Alex Lewin (St Marys) Tim Aitman
(Hammersmith) Peter Green (Bristol)
Biological Atlas of Insulin Resistance
www.bgx.org.uk
BBSRC
2
Statistical modelling and biology

Extracting the message from microarray data
needs statistical as well as biological
understanding
Statistical modelling in contrast to data
analysis gives a framework for formally
organising assumptions about signal and noise
Our models are structured, reflecting data
generation process
Bayesian hierarchical modelling approach
Inference based on posterior distribution of
quantities of interest

3
What are gene expression data ?

DNA Microarrays are used to measure the relative
abundance of mRNA, providing information on gene
expression in a particular cell type, under
specific conditions
Gene expression data (e.g. Affymetrix?) results
from the scanning of arrays where hybridisation
between a sample and a large number of probes has
taken place
gene expression measure for each gene
The expression level of ten of thousands of
probes are measured on a single
microarray
gene expression profile
Typically, gene expression profiles are obtained
for several samples, in a single
or related experiments
gene expression data matrix

4
Common characteristics of data sets in
transcriptomic

High dimensional data (ten of thousands of genes)
and few samples
Many sources of variability (low signal/noise
ratio)

within/between array variation
gene specific variability of the probes for a
gene (e.g. for Affymetrix?)

condition/treatment
biological
array manufacture
imaging
technical

5
Analysing gene expression data
Gene expression data matrix

Gene expression data can be used in several types
of analysis
-- Comparison of gene expression under different
experimental conditions, or in different tissues
-- Building a predictive model for classification
or prognosis based on gene expression
measurements
-- Exploration of patterns in gene expression
matrices

Samples
Genes (20000)
Gene expression level
6
Common statistical issues

Pre-processing and data reduction
account for the uncertainty of the signal?
making arrays comparable normalisation
Realistic assessment of uncertainty
Multiplicity control of error rates
Need to borrow information
Importance to include prior biological knowledge

Illustrate how structured statistical modelling
can help to tease out signal from noise and
strengthen inference in the context of
differential expression studies
7
Outline

Background
Modelling uncertainty in the signal
Bayesian hierarchical models for differential
expression experiments
posterior predictive checks
use of posterior distribution of parameters of
interest to select genes of interest
Further structure mixture models

8
I Modelling uncertainty in the signalA fully
Bayesian Gene expression index for Affymetrix
Gene Chip arrays (Anne Mette Hein)
Data Affymetrix chip - Each gene g
is represented by a probe set, consisting of a
number of probe pairs (reporters) j Perfect
match (PM) and Mismatch (MM) Aim Formulate a
model to combine PM and MM values into a new
expression value for the gene -- BGX -
Base the model on biological assumptions
- Combine good features of Li and Wong (dChip)
and RMA (Robust Multichip Analysis,
Irrizarry et al)

Use a flexible Bayesian framework that will allow
to get a measure of uncertainty of the
expression
to integrate further components of the
experimental design

9
Single array model Motivation
Key observations
Conclusions

PMs and MMs both increase with spike-in
concentration (MMs slower than PMs)

MMs bind fraction of signal
Multiplicative (and additive) error
transformation needed

Spread of PMs increase with level

Considerable variability in PM (and MM) response
within a probe set

Varying reliability in gene expression estimation
for different genes
Estimate gene expression measure from PMs and MMs
on log scale

Probe effects approximately additive on
log-scale

10
BGX single array model
PMgj ? N( Sgj Hgj , t2) MMgj ? N(F Sgj Hgj
, t2)
Background noise, additive
Gene and probe specific S and H (g1,,1000s,
j1,,tens)
Non-specific hybridisation array wide
distribution j1,,J (20), g1,,G
Expression measure for gene g is built from
j1,,J (20)
log(Sgj1) ? TN(µg,sg2)
Shrinkage exchangeability
log(Hgj1) ? TN(?, ?2)
log(sg2)?N(a, b2)
BGX expression measure
Remaining priors vague
Emp. Bayes
11
BGX model inference Hein et al, Biostatistics,
2005
For each gene g obtain a distribution for signal
(log scale)
PM MM

Implemented in WinBugs and C (MCMC)
All parameters estimated jointly in full
Bayesian framework
Posterior distributions of parameters (and
functions) obtained

The single array model can be extended to
estimate signal from several biological
replicates, as well as differential signal
between conditions
12
Single array modelexamples of posterior
distributions of BGX indices
Each curve represents a gene
Examples with data o log(PMgj-MMgj)
j1,,J (at 0 if not defined)
Mean ? 1SD
13
Comparison with other expression measures
11 genes spiked in at 13 (increasing)
concentrations
BGX index µg increases with concentration ..
except for gene 7 (incorrectly spiked-in??)
Indication of smooth sustained increase over
a wider range of concentrations
14
95 credibility intervals for Bayesian gene
expression index
11 spike-in genes at 13 different concentrations
Each colour corresponds to a different spike-in
gene Gene 7 broken red line
Note how the variability is substantially larger
for low expression level
15
II Modelling differential expression
Condition 2
Condition 1
Start with given point estimates of expression

Hierarchical model of replicate variability and
array effect
Hierarchical model of replicate variability and
array effect
Posterior distribution (flat prior)
Differential expression parameter
Mixture modelling for classification
16
Data Sets and Biological question

Biological Question
Understand the mechanisms of insulin resistance
Using animal models where key genes are knockout
A) Cd36 Knock out Data set (MAS 5) 3 wildtype
(normal) mice compared with 3 mice with Cd36
knocked out
(? 12000 genes on each array )
B) IRS2 Knock out Data set (RMA) 8 wildtype
(normal) mice compared with 8 mice with IRS2
gene knocked out
(? 22700 genes on each array)

17
Exploratory analysis showing array effect
Mouse data set A
Condition 1 (3 replicates)
Needs normalisation Spline curves shown
Condition 2 (3 replicates)

18
Bayesian hierarchical model for differential
expression (Lewin et al, Biometrics, 2005)

Data ygcr log gene expression gene g,
replicate r, condition c
?g gene effect
dg differential effect for gene g between
2 conditions
?r(g)c array effect modelled as a smooth
(spline) function of ?g
?gc2 gene specific variance
1st level yg1r ? N(?g ½ dg ?r(g)1 ,
?g12)
yg2r ? N(?g ½ dg ?r(g)2 , ?g22)
Sr?r(g)c 0, ?r(g)c function of ?g ,
parameters c,d
2nd level Flat priors for ?g , dg, c,d
?gc2 ? lognormal (ac, bc)

Exchangeable variances
19
Directed Acyclic Graph for the differential
expression model (no array effect represented)
a1, b1
½(yg1.- yg2.)
dg
?2g1
s2g1

?2g2
s2g2
½(yg1. yg2.)
?g
a2, b2
20
Differential expression model

Joint modelling of array effects and differential
expression
Performs normalisation simultaneously with
estimation
Gives fewer false positives

How to check some of the modelling
assumptions? Posterior predictive checks How to
use the posterior distribution of dg to select
genes of interest ? Decision rules
21
Bayesian Model Checking

Check assumptions on gene variances, e.g.
exchangeable variances, what distribution ?
Predict sample variance sg2 new (a chosen
checking function) from the model specification
(not using the data for this)
Compare predicted sg2 new with observed sg2 obs
Bayesian p-value Prob( sg2 new gt sg2 obs )
Distribution of p-values approx Uniform if model
is true (Marshall and Spiegelhalter, 2003)
Easily implemented in MCMC algorithm

22
?2g1
new
Bayesian model checking
a1, b1
s2g1
new
½(yg1.- yg2.)
dg
obs
?2g1
s2g1

?2g2
s2g2
½(yg1. yg2.)
?g
a2, b2
23
Mouse Data set A
24
Use of tail probabilities for selecting gene lists

dg log fold change
tg dg / (s2 g1 / n1 s2 g2 / n2 )½
standardised difference
(n1 and n2 replicates in each condition)
-- Obtain the posterior distribution of dg
and/or tg
-- Compute directly posterior probability of
genes satisfying criterion X of interest, e.g.
dg gt threshold or tg gt percentile
pg,X Prob( g of interest Criterion X,
data)
-- Compute the distributions of ranks, .

Interesting statistical issues on relative merits
and properties of different selection rules based
on tail probabilities
25
Using the posterior distribution of tg
(standardised difference) (Natalia Bochkina)

Compute
Probability ( tg gt 2 data)
Bayesian T test
Order genes
Select genes such that

Data set B
Probability ( tg gt 2 data) gt cut-off ( in
blue) By comparison, additional genes selected by
a standard T test with p value lt 5 are in red)
26
Credibility intervals for ranks
100 genes with lowest rank (most under/ over
expressed)
Low rank, high uncertainty
Low rank, low uncertainty
27
III Mixture and Bayesian estimation of False
Discovery Rates (FDR)

Mixture models can be used to perform a model
based classification
Mixture models can be considered at the level of
the data (e.g. clustering time profiles) or for
the underlying parameters
Mixture models can be used to detect
differentially expressed genes if a model of the
alternative is specified
One benefit is that an estimate of the
uncertainty of the classification the False
Discovery Rate is simultaneously obtained

28
Mixture framework for differential expression

yg1r ?g - ½ dg ?g1r , r 1, R1
yg2r ?g ½ dg ?g2r , r 1, R2
(We assume that the data has been pre normalised)
Var(?gcr ) s2gc IG(ac, bc)
dg p0d0 p1G (-x1.5, ?1) p2G (x1.5, ?2)
H0 H1
Dirichlet distribution for (p0, p1, p2)
Exp(1) hyper prior for ?1 and ?2

Explicit modelling of the alternative
29
Mixture for classification of DE genes

Calculate the posterior probability for any gene
of belonging to the unmodified component pg0
data
Classify using a cut-off on pg0
i.e. declare gene is DE if 1- pg0 gt pcut
Bayes rule corresponds to pcut 0.5
Bayesian estimate of FDR (and FNR) for any list
(Newton et al 2003, Broët et al 2004)

Bayes FDR (list) data 1/card(list) Sg ? list
pg0
30
Performance of the mixture prior

Joint estimation of all the mixture parameters
(including p0) using MCMC algorithms avoids
plugging-in of values that are influential on
the classification
Estimation of all parameters combines
information from biological replicates and
between condition contrasts
Performance has been tested on simulated data
sets

31
p0 0.9, 250 DE
p0 0.8, 500 DE
Plot of true difference in each case
p0 0.99, 25 DE
p0 0.95, 125 DE
p0 0.80, 500 DE
32
Examples of simulated data for each case
33
Results averaged over 50 replications
Good estimates of p0 Prob(null) for each case

Av. p0 0.99
34
Comparison of estimated (dotted lines) and
observed (full) FDR (black) and FNR (red) rates
as cut-off for declaring DE is varied

Bayesian mixture
good estimates of
FDR and FNR
easy way to
choose efficient
classification rule

35
In summary

Integrated gene expression analysis
Uses the natural hierarchical structure of the
data e.g. probes within genes within replicate
arrays within condition to synthesize, borrow
information and provide realistic quantification
of uncertainty
Posterior distributions can be exploited for
inference with few replicates choice of decision
rules
Framework where biological prior information,
e.g. on the structure of the probes or on
chromosomic location, can be incorporated
Model based classification, e.g. through
mixtures, provides interpretable output and a
structure to deal with multiplicity

General framework for investigating other
questions
36
Many interesting questions in the analysis of
gene expression data
-- Comparison of gene expression under different
experimental conditions, or in different tissues
-- Integrated gene expression analysis
-- Investigate high dimensional classification
rules (prediction with large number of variables)
and large p small n regression problems
(shrinkage or variable selection)
-- Building a predictive model for classification
or prognosis based on gene expression
measurements, finding signatures
37
Association of gene expression with prognosis
Expression plot of 115 prognostic genes
comprising The Ovarian Cancer Prognostic Profile
Investigate properties of high dimensional
classification rules (prediction with large
number of variables) and large p small n
regression problems (shrinkage or variable
selection)
38
Other questions .
-- Comparison of gene expression under different
experimental conditions, or in different
tissues -- Building a predictive model for
classification or prognosis based on gene
expression measurements, finding signatures
-- Integrated gene expression analysis
-- Investigate high dimensional classification
rules (prediction with large number of variables)
and large p small n regression problems
(shrinkage or variable selection)
-- Perform unsupervised model based clustering --
Estimate graphical models
-- Exploration of patterns and association
networks in gene expression matrices
39
Exploration of patterns in gene expression
matrices
-- Comparison of gene expression under different
experimental conditions, or in different
tissues -- Classification of gene expression
profiles and association of gene expression with
other factors, e.g. prognosis (prediction
problem)
Perform unsupervised model based clustering (e.g.
semi-parametric using basis functions, mixtures
or DP processes)
Development of central nervous systems in rats (9
time points)
40
Thanks
BBSRC Exploiting Genomics grant Colleagues
Natalia Bochkina, Anne Mette Hein, Alex Lewin
(Imperial College) Peter Green (Bristol
University) Philippe Broët (INSERM,
Paris) Papers and technical reports
www.bgx.org.uk/

Write a Comment

User Comments (0)

About PowerShow.com

b PowerPoint PPT Presentation