b - PowerPoint PPT Presentation

About This Presentation
Title:

b

Description:

... wildtype ('normal') mice compared with 3 mice with Cd36 ... Mouse data. set A. 18. BGX. Data: ygcr = log gene expression gene g, replicate r, condition c ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 41
Provided by: AHe1
Category:
Tags: mousemice

less

Transcript and Presenter's Notes

Title: b


1
Statistical Analysis of Gene Expression Data
Sylvia Richardson Centre for Biostatistics Imperia
l College, London
In collaboration with Natalia Bochkina, Anne
Mette Hein, Alex Lewin (St Marys) Tim Aitman
(Hammersmith) Peter Green (Bristol)
Biological Atlas of Insulin Resistance
www.bgx.org.uk
BBSRC
2
Statistical modelling and biology
  • Extracting the message from microarray data
    needs statistical as well as biological
    understanding
  • Statistical modelling in contrast to data
    analysis gives a framework for formally
    organising assumptions about signal and noise
  • Our models are structured, reflecting data
    generation process
  • Bayesian hierarchical modelling approach
  • Inference based on posterior distribution of
    quantities of interest

3
What are gene expression data ?
  • DNA Microarrays are used to measure the relative
    abundance of mRNA, providing information on gene
    expression in a particular cell type, under
    specific conditions
  • Gene expression data (e.g. Affymetrix?) results
    from the scanning of arrays where hybridisation
    between a sample and a large number of probes has
    taken place
  • gene expression measure for each gene
  • The expression level of ten of thousands of
    probes are measured on a single
    microarray
  • gene expression profile
  • Typically, gene expression profiles are obtained
    for several samples, in a single
    or related experiments
  • gene expression data matrix

4
Common characteristics of data sets in
transcriptomic
  • High dimensional data (ten of thousands of genes)
    and few samples
  • Many sources of variability (low signal/noise
    ratio)
  • within/between array variation
  • gene specific variability of the probes for a
    gene (e.g. for Affymetrix?)
  • condition/treatment
  • biological
  • array manufacture
  • imaging
  • technical


5
Analysing gene expression data
Gene expression data matrix
  • Gene expression data can be used in several types
    of analysis
  • -- Comparison of gene expression under different
    experimental conditions, or in different tissues
  • -- Building a predictive model for classification
    or prognosis based on gene expression
    measurements
  • -- Exploration of patterns in gene expression
    matrices

Samples
Genes (20000)
Gene expression level
6
Common statistical issues
  • Pre-processing and data reduction
  • account for the uncertainty of the signal?
  • making arrays comparable normalisation
  • Realistic assessment of uncertainty
  • Multiplicity control of error rates
  • Need to borrow information
  • Importance to include prior biological knowledge

Illustrate how structured statistical modelling
can help to tease out signal from noise and
strengthen inference in the context of
differential expression studies
7
Outline
  • Background
  • Modelling uncertainty in the signal
  • Bayesian hierarchical models for differential
    expression experiments
  • posterior predictive checks
  • use of posterior distribution of parameters of
    interest to select genes of interest
  • Further structure mixture models

8
I Modelling uncertainty in the signalA fully
Bayesian Gene expression index for Affymetrix
Gene Chip arrays (Anne Mette Hein)
Data Affymetrix chip - Each gene g
is represented by a probe set, consisting of a
number of probe pairs (reporters) j Perfect
match (PM) and Mismatch (MM) Aim Formulate a
model to combine PM and MM values into a new
expression value for the gene -- BGX -
Base the model on biological assumptions
- Combine good features of Li and Wong (dChip)
and RMA (Robust Multichip Analysis,
Irrizarry et al)
  • Use a flexible Bayesian framework that will allow
  • to get a measure of uncertainty of the
    expression
  • to integrate further components of the
    experimental design

9
Single array model Motivation
Key observations
Conclusions
  • PMs and MMs both increase with spike-in
    concentration (MMs slower than PMs)

MMs bind fraction of signal
Multiplicative (and additive) error
transformation needed
  • Spread of PMs increase with level
  • Considerable variability in PM (and MM) response
    within a probe set

Varying reliability in gene expression estimation
for different genes
Estimate gene expression measure from PMs and MMs
on log scale
  • Probe effects approximately additive on
    log-scale

10
BGX single array model
PMgj ? N( Sgj Hgj , t2) MMgj ? N(F Sgj Hgj
, t2)
Background noise, additive
Gene and probe specific S and H (g1,,1000s,
j1,,tens)
Non-specific hybridisation array wide
distribution j1,,J (20), g1,,G
Expression measure for gene g is built from
j1,,J (20)
log(Sgj1) ? TN(µg,sg2)
Shrinkage exchangeability
log(Hgj1) ? TN(?, ?2)
log(sg2)?N(a, b2)
BGX expression measure
Remaining priors vague
Emp. Bayes
11
BGX model inference Hein et al, Biostatistics,
2005
For each gene g obtain a distribution for signal
(log scale)
PM MM
  • Implemented in WinBugs and C (MCMC)
  • All parameters estimated jointly in full
    Bayesian framework
  • Posterior distributions of parameters (and
    functions) obtained

The single array model can be extended to
estimate signal from several biological
replicates, as well as differential signal
between conditions
12
Single array modelexamples of posterior
distributions of BGX indices
Each curve represents a gene
Examples with data o log(PMgj-MMgj)
j1,,J (at 0 if not defined)
Mean ? 1SD
13
Comparison with other expression measures
11 genes spiked in at 13 (increasing)
concentrations
BGX index µg increases with concentration ..
except for gene 7 (incorrectly spiked-in??)
Indication of smooth sustained increase over
a wider range of concentrations
14
95 credibility intervals for Bayesian gene
expression index
11 spike-in genes at 13 different concentrations
Each colour corresponds to a different spike-in
gene Gene 7 broken red line
Note how the variability is substantially larger
for low expression level
15
II Modelling differential expression
Condition 2
Condition 1
Start with given point estimates of expression

Hierarchical model of replicate variability and
array effect
Hierarchical model of replicate variability and
array effect
Posterior distribution (flat prior)
Differential expression parameter
Mixture modelling for classification
16
Data Sets and Biological question
  • Biological Question
  • Understand the mechanisms of insulin resistance
  • Using animal models where key genes are knockout
  • A) Cd36 Knock out Data set (MAS 5) 3 wildtype
    (normal) mice compared with 3 mice with Cd36
    knocked out
  • (? 12000 genes on each array )
  • B) IRS2 Knock out Data set (RMA) 8 wildtype
    (normal) mice compared with 8 mice with IRS2
    gene knocked out
  • (? 22700 genes on each array)

17
Exploratory analysis showing array effect
Mouse data set A
Condition 1 (3 replicates)
Needs normalisation Spline curves shown
Condition 2 (3 replicates)

18
Bayesian hierarchical model for differential
expression (Lewin et al, Biometrics, 2005)
  • Data ygcr log gene expression gene g,
    replicate r, condition c
  • ?g gene effect
  • dg differential effect for gene g between
    2 conditions
  • ?r(g)c array effect modelled as a smooth
    (spline) function of ?g
  • ?gc2 gene specific variance
  • 1st level yg1r ? N(?g ½ dg ?r(g)1 ,
    ?g12)
  • yg2r ? N(?g ½ dg ?r(g)2 , ?g22)
  • Sr?r(g)c 0, ?r(g)c function of ?g ,
    parameters c,d
  • 2nd level Flat priors for ?g , dg, c,d
  • ?gc2 ? lognormal (ac, bc)

Exchangeable variances
19
Directed Acyclic Graph for the differential
expression model (no array effect represented)
a1, b1
½(yg1.- yg2.)
dg
?2g1
s2g1

?2g2
s2g2
½(yg1. yg2.)
?g
a2, b2
20
Differential expression model
  • Joint modelling of array effects and differential
    expression
  • Performs normalisation simultaneously with
    estimation
  • Gives fewer false positives

How to check some of the modelling
assumptions? Posterior predictive checks How to
use the posterior distribution of dg to select
genes of interest ? Decision rules
21
Bayesian Model Checking
  • Check assumptions on gene variances, e.g.
    exchangeable variances, what distribution ?
  • Predict sample variance sg2 new (a chosen
    checking function) from the model specification
    (not using the data for this)
  • Compare predicted sg2 new with observed sg2 obs
  • Bayesian p-value Prob( sg2 new gt sg2 obs )
  • Distribution of p-values approx Uniform if model
    is true (Marshall and Spiegelhalter, 2003)
  • Easily implemented in MCMC algorithm

22
?2g1
new
Bayesian model checking
a1, b1
s2g1
new
½(yg1.- yg2.)
dg
obs
?2g1
s2g1

?2g2
s2g2
½(yg1. yg2.)
?g
a2, b2
23
Mouse Data set A
24
Use of tail probabilities for selecting gene lists
  • dg log fold change
  • tg dg / (s2 g1 / n1 s2 g2 / n2 )½
    standardised difference
  • (n1 and n2 replicates in each condition)
  • -- Obtain the posterior distribution of dg
    and/or tg
  • -- Compute directly posterior probability of
    genes satisfying criterion X of interest, e.g.
    dg gt threshold or tg gt percentile
  • pg,X Prob( g of interest Criterion X,
    data)
  • -- Compute the distributions of ranks, .

Interesting statistical issues on relative merits
and properties of different selection rules based
on tail probabilities
25
Using the posterior distribution of tg
(standardised difference) (Natalia Bochkina)
  • Compute
  • Probability ( tg gt 2 data)
  • Bayesian T test
  • Order genes
  • Select genes such that

Data set B
Probability ( tg gt 2 data) gt cut-off ( in
blue) By comparison, additional genes selected by
a standard T test with p value lt 5 are in red)
26
Credibility intervals for ranks
100 genes with lowest rank (most under/ over
expressed)
Low rank, high uncertainty
Low rank, low uncertainty
27
III Mixture and Bayesian estimation of False
Discovery Rates (FDR)
  • Mixture models can be used to perform a model
    based classification
  • Mixture models can be considered at the level of
    the data (e.g. clustering time profiles) or for
    the underlying parameters
  • Mixture models can be used to detect
    differentially expressed genes if a model of the
    alternative is specified
  • One benefit is that an estimate of the
    uncertainty of the classification the False
    Discovery Rate is simultaneously obtained

28
Mixture framework for differential expression
  • yg1r ?g - ½ dg ?g1r , r 1, R1
  • yg2r ?g ½ dg ?g2r , r 1, R2
  • (We assume that the data has been pre normalised)
  • Var(?gcr ) s2gc IG(ac, bc)
  • dg p0d0 p1G (-x1.5, ?1) p2G (x1.5, ?2)
  • H0 H1
  • Dirichlet distribution for (p0, p1, p2)
  • Exp(1) hyper prior for ?1 and ?2

Explicit modelling of the alternative
29
Mixture for classification of DE genes
  • Calculate the posterior probability for any gene
    of belonging to the unmodified component pg0
    data
  • Classify using a cut-off on pg0
  • i.e. declare gene is DE if 1- pg0 gt pcut
  • Bayes rule corresponds to pcut 0.5
  • Bayesian estimate of FDR (and FNR) for any list
  • (Newton et al 2003, Broët et al 2004)

Bayes FDR (list) data 1/card(list) Sg ? list
pg0
30
Performance of the mixture prior
  • Joint estimation of all the mixture parameters
    (including p0) using MCMC algorithms avoids
    plugging-in of values that are influential on
    the classification
  • Estimation of all parameters combines
    information from biological replicates and
    between condition contrasts
  • Performance has been tested on simulated data
    sets

31
p0 0.9, 250 DE
p0 0.8, 500 DE
Plot of true difference in each case
p0 0.99, 25 DE
p0 0.95, 125 DE
p0 0.80, 500 DE
32
Examples of simulated data for each case
33
Results averaged over 50 replications
Good estimates of p0 Prob(null) for each case

Av. p0 0.99
34
Comparison of estimated (dotted lines) and
observed (full) FDR (black) and FNR (red) rates
as cut-off for declaring DE is varied
  • Bayesian mixture
  • good estimates of
  • FDR and FNR
  • easy way to
  • choose efficient
  • classification rule

35
In summary
  • Integrated gene expression analysis
  • Uses the natural hierarchical structure of the
    data e.g. probes within genes within replicate
    arrays within condition to synthesize, borrow
    information and provide realistic quantification
    of uncertainty
  • Posterior distributions can be exploited for
    inference with few replicates choice of decision
    rules
  • Framework where biological prior information,
    e.g. on the structure of the probes or on
    chromosomic location, can be incorporated
  • Model based classification, e.g. through
    mixtures, provides interpretable output and a
    structure to deal with multiplicity

General framework for investigating other
questions
36
Many interesting questions in the analysis of
gene expression data
-- Comparison of gene expression under different
experimental conditions, or in different tissues
-- Integrated gene expression analysis
-- Investigate high dimensional classification
rules (prediction with large number of variables)
and large p small n regression problems
(shrinkage or variable selection)
-- Building a predictive model for classification
or prognosis based on gene expression
measurements, finding signatures
37
Association of gene expression with prognosis
Expression plot of 115 prognostic genes
comprising The Ovarian Cancer Prognostic Profile
Investigate properties of high dimensional
classification rules (prediction with large
number of variables) and large p small n
regression problems (shrinkage or variable
selection)
38
Other questions .
-- Comparison of gene expression under different
experimental conditions, or in different
tissues -- Building a predictive model for
classification or prognosis based on gene
expression measurements, finding signatures
-- Integrated gene expression analysis
-- Investigate high dimensional classification
rules (prediction with large number of variables)
and large p small n regression problems
(shrinkage or variable selection)
-- Perform unsupervised model based clustering --
Estimate graphical models
-- Exploration of patterns and association
networks in gene expression matrices
39
Exploration of patterns in gene expression
matrices
-- Comparison of gene expression under different
experimental conditions, or in different
tissues -- Classification of gene expression
profiles and association of gene expression with
other factors, e.g. prognosis (prediction
problem)
Perform unsupervised model based clustering (e.g.
semi-parametric using basis functions, mixtures
or DP processes)
Development of central nervous systems in rats (9
time points)
40
Thanks
BBSRC Exploiting Genomics grant Colleagues
Natalia Bochkina, Anne Mette Hein, Alex Lewin
(Imperial College) Peter Green (Bristol
University) Philippe Broët (INSERM,
Paris) Papers and technical reports
www.bgx.org.uk/
Write a Comment
User Comments (0)
About PowerShow.com