Title: An Adaptive Empirical Bayesian Thresholding Procedure for Analysing Microarray Experiments
1An Adaptive Empirical Bayesian Thresholding
Procedure for Analysing Microarray Experiments
- Rebecca E. Walls, Stuart Barber, Mark S.
Gilthorpe John T. Kent - rebecca_at_maths.leeds.ac.uk
- University of Leeds
PG seminar series 6th December 2006
2Outline
- An introduction to gene expression and
microarrays - ( THIS WILL BE VERY BRIEF!!)
- Empirical Bayesian methodology
- Results
-
3Gene expression
- Gene expression - process in which a cell
transfers the coded information stored in its DNA
into proteins - Gene expression is regulated genes will only
express at the right time and in the right cell
and responds to enviromental stimuli - Interesting questions to try and answer...
- Which genes are expressed in which tissues?
- How is the expression of a gene influenced by
external stimuli? - What patterns of gene expression cause a disease
or lead to disease progression? - What patterns of gene expression influence
response to treatment?
4Philosophy behind microarrays
- Microarrays allow the comparison of gene
expression between multiple samples for many
thousands of genes simultaneously - Previously we described gene expression as a
process- how can we measure a process?? -
- Proteins notoriously hard to measure
accurately - Instead, we measure the abundance of
the intermediary molecule mRNA
5Notation
- Suppose we observe intensity measurements Tijk
(treatment) and Cijk (control), where - i 1, , n genes
- j 1, , m chips
- k 1, , r replicate spots
- For which of the i genes are Tijk and Cijk
significantly different in intensity level? - For those significant genes, can we estimate
the level of differential expression?
6Statistical challenges
- Data generation
- Noise
- Background
- Experimental variability (chip, samples, lab)
- Intensities not well distributed
- T-test becomes invalid!!
- Data structure
- Many obs (from 100 20k genes per chip,
replicated) - Multiple testing ( suppose that 10,000 genes are
tested could incur as many as 500 false
positives at 5 level!) - Lack of independence (20k genes -gt 1M proteins!)
7Contrast variables
- We assume that Tijk and Cijk have been suitably
transformed and normalised and have constant
variance (Huber et al.) denote adjusted
intensities Tijk and Cijk (on logarithmic
scale) - Define the contrast variable
- Xijk Tijk - Cijk
- We analyse the sequence of , where
8Sparse sequences
- Assume most genes are not differentially
expressed - sequence of will be sparse
- (0, 0, 0, -3.1, 0, 0, 0, 3.7, -2.6, 0, 0, 2.1)
- (2.2, -0.1, -1.4, -5.4, -2.6, 2.2, -1.1, 1.0,
-1.8, 1.2, 1.0, -0.1) - We adapt the EBayesThresh methodology of
Johnstone and Silverman (2002), originally
designed for thresholding wavelet coefficients
Add normally distributed noise with mean 0 and
some variance s2
9Empirical Bayesian methodology
- Suppose we have an observation Z which can be
written in the form - The prior on µ is a mixture of d0(µ), a point
mass at zero, and ?(µa), a heavy-tailed Laplace
distribution, - in proportions according to the
mixing weight, 0 ? 1.
10Empirical Bayesian methodology
- Suppose we have an observation Z which can be
written in the form - The prior on µ is a mixture of d0(µ), a point
mass at zero, and ?(µa), a heavy-tailed Laplace
distribution, - in proportions according to the
for small ?
mixing weight, 0 ? 1.
11Posterior distribution for µ
- The posterior distribution for µ given Z z is a
mixture distribution with a point mass at zero
and is given by - where,
- and be calculated explicitly.
12Estimating µ
- Estimate µ by the posterior median
- For a fixed ?, is a
monotonic function with a thresholding property
An observation Z will yield a non-zero µ if Z
exceeds some threshold t(?)
13Parameter estimation
- Suppose now we have a sequence of observations of
the form -
- for i 1, , n
- We need to estimate the mixing weight ?, the
scaling parameter a, and variance s2. - We use a maximum likelihood approach to find
estimates for ? and a. - To estimate s2, we employ a sum-of-squares
approach from fitting a linear additive model
which accounts for both variation between chip
replicates and spot replicates nested within the
chips
14Linear additive model
- Model each observed intensity by
- with µi gene specific mean
- between chip variation
- within chip variation
- Given estimates for sB2 and sW2, the variance of
is given by - Simulations show sum-of-squares approach more
reliable as ? grows large
15Error distribution for spike-in experiment HIV
data
16Results for HIV spike-in experiment
17Data from homemade spotted array E.Coli
18Results
Table 1 Numbers of differentially expressed
genes identified by different common methods
19Conclusions
- The empirical Bayesian approach is a natural way
to incorporate our prior belief that not many
genes are differentially expressed. - Making an adjustment to the variance is not
sufficient compensation for using the incorrect
prior distribution! Future work includes using a
Laplace distribution for the errors.
20References
- Hedenfalk, I. et al. (2001). Gene expression
profiles in hereditary breast cancer. The New
England Journal of Medicine, 344 (8), 539-548. - Huber, W. et al. (2003). Variance stabilization
applied to microarray data calibration and to the
quantification of differential expression.
Bioinformatics, 18, S96-S104. - Johnstone, I. and Silverman, B. (2004). Needles
and straw in haystacks Empirical bayes estimates
of possibly sparse sequences. The Annals of
Statistics, 32, 1594-1649. - McLachlan, G., Bean, R. and Ben-Tovim Jones, L.
(2006). A simple implementation of a normal
mixture approach to differential expression in
multiclass microarrays. Bioinformatics, 22 (13),
1608-1615. - Smyth, G., Michaud, J. and Scott, H. (2005). Use
of within-array replicate spots for assessing
differential expression in microarray
experiments. Bioinformatics, 21 (9), 2067-2075. - Tusher, V., Tibshirani, R. and Chu, C. (2001).
Significance analysis of microarray applied to
transcriptional responses to ionizing radiation.
Proc. Natn. Acad. Sci. USA, 98, 5116-5121.