Title: Bayesian processbased modeling of gene expression data: estimating absolute mRNA concentrations
1Bayesian process-based modeling of gene
expression data estimating absolute mRNA
concentrations
- Arnoldo Frigessi, University of Oslo
- Mark van de Wiel, Technische Universiteit
Eindhoven - Marit Holden, Norwegian Computing Center
- Ingrid K. Glad, University of Oslo
- Heidi Lyng, The Norwegian Radium Hospital
Bricks dag 29-11-2005
2cDNA microarray
DNA
transcription
mRNA
translation
amino acid
organism phenotype
protein
cell phenotype
- Microarrays measure gene expression at the
transcription level - Microarray technology 40,000 spots cDNA
microarray, hybridized with a sample labelled
with a green fluorescent colour (Cy3) and a
reference sample, labelled in red (Cy5).
3Adapted fromChristina Kendziorski
http//www.biostat.wisc.edu/kendzior
Data
log2(rj/gj)
4- Efficient production of spotted glass-slide
arrays has made the microarray technology to a
widespread technique. - Spotted microarrays provided valuable
information on relative transcript levels in
tissues, but - differences in experimental protocols make
direct comparison of results between microarray
studies very difficult.
5- Can we get information about absolute transcript
levels from - standard spotted microarray data?
- Extraction of absolute transcript levels is
complicated due to experimental variation and
noise originating in the production and
hybridisation processes. - Main difficulty Probes (to which the transcripts
bind) have different properties.
6- Is information about absolute transcript levels
useful? - Absolute concentrations of mRNA are universal
and can be included in further analysis with
similar estimates obtained with different
techniques in other labs. - A first step towards building an annotated data
base of transcript levels of cells. - It is possible to detect significant
concentration differences between two different
genes within the same tumor, comparisons that are
not possible with standard intensity ratios.
7Sample 1
Sample 3
Sample t
Sample 2
K1g
K2g
K1g
Ktg
Estimates
TransCount counting the number of transcripts of
each gene in each sample.
K1g
K2g
K3g
Ktg
8Propagating uncertainty
- Current practice
- Divide the experiment into separate steps
- microarray production
- transcription labelling hybridisation
- image analysis
- normalisation
- imputation
- estimation of intensities
- testing / clustering
- Do inference inside each task and plug-in
results into the next step.
We do a coherent statistical analysis and
propagate uncertainties.
9- We use available covariates describing the
various steps of the - experiment, from target preparation to laser
scanning of the images. - We try to keep the model as close as possible to
the biology, physics, - bio-chemistry of the experiment.
- MCMC converges (slowly, as usual in complex
models).
10- Some genes must be spotted at least twice on
some arrays - the number of such genes does not depend on
the total number - of genes in the study but on the design.
Currently 50 genes - in duplicate.
- Our method succeeds in obtaining absolute
concentrations - because it makes explicit use of probe and
spot related covariates - like probe length and quantity, to describe
probe-dependent - hybridisation efficiency. By means of
duplicate spotting, - we have many transcripts with more than one
probe, and the - effect of probe-dependent covariates can be
estimated.
11- We follow the mRNA molecules
- through the whole experiment.
- At each step, some molecules
- survive, according to a Binomial
- process with a success probability
- depending on appropriate covariates.
- At the end, some molecules are
- scanned, and produce our data,
- i.e. the raw measured intensities.
12- Two off-line experiments are needed to determine
two - covariates which are technology dependent
- Hybridisation factor c is used to scale the
estimated values - to the true number of transcripts. Estimated
using two control - samples (spikes) with known concentrations.
-
- Amplification factor f is a measure of the
increase in intensity per unit - of increase in PMT voltage during laser
scanning. Estimated once for - each dye and scanner.
- Under ordinary stable experimental settings
it is sufficient to estimate these - factors once.
13MODEL
1 scaling and selection of target molecules 2
inclusion of covariate information 3 scanning 4
imaging
14- Reparametrisation principle
- Approximate Binomial with Poisson
- Hgt,a Poisson (c nas qta Ktg pst,a )
- and find the parameters that are
identifiable. - Reparametrise Ktg to include all other
remaining parameters - Next approximate Binomial with Normal
-
- Parameters that were not Poisson identifiable (
) - do not occur in the mean, but only in the
variance. - (a is the ratio of the two dye parameters)
15- Validate estimated concentrations in a dye swap
experiment with - control samples at known concentrations.
- 17 genes spotted each 6 times on 2 arrays.
- 2 control samples (spikes) each with 17
different - mRNA sequences at specific
- concentrations.
- Hyb. Factor 0.001
-
Low concentrations are overestimated.
16- Validate estimated concentrations with results
from quantitative - real-time PCR.
- 12 cervix tumor samples and a pool of ten cancer
cell lines - 24 arrays, each with one tumor and the pool,
dye-swapped - 10000 genes
17TransCount
Log-ratios
18- Clear linear relation between the PCR data and
estimated concentrations - The best agreement for intermediate and high
concentrations, reflecting the increased - uncertainties of both methods in
quantification of low abundant transcripts. - Good positive correlation between estimated
concentration and PCR data for some - individual genes, despite a limited
within-gene variability and few data points. - Standard log-ratio expressions also showed a
significant correlation to PCR data, - BUT much lower
- Many genes had approximately the same log-ratio
although their absolute transcript - concentrations differed considerably. This
shows the additional information of - absolute measures.
19Cervix tumor samples and pooled cell lines
- Estimated absolute transcript concentrations
- 12 cervix tumor samples and a pool of ten cancer
cell lines - 24 arrays, each with one tumor and the pool, dye
swapped - 10000 genes
20(No Transcript)
21(No Transcript)
22- Estimate both parameters in a binomial (which
allows for estimating gene-dependent effects) - Use scaling to focus hybridization process
- Do not use single observations to estimate its
variance. - Use conditional independence in hierarchical
models to model - complex dependencies in a flexible way.
- Start MCMC runs with central initial values.
23- Four main ideas
- we use covariates explicitly, incl. some
describing hybridisation - efficiency of each spot
- we treat unequal number of replicates per gene
- we use the binomial process, which better
describes the experimental - dynamics and allows estimation of gene and
dye effects - we build a bottom-to-top coherent stochastic
model, avoiding - plug-ins and propagating fully uncertainty.
24Publications Arnoldo Frigessi, M.A. van de
Wiel, M. Holden, D.H. Svendsrud, I.K Glad and H.
Lyng (2005), Genome-wide estimation of transcript
concentrations from spotted cDNA microarray data,
Nucleic Acids Research - Methods Online, 33,
e143 M.A. van de Wiel, M. Holden, I.K Glad, H.
Lyng and Arnoldo Frigessi (2006), Bayesian
process-based modeling of two-channel microarray
experiments estimating absolute mRNA
concentrations In Bayesian Inference for Gene
Expression and Proteomics (Mueller Do,
eds) Tech report available here
http//www.nr.no/files/samba/smbi/Transcount/repor
t999.pdf
TransCount a prototype and quite-user-friendly
version of the MCMC sampler is available here
http//www.nr.no/pages/samba/area_emr_smbi_transco
unt
25(No Transcript)