Title: Adsorption models of oligonucleotide microarrays Conrad Burden, Centre for Bioinformation Science, A
1Adsorption models of oligonucleotide
microarraysConrad Burden, Centre for
Bioinformation Science, ANU
2Oligonucleotide microarray chips
Affymetrix make these little beasties for
testing for the presence of genes in prepared
cRNA samples
Image courtesy of Affymetrix
3- Single strand DNA oligo probes 25 bases in length
deposited onto glass substrate using
photolithographic process
Image courtesy of Affymetrix
4- The chip surface is divided up into 500,000
features tens of microns across, probes within
each feature are a specific sequence - Each gene represented by between 11 and 16 pairs
of such regions one perfect match (PM) sequence,
and one mismatch (MM) sequence - ? Tens of thousands of genes measured by a single
chip
5Image courtesy of Affymetrix
6Image courtesy of Affymetrix
7Image courtesy of Affymetrix
8Image courtesy of Affymetrix
9- Data from an experiment showing the expression of
thousands of genes on a single GeneChip probe
array.
Image courtesy of Affymetrix
10- Given a set of typically 16 PM and MM intensity
values (number of replicate chips in expt.), how
can we obtain a measure of mRNA expression for a
given gene? - Either as an absolute mRNA concentration in, say,
picomolar - Or a relative change in mRNA concentration
between treatments
11- Absolute concentration well come to later
- Relative expression between treatments existing
Expression measures such as - MAS5
- RMA
- Li-Wong
- attempt to do this.
- (MAS5 is provided with Affymetrix chips.
- The Bioconductor software provides inbuilt
functions for all three measures.)
12MAS5 (MicroArray Suite v5)
if
where
something lt PM otherwise
2. Tukey biweight average of logged Vs within
probeset (summarisation)
SignalLogValue
13- 3. Optional scaling factor
4. Final output is
Reported value of ith probeset
14RMA (Robust Microarray Average)
Irizarry et al. Biostatistics, 4 (2003) 249-264
1. Background Correction
Subtract from PMs a probe specific background
correction using a model based on observed
intensity being the sum of (exponential) signal
(normal) noise.
- 2. Quantile normalisation
Assuming multiple replicates of each experiment,
this adjusts intensities so that the
distribution of intensities is the same for all
chips within set of replicates.
154. Average across the 16 probes in probeset using
median polish summarisation
i.e., fit to model
is the required measure
16Affymetrix Latin Square experiment
- 14 genes spiked at cyclic permutations of the 14
concentrations (0, 0.25, 0.5, 1, ,1024) pM
- into background of human pancreas cRNA
- Hybridised onto 14 arrays
- 3 replicates of experiment
17GENES
CHIPS
18Gene 37777_at
Background
64 pM
Saturation
1 pM
19(No Transcript)
20- Existing expression measures
- wrongly assume a linear relationship between
target concentration and measured fluorescent dye
intensity - fail to account for saturation effects
- fail to account properly for probe specific
differences in binding probe-target affinities - An alternate approach is to use adsorption models
of physical chemistry to infer absolute
concentration estimates.
21Langmuir Adsorption Model
ADSORPTION PROBE TARGET DUPLEX
DESORPTION
Image courtesy of Affymetrix
22Langmuir Adsorption Model
- Let x be the concentration of mRNA target and
?(t) be the fraction of sites occupied by
probe-target duplexes. - Assume
- (Adsorption) Target mRNA attaches to probes at a
rate kfx(1 ?(t)) proportional to concentration
of specific target mRNA and fraction of
unoccupied probes - (Desorption) Target mRNA detaches from probes at
a rate kb?(t) proportional to fraction of
occupied probes
23- Solution with initial condition ?(0) 0 is
where K kb/kf. Let y(x,t) be the measured
fluorescence intensity, y0 be the background
intensity at zero concentration. Also assume
intensity above background is proportional to
?(t). Then
24Equilibrium limit, t ? 8, gives the Langmuir
isotherm
25Time-dependent solutions
26GENES
CHIPS
27Raw data from .cel files
Affy spike-in experiment Gene 37777_at Red
PM Black MM
28Raw data from .cel files
Affy spike-in experiment Gene 37777_at Red
PM Black MM
29Raw data from .cel files
Affy spike-in experiment Gene 1024_at Red
PM Black MM
30Raw data from .cel files
Affy spike-in experiment Gene 1024_at Red
PM Black MM
31Statistical Model
- Use a Generalized Linear Model to fit
fluorescence intensity values y to Gamma
distribution i.e. assume random variable Y has a
Gamma distribution - Y G(µ,?)
- with mean given by Langmuir adsorption
solution - µ yLangmuir(x,t)
- and constant shape parameter ?, i.e. constant
coefficient of variation.
32Justification for Gamma distribution
- Add to Langmuir equation a stochastic noise
where z(t) is a Gaussian noise, then under
reasonable assumptions on h(x,?), ? follows an
approximate Gamma distribution.
33Test of Gamma assumption using Q-Q plot
- Y G(µ,?) ? Y/µ G(1,?)
- coeff. of variation
- std. dev./mean 0.192
- (gt 8,000 data points)
34- We tested many versions of the model
35- and determined the best supported model
- (parsimonious i.e. no unnecessary parameters
- but accurate over all data)
- Equilibrium Langmuir isotherm
- Parameters y0, b, K all probe
- dependent
- Overall wafer-dependent
- scaling effect
36Inverse problem
- Given the measured fluorescence intensities from
16 probes, what is the concentration of mRNA?
37- First try a simple algorithm
- (following D. Hekstra et al. Nucl. Acids Res.
31(2003) 1962) - 1) Fit parameters to a linear model
where nA, nC and nG are number of each nucleotide
in probe
2) Given a new set of 16 probe sequences,
estimate their parameters y0, b and K from
the model
38- 3) Invert the Langmuir isotherm to get 16
estimates of the gene concentration
4) Median of these 16 values gives a robust
estimate of mRNA concentration for this gene
39Why the median and not the mean?
40Why the median and not the mean?
?
41Why the median and not the mean?
?
So that we can account for data outside the
range y0 lt y lt y0 b
42Calculated mRNA concentration vs. true values
43 compare with MAS5 () and RMA ()
44- Even this mindlessly simple algorithm is an
improvement on the currently available
Expression measures!
45The challenge is to find an algorithm that will
predict y0, b and K for any given probe sequence
- Parameters y0, b and K probe dependent
- explanation from physical chemistry?
- Work in progress
46Improvements to naïve Langmuir model
- Include cross hybridization competition from
mRNA other than the intended target sequence
Rate of uptake of specific target
Rate of uptake of non-specific target
47- Include dynamics of probe target binding
Even with these two improvements, The
hyperbolic form of isotherm is preserved!
48- Langmuir isotherm
- is still appropriate (but the three parameters
have less simple meanings). - This model enables a comparison between PM and MM
probes parameters in terms of binding free
energies
49Langmuir isotherms for PM and MM
Affy spike-in experiment Gene 37777_at Red
PM Black MM
50(No Transcript)
51- which is where we are up to.
- Where is it going?
- Final aim is to combine our adsorption model with
existing models (e.g. Position Dependent Nearest
Neighbour model) to find an algorithm for
determining y0, b and K for any probe sequence. - This will provide a practical way of measure
absolute concentration of mRNA in biological
samples
52References
- Statistical Analysis of Adsorption Models for
Oligonucleotide Microarrays, - Statistical Applications in Genetics and
Molecular Biology, 2004 (to appear) - An Adsorption Model of Hybridization Behaviour
on Oligonucleotide Microarrays, - ePrint arXiv q-bio.BM/1411005
53Acknowledgements
- Susan Wilson (CBiS/CMA, ANU)
- Yvonne Pittelkow (CBiS, ANU)
- C.B. (CBiS/JCSMR, ANU)