Title: Low-level Analysis of Microarray Data
1Low-level Analysis of Microarray Data
Mathematical Statistics Centre for
Mathematical Sciences Lund University,
Sweden 2004-10-01
2? Why are microarrays important?
Genome projects have identified lots of new
genes, but for many we do not know what they are
doing. "Functional Genomics" Application
areas Cancer, Inflammation, Infectious
diseases Toxicology - drug (side)
effects Developmental biology
3Papers
- H. Bengtsson, B. Calder, I. S. Mian, M. Callow,
E. Rubin, and T. P. Speed. Identifying
Differentially Expressed Genes in cDNA Microarray
Experiments making aging visible. 2001. - H. Bengtsson. Identification and normalization of
plate effect in cDNA microarray data. 2002. - H. Bengtsson. The R.oo package - object-oriented
programming with references using standard R
code. 2003. - H. Bengtsson and O. Hössjer. Methodological study
of affine transformations of gene expression data
with proposed normalization method. 2003. - H. Bengtsson, G. Jönsson, and J.
Vallon-Christersson. Calibration and assessment
of channel-specific biases in microarray data
with extended dynamical range. 2003. - H. Bengtsson. aroma - An R Object-oriented
Microarray Analysis environment. 2004. - H. Bengtsson and O. Hössjer. Affine calibration
for microarrays with dilution series or
spike-ins. 2004.
4Central Dogma of Molecular Biology
Idea of microarrays (and other gene expression
techniques) Measure the amount of mRNA to find
genes that are expressed (active). (measuring
protein might be better, but is currently much
harder)
5Gene-expression analysis
- xc,i gene expression for gene i1,...,I in
sample c1,...C. - I5...-30,000 and Cgt2.
- Consider two samples from cancer and healthy
tissue. - We are interested in genes that have different
expression - Non-differentially expr. genes
- Ix1x2 i?I x1,ix2,i
- Differentially expressed genes
- Ix1?x2 I \ Ix1x2 i?I x1,i ? x2,i
non-diff. expr. genes
6Hypothesis tests
- For each gene i1,...,I we test the hypotheses
null hypothesis H0 x1,i x2,i Non-differentially expressed genes
alternative hypothesis H1 x1,i ? x2,i Differentially expressed genes
7? Which genes are differentially transcribed?
same-same
tumor-normal
8Statistics 101
?bias accuracy?
? precision variance?
9Central dogma of data analysis
Can always increase sensitivity on the cost of
specificity, or vice versa, the art is to find
the best trade-off!
X
X
X
X
X
X
X
X
X
10Printing / spotting
11Microarray slide preparation
J. Vallon-Christersson, Dept Oncology, Lund Univ.
12RNA extraction hybridization
(targets)
Extract mRNA from samples Reverse transcription
of mRNA to cDNA Label with Cy3 or Cy5 fluorescent
dye Hybridize labeled cDNA to slide Wash slide
(probes)
Figure Hybridization chamber.
13Two-channel scanning
? higher frequency, more energy
? lower frequency, less energy
14Combined color image for visualization
15Signal quantification
- AddressingLocate spot centers.
- SegmentationClassification of pixels either as
signal or background (using circles, seeded
region growing or other). - Signal quantificationa) foreground estimates
Rfg, Gfgb) background estimates Rbg, Gbgc)
... (shape, size etc)
Terry Speed et al.
16Summary
scanning
Production
data (Rfg,Gfg,Rbg,Gbg, ...)
17? Sources of variation
amount of RNA in the biopsy efficiencies of -RNA
extraction -reverse transcription
-labeling -fluorescent detection
probe purity and length distribution spotting
efficiency, spot size cross-/unspecific
hybridization stray signal
18Exploratory data analysis (R,G) ? (log2R,log2G)
? (M,A)
log(R) vs log(G)
M vs A
R vs G
R red channel signalG green channel signal
M log2(R/G) (log-ratio), A ½log2(RG)
(log-intensity)
19Exploratory data analysis the need for
normalization
In self-self comparisons, find stochastic as well
as systematic deviations from M0. Idea Do
various exploratory plots to understand the
nature of these deviations for example, M vs A,
spatial plots, density boxplots plots,
print-order plots etc.
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24Within-slide normalization
- Classical methods
- No normalization
- Median normalizationShift of log-ratios.
- Curve-fit normalization methods.A.k.a. lo(w)ess
normalization. - etc.
25Global median normalization
Assumption Changes roughly symmetric.
M M median(M)
aroma normalizeWithinSlide(ma, methodm)
26Global curve-fit normalization cont.
curve-fit normalization done for all spots
together
Mi Mi c(Ai)
(Biased towards the green channel intensity
dependent artifacts)
aroma plot(ma) normalizeWithinSlide(ma,
methodl) plot(ma)
27Print-tip curve-fit normalization
- Print-tip curve-fit normalized data
(A,M)n1..N - Mi,p Mi,p - cp(Ai) p print tip 1-16
- where cp is an intensity dependent function for
print tip p.
aroma plot(ma) lowessCurve(ma,
groupByprinttip) boxplot(ma,
groupByprinttip)
28Print-tip curve-fit normalization cont.
curve-fit normalization done for each print-tip
group separately
aroma normalizeWithinSlide(ma, p)
boxplot(ma, groupByprinttip) plotSpatial(ma)
29Across-slides normalization
The within-slide normalization makes the red and
the green signals comparable. What about signals
from different arrays? Across-slide normalization
attacks this problem. Idea Log-ratios (M
values) from different slides should have similar
variance. Example If the standard deviation of
the log-ratios from one array is 0.50 and from
another array 0.78, the log-ratios have to be
rescaled to get the same standard deviation.
30Paper A
- H. Bengtsson, B. Calder, I. S. Mian, M. Callow,
E. Rubin, and T. P. Speed. - Identifying Differentially Expressed Genes in
cDNA Microarray Experiments making aging
visible. - Online report accompanying the discussion forum
with the same name. Science SAGE KE, 2001 (12),
vp8. - Illustrates typical normalization of two-color
microarray data - Short introduction on how to identify
differentially-expressed genes. - Basic study on how many replicated arrays are
needed. - Comparison between two image-analysis methods.
31Paper B
- H. Bengtsson.
- Identification and normalization of plate effect
in cDNA microarray data. - Preprints in Mathematical Sciences 200228,
Mathematical Statistics, - Centre for Mathematical Sciences, Lund
University, 2002.
32Printing of spots
1
Hmm... why the horizontal stripes? ?
16
6384 spots printed onto 9 slides in total 399
print turns using 4x4 print-tips...
33Print-order plot of log-ratios
The spots are order according to when they were
spotted/dipped onto the glass slide(s). Note that
it takes hours/days to print all spots on all
arrays.
plotPrintorder(ma)
34Remove (constant) plate biases?
Will remove some of the intensity dependent
effects...
...and some of the spatial artifacts.
35Intensity normalization plate by plate?
...and most of the spatial artifacts.
36Comparison of normalization methods
Ex two different genes da lt db
Median absolute deviation (MAD) for gene i
with replicates j1,2,...,J di 1.4826
median rij where rij Mij median Mij is
residual j for gene i.
The measure of reproducibility (small is good) is
a scalar defined as themean of all genewise
MADs M.O.R. ?di / N where N is the
number of genes. Remember that minimizing
variance alone is not useful - we also need
to consider bias
37Results
- Doing platewise intensity dependent
normalization lowers the gene variability by
another 10 from print-tip norm. - Background correction increases variability (but
might reduce bias) - Using measure of reproducibility is helpful in
deciding what to do.
Pl Constant platewise norm., Pl(A) Intensity
dep. platewise norm., Sl(A) Intensity dep.
slidewise norm., Pr(A) Intensity dep.
print-tip-wise norm., sPr(A) Scaled intensity
dep. print-tip-wise norm., bg background
corrected data.
38Visual comparison
Scaled print-tip intensity normalization (M.O.R.
0.123 46)
Scaled print-tip follow by plate intensity
normalization (M.O.R.0.110 41)
No normalization (M.O.R.0.270 100)
39Major reason for print-order effects
intensity dependent effects
different intensities for different plates
?
40Paper D
- H. Bengtsson and O. Hössjer.
- Methodological study of affine transformations of
gene expression data with proposed normalization
method. - Preprints in Mathematical Sciences 200338,
Mathematical Statistics, Centre for Mathematical
Sciences, Lund University, 2003.
41A general model
- yc,i fc(xc,i) ?c,i
- genes i 1, 2, . . . , I
- RNA extracts c 1, 2, . . . , C
- xc,i the true gene expression level
- yc,i the observed gene expression level
- ?c,i measurement noise, E?c,i 0
- fc() measurement function is (unknown)
42Measurement functions model the biochemical and
physical measurement procedure
43Recall the M vs A transform
- Especially in two-channel microarray analysis the
M vs A transform is useful - where xR,i and xG,i are the true gene expression
levels. Since these are unknown, it is a common
practice to plug in the observed expression
levels
44Linear models and M vs A
- In the literature, it common to assume a linear
measurement function, either explicitly or
implicitly - where bc is the scale factor for channel c
R,G. The observed log-ratios and log-intensities
become - where ? bR /bG is the relative scale factor.
- Systematic effects The overall shift in the
log-ratios is log2 ?, which is constant.
45Examples of bias with linear measurement functions
- Green too strong
- ? bR/bG 1/4
- ) log2? -2
- Balance channels
- ? bR/bG 1
- ) log2? 0
- Red too strong
- ? bR/bG 4/1
- ) log2? 2
46Affine measurement functions
- Assume that the measurement function is affine
- where ac and bc are channel-specific bias and
scale parameters. - The observed log-ratios become
- where ?i aR - ? aG (xR,i/ xG,i), ?bR/bG, and
Ai is the observed log intensity for gene i. - Systematic effects i) Constant shift of log2 ?.
ii) intensity-dependent bias. - Note that the second bias term is fold-change
dependent! Moreover, if aR aG 0 (linear), the
second bias term is zero as before.
47Examples of bias with affine measurement functions
Assume different channel biases, i.e. aR?aG and
?bR /bG1.
Small bias in red, large bias in
green (aR,aG)(20,200)
No bias (aR,aG)(0,0)
Small bias in red (aR,aG)(20,0)
48Theoretical examples of affine transformations
Assume different channel biases, i.e. aR?aG, but
same ?bR /bG0.57.
(aR,aG)(80,140) ?0.57
(aR,aG)(0,0) ? 0.57
(aR,aG)(20,200) ?0.57
The blue dash-dot curves are non-differentially
expressed genes. Colored curves above and below
are genes with fold-change log2(xR,i /xG,i)
1,2, ...
49? and with data
50Robust affine normalization
Consider a two-channel (CR,G) microarray with
genes i1,...,I for which we observe With ?
aR-?aG and ? bR/bG, we have for
non-differentially expressed genes (no noise)
that The red and green signals, yi
(yR,i,yG,i), then scatter around the line with
intercept ? and slope ?. Next, robust estimates
of ? and ?.
51IWPCA iterative reweighted PCA
- Define the objective function
- wi is a weight, where ri(?,?yi) gt 0 is the
Euclidean distance between yi and the line
L(?,?). The estimate of ? and ? is then - With weights wi 1 we obtain standard PCA
(minimization in L2). With weights - we minimize in L1, if we let ? ! 0.
- Algorithm Estimate ? and ?. Update weights.
Reestimate ? and ? and so on.
52Additional constraints
- With estimates of ? aR-?aG and ? bR/bG, we
impose the additional constraint that after
normalization no negative signal may exists (less
conservative constraint may also be used). In
addition, let bR1. The robust affine
normalization is then - Example Estimates
- (aR,aG,?)
- (21.5,29.3,0.163).
53Generalization to several channels/arrays
- The robust affine normalization generalizes
directly - to microarrays with multilple channels, i.e.
C3,4,... . - to multiple arrays, if the experimental design
allows, i.e. K2,3,... . - In addition
- No between-array normalization is needed (its
included). - Normalization to a common reference is
straightforward. That is, if all arrays share the
same reference, first all reference channels is
normalized toward each other to form a baseline
channel, then test channels are normalized
toward the baseline channel. - Usage (in aroma)
- normalizeAffine(rg)
54Paper E
- H. Bengtsson, G. Jönsson, and J.
Vallon-Christersson. - Calibration and assessment of channel-specific
biases in microarray data with extended dynamical
range. - Preprints in Mathematical Sciences 200337,
Mathematical Statistics, Centre for Mathematical
Sciences, Lund University, 2003.
55The microarray technology (again)
56Scan protocol
By scanning the same array at various PMT gains
(and constant laser power) we can learn about hc
. When we scan at gains v and w, we obtain two
different measurements of the unknown, but fixed
amount of hybridized DNA (xc,i) for both
channels c R,G and all genes i 1,...,I.
57Observed signals at various gains
zoom ?
All PMT pairs converge at the same point (ec,ec)
and not at the origin (0,0). ? Affine measurement
function
Channel c G. PMT voltage pairs
(v,w) (800,500) (600,500) (700,500) (700,600) (80
0,600) (800,700)
58General model and estimation
The K observed signals in channel c for each gene
i scatter around the line with The line
L(ac,bc) can be estimate robustly using IWPCA as
before. Next, define Bias parameters ac are
uniquely identified by the additional constraint
that dc ¼ 0, i.e.
59Multiscan calibration
Backtransform
60Results
- By scanning the same microarray at various
sensitivity levels one obtains - calibrated signals with scanner biases removed
- an extended dynamical range
- improved signal-to-noise levels
- better understanding of the noise structure (of
the scanner)
61Paper G
- H. Bengtsson and O. Hössjer.
- Affine calibration for microarrays with dilution
series or spike-ins. - Preprints in Mathematical Sciences 200419,
Mathematical Statistics, Centre for Mathematical
Sciences, Lund University, 2004.
62Model
gene i1,...,I replicated spots j1,...,J at
various concentrations (dilution series). channel
c. First, assume that the concentration of
fluorophores bi,j of all replicated spots of a
gene i is the same zc,i,j zc,i Model the
total amount of light xc,i,j from spot (i,j)
as xc,i,j bi,j zc,i
63Model cont.
We observe yc,i,j ac bc xc,i,j ?c,i,j
ac bc bi,j zc,i ?c,i,j ?c,i,j independent
zero-mean noise with variance ?c,i,j2. ac and bc
offset and scale parameters for channel c In
the paper, the variance function ?c,i,j2 ?c,i2
kc ?i2 is used. The offset parameter ac can
now be uniquely estimated using maximum
likelihood techniques utilizing numerical
optimization and PCA. bc are estimated as before.
64Probe calibration
C2, I432, J4 ten dilution series are
highlighted
65In addition the method gives...
Estimates of the relative expression level of the
dilution series follow directly from the
maximum likelihood procedure.
66Result
Arrays should contain multiple spots for the same
gene that have different dilutions (instead of
identical replicates) Systematic treatment of
bias (optical background), expression ratios for
the dilution series ("genes"), and noise.
67Good scientific software is like a good
scientific publication
? reproducible ? peer-reviewed ? easily
accessible by other researchers, society ?
builds on the work of others ? other will build
their work on top of it ? commercialize those
developments that are successful
68Why Open Source?
- so that you can find out what algorithm is being
used, and how it is being used - so that you can modify these algorithms to try
out new ideas or to accommodate local conditions
or needs - so that they can be used as components
- ?Transparency
- ?Pursuit of reproducibilty
- ?Efficiency of development
- ?Training
69Paper C
- H. Bengtsson.
- The R.oo package - object-oriented programming
with references using standard R code. - In Kurt Hornik, Friedrich Leisch, and Achim
Zeileis, editors, Proceedings of the 3rd
International Workshop on Distributed Statistical
Computing (DSC 2003), Vienna, Austria, March 2003
70R.oo package
- Objectives
- Better and more robust object-oriented design and
implementation - More memory efficient and faster implementations
- Easy to understand and to extend
- User- and developer-friendly interface
- Core of the microarray package (aroma)
- Methods
- Single root class Object all classes inherits
from. - Provide support for reference variables.
- R Coding Conventions (RCC), e.g. naming
conventions (MyClassName,myObject, myFunction()) - Free and open-source (5500 code lines).
- Proof of concept, rich documentation with many
examples (100 pages).
71Paper F
- H. Bengtsson.
- aroma - An R Object-oriented Microarray Analysis
environment. - Preprints in Mathematical Sciences 200418,
Mathematical Statistics, Centre for Mathematical
Sciences, Lund University, 2004.
72aroma package
- Objectives
- Provide simple methods for low-level analysis of
microarray data - Wide support for file formats
- User- and developer-friendly interface
- Easy to understand and easy to extend
- Easy to maintain
- Methods
- Make use of R.oo
- Make use of R.classes pkgs (20000 lines of code
gt 400 pages of docs) - Follow R Coding Conventions
- Free and open-source (18000 code lines)
- Easy to install anywhere.
- Rich documentation with examples (gt280 pages)
73A short example
gt gpr lt- MicroarrayDataread("Slide1.gpr") gt
gpr 1 "GenePixData Number of slides 3. Number
of fields 43. Layout Grids 4x4 (16), spots in
grids 18x18 (324), total numberof spots 5184.
Spot names are specified. Spot ids are
specified." gt plotSpatial(gpr) gt raw lt-
getRawData(gpr) gt ma lt- getSignal(raw) gt idx lt-
(abs(maM) gt 1.5) gt plot(ma) gt highlight(ma, idx,
col"blue") gt normalizeAffine(ma) gt ...