Low-level Analysis of Microarray Data

About This Presentation

Title:

Low-level Analysis of Microarray Data

Description:

the art is to find the best trade-off! X. X. X. X. X. X. X. X. X. 10. Printing / spotting. 11 ... of the unknown, but fixed amount of hybridized DNA (xc,i) ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 70

Provided by: wolfganghu

Category:

more less

Transcript and Presenter's Notes

Title: Low-level Analysis of Microarray Data

1
Low-level Analysis of Microarray Data
Mathematical Statistics Centre for
Mathematical Sciences Lund University,
Sweden 2004-10-01
2
? Why are microarrays important?
Genome projects have identified lots of new
genes, but for many we do not know what they are
doing. "Functional Genomics" Application
areas Cancer, Inflammation, Infectious
diseases Toxicology - drug (side)
effects Developmental biology
3
Papers

H. Bengtsson, B. Calder, I. S. Mian, M. Callow,
E. Rubin, and T. P. Speed. Identifying
Differentially Expressed Genes in cDNA Microarray
Experiments making aging visible. 2001.
H. Bengtsson. Identification and normalization of
plate effect in cDNA microarray data. 2002.
H. Bengtsson. The R.oo package - object-oriented
programming with references using standard R
code. 2003.
H. Bengtsson and O. Hössjer. Methodological study
of affine transformations of gene expression data
with proposed normalization method. 2003.
H. Bengtsson, G. Jönsson, and J.
Vallon-Christersson. Calibration and assessment
of channel-specific biases in microarray data
with extended dynamical range. 2003.
H. Bengtsson. aroma - An R Object-oriented
Microarray Analysis environment. 2004.
H. Bengtsson and O. Hössjer. Affine calibration
for microarrays with dilution series or
spike-ins. 2004.

4
Central Dogma of Molecular Biology
Idea of microarrays (and other gene expression
techniques) Measure the amount of mRNA to find
genes that are expressed (active). (measuring
protein might be better, but is currently much
harder)
5
Gene-expression analysis

xc,i gene expression for gene i1,...,I in
sample c1,...C.
I5...-30,000 and Cgt2.
Consider two samples from cancer and healthy
tissue.
We are interested in genes that have different
expression
Non-differentially expr. genes
Ix1x2 i?I x1,ix2,i
Differentially expressed genes
Ix1?x2 I \ Ix1x2 i?I x1,i ? x2,i

non-diff. expr. genes
6
Hypothesis tests

For each gene i1,...,I we test the hypotheses

null hypothesis H0 x1,i x2,i Non-differentially expressed genes
alternative hypothesis H1 x1,i ? x2,i Differentially expressed genes
7
? Which genes are differentially transcribed?
same-same
tumor-normal
8
Statistics 101
?bias accuracy?
? precision variance?
9
Central dogma of data analysis
Can always increase sensitivity on the cost of
specificity, or vice versa, the art is to find
the best trade-off!
X
X
X
X
X
X
X
X
X
10
Printing / spotting
11
Microarray slide preparation
J. Vallon-Christersson, Dept Oncology, Lund Univ.
12
RNA extraction hybridization
(targets)
Extract mRNA from samples Reverse transcription
of mRNA to cDNA Label with Cy3 or Cy5 fluorescent
dye Hybridize labeled cDNA to slide Wash slide
(probes)
Figure Hybridization chamber.
13
Two-channel scanning
? higher frequency, more energy
? lower frequency, less energy
14
Combined color image for visualization
15
Signal quantification

AddressingLocate spot centers.
SegmentationClassification of pixels either as
signal or background (using circles, seeded
region growing or other).
Signal quantificationa) foreground estimates
Rfg, Gfgb) background estimates Rbg, Gbgc)
... (shape, size etc)

Terry Speed et al.
16
Summary
scanning
Production
data (Rfg,Gfg,Rbg,Gbg, ...)
17
? Sources of variation
amount of RNA in the biopsy efficiencies of -RNA
extraction -reverse transcription
-labeling -fluorescent detection
probe purity and length distribution spotting
efficiency, spot size cross-/unspecific
hybridization stray signal
18
Exploratory data analysis (R,G) ? (log2R,log2G)
? (M,A)
log(R) vs log(G)
M vs A
R vs G
R red channel signalG green channel signal
M log2(R/G) (log-ratio), A ½log2(RG)
(log-intensity)
19
Exploratory data analysis the need for
normalization
In self-self comparisons, find stochastic as well
as systematic deviations from M0. Idea Do
various exploratory plots to understand the
nature of these deviations for example, M vs A,
spatial plots, density boxplots plots,
print-order plots etc.
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
Within-slide normalization

Classical methods
No normalization
Median normalizationShift of log-ratios.
Curve-fit normalization methods.A.k.a. lo(w)ess
normalization.
etc.

25
Global median normalization
Assumption Changes roughly symmetric.
M M median(M)
aroma normalizeWithinSlide(ma, methodm)
26
Global curve-fit normalization cont.
curve-fit normalization done for all spots
together
Mi Mi c(Ai)
(Biased towards the green channel intensity
dependent artifacts)
aroma plot(ma) normalizeWithinSlide(ma,
methodl) plot(ma)
27
Print-tip curve-fit normalization

Print-tip curve-fit normalized data
(A,M)n1..N
Mi,p Mi,p - cp(Ai) p print tip 1-16
where cp is an intensity dependent function for
print tip p.

aroma plot(ma) lowessCurve(ma,
groupByprinttip) boxplot(ma,
groupByprinttip)
28
Print-tip curve-fit normalization cont.
curve-fit normalization done for each print-tip
group separately
aroma normalizeWithinSlide(ma, p)
boxplot(ma, groupByprinttip) plotSpatial(ma)
29
Across-slides normalization
The within-slide normalization makes the red and
the green signals comparable. What about signals
from different arrays? Across-slide normalization
attacks this problem. Idea Log-ratios (M
values) from different slides should have similar
variance. Example If the standard deviation of
the log-ratios from one array is 0.50 and from
another array 0.78, the log-ratios have to be
rescaled to get the same standard deviation.
30
Paper A

H. Bengtsson, B. Calder, I. S. Mian, M. Callow,
E. Rubin, and T. P. Speed.
Identifying Differentially Expressed Genes in
cDNA Microarray Experiments making aging
visible.
Online report accompanying the discussion forum
with the same name. Science SAGE KE, 2001 (12),
vp8.
Illustrates typical normalization of two-color
microarray data
Short introduction on how to identify
differentially-expressed genes.
Basic study on how many replicated arrays are
needed.
Comparison between two image-analysis methods.

31
Paper B

H. Bengtsson.
Identification and normalization of plate effect
in cDNA microarray data.
Preprints in Mathematical Sciences 200228,
Mathematical Statistics,
Centre for Mathematical Sciences, Lund
University, 2002.

32
Printing of spots
1
Hmm... why the horizontal stripes? ?
16
6384 spots printed onto 9 slides in total 399
print turns using 4x4 print-tips...
33
Print-order plot of log-ratios
The spots are order according to when they were
spotted/dipped onto the glass slide(s). Note that
it takes hours/days to print all spots on all
arrays.
plotPrintorder(ma)
34
Remove (constant) plate biases?
Will remove some of the intensity dependent
effects...
...and some of the spatial artifacts.
35
Intensity normalization plate by plate?
...and most of the spatial artifacts.
36
Comparison of normalization methods
Ex two different genes da lt db
Median absolute deviation (MAD) for gene i
with replicates j1,2,...,J di 1.4826
median rij where rij Mij median Mij is
residual j for gene i.
The measure of reproducibility (small is good) is
a scalar defined as themean of all genewise
MADs M.O.R. ?di / N where N is the
number of genes. Remember that minimizing
variance alone is not useful - we also need
to consider bias
37
Results

Doing platewise intensity dependent
normalization lowers the gene variability by
another 10 from print-tip norm.
Background correction increases variability (but
might reduce bias)
Using measure of reproducibility is helpful in
deciding what to do.

Pl Constant platewise norm., Pl(A) Intensity
dep. platewise norm., Sl(A) Intensity dep.
slidewise norm., Pr(A) Intensity dep.
print-tip-wise norm., sPr(A) Scaled intensity
dep. print-tip-wise norm., bg background
corrected data.
38
Visual comparison
Scaled print-tip intensity normalization (M.O.R.
0.123 46)
Scaled print-tip follow by plate intensity
normalization (M.O.R.0.110 41)
No normalization (M.O.R.0.270 100)
39
Major reason for print-order effects

intensity dependent effects
different intensities for different plates
?
40
Paper D

H. Bengtsson and O. Hössjer.
Methodological study of affine transformations of
gene expression data with proposed normalization
method.
Preprints in Mathematical Sciences 200338,
Mathematical Statistics, Centre for Mathematical
Sciences, Lund University, 2003.

41
A general model

yc,i fc(xc,i) ?c,i
genes i 1, 2, . . . , I
RNA extracts c 1, 2, . . . , C
xc,i the true gene expression level
yc,i the observed gene expression level
?c,i measurement noise, E?c,i 0
fc() measurement function is (unknown)

42
Measurement functions model the biochemical and
physical measurement procedure
43
Recall the M vs A transform

Especially in two-channel microarray analysis the
M vs A transform is useful
where xR,i and xG,i are the true gene expression
levels. Since these are unknown, it is a common
practice to plug in the observed expression
levels

44
Linear models and M vs A

In the literature, it common to assume a linear
measurement function, either explicitly or
implicitly
where bc is the scale factor for channel c
R,G. The observed log-ratios and log-intensities
become
where ? bR /bG is the relative scale factor.
Systematic effects The overall shift in the
log-ratios is log2 ?, which is constant.

45
Examples of bias with linear measurement functions

Green too strong
? bR/bG 1/4
) log2? -2

Balance channels
? bR/bG 1
) log2? 0

Red too strong
? bR/bG 4/1
) log2? 2

46
Affine measurement functions

Assume that the measurement function is affine
where ac and bc are channel-specific bias and
scale parameters.
The observed log-ratios become
where ?i aR - ? aG (xR,i/ xG,i), ?bR/bG, and
Ai is the observed log intensity for gene i.
Systematic effects i) Constant shift of log2 ?.
ii) intensity-dependent bias.
Note that the second bias term is fold-change
dependent! Moreover, if aR aG 0 (linear), the
second bias term is zero as before.

47
Examples of bias with affine measurement functions
Assume different channel biases, i.e. aR?aG and
?bR /bG1.
Small bias in red, large bias in
green (aR,aG)(20,200)
No bias (aR,aG)(0,0)
Small bias in red (aR,aG)(20,0)
48
Theoretical examples of affine transformations
Assume different channel biases, i.e. aR?aG, but
same ?bR /bG0.57.
(aR,aG)(80,140) ?0.57
(aR,aG)(0,0) ? 0.57
(aR,aG)(20,200) ?0.57
The blue dash-dot curves are non-differentially
expressed genes. Colored curves above and below
are genes with fold-change log2(xR,i /xG,i)
1,2, ...
49
? and with data
50
Robust affine normalization
Consider a two-channel (CR,G) microarray with
genes i1,...,I for which we observe With ?
aR-?aG and ? bR/bG, we have for
non-differentially expressed genes (no noise)
that The red and green signals, yi
(yR,i,yG,i), then scatter around the line with
intercept ? and slope ?. Next, robust estimates
of ? and ?.
51
IWPCA iterative reweighted PCA

Define the objective function
wi is a weight, where ri(?,?yi) gt 0 is the
Euclidean distance between yi and the line
L(?,?). The estimate of ? and ? is then
With weights wi 1 we obtain standard PCA
(minimization in L2). With weights
we minimize in L1, if we let ? ! 0.
Algorithm Estimate ? and ?. Update weights.
Reestimate ? and ? and so on.

52
Additional constraints

With estimates of ? aR-?aG and ? bR/bG, we
impose the additional constraint that after
normalization no negative signal may exists (less
conservative constraint may also be used). In
addition, let bR1. The robust affine
normalization is then
Example Estimates
(aR,aG,?)
(21.5,29.3,0.163).

53
Generalization to several channels/arrays

The robust affine normalization generalizes
directly
to microarrays with multilple channels, i.e.
C3,4,... .
to multiple arrays, if the experimental design
allows, i.e. K2,3,... .
In addition
No between-array normalization is needed (its
included).
Normalization to a common reference is
straightforward. That is, if all arrays share the
same reference, first all reference channels is
normalized toward each other to form a baseline
channel, then test channels are normalized
toward the baseline channel.
Usage (in aroma)
normalizeAffine(rg)

54
Paper E

H. Bengtsson, G. Jönsson, and J.
Vallon-Christersson.
Calibration and assessment of channel-specific
biases in microarray data with extended dynamical
range.
Preprints in Mathematical Sciences 200337,
Mathematical Statistics, Centre for Mathematical
Sciences, Lund University, 2003.

55
The microarray technology (again)
56
Scan protocol
By scanning the same array at various PMT gains
(and constant laser power) we can learn about hc
. When we scan at gains v and w, we obtain two
different measurements of the unknown, but fixed
amount of hybridized DNA (xc,i) for both
channels c R,G and all genes i 1,...,I.
57
Observed signals at various gains
zoom ?
All PMT pairs converge at the same point (ec,ec)
and not at the origin (0,0). ? Affine measurement
function
Channel c G. PMT voltage pairs
(v,w) (800,500) (600,500) (700,500) (700,600) (80
0,600) (800,700)
58
General model and estimation
The K observed signals in channel c for each gene
i scatter around the line with The line
L(ac,bc) can be estimate robustly using IWPCA as
before. Next, define Bias parameters ac are
uniquely identified by the additional constraint
that dc ¼ 0, i.e.
59
Multiscan calibration
Backtransform
60
Results

By scanning the same microarray at various
sensitivity levels one obtains
calibrated signals with scanner biases removed
an extended dynamical range
improved signal-to-noise levels
better understanding of the noise structure (of
the scanner)

61
Paper G

H. Bengtsson and O. Hössjer.
Affine calibration for microarrays with dilution
series or spike-ins.
Preprints in Mathematical Sciences 200419,
Mathematical Statistics, Centre for Mathematical
Sciences, Lund University, 2004.

62
Model
gene i1,...,I replicated spots j1,...,J at
various concentrations (dilution series). channel
c. First, assume that the concentration of
fluorophores bi,j of all replicated spots of a
gene i is the same zc,i,j zc,i Model the
total amount of light xc,i,j from spot (i,j)
as xc,i,j bi,j zc,i
63
Model cont.
We observe yc,i,j ac bc xc,i,j ?c,i,j
ac bc bi,j zc,i ?c,i,j ?c,i,j independent
zero-mean noise with variance ?c,i,j2. ac and bc
offset and scale parameters for channel c In
the paper, the variance function ?c,i,j2 ?c,i2
kc ?i2 is used. The offset parameter ac can
now be uniquely estimated using maximum
likelihood techniques utilizing numerical
optimization and PCA. bc are estimated as before.
64
Probe calibration
C2, I432, J4 ten dilution series are
highlighted
65
In addition the method gives...
Estimates of the relative expression level of the
dilution series follow directly from the
maximum likelihood procedure.
66
Result
Arrays should contain multiple spots for the same
gene that have different dilutions (instead of
identical replicates) Systematic treatment of
bias (optical background), expression ratios for
the dilution series ("genes"), and noise.
67
Good scientific software is like a good
scientific publication
? reproducible ? peer-reviewed ? easily
accessible by other researchers, society ?
builds on the work of others ? other will build
their work on top of it ? commercialize those
developments that are successful
68
Why Open Source?

so that you can find out what algorithm is being
used, and how it is being used
so that you can modify these algorithms to try
out new ideas or to accommodate local conditions
or needs
so that they can be used as components
?Transparency
?Pursuit of reproducibilty
?Efficiency of development
?Training

69
Paper C

H. Bengtsson.
The R.oo package - object-oriented programming
with references using standard R code.
In Kurt Hornik, Friedrich Leisch, and Achim
Zeileis, editors, Proceedings of the 3rd
International Workshop on Distributed Statistical
Computing (DSC 2003), Vienna, Austria, March 2003

70
R.oo package

Objectives
Better and more robust object-oriented design and
implementation
More memory efficient and faster implementations
Easy to understand and to extend
User- and developer-friendly interface
Core of the microarray package (aroma)
Methods
Single root class Object all classes inherits
from.
Provide support for reference variables.
R Coding Conventions (RCC), e.g. naming
conventions (MyClassName,myObject, myFunction())
Free and open-source (5500 code lines).
Proof of concept, rich documentation with many
examples (100 pages).

71
Paper F

H. Bengtsson.
aroma - An R Object-oriented Microarray Analysis
environment.
Preprints in Mathematical Sciences 200418,
Mathematical Statistics, Centre for Mathematical
Sciences, Lund University, 2004.

72
aroma package

Objectives
Provide simple methods for low-level analysis of
microarray data
Wide support for file formats
User- and developer-friendly interface
Easy to understand and easy to extend
Easy to maintain
Methods
Make use of R.oo
Make use of R.classes pkgs (20000 lines of code
gt 400 pages of docs)
Follow R Coding Conventions
Free and open-source (18000 code lines)
Easy to install anywhere.
Rich documentation with examples (gt280 pages)

73
A short example
gt gpr lt- MicroarrayDataread("Slide1.gpr") gt
gpr 1 "GenePixData Number of slides 3. Number
of fields 43. Layout Grids 4x4 (16), spots in
grids 18x18 (324), total numberof spots 5184.
Spot names are specified. Spot ids are
specified." gt plotSpatial(gpr) gt raw lt-
getRawData(gpr) gt ma lt- getSignal(raw) gt idx lt-
(abs(maM) gt 1.5) gt plot(ma) gt highlight(ma, idx,
col"blue") gt normalizeAffine(ma) gt ...

Write a Comment

User Comments (0)