Low-level Analysis of Microarray Data - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Low-level Analysis of Microarray Data

Description:

the art is to find the best trade-off! X. X. X. X. X. X. X. X. X. 10. Printing / spotting. 11 ... of the unknown, but fixed amount of hybridized DNA (xc,i) ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 70
Provided by: wolfganghu
Category:

less

Transcript and Presenter's Notes

Title: Low-level Analysis of Microarray Data


1
Low-level Analysis of Microarray Data
Mathematical Statistics Centre for
Mathematical Sciences Lund University,
Sweden 2004-10-01
2
? Why are microarrays important?
Genome projects have identified lots of new
genes, but for many we do not know what they are
doing. "Functional Genomics" Application
areas Cancer, Inflammation, Infectious
diseases Toxicology - drug (side)
effects Developmental biology
3
Papers
  1. H. Bengtsson, B. Calder, I. S. Mian, M. Callow,
    E. Rubin, and T. P. Speed. Identifying
    Differentially Expressed Genes in cDNA Microarray
    Experiments making aging visible. 2001.
  2. H. Bengtsson. Identification and normalization of
    plate effect in cDNA microarray data. 2002.
  3. H. Bengtsson. The R.oo package - object-oriented
    programming with references using standard R
    code. 2003.
  4. H. Bengtsson and O. Hössjer. Methodological study
    of affine transformations of gene expression data
    with proposed normalization method. 2003.
  5. H. Bengtsson, G. Jönsson, and J.
    Vallon-Christersson. Calibration and assessment
    of channel-specific biases in microarray data
    with extended dynamical range. 2003.
  6. H. Bengtsson. aroma - An R Object-oriented
    Microarray Analysis environment. 2004.
  7. H. Bengtsson and O. Hössjer. Affine calibration
    for microarrays with dilution series or
    spike-ins. 2004.

4
Central Dogma of Molecular Biology
Idea of microarrays (and other gene expression
techniques) Measure the amount of mRNA to find
genes that are expressed (active). (measuring
protein might be better, but is currently much
harder)
5
Gene-expression analysis
  • xc,i gene expression for gene i1,...,I in
    sample c1,...C.
  • I5...-30,000 and Cgt2.
  • Consider two samples from cancer and healthy
    tissue.
  • We are interested in genes that have different
    expression
  • Non-differentially expr. genes
  • Ix1x2 i?I x1,ix2,i
  • Differentially expressed genes
  • Ix1?x2 I \ Ix1x2 i?I x1,i ? x2,i

non-diff. expr. genes
6
Hypothesis tests
  • For each gene i1,...,I we test the hypotheses

null hypothesis H0 x1,i x2,i Non-differentially expressed genes
alternative hypothesis H1 x1,i ? x2,i Differentially expressed genes
7
? Which genes are differentially transcribed?
same-same
tumor-normal
8
Statistics 101
?bias accuracy?
? precision variance?
9
Central dogma of data analysis
Can always increase sensitivity on the cost of
specificity, or vice versa, the art is to find
the best trade-off!
X
X
X
X
X
X
X
X
X
10
Printing / spotting
11
Microarray slide preparation
J. Vallon-Christersson, Dept Oncology, Lund Univ.
12
RNA extraction hybridization
(targets)
Extract mRNA from samples Reverse transcription
of mRNA to cDNA Label with Cy3 or Cy5 fluorescent
dye Hybridize labeled cDNA to slide Wash slide
(probes)
Figure Hybridization chamber.
13
Two-channel scanning
? higher frequency, more energy
? lower frequency, less energy
14
Combined color image for visualization
15
Signal quantification
  1. AddressingLocate spot centers.
  2. SegmentationClassification of pixels either as
    signal or background (using circles, seeded
    region growing or other).
  3. Signal quantificationa) foreground estimates
    Rfg, Gfgb) background estimates Rbg, Gbgc)
    ... (shape, size etc)

Terry Speed et al.
16
Summary
scanning
Production
data (Rfg,Gfg,Rbg,Gbg, ...)
17
? Sources of variation
amount of RNA in the biopsy efficiencies of -RNA
extraction -reverse transcription
-labeling -fluorescent detection
probe purity and length distribution spotting
efficiency, spot size cross-/unspecific
hybridization stray signal
18
Exploratory data analysis (R,G) ? (log2R,log2G)
? (M,A)
log(R) vs log(G)
M vs A
R vs G
R red channel signalG green channel signal
M log2(R/G) (log-ratio), A ½log2(RG)
(log-intensity)
19
Exploratory data analysis the need for
normalization
In self-self comparisons, find stochastic as well
as systematic deviations from M0. Idea Do
various exploratory plots to understand the
nature of these deviations for example, M vs A,
spatial plots, density boxplots plots,
print-order plots etc.
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
Within-slide normalization
  • Classical methods
  • No normalization
  • Median normalizationShift of log-ratios.
  • Curve-fit normalization methods.A.k.a. lo(w)ess
    normalization.
  • etc.

25
Global median normalization
Assumption Changes roughly symmetric.
M M median(M)
aroma normalizeWithinSlide(ma, methodm)
26
Global curve-fit normalization cont.
curve-fit normalization done for all spots
together
Mi Mi c(Ai)
(Biased towards the green channel intensity
dependent artifacts)
aroma plot(ma) normalizeWithinSlide(ma,
methodl) plot(ma)
27
Print-tip curve-fit normalization
  • Print-tip curve-fit normalized data
    (A,M)n1..N
  • Mi,p Mi,p - cp(Ai) p print tip 1-16
  • where cp is an intensity dependent function for
    print tip p.

aroma plot(ma) lowessCurve(ma,
groupByprinttip) boxplot(ma,
groupByprinttip)
28
Print-tip curve-fit normalization cont.
curve-fit normalization done for each print-tip
group separately
aroma normalizeWithinSlide(ma, p)
boxplot(ma, groupByprinttip) plotSpatial(ma)
29
Across-slides normalization
The within-slide normalization makes the red and
the green signals comparable. What about signals
from different arrays? Across-slide normalization
attacks this problem. Idea Log-ratios (M
values) from different slides should have similar
variance. Example If the standard deviation of
the log-ratios from one array is 0.50 and from
another array 0.78, the log-ratios have to be
rescaled to get the same standard deviation.
30
Paper A
  • H. Bengtsson, B. Calder, I. S. Mian, M. Callow,
    E. Rubin, and T. P. Speed.
  • Identifying Differentially Expressed Genes in
    cDNA Microarray Experiments making aging
    visible.
  • Online report accompanying the discussion forum
    with the same name. Science SAGE KE, 2001 (12),
    vp8.
  • Illustrates typical normalization of two-color
    microarray data
  • Short introduction on how to identify
    differentially-expressed genes.
  • Basic study on how many replicated arrays are
    needed.
  • Comparison between two image-analysis methods.

31
Paper B
  • H. Bengtsson.
  • Identification and normalization of plate effect
    in cDNA microarray data.
  • Preprints in Mathematical Sciences 200228,
    Mathematical Statistics,
  • Centre for Mathematical Sciences, Lund
    University, 2002.

32
Printing of spots
1
Hmm... why the horizontal stripes? ?
16
6384 spots printed onto 9 slides in total 399
print turns using 4x4 print-tips...
33
Print-order plot of log-ratios
The spots are order according to when they were
spotted/dipped onto the glass slide(s). Note that
it takes hours/days to print all spots on all
arrays.
plotPrintorder(ma)
34
Remove (constant) plate biases?
Will remove some of the intensity dependent
effects...
...and some of the spatial artifacts.
35
Intensity normalization plate by plate?
...and most of the spatial artifacts.
36
Comparison of normalization methods
Ex two different genes da lt db
Median absolute deviation (MAD) for gene i
with replicates j1,2,...,J di 1.4826
median rij where rij Mij median Mij is
residual j for gene i.
The measure of reproducibility (small is good) is
a scalar defined as themean of all genewise
MADs M.O.R. ?di / N where N is the
number of genes. Remember that minimizing
variance alone is not useful - we also need
to consider bias
37
Results
  • Doing platewise intensity dependent
    normalization lowers the gene variability by
    another 10 from print-tip norm.
  • Background correction increases variability (but
    might reduce bias)
  • Using measure of reproducibility is helpful in
    deciding what to do.

Pl Constant platewise norm., Pl(A) Intensity
dep. platewise norm., Sl(A) Intensity dep.
slidewise norm., Pr(A) Intensity dep.
print-tip-wise norm., sPr(A) Scaled intensity
dep. print-tip-wise norm., bg background
corrected data.
38
Visual comparison
Scaled print-tip intensity normalization (M.O.R.
0.123 46)
Scaled print-tip follow by plate intensity
normalization (M.O.R.0.110 41)
No normalization (M.O.R.0.270 100)
39
Major reason for print-order effects

intensity dependent effects
different intensities for different plates
?
40
Paper D
  • H. Bengtsson and O. Hössjer.
  • Methodological study of affine transformations of
    gene expression data with proposed normalization
    method.
  • Preprints in Mathematical Sciences 200338,
    Mathematical Statistics, Centre for Mathematical
    Sciences, Lund University, 2003.

41
A general model
  • yc,i fc(xc,i) ?c,i
  • genes i 1, 2, . . . , I
  • RNA extracts c 1, 2, . . . , C
  • xc,i the true gene expression level
  • yc,i the observed gene expression level
  • ?c,i measurement noise, E?c,i 0
  • fc() measurement function is (unknown)

42
Measurement functions model the biochemical and
physical measurement procedure
43
Recall the M vs A transform
  • Especially in two-channel microarray analysis the
    M vs A transform is useful
  • where xR,i and xG,i are the true gene expression
    levels. Since these are unknown, it is a common
    practice to plug in the observed expression
    levels

44
Linear models and M vs A
  • In the literature, it common to assume a linear
    measurement function, either explicitly or
    implicitly
  • where bc is the scale factor for channel c
    R,G. The observed log-ratios and log-intensities
    become
  • where ? bR /bG is the relative scale factor.
  • Systematic effects The overall shift in the
    log-ratios is log2 ?, which is constant.

45
Examples of bias with linear measurement functions
  • Green too strong
  • ? bR/bG 1/4
  • ) log2? -2
  • Balance channels
  • ? bR/bG 1
  • ) log2? 0
  • Red too strong
  • ? bR/bG 4/1
  • ) log2? 2

46
Affine measurement functions
  • Assume that the measurement function is affine
  • where ac and bc are channel-specific bias and
    scale parameters.
  • The observed log-ratios become
  • where ?i aR - ? aG (xR,i/ xG,i), ?bR/bG, and
    Ai is the observed log intensity for gene i.
  • Systematic effects i) Constant shift of log2 ?.
    ii) intensity-dependent bias.
  • Note that the second bias term is fold-change
    dependent! Moreover, if aR aG 0 (linear), the
    second bias term is zero as before.

47
Examples of bias with affine measurement functions
Assume different channel biases, i.e. aR?aG and
?bR /bG1.
Small bias in red, large bias in
green (aR,aG)(20,200)
No bias (aR,aG)(0,0)
Small bias in red (aR,aG)(20,0)
48
Theoretical examples of affine transformations
Assume different channel biases, i.e. aR?aG, but
same ?bR /bG0.57.
(aR,aG)(80,140) ?0.57
(aR,aG)(0,0) ? 0.57
(aR,aG)(20,200) ?0.57
The blue dash-dot curves are non-differentially
expressed genes. Colored curves above and below
are genes with fold-change log2(xR,i /xG,i)
1,2, ...
49
? and with data
50
Robust affine normalization
Consider a two-channel (CR,G) microarray with
genes i1,...,I for which we observe With ?
aR-?aG and ? bR/bG, we have for
non-differentially expressed genes (no noise)
that The red and green signals, yi
(yR,i,yG,i), then scatter around the line with
intercept ? and slope ?. Next, robust estimates
of ? and ?.
51
IWPCA iterative reweighted PCA
  • Define the objective function
  • wi is a weight, where ri(?,?yi) gt 0 is the
    Euclidean distance between yi and the line
    L(?,?). The estimate of ? and ? is then
  • With weights wi 1 we obtain standard PCA
    (minimization in L2). With weights
  • we minimize in L1, if we let ? ! 0.
  • Algorithm Estimate ? and ?. Update weights.
    Reestimate ? and ? and so on.

52
Additional constraints
  • With estimates of ? aR-?aG and ? bR/bG, we
    impose the additional constraint that after
    normalization no negative signal may exists (less
    conservative constraint may also be used). In
    addition, let bR1. The robust affine
    normalization is then
  • Example Estimates
  • (aR,aG,?)
  • (21.5,29.3,0.163).

53
Generalization to several channels/arrays
  • The robust affine normalization generalizes
    directly
  • to microarrays with multilple channels, i.e.
    C3,4,... .
  • to multiple arrays, if the experimental design
    allows, i.e. K2,3,... .
  • In addition
  • No between-array normalization is needed (its
    included).
  • Normalization to a common reference is
    straightforward. That is, if all arrays share the
    same reference, first all reference channels is
    normalized toward each other to form a baseline
    channel, then test channels are normalized
    toward the baseline channel.
  • Usage (in aroma)
  • normalizeAffine(rg)

54
Paper E
  • H. Bengtsson, G. Jönsson, and J.
    Vallon-Christersson.
  • Calibration and assessment of channel-specific
    biases in microarray data with extended dynamical
    range.
  • Preprints in Mathematical Sciences 200337,
    Mathematical Statistics, Centre for Mathematical
    Sciences, Lund University, 2003.

55
The microarray technology (again)
56
Scan protocol
By scanning the same array at various PMT gains
(and constant laser power) we can learn about hc
. When we scan at gains v and w, we obtain two
different measurements of the unknown, but fixed
amount of hybridized DNA (xc,i) for both
channels c R,G and all genes i 1,...,I.
57
Observed signals at various gains
zoom ?
All PMT pairs converge at the same point (ec,ec)
and not at the origin (0,0). ? Affine measurement
function
Channel c G. PMT voltage pairs
(v,w) (800,500) (600,500) (700,500) (700,600) (80
0,600) (800,700)
58
General model and estimation
The K observed signals in channel c for each gene
i scatter around the line with The line
L(ac,bc) can be estimate robustly using IWPCA as
before. Next, define Bias parameters ac are
uniquely identified by the additional constraint
that dc ¼ 0, i.e.
59
Multiscan calibration
Backtransform
60
Results
  • By scanning the same microarray at various
    sensitivity levels one obtains
  • calibrated signals with scanner biases removed
  • an extended dynamical range
  • improved signal-to-noise levels
  • better understanding of the noise structure (of
    the scanner)

61
Paper G
  • H. Bengtsson and O. Hössjer.
  • Affine calibration for microarrays with dilution
    series or spike-ins.
  • Preprints in Mathematical Sciences 200419,
    Mathematical Statistics, Centre for Mathematical
    Sciences, Lund University, 2004.

62
Model
gene i1,...,I replicated spots j1,...,J at
various concentrations (dilution series). channel
c. First, assume that the concentration of
fluorophores bi,j of all replicated spots of a
gene i is the same zc,i,j zc,i Model the
total amount of light xc,i,j from spot (i,j)
as xc,i,j bi,j zc,i
63
Model cont.
We observe yc,i,j ac bc xc,i,j ?c,i,j
ac bc bi,j zc,i ?c,i,j ?c,i,j independent
zero-mean noise with variance ?c,i,j2. ac and bc
offset and scale parameters for channel c In
the paper, the variance function ?c,i,j2 ?c,i2
kc ?i2 is used. The offset parameter ac can
now be uniquely estimated using maximum
likelihood techniques utilizing numerical
optimization and PCA. bc are estimated as before.
64
Probe calibration
C2, I432, J4 ten dilution series are
highlighted
65
In addition the method gives...
Estimates of the relative expression level of the
dilution series follow directly from the
maximum likelihood procedure.
66
Result
Arrays should contain multiple spots for the same
gene that have different dilutions (instead of
identical replicates) Systematic treatment of
bias (optical background), expression ratios for
the dilution series ("genes"), and noise.
67
Good scientific software is like a good
scientific publication
? reproducible ? peer-reviewed ? easily
accessible by other researchers, society ?
builds on the work of others ? other will build
their work on top of it ? commercialize those
developments that are successful
68
Why Open Source?
  • so that you can find out what algorithm is being
    used, and how it is being used
  • so that you can modify these algorithms to try
    out new ideas or to accommodate local conditions
    or needs
  • so that they can be used as components
  • ?Transparency
  • ?Pursuit of reproducibilty
  • ?Efficiency of development
  • ?Training

69
Paper C
  • H. Bengtsson.
  • The R.oo package - object-oriented programming
    with references using standard R code.
  • In Kurt Hornik, Friedrich Leisch, and Achim
    Zeileis, editors, Proceedings of the 3rd
    International Workshop on Distributed Statistical
    Computing (DSC 2003), Vienna, Austria, March 2003

70
R.oo package
  • Objectives
  • Better and more robust object-oriented design and
    implementation
  • More memory efficient and faster implementations
  • Easy to understand and to extend
  • User- and developer-friendly interface
  • Core of the microarray package (aroma)
  • Methods
  • Single root class Object all classes inherits
    from.
  • Provide support for reference variables.
  • R Coding Conventions (RCC), e.g. naming
    conventions (MyClassName,myObject, myFunction())
  • Free and open-source (5500 code lines).
  • Proof of concept, rich documentation with many
    examples (100 pages).

71
Paper F
  • H. Bengtsson.
  • aroma - An R Object-oriented Microarray Analysis
    environment.
  • Preprints in Mathematical Sciences 200418,
    Mathematical Statistics, Centre for Mathematical
    Sciences, Lund University, 2004.

72
aroma package
  • Objectives
  • Provide simple methods for low-level analysis of
    microarray data
  • Wide support for file formats
  • User- and developer-friendly interface
  • Easy to understand and easy to extend
  • Easy to maintain
  • Methods
  • Make use of R.oo
  • Make use of R.classes pkgs (20000 lines of code
    gt 400 pages of docs)
  • Follow R Coding Conventions
  • Free and open-source (18000 code lines)
  • Easy to install anywhere.
  • Rich documentation with examples (gt280 pages)

73
A short example
gt gpr lt- MicroarrayDataread("Slide1.gpr") gt
gpr 1 "GenePixData Number of slides 3. Number
of fields 43. Layout Grids 4x4 (16), spots in
grids 18x18 (324), total numberof spots 5184.
Spot names are specified. Spot ids are
specified." gt plotSpatial(gpr) gt raw lt-
getRawData(gpr) gt ma lt- getSignal(raw) gt idx lt-
(abs(maM) gt 1.5) gt plot(ma) gt highlight(ma, idx,
col"blue") gt normalizeAffine(ma) gt ...
Write a Comment
User Comments (0)
About PowerShow.com