Title: Preprocessing of cDNA microarray data
1Preprocessing of cDNA microarray data
- Lecture 19, Statistics 246,
- April 1, 2004
2Begin by looking at the data
- Was the experiment a success?
- What analysis tools should be used?
- Are there any specific problems?
3Red/Green overlay images
Co-registration and overlay offers a quick
visualization, revealing information on color bal
ance, uniformity of hybridization, spot uniformit
y, background, and artifiacts such as dust or scr
atches
Good low bg, lots of d.e.
Bad high bg, ghost spots, little d.e.
4Always log, always rotate
log2R vs log2G
Mlog2R/G vs Alog2vRG
5Histograms
Signal/Noise log2(spot intensity/background
intensity)
6Boxplots of log2R/G
Liver samples from 16 mice 8 WT, 8 ApoAI KO.
7Spatial plots background from the two slides
8Highlighting extreme log ratios
Top (black) and bottom (green) 5 of log ratios
9Boxplots and highlighting
Log-ratios
Print-tip groups
pin group
- Clear example of spatial bias (here high
is red, low green)
10Pin group (sub-array) effects
Boxplots of log ratios by pin group
Lowess lines through points from pin groups
11Plate effects
12KO 8
Probes 6,000 cDNAs, including 200 related to
lipid metabolism. Arranged in a 4x4 a
rray of 19x21 sub-arrays.
13Time of printing effects
spot number
Green channel intensities (log2G). Printing over
4.5 days. The previous slide depicts a slide from
this print run.
14Normalization
- Why?
- To correct for systematic differences between
samples on the same slide, or between slides,
which do not represent true biological variation
between samples. - How do we know it is necessary?
- By examining self-self hybridizations, where
no true differential expression is occurring.
- We find dye biases which vary with overall spot
intensity, location on the array, plate origin,
pins, scanning parameters,.
15Self-self hybridizations
False color overlay
Boxplots within pin-groups
Scatter (MA-)plots
16A series of non self-self hybridizations
From the NCI60 data set (Stanford web site)
17Early Ngai lab, UC Berkeley
18Early Goodman lab, UC Berkeley
19From the Ernest Gallo Clinic Research Center
20Early PMCRI, Melbourne Australia
21Normalization methods
- a) Normalization based on a global
adjustment
- log2 R/G - log2 R/G - c log2 R/(kG)
- Choices for k or c log2k are c median or
mean of log ratios for a particular gene set
(e.g. housekeeping genes). Or, total intensity
normalization, where k ?Ri/ ?Gi. -
- b) Intensity-dependent normalization.
- Here we run a line through the middle of the
MA plot, shifting the M value of the pair (A,M)
by cc(A), i.e.
-
- log2 R/G - log2 R/G - c (A) log2
R/(k(A)G).
-
- One estimate of c(A) is made using the
LOWESS function of Cleveland (1979) LOcally
WEighted Scatterplot Smoothing.
22Normalization methods
- c) Within print-tip group normalization.
- In addition to intensity-dependent variation
in log ratios, spatial bias can also be a
significant source of systematic error.
- Most normalization methods do not correct
for spatial effects produced by hybridization
artifacts or print-tip or plate effects
during the construction of the microarrays. - It is possible to correct for both print-tip
and intensity-dependent bias by performing LOWESS
fits to the data within print-tip groups, i.e.
-
- log2 R/G - log2 R/G - ci(A) log2
R/(ki(A)G),
-
- where ci(A) is the LOWESS fit to the
MA-plot for the ith grid only.
-
23Which spots to use for normalization?
- The LOWESS lines can be run through many
different sets of points, and each strategy has
its own implicit set of assumptions justifying
its applicability. -
- For example, we can justify the use of a
global LOWESS approach by supposing that, when
stratified by mRNA abundance, a) only a minority
of genes are expected to be differentially
expressed, or b) any differential expression is
as likely to be up-regulation as down-regulation.
-
- Pin-group LOWESS requires stronger
assumptions that one of the above applies within
each pin-group.
-
- The use of other sets of genes, e.g. control
or housekeeping genes, involve similar
assumptions.
24Use of control spots
Lowess curve
blanks
Positive controls (spotted in varying concentrati
ons)
Negative controls
M log R/G logR - logG
A ( logR logG) /2
25Global scale, global lowess, pin-group lowess
spatial plot after, smooth histograms of M after
26MSP titration series(Microarray Sample Pool)
Pool the whole library
Control set to aid intensity- dependent
normalization Different concentrations Spotted e
venly spread across the slide
27MSP normalization compared to other methods
Orange Schadt-Wong rank invariant set
Red line lowess
smooth
Yellow GAPDH, tubulin Light blue MSP
pool / titration
28Composite normalization
ci(A)aAg(A)(1-aA)fi(A)
-MSP lowess curve -Global lowess curve -Composit
e lowess curve
(Other colours control spots)
Before and after composite normalization
29Comparison of Normalization Schemes(courtesy of
Jason Goncalves)
- No consensus on best segmentation or
normalization method
-
- Scheme was applied to assess the common
normalization methods
-
- Based on reciprocal labeling experiment data
for a series of 140 replicate experiments on two
different arrays each with 19,200 spots
30DESIGN OF RECIPROCAL LABELING EXPERIMENT
Replicate experiment in which we assess the
same mRNA pools but invert the fluors used.
The replicates are independent experiments
and are scanned, quantified and normalized as
usual
31The following relationship would be observed
for reciprocal microarray experiments in which
the slides are free of defects and the
normalization scheme performed ideally
We can measure using real data sets how well each
microarray normalization scheme approaches this
ideal
32Deviation metric to assess normalization schemes
We now use the mean array average deviation to
compare the normalization methods. Note that this
comparison addresses only variance (precision)
and not bias (accuracy) aspects of normalization.
33 34Scale normalization between slides
Boxplots of log ratios from 3 replicate self-self
hybridizations. Left panel before normalization
Middle panel after within print-tip group
normalization Right panel after a further betwee
n-slide scale normalization.
35The NCI 60 experiments (no bg)
Some scale normalization seems desirable
36Scale normalization another data set
Log-ratios
- Only small differences in spread apparent. No
action required.
37One way of taking scale into account
- Assumption All slides have the same spread
in M
- True log ratio is mij where i represents
different slides and j represents different
spots.
- Observed is Mij, where
- Mij ai mij
- Robust estimate of ai is
-
-
- MADi medianj yij - median(yij)
38A slightly harder normalization problem
- Global lowess doesnt do the trick here.
39Print-tip-group normalization helps
40But not completely
There is still a lot of scatter in the middle in
a WT vs KO comparison.
41Effects of previous normalisation
Before normalisation
After print-tip-group normalization
42Within print-tip-group box plots of M
afterprint-tip-group normalization
43Taking scale into account, cont.
- Assumption
- All print-tip-groups have the same spread
in M
- True log ratio is mij where i represents
different print-tip-groups and j
represents different spots.
- Observed is Mij, where
- Mij ai mij
- Robust estimate of ai is
-
-
- MADi medianj yij - median(yij)
44Effect of location scale normalization
Clearly care is needed in making decisions like
this one.
45A comparison of three MA-plots
Print-tip normalization
Print tip scale n.
Unnormalized
46The same idea on another data set
Log-ratios
Print-tip groups
-
- After print-tip location and scale
normalization.
47Follow-up experiment
On each slide, half the spots (?8) are
differentially expressed, the other half are not.
48Paired-slides dye-swap
- Slide 1, M log2 (R/G) - c
- Slide 2, M log2 (R/G) - c
- Combine by subtracting the normalized
log-ratios
- (log2 (R/G) - c) - (log2 (R/G) - c) / 2
- ? log2 (R/G) log2 (G/R) / 2
- ? log2 (RG/GR) / 2
- provided c c.
- Assumption the normalization functions are the
same for the two slides.
49Checking the assumption
MA plot for slides 1 and 2 it isnt always like
this.
50Result of self-normalization
(M - M)/2 vs. (A A)/2
51Summary of normalization
- Reduces systematic (not random) effects
- Makes it possible to compare several arrays
- Use logratios (MA-plots)
- Lowess normalization (dye bias)
- MSP titration series composite normalization
- Pin-group location normalization
- Pin-group scale normalization
- Between slide scale normalization
- More? Use controls!
- Normalization introduces more variability
- Outliers (bad spots) are handled with replication
52What is missing?
- Principally, a discussion of data quality
issues. Most image analysis programs collect a
wide range of measurements associated with each
spot morphological measures such as area and
perimeter (in pixels), uniformity measures such
as the SD of foreground and background
intensities in each channel, and of ratios of
intensities (with and without background) across
the pixels in a spot and spot brightness
indicators such as the ratio of spot foreground
to spot background, and the fraction of pixels in
the foreground with intensity greater than
background intensity (or a given multiple
thereof). From these, further derived measures
can be calculated, such as coefficients of
variation, and so on. -
- How should we make use of the various
quality indicators? Most programs include
procedures for flagging spots on the basis of one
or more indicators, and users typically omit
flagged spots from their primary analyses. Data
filtering of this kind clearly improves the
appearance of the data, but.can we do more? That
is a longer story, for another time.
53Acknowledgments
- Jean Yee Hwa Yang (UCB)
- Sandrine Dudoit (UCB)
- Natalie Thorne (WEHI)
- Ingrid Lönnstedt (Uppsala)
- Henrik Bengtsson (Lund)
- Jason Goncalves (Iobion)
- Matt Callow (LLNL)
- Percy Luu (UCB)
- John Ngai (UCB)
- Vivian Peng (UCB)
- Dave Lin (Cornell)
54- Reference
- Yang et al (2002) Nucleic Acids Research 30,
e15.
- Some web sites
- Technical reports, talks, software etc.
- http//www.stat.berkeley.edu/users/terry/zarray/Ht
ml/
- Statistical software R (GNUs S)
- http//www.R-project.org/
- Packages within R environment
- -- SMA (statistics for microarray analysis)
http//www.stat.berkeley.edu/users/terry/zarray/So
ftware/smacode.html
- --Spot http//www.cmis.csiro.au/iap/spot.htm