Title: 2. Data quality assessment and normalization
12. Data quality assessmentand normalization
- Alex Sánchez. Dept. EstadÃstica
- Universitat de Barcelona
2Outline
- Microarray data quality diagnostic plots
- Pre-processing
- Image analysis
- Normalization
3Microarray studies life cycle
Here we are
4Looking at microarray data
5Diagnostic plots
- Plots can be used to check microarray quality and
to select the appropiate normalization - Many types available
- Image (Fg, Bg) plots, Histograms, Spatial plots,
Box plots, Scatterplot, MA plot, - Qualitative approach
- No threshold
- Needs some experience, seeing bad good plots
6Red / Green overlay images
- Start by looking at the slides
Bad high bg
Good low bg
7Signal/Noise histograms
Images with high background tend to have lower
log2(signal/noise) ratios
8Spatial plots for slide backgrounds
Log-ratios (M)
9Spatial plot of high intensity log ratios
If there are no spatial effects ? high intensity
spots should be uniformly distributed
Top (black) and bottom (green) 5 of log ratios
10Scatterplots always log, always rotate
log2R vs log2G
Mlog2R/G vs Alog2vRG
Instead of plotting log2R vs log2G? M-A is better
11MA-plot for spotted arrays (2 colors)
Mutant (MT)
MT and WT intensity for each probe
Cy3/5- cDNA or aRNA
M Log2 (MT/WT)
Spot
Wild Type (WT)
A Log2(MTWT) / 2 (signal strength)
12MA-plot for GeneChip arrays (1 color)
MT intensity for each probe set
aRNA
RMA
M Log2 (MT/WT)
MT
WT intensity for each probe set
aRNA
RMA
WT
A Log2(MTWT) / 2 (signal strength)
13Pin-group effects
MA-plot
Boxplot
Boxplots of log ratios by pin group
Lowess lines through points from pin groups
14Highlighting pin group effects
Log-ratios
Print-tip groups
Scatterplot and boxplots show a clear spatial
bias which may be associated with sample
preparation because spatially defined groups
are of different colours
15Slide effects
16Normalization
- Addressing systematic bias
17Preprocessing normalization
- The word normalization describes techniques used
to suitably transform the data before they are
analysed. - Goal is to correct for systematic differences
- between samples on the same slide, or
- between slides,
- which do not represent true biological variation
between samples.
18The origin of systematic differences
- Systematic differences may be due to
- Dye biases which vary with spot intensity,
- Location on the array,
- Plate origin,
- Printing quality which may vary between
- Pins
- Time of printing
- Scanning parameters,
19Dye bias
- Cy3 and Cy5 are relatively unstable, and may
present different incorporation efficiencies
during labeling, different quantum efficiencies,
and are detected by the scanner with different
efficiencies. - Normalization is performed to balance the
fluorescence intensities of the two dyes, as well
as to allow the comparison of expression levels
across experiments (slides).
20How to know if its necessary?
- Look at diagnostic plots for dye, slide or
spatial effects - Perform self-self normalization
- If we hibridize a sample with itself instead of
sample vs control intensities should be the same
in both channels - All deviations from this equality means there is
systematic bias that needs correction
21R vs G plot
DIRECT REPRESENTATION OF INTENSITY VALUES
22log R vs log G
WELL NORMALIZED DATA SHOULD FOLLOW THE DIAGONAL
YX
23M vs A
WELL NORMALIZED DATA SHOULD FOLLOW THE HORIZONTAL
Y0
24Self-self hybridizations
False color overlay
Boxplots within pin-groups
Scatter (MA-)plots
25Some non self-self hybridizations
From the NCI60 data set
Early Ngai lab, UC Berkeley
Early PMCRI, Melbourne Australia
Early Goodman lab, UC Berkeley
26Normalization methods issues
- Methods
- Global adjustment
- Median normalization
- Regression based normalization
- Intensity dependent normalization
- Within print-tip group normalization
- And many other
- Selection of spots for normalization
27Global normalization
- Based on a global adjustment
- log2 R/G ?log2 R/G - c log2 R/(kg)
- Choices for k or c log2k are
- c median or mean of log ratios for a particular
gene set - All genes or control or housekeeping genes.
- Total intensity normalization, where
- K ?Ri/ ?Gi.
28Example (Callow et al 2002)Global median
normalization.
29Regression normalization
- Linear Regression (Calibration)
- log(Cy3)ablog(Cy5)
- Use estimates a and b to normalize the data
- Alternative Regression through the origin
30Regression normalization
Before normalization
After normalization
31Intensity-dependent normalization
- Dye bias is not linear, as can easily be seen in
an MA plot - Run a line through the middle of the MA plot,
shifting the M value of the pair (A,M) by cc(A),
i.e. log2 R/G ? log2 R/G - c (A) log2
R/(k(A)G). - One estimate of c(A) is made using the LOWESS
function of Cleveland (1979) LOcally WEighted
Scatterplot Smoothing.
32Intensity-dependent normalization
- Run a line through the middle of the MA plot,
shifting the M value of the pair (A,M) by cc(A),
i.e. log2 R/G ? log2 R/G - c (A) log2
R/(k(A)G). - One estimate of c(A) is made using the LOWESS
function of Cleveland (1979) LOcally WEighted
Scatterplot Smoothing.
33Example (Callow et al 2002)loess vs median
normalization.
34Example (Callow et al 2002)Global median
normalization.
- Global normalization performs a global correction
but it cannot account for spatial effects? - See next slide boxplots for the same situations
in only one mouse, showing all sectors
35Global normalisation does not correct spatial
bias (print-tip-sectors)
36Within print-tip group normalization
- To correct for spatial bias produced by
hybridization artefacts or print-tip or plate
effects during the construction of arrays. - To correct for both print-tip and
intensity-dependent bias perform LOWESS fits to
the data within print-tip groups, i.e. - Log2 R/G? log2 R/G - ci(A) log2 R/(ki(A)G),
where ci(A) is the LOWESS fit to the MA-plot for
the ith grid only.
37Local print-tip normalisation corrects spatial
bias (print-tip-sectors)
38Normalization, which spots to use?
- LOWESS can be run through many different sets of
points, - All genes on the array.
- Constantly expressed genes (housekeeping).
- Controls.
- Spiked controls (genes from distant species).
- Genomic DNA titration series.
- Rank invariant set.
39Strategies for selecting a set of spots for
normalization
- Use of a global LOWESS approach can be justified
by supposing that, when stratified by mRNA
abundance, - Only a minority of genes expected to be
differentially expressed, - Any differential expression is as likely to be
up-regulation as down-regulation. - Pin-group LOWESS requires stronger assumptions
that one of the above applies within each
pin-group.
40Summary
- Microarray experiments have many hot spots
where errors or systematic biases can apper - Visual and numerical quality control should be
performed - Usually intensities will require normalisation
- At least global or intensity dependent
normalisation should be performed - More sophisticated procedures rely on stronger
assumptions? Must look for a balance
41Acknowledgments
- Special thanks to Yee Hwa Yang (UCSF) for
allowing me to use some of her materials - Sandrine Dudoit Terry Speed, U.C. Berkeley
- M. Carme RuÃz de Villa, U. Barcelona
- Sara Marsal, U. ReumatologÃa, HVH Barcelona