Preprocessing of cDNA microarray data - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Preprocessing of cDNA microarray data

Description:

hybridization, spot uniformity, background, and artifiacts. such ... M=log2R/G vs A=log2RG. Signal/Noise = log2(spot intensity/background intensity) Histograms ... – PowerPoint PPT presentation

Number of Views:209
Avg rating:3.0/5.0
Slides: 55
Provided by: cen7154
Category:

less

Transcript and Presenter's Notes

Title: Preprocessing of cDNA microarray data


1
Preprocessing of cDNA microarray data
  • Lecture 19, Statistics 246,
  • April 1, 2004

2

Begin by looking at the data
  • Was the experiment a success?
  • What analysis tools should be used?
  • Are there any specific problems?

3
Red/Green overlay images
Co-registration and overlay offers a quick
visualization, revealing information on color bal
ance, uniformity of hybridization, spot uniformit
y, background, and artifiacts such as dust or scr
atches
Good low bg, lots of d.e.
Bad high bg, ghost spots, little d.e.
4
Always log, always rotate
log2R vs log2G
Mlog2R/G vs Alog2vRG
5
Histograms
Signal/Noise log2(spot intensity/background
intensity)
6
Boxplots of log2R/G
Liver samples from 16 mice 8 WT, 8 ApoAI KO.
7
Spatial plots background from the two slides
8
Highlighting extreme log ratios
Top (black) and bottom (green) 5 of log ratios
9
Boxplots and highlighting
Log-ratios
Print-tip groups
pin group
  • Clear example of spatial bias (here high
    is red, low green)

10
Pin group (sub-array) effects
Boxplots of log ratios by pin group
Lowess lines through points from pin groups
11
Plate effects
12
KO 8
Probes 6,000 cDNAs, including 200 related to
lipid metabolism. Arranged in a 4x4 a
rray of 19x21 sub-arrays.
13
Time of printing effects
spot number
Green channel intensities (log2G). Printing over
4.5 days. The previous slide depicts a slide from
this print run.
14
Normalization
  • Why?
  • To correct for systematic differences between
    samples on the same slide, or between slides,
    which do not represent true biological variation
    between samples.
  • How do we know it is necessary?
  • By examining self-self hybridizations, where
    no true differential expression is occurring.
  • We find dye biases which vary with overall spot
    intensity, location on the array, plate origin,
    pins, scanning parameters,.

15
Self-self hybridizations
False color overlay
Boxplots within pin-groups
Scatter (MA-)plots
16
A series of non self-self hybridizations
From the NCI60 data set (Stanford web site)
17
Early Ngai lab, UC Berkeley
18
Early Goodman lab, UC Berkeley
19
From the Ernest Gallo Clinic Research Center
20
Early PMCRI, Melbourne Australia
21
Normalization methods
  • a) Normalization based on a global
    adjustment
  • log2 R/G - log2 R/G - c log2 R/(kG)
  • Choices for k or c log2k are c median or
    mean of log ratios for a particular gene set
    (e.g. housekeeping genes). Or, total intensity
    normalization, where k ?Ri/ ?Gi.
  • b) Intensity-dependent normalization.
  • Here we run a line through the middle of the
    MA plot, shifting the M value of the pair (A,M)
    by cc(A), i.e.
  • log2 R/G - log2 R/G - c (A) log2
    R/(k(A)G).
  • One estimate of c(A) is made using the
    LOWESS function of Cleveland (1979) LOcally
    WEighted Scatterplot Smoothing.

22
Normalization methods
  • c) Within print-tip group normalization.
  • In addition to intensity-dependent variation
    in log ratios, spatial bias can also be a
    significant source of systematic error.
  • Most normalization methods do not correct
    for spatial effects produced by hybridization
    artifacts or print-tip or plate effects
    during the construction of the microarrays.
  • It is possible to correct for both print-tip
    and intensity-dependent bias by performing LOWESS
    fits to the data within print-tip groups, i.e.
  • log2 R/G - log2 R/G - ci(A) log2
    R/(ki(A)G),
  • where ci(A) is the LOWESS fit to the
    MA-plot for the ith grid only.

23
Which spots to use for normalization?
  • The LOWESS lines can be run through many
    different sets of points, and each strategy has
    its own implicit set of assumptions justifying
    its applicability.
  • For example, we can justify the use of a
    global LOWESS approach by supposing that, when
    stratified by mRNA abundance, a) only a minority
    of genes are expected to be differentially
    expressed, or b) any differential expression is
    as likely to be up-regulation as down-regulation.
  • Pin-group LOWESS requires stronger
    assumptions that one of the above applies within
    each pin-group.
  • The use of other sets of genes, e.g. control
    or housekeeping genes, involve similar
    assumptions.

24
Use of control spots
Lowess curve
blanks
Positive controls (spotted in varying concentrati
ons)
Negative controls
M log R/G logR - logG
A ( logR logG) /2
25
Global scale, global lowess, pin-group lowess
spatial plot after, smooth histograms of M after
26
MSP titration series(Microarray Sample Pool)
Pool the whole library
Control set to aid intensity- dependent
normalization Different concentrations Spotted e
venly spread across the slide
27
MSP normalization compared to other methods
Orange Schadt-Wong rank invariant set
Red line lowess
smooth
Yellow GAPDH, tubulin Light blue MSP
pool / titration
28
Composite normalization
ci(A)aAg(A)(1-aA)fi(A)
-MSP lowess curve -Global lowess curve -Composit
e lowess curve
(Other colours control spots)
Before and after composite normalization
29
Comparison of Normalization Schemes(courtesy of
Jason Goncalves)
  • No consensus on best segmentation or
    normalization method
  • Scheme was applied to assess the common
    normalization methods
  • Based on reciprocal labeling experiment data
    for a series of 140 replicate experiments on two
    different arrays each with 19,200 spots

30
DESIGN OF RECIPROCAL LABELING EXPERIMENT
Replicate experiment in which we assess the
same mRNA pools but invert the fluors used.
The replicates are independent experiments
and are scanned, quantified and normalized as
usual
31
The following relationship would be observed
for reciprocal microarray experiments in which
the slides are free of defects and the
normalization scheme performed ideally

We can measure using real data sets how well each
microarray normalization scheme approaches this
ideal
32
Deviation metric to assess normalization schemes
We now use the mean array average deviation to
compare the normalization methods. Note that this
comparison addresses only variance (precision)
and not bias (accuracy) aspects of normalization.
33

34
Scale normalization between slides
Boxplots of log ratios from 3 replicate self-self
hybridizations. Left panel before normalization
Middle panel after within print-tip group
normalization Right panel after a further betwee
n-slide scale normalization.
35
The NCI 60 experiments (no bg)
Some scale normalization seems desirable
36
Scale normalization another data set
Log-ratios
  • Only small differences in spread apparent. No
    action required.

37
One way of taking scale into account
  • Assumption All slides have the same spread
    in M
  • True log ratio is mij where i represents
    different slides and j represents different
    spots.
  • Observed is Mij, where
  • Mij ai mij
  • Robust estimate of ai is
  • MADi medianj yij - median(yij)

38
A slightly harder normalization problem
  • Global lowess doesnt do the trick here.

39
Print-tip-group normalization helps
40
But not completely
There is still a lot of scatter in the middle in
a WT vs KO comparison.
41
Effects of previous normalisation
Before normalisation
After print-tip-group normalization
42
Within print-tip-group box plots of M
afterprint-tip-group normalization
43
Taking scale into account, cont.
  • Assumption
  • All print-tip-groups have the same spread
    in M
  • True log ratio is mij where i represents
    different print-tip-groups and j
    represents different spots.
  • Observed is Mij, where
  • Mij ai mij
  • Robust estimate of ai is
  • MADi medianj yij - median(yij)

44
Effect of location scale normalization
Clearly care is needed in making decisions like
this one.
45
A comparison of three MA-plots
Print-tip normalization
Print tip scale n.
Unnormalized
46
The same idea on another data set
Log-ratios
Print-tip groups
  • After print-tip location and scale
    normalization.

47
Follow-up experiment
On each slide, half the spots (?8) are
differentially expressed, the other half are not.
48
Paired-slides dye-swap
  • Slide 1, M log2 (R/G) - c
  • Slide 2, M log2 (R/G) - c
  • Combine by subtracting the normalized
    log-ratios
  • (log2 (R/G) - c) - (log2 (R/G) - c) / 2
  • ? log2 (R/G) log2 (G/R) / 2
  • ? log2 (RG/GR) / 2
  • provided c c.
  • Assumption the normalization functions are the
    same for the two slides.

49
Checking the assumption
MA plot for slides 1 and 2 it isnt always like
this.

50
Result of self-normalization
(M - M)/2 vs. (A A)/2
51
Summary of normalization
  • Reduces systematic (not random) effects
  • Makes it possible to compare several arrays
  • Use logratios (MA-plots)
  • Lowess normalization (dye bias)
  • MSP titration series composite normalization
  • Pin-group location normalization
  • Pin-group scale normalization
  • Between slide scale normalization
  • More? Use controls!
  • Normalization introduces more variability
  • Outliers (bad spots) are handled with replication

52
What is missing?
  • Principally, a discussion of data quality
    issues. Most image analysis programs collect a
    wide range of measurements associated with each
    spot morphological measures such as area and
    perimeter (in pixels), uniformity measures such
    as the SD of foreground and background
    intensities in each channel, and of ratios of
    intensities (with and without background) across
    the pixels in a spot and spot brightness
    indicators such as the ratio of spot foreground
    to spot background, and the fraction of pixels in
    the foreground with intensity greater than
    background intensity (or a given multiple
    thereof). From these, further derived measures
    can be calculated, such as coefficients of
    variation, and so on.
  • How should we make use of the various
    quality indicators? Most programs include
    procedures for flagging spots on the basis of one
    or more indicators, and users typically omit
    flagged spots from their primary analyses. Data
    filtering of this kind clearly improves the
    appearance of the data, but.can we do more? That
    is a longer story, for another time.

53
Acknowledgments
  • Jean Yee Hwa Yang (UCB)
  • Sandrine Dudoit (UCB)
  • Natalie Thorne (WEHI)
  • Ingrid Lönnstedt (Uppsala)
  • Henrik Bengtsson (Lund)
  • Jason Goncalves (Iobion)
  • Matt Callow (LLNL)
  • Percy Luu (UCB)
  • John Ngai (UCB)
  • Vivian Peng (UCB)
  • Dave Lin (Cornell)

54
  • Reference
  • Yang et al (2002) Nucleic Acids Research 30,
    e15.
  • Some web sites
  • Technical reports, talks, software etc.
  • http//www.stat.berkeley.edu/users/terry/zarray/Ht
    ml/
  • Statistical software R (GNUs S)
  • http//www.R-project.org/
  • Packages within R environment
  • -- SMA (statistics for microarray analysis)
    http//www.stat.berkeley.edu/users/terry/zarray/So
    ftware/smacode.html
  • --Spot http//www.cmis.csiro.au/iap/spot.htm
Write a Comment
User Comments (0)
About PowerShow.com