Corrections and Normalization in microarrays data analysis PowerPoint PPT Presentation

presentation player overlay
1 / 42
About This Presentation
Transcript and Presenter's Notes

Title: Corrections and Normalization in microarrays data analysis


1
Corrections and Normalization in
microarraysdata analysis
  • Mauro Delorenzi

2
Acknowledgments
  • Uni. Cal. Statistics Berkeley / WEHI
    Bioinformatics
  • Terry Speed (Berkeley / WEHI)
  • Yee Hwa Yang (Berkeley)
  • Sandrine Dudoit (Stanford)
  • Ingrid Lönnstedt (Uppsala)
  • Yongchao Ge (Berkeley)
  • Natalie Thorne (WEHI)
  • Mauro Delorenzi (WEHI)

Most slides were taken from our collection
  • Collaborations with
  • Peter Mac CI, Melb.
  • Brown-Botstein lab, Stanford
  • Matt Callow (LBNL)
  • CSIRO Image Analysis Group

3
Biological question Gene regulation Class
prediction
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
4
excitation
scanning
cDNA clones (probes)
laser 2
laser 1
emission
PCR product amplification purification
printing
mRNA target)
overlay images and normalise
0.1nl/spot
Hybridise target to microarray
microarray
analysis
5
Scanner's Spots
Part of the image of one channel false-coloured
on a white (v. high) red (high) through yellow
and green (medium) to blue (low) and black scale.
6
Gene Expression Data
  • Gene expression data on p genes for n samples

Slides
slide 1 slide 2 slide 3 slide 4 slide 5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene 5 in slide 4 j

Log2( Red intensity / Green intensity)
These values are conventionally displayed on a
red (gt0) yellow (0) green (lt0) scale.
7
Some statistical questions
  • Image analysis addressing, segmenting,
    quantifying
  • Normalisation within and between slides
  • Quality of images, of spots, of (log) ratios
  • Which genes are (relatively) up/down regulated?
  • Assigning p-values to tests / confidence to
    results
  • Planning of experiments design, sample size
  • Discrimination and allocation of samples
  • Clustering, classification of samples, of genes
  • Selection of genes relevant to any given analysis
  • Analysis of time course, factorial and other
    special experiments
  • more

8
I. The simplest problem is identifying
differentially expressed genes using one slide
  • This is a common enough hope
  • Efforts are frequently successful
  • It is not hard to do by eye
  • The problem is probably beyond formal statistical
    inference (valid p-values, etc) for the
    foreseeable future.

9
Objectives
  • Important aspects of a statistical analysis
    include
  • Tentatively separating systematic sources of
    variation ("artefacts"), that bias the results,
    from random sources of variation ("noise"), that
    hide the truth.
  • Removing the former and quantifying the latter
  • Identifying and dealing with the most relevant
    source of variation in subsequent analyses
  • Only if this is done can we hope to make more or
    less valid probability statements about the
    confidence in the results
  • Every Correction is a new source of variability.
    There is a trade-off
  • between gains and losses. The best method depends
    on the characteristic of the data and this can
    vary.

10
Typical Statistical Approach
  • Measured value
  • real value systematic errors noise
  • Corrected value
  • real value
    noise
  • Analysis of Corrected value gt
    (unbiased) CONCLUSIONS
  • Estimation of Noise gt
  • quality of CONCLUSIONS, statistical
    significance (level of confidence) of the
    conclusions

11
Step 1 Background Correction
  • Image Analysis gt Rfg Rbg Gfg Gbg (fg
    foreground, bg background.) For each spot on
    the slide we calculate
  • Red intensity R Rfg - Rbg
  • Green intensity G Gfg - Gbg
  • M Log2( Red intensity / Green intensity)
  • Subtraction of background values (additive
    background model assuming to be locally constant
    )
  • Sources of background probe unspecifically
    sticking on slide, irregular / dirty slide
    surface, dust, noise in the scanner measurement
  • Not included real cross-hybridisation and
    unspecific hybridisation to the probe

12
  • The intensity pairs (R, G) are highly processed
    data and the methods of image processing and
    background correction of the laser scan images
    can have a large impact. Before applying
    normalisation, inference, cluster analysis and
    the like, it is important to identify and remove
    systematic sources of variation such as due to
    different labeling efficiencies and scanning
    properties of the two dyes or spatial
    inhomogeneities.
  • With many different users and protocols, the
    portion of the variation due to systematic
    effects can vary substantially.
  • There are many sources of systematic variation
    which affect the measured gene expression levels.
    Normalisation is the term used to describe the
    process of re moving such variation.
  • Until the variation is properly accounted for or
    modelled, there is no question of the system
    being in statistical control and hence no basis
    for a statistical model to describe chance
    variation.

13
Step 2 An M vs A (MVA) Plot
M log R/G logR - logG
Lowess curve
blanks
Positive controls (spotted in varying
concentrations)
Negative controls
A ( logR logG ) /2
14
A reminder on logarithms
15
A numerical example
16
Why use an M vs A plot ?
  1. Logs stretch out region we are most interested
    in.
  2. Can more clearly see features of the data such as
    intensity dependent variation, and dye-bias.
  3. Differentially expressed genes more easily
    identified.
  4. Intuitive interpretation

17
MVA plot looking at data 1
Spot identifier
Lowess curve
S1.n. Control Slide Dye Effect, Spread.
18
MVA plot looking at data 2
S1.p . Normalised data. Spread.
19
MVA plot looking at data 3
S4. A-dependent variability.
20
MVA plot analysing data 4
S17. Saturation
21
MVA plot looking at data 5 Unique effects of
different scanners
22
Normalisation - Median
Step 3 Normalisation - median
  • Assumption Changes roughly symmetric
  • First panel smooth density of log2G and log2R.
  • Second panel M vs A plot with median put to zero

23
Step 4 Normalisation - lowess
  • Assumption changes roughly symmetric at all
    intensities.

24
A hypothetical quantitative model
a. linear response
25
A realistic hypothetical quantitative model
b. power function-response
26
Step 5 Normalisation - between groups
Log-ratios
Print-tip groups
  • After within slide global lowess normalization.
  • Likely to be a spatial effect.

27
Normalization between groups (ctd)
Log-ratios
Print-tip groups
  • After print-tip location- and scale-
    normalization.

28
Effects of Location Normalisation (example)
Before
After
29
Taking varying scale into account
Step 6 Rescaling (Spread-Normalisation)
  • Assumption
  • All (print-tip-)groups should have the same
    spread in M
  • True ratio is ?ij where i represents different
    (print-tip)-groups and j represents different
    spots. Observed is Mij, where Mij ai
    log(?ij)
  • Robust estimate of ai is
  • Corrected values are calculated as

30
Illustration print-tip-group - Normalisation
Assumption For every print group changes
roughly symmetric at all intensities.
Glass Slide Array of bound cDNA probes 4x4
blocks 16 pin groups
31
Step 7 Assessing Significance
MVA-plot and critical curves Newtons, Sapir
Churchills and Chens single slide method
32
Other Approaches
  • These normalisation procedures are based on the
    assumption that spots are as likely to be higher
    in the first or the second dye. They work well
    with a high number of independent spots.
  • If (a few) genes were selected another approach
    might be needed.
  • For the correction of dye-effects we recommend to
    use either
  • Paired dye-swapped slides and/or
  • Internal Controls as spikes or a dilution series
  • In the second case, instead of all genes only the
    control spots are used to compute the
    corrections.
  • In the first case, the data from the two slides
    can be combined. Assuming identical dye-intensity
    interactions in the two slides, the effect is
    corrected by taking
  • A 0,5 (A1 A2)
  • M 0,5 (M1 M2)
  • This procedure is called self-normalisation, as
    it is done spot-by-spot. A number of controls
    give indication if it is working well. It also
    deals with some artifacts that cause some genes
    to be always higher in one dye than in the other.

33
II. The second simplest problem is identifying
differentially expressed genes using replicated
slides
  • There are a number of different aspects
  • First, between-slide normalization then
  • What should we look at averages, SDs
    t-statistics, other summaries?
  • How should we look at them?
  • Can we make valid probability statements?

34
Selecting genes up/down regulated 1
  • M
  • t
  • t ?M

Results from the Apo AI ko experiment
35
Which genes are (relatively) up/down regulated?
Selecting genes up/down regulated
  • Two samples.
  • e.g. KO vs. WT or mutant vs. WT

Two samples with a reference (e.g. pooled control)
For each gene form the t statistic
average of n trt Ms sqrt(1/n (SD
of n trt Ms)2)
  • For each gene form the t statistic
  • average of n trt Ms - average of n ctl
    Ms
  • sqrt(1/n (SD of n trt Ms)2 (SD of n ctl Ms)2)

36
Which genes have changed?When permutation
testing is possible
  • 1. For each gene and each hybridisation (8 ko 8
    ctl), use Mlog2(R/G).
  • 2. For each gene form the t statistic
  • average of 8 ko Ms - average of 8 ctl
    Ms
  • sqrt(1/8 (SD of 8 ko Ms)2 (SD of 8 ctl Ms)2)
  • 3. Form a histogram of 6,000 t values.
  • 4. Do a normal Q-Q plot look for values off the
    line.
  • 5. Permutation testing.
  • 6. Adjust for multiple testing.

37
Histogram qq plot
ApoA1
38
Adjusted and Unadjusted p-values for the 50 genes
with the largest absolute t-statistics.
39
Which genes have changed? When Permutation
testing is not possible
  • Our current approach is to use M-averages, SDs,
    t-statistics and a new statistic we call B,
    inspired by empirical Bayes.
  • We hope in due course to calibrate B and use that
    as our main tool.

Empirical Bayes log posterior odds ratio
40
  • T
  • B
  • t ? M ?B
  • t ?B

41
Remarks for multiarrays experiments
  • Microarray experiments typically have thousands
    of genes, but only few (1-10) replicates for each
    gene.
  • Averages can be driven by outliers.
  • Ts can be driven by tiny variances.
  • B LOR will, we hope
  • use information from all the genes
  • combine the best of M. and T
  • avoid the problems of M. and T

42
  • Some web sites
  • Technical reports, talks, software etc.
  • http//www.stat.berkeley.edu/users/terry/zarray/Ht
    ml/
  • Especially
  • Dudoit et al Statistical methods for
  • Yee Hwa Yang et al. Normalization for cDNA
    Microarray Data
  • Statistical software R GNUs S
    http//lib.stat.cmu.edu/R/CRAN/
  • Packages within R environment
  • -- Spot http//www.cmis.csiro.au/iap/spot.htm
  • -- SMA (statistics for microarray analysis)
    http//www.stat.berkeley.edu/users/terry/zarray/So
    ftware /smacode.html
Write a Comment
User Comments (0)
About PowerShow.com