Title: Data normalisation for twocolour microarrays
1Data normalisation for two-colour microarrays
- References for microarray analysis
- 1. Leung and Cavalieri. 2003. Fundamentals of
cDNA microarray data analysis. Trends in
Genetics 19 649-659 - 2. Hegde et al. (2000) A concise guide to cDNA
microarray analysis. Biotechniques 29(3) 548-554 - Quackenbush (2002) Microarray data normalization
and transformation. Nature genetics 32496-501 - Acknowledgements to John Quakenbush for slide
material
2Outline
- Why Normalise?
- Normalisation methods
- Available Normalisation packages
- DNMAD (GEPAS)
- MIDAS (TIGR)
- R
3A two-colour normalisation
4Why normalise?
- During probe preparations technical variations
can be generated including - Unequal amounts of cDNAs
- Differences in dye properties
- Differences in dye incorporation
- Differences in scanning
- Normalisation aims to correct for these variations
5Assumptions
- Most global normalisation methods assume the two
dyes are related by a constant factor - RkG
- Most normalisation methods assume that the mean
ratio is 1 or that the log of this ratio is 0 ie
a gene does not change its expression under the
condition being studied.
6Remember ratio vs log ratio
Acy3, Bcy5
R
4
Gene1
3
2
1
Gene2
0
AB
Advantage of log transformation Treat
up-regulated and down-regulated genes
symmetrically
7Log2(ratio) measures treat up- and
down-regulated genes equally
log2(1) 0 log2(2) 1 log2(1/2) -1
8Normalisation methods
- Within array normalisation
- Small targeted arrays
- housekeeping genes
- internal spikes
- Larger randomly printed arrays
- total signal intensity
- linear regression
- non-linear regression and Lowess
- Between array normalisation
9Normalisation to housekeeping genes
- Idea Some genes shouldnt be differentially
expressed - But what are these genes?
- Perhaps actin, ubiquitin, ribosomal RNAs etc
- Normalisation constants
- (Cy3)/(Cy5) for those genes or the median value
of Cy3/Cy5 for several housekeeping genes
10Normalisation to spikes or exogenous RNAs
- On the array place a number of sequences from a
different organism - genes that have low homology
to any gene in the organism of interest. - Synthesize RNA for each of these genes by IVT
- Spike known quantities of these genes into known
quantities of sample (NB- must start with same
amt of sample RNA) - Set the normalization constant to get the
expected value of the ratios for the exogenous
added genes
11Normalisation to Total Signal
- Assume equal gene expression/signal in each
sample - Global/ non-selective microarray
- Normalisation constant
- S(Cy3)/S(Cy5) or
- S(Cy5)/S(Cy3) for the dye swap experiment
- For each gene, multiply the ratio by the
normalization constant.
12Linear regression of cy3 vs cy5
- Assume expression of majority of genes doesnt
change between samples - Scatter plot of raw data (green vs red)
- Median background subtraction from mean
foreground? You can decide - data is bunched up in the left hand corner
- solution log transformation
13Linear regression of log(cy3) vs log(cy5)
- draw scatter plot of log(green) vs log(red)
- draw linear best fit line
- yaxb, where a0.878 b1.419 (red line)
- x normalisedaxb
14Before after normalisation
Red- raw data, blue-normalised data
15The RI or MA plot
- Checks if data exhibits an intensity-dependent
structure - Uncertainty in ratio measurements generally
greater at lower intensities - For RI Plot log2(R/G) vs. log2(RG)
- For MA Plot log2(R/G) vs (½ )log2(RG)
- Remember
- log (R/G) log(R) log(G)
- log (RG) log(R) log(G)
16MA plots pre- and post normalisation
After normalisation, the fit of the data to the
horizontal line through 0 is much better
17Example of good data RI plot
18Example of bad data RI plot
Each print tip is coloured differently NB! The
data is curved so a straight line normalisation
is not a good idea!
19 Non-linear regression
- Global lowess (locally weighted scatterplot
smoothing OR locally weighted linear regression) - Print tip lowess
- 2D lowess (spatial bias)
- DNA spots on array must be arranged randomly
20Applying a Global Lowess to a single sample
- Produce a MA plot (calculating the average log
intensity and log ratio for each feature) - Curve fits data far better than a straight line
as the data is curved. - Apply a lowess regression to the data
- Calculate the normalised log ratio for each
feature according to the equation of the curve
21How does Lowess work?
- Lowess performs a large number of local
regressions in overlapping windows. - Each regression is then then combined to form a
smooth curve ? the Lowess data set.
22Print-tip lowess
- Application of lowess to a sub-set of the
- data (ie print group or subgrid) (providing
sufficient genes printed per tip) - Advantage
- can correct for systematic spatial variation
- (inconsistencies between pins,
- variability in slide surface,
- slight local differences in hyb conditions)
Print tips normalised for mean around zero
Print-tip layout
23Print-tip lowess
24A potential problem Spatial effects
- In some experiments there is a spatial difference
between the two channels, resulting in parts of
the array being brighter in the cy3 and other
parts in the cy5. - A typical cause is slides with an uneven surface
resulting in the different lasers being out of
focus if they do not adjust as they scan.
25Correcting for Spatial Effects
- 2d lowess regression - fits a 2 dimensional
polynomial surface to the data - Block-by-block lowess regression
- MA plots do not show gradients across slides,
therefore a pseudo colour or false colour overlay
is required for each slide
or ?
26Normalisation methods
- Within array normalisation
- Small targeted arrays
- housekeeping genes
- internal spikes
- Larger random arrays
- total signal intensity
- linear regression
- non-linear regression and Lowess
- Between array normalisation
27Between array normalisations
- Usually you will want to do more than one array,
either as replicates or as additional samples. - You will need to normalise between arrays if you
are wanting to to compare the results. - You can even normalise across different platforms
(e.g. spotted and Affymetrix) and laboratories!
28Main assumption
- The variation is due to experimental artefacts
NOT biology!! - If you expect large differences between samples
you should not try and normalise!! - Data needs to be centered ? means of log ratio0
- Data needs to be scaled ? s.d.1
- Data needs to be normally distributed ?
distributions are identical
29Scaling across slides
Equal spread of variation between slides
30Normalization Approaches
The Solution(?)
- Can minimise normalisation
- by adjusting PMTs during scanning so that the
mean/median of the ratios1 - The best technique is experiment dependent
- Check diagnostic plots carefully
- All analysis methods depend on good
- experimental design
31Normalisation packages
- Genepix
- Excel
- DNMAD (GEPAS)
- MIDAS (TIGR)
- R
32What you can do in Genepix
- Array quality control
- Scatter plots (cy3 vs cy5)
- MA plots
- Normalisation using housekeeping genes or
internal spikes - Global normalisation
- total intensity (mean/median of ratios1)
- linear regression (regression ratio1)
By default, flags are excluded from
normalisation
33Excel
- What you can do in excel
- Scatter plots (cy3 vs cy5)
- MA plots
- Normalisation using housekeeping genes or
internal spikes - Normalisation using total intensity
- Linear regression normalisation
34DNAMAD
- http//dnmad.bioinfo.cnio.es/
- Input GPR files
- Input array layout (eg Arabidopsis Arizona
slides Main grid 12 rows X 4 columns, sub-grid
26 rows X 25 columns) - Normalisation options eg use flags, return flags
as NA, use background subtraction, use global
lowess - Work through tutorial http//bioinfo.cnio.es/docu
s/courses/dnmad/index.html
35Output
Can down load normalised M values as txt file
(remember to multiple dye swaps by -1)
36Box plot of M values by print-tip
37MA plots pre- and post LOWESS
38Slide1 diagnostic plots
39Slide1 Image plots
Background and M plots also shown
40Slide scale normalisation
Next send to Preprocessor Download normalised M
values (txt file)
41Microarray Data Flow at TIGR
Image Analysis
.tiff Image File
Raw Gene Expression Data
Gene Annotation
Normalization / Filtering
Normalized Data with Gene Annotation
Expression Analysis
Data Entry / Management
Interpretation of Analysis Results
42MIDAS is a Normalization and Filtering tool
for microarray data analysis!
Serves as a data pre-processor for clustering
analysis (MeV).
Requires Java Runtime environment and Windows XP
43MIDAS data analysis methods
- 7 normalization/transformation methods
Total Intensity normalization
LOWESS (Locfit) normalization
Iterative linear regression normalization
Iterative log mean centering normalization
Ratio Statistics normalization
Standard deviation regularization
In-slide replicates analysis
- 8 filtering (quality control) methods
Low intensity filter
Slice analysis
Flip-dye consistency checking
Ratio Statistics Confidence Interval checking
Invalid-intensity checking
Signal/Noise checking
Spot QC flag checking
Cross-file-trim
44Graphical scripting language
45Graphical scripting language
- Read input files
-
- Define analysis
- pipeline and set
- parameters for
- each analysis module
- Write output files
- NB
- Input MEV files only (convert GPR files using
Expression Converter) - Click create PDF report
46Using R
statistical microarray analysis (sma) module
- sma will normalise, compare slides, and do
statistical tests on data - Allows simultaneous multiple slide analysis
- To process the data
- load experiments into R
- describe slide printing configuration
- load experiments into a working data set
- Analyse data
47Normalisation exercises
- Two exercises using DNMAD
- http//bioinfo.cnio.es/docus/courses/dnmad/index.h
tml - Using your own data
- Try linear regression normalisation (in excel) vs
print-tip LOWESS (DNMAD and/or MIDAS)