Title: Introduction to Bioinformatics Microarrays2: Microarray Data Normalisation
1Introduction to Bioinformatics Microarrays2
Microarray Data Normalisation
- Course 341
- Department of Computing
- Imperial College, London
- Moustafa Ghanem
2Lecture Overview
- Background and Motivation
- Introduction
- Microarray experiments and microarray data
analysis - Sources of variability
- Experimental design
- Normalisation Examples
- Probe intensity values
- Two colour arrays
- Positive controls
- Spatial normalisation within array
- Between array normalisation
- Normalisation Methods
- Total intensity normalisation
- Scaling and centring
- Linear regression
- MA plots and Lowess
3BackgroundMicroarrays
A Microarray works by exploiting the ability of
mRNA molecule to hybridize to its complementary
DNA probe The mRNA molecules in a target
biological sample are labelled using a
fluorescent dye and applied to the array The
fluorescent label enables the detection of which
probes have hybridised (presence) via the light
emitted from the probe.
- A Microarray is a device detects the presence and
abundance of labelled nucleic acids in a
biological sample. - The Microarray consists of a solid surface onto
which known DNA molecules have been chemically
bonded at special locations. - Each array location is typically known as a probe
and contains many replicates of the same
molecule. - The molecules in each array location are
carefully chosen so as to hybridise only with
mRNA molecules corresponding to a single gene.
4BackgroundMicroarray Data Analysis
Biological question
Sample Attributes
Experimental design Platform Choice
16-bit TIFF Files
Microarray experiment
(Rspot, Rbkg), (Gspot, Gbkg)
Image analysis
Normalization
Clustering
Statistical Analysis
Data Mining
Pattern Discovery
Classification
Biological verification and interpretation
5Motivation
- Data generated from Microarray experiments are
inherently highly variable. - First, there is the Law of Large Numbers
- Any measurement of thousands of values will find
some large differences due to chance (normal
distribution) - However, the average gene does not change its
expression across experiments - Must have replication (e.g. different patients
different experiments) and statistics to show
that measured differences are real. - Second, there are also Systematic Sources of
Variability - e.g. Errors is scanning microarray images,
differences between properties of Cy3 and Cy5
channels, etc - Must have systematic methods for addressing such
errors.
6Motivation
- Normalisation is a general term for a collection
of methods that are directed at reasoning about
and resolving the systematic errors and bias
introduced by microarray experimental platforms - Normalisation methods stand in contrast with the
data analysis methods described in other lectures
(e.g. differential gene expression analysis,
classification and clustering). - Our overall aim is to be able to quantify
measured/calculated variability, differentials
and similarity - Are they biologically significant or just side
effects of the experimental platforms and
conditions?
7IntroductionSources of Microarray Data
Variability
The measured gene expression in any experiment
includes true gene expression,together with
contributions from many sources of variability
- There are several levels of variability in
measured gene expression of a feature. - At the highest level, there is biological
variability in the population from which the
sample derives. - At an experimental level, there is
- variability between preparations and labelling of
the sample, - variability between hybridisations of the same
sample to different arrays, and - variability between the signal on replicate
features on the same array.
8IntroductionSources of Microarray Data
Variability
There are many standard experimental protocols
that biologists need to follow when conducting
their experiments to minimize variability
- Population Variation
- Whose mRNA are we using? May need to test
different samples in parallel. - May need many replicates to study biological
variation - Sample Treatment
- Experimental conditions
- Tissue preparation
- Target Preparation
- RNA isolation need to use use identical amounts
of tissue, identical extraction methods use
minimum number of steps measure amount of RNA
and normalize concentration - Labelling need to account for and measure
incorporation of label and normalize samples to
same concentration - Amount Need to add same amount of label to each
hybridization
9IntroductionSources of Microarray Data
Variability
Oligos reduce variability of probes compared to
PCR products. In-situ synthesis standardises
probe production and produces better spot quality
and reduces errors in image acquisition
- Arrays
- Same sample may be hybridized to different arrays
in different labs - PCR products probes prepared through
amplification directly from cells, must add same
amount of product to each spot on filter - Uniformity of spotting must use arraying tool
for filter arrays or robot for microarrays. - Treatment and handling of filters or slides
- Hybridization and washing
- Time long hybridization ensures that
hybridization goes to completion. - Temperature most hybridisations performed
between 45 and 65 oc - Data acquisition
- Image acquisition
- Spot and background detection
10IntroductionBiological and Technical Variability
- Biological Variability variation between
individuals in the population and is independent
of the microarray process itself - Population variability can be measured with pilot
studies - Technical Variability is dependent on the
microarray process itself. - Technical variability is measured in calibration
experiments. - In good experiments, technical variation should
be much less than biological variation
11IntroductionExperimental Design
- Tree representation of replicate experiments
- The first level is at the level of biological
replicates - This is followed by two independent mRNA
extractions, and reciprocal Cy3 and Cy5 labelling - Finally on each array, each probe is printed in
triplicate. - In this example, each data point in the
experiment is replicated a total of 24 times. - Furthermore, in each microarray experiment, each
gene (each probe or probe set) is really a
separate experiment in its own right
12IntroductionConducting Good Experiments
13IntroductionGene Expression Matrices
14Normalisation Examples Probe Intensity Value
Typical Problem Usually more variability at low
intensity
- The raw intensities of signal from each spot on
the array are not directly comparable. Depending
on the types of experiments done, a number of
different approaches to normalization may be
needed. Not all types of normalization are
appropriate in all experiments. Some experiments
may use more than one type of normalization. - Reasonable Assumption intensities of fluorescent
molecules reflect the abundance of the mRNA
molecules generally true but could be
problematic - Example
- intensity of gene A spot is 100 units in
normal-tissue array - intensity of gene A spot is 50 units in
cancer-tissue array - Conclusion gene As expression level in normal
issue is significantly higher than in cancer
tissue
15Normalisation Examples Probe Intensity Value
Images showing examples of how background
intensity can be calculated
- Problem? What if the overall background intensity
of the normal-tissue array is 95 units while the
background intensity of cancer-tissue array is 10
units? - Solutions
- Subtract background intensity value
- Take ratio of spot intensity to background
intensity (preferable) - In both cases have to decide where to measure
background intensity (e.g. local to spot or
globally per chip) - In general, There could be many factors
contributing to the background intensity of a
microarray chip - To compare microarray data across different
chips, data (intensity levels) need to be
normalized to the same level
16Normalisation Examples Two Colour Arrays
- Reasonable Assumption For two colour arrays, in
a self self hybridization, we expect for each
spot Red Green - Problem This is not necessarily true due to
labelling effects, chemistry (dye properties),
scanner properties, etc - Dye Bias in Two-channel microarrays Intensity in
one channel may be higher than the other - Solutions
- Dye swapping experiments in first replicate
label control with red and experiment with green
in second replicate swap colours - Calibration Experiments (Self vs. self
Hybridisation)label same extract with both
colours and calculate variation
17Normalisation Examples Two Colour Arrays
y ax
y x
- possible ways of correction
- dividing all x by a 2. multiplying all y by a
- Can easily be extended when regression line is y
axb
18Normalisation ExamplesRatio of Signal to
Positive Control
How does this approach compare to the affymetrix
PM/MM probes?
- Problem Is there any cross hybridisation?
- Solution It is often useful to spike the
labelling reaction with some foreign RNA or DNA
that is not normally in the RNA population. - The signal si for gene i would therefore be raw
counts gi divided by the median of the counts for
the vector spots.
19Normalisation ExamplesRatio of Signal to
Positive Control
- Normalization of signal for each gene to a ratio
makes it possible to compare ratios between
experiments, provided that the spiked controls
are the same in all experiments. - Normalization to a positive control is typically
used in single-label experiments. Comparison of
one experiment to another can either be done by
plotting signal si directly on a graph, or
signals from two experiments can be converted
into a ratio, usually by choosing one treatment
as a control. - For example, in a time course, a 0 hour time
point might be chosen, and signal from all other
time points divided by the signal for the 0 hour
time point, to give a ratio.
20Normalisation ExamplesSpatial variation within
array
- Problem Signal varies according to spot location
- Particularly corners Less hybridization solution
- Also because of print-tip group of robot
- Solutions
- Calculate ratio to mean or total intensity value
- Use Locally Weighted Regression (Lowess)
- Use Block-Block Lowess
21Normalisation ExamplesBetween Array Normalisation
- Assumption the overall intensities across two
arrays should be similar - Problem Not always the case
- Solution1 Ensure that data points in the
two-intensity coordinate system should be roughly
centered around the diagonal
Solution2 Use total intensity normalization for
large number
22Normalisation MethodsBetween Array Normalisation
- Mean/Median centering mean/median intensity of
every chip brought to same level - Total intensity normalization scaling factor
determined by summing intensities - Spiked-control, housekeeping normalization
(Positive Controls)
23Normalisation MethodsCentring and Scaling
- Data is scaled to ensure that the means and the
standard deviations of all of the distributions
are equal. For each measurement on the array,
subtract the mean measurement of the array and
divide by the standard deviation. Following
centring, the mean measurements on each array
will be zero, and the standard deviation will be
1
24Normalisation MethodsNormalized ratios usually
expressed as logs
- To facilitate easier mathematical handling of the
data, as well as comparisons over a wide range of
expression levels, ratios are usually expressed
as logs. - For example, if a gene is expressed at 4 times
the level in the control than in the mutant, log2
(1/4) -2. A log ratio of 0 is therefore
indicative of a gene whose expression is the same
in both conditions or treatments.
Rg
Ratio Tg
Gg
Rg
log2
Log Ratio log2(Tg)
Gg
25Normalisation MethodsRegression Normalisation
- Regression normalization
- Fit the linear regression model y ax b
- Test the significance of the intercept b. Fit a
linear regression without b if it is
insignificant. - Transform the data
- Problem assumption may not hold due to nonlinear
trend
26Normalisation MethodsFrom Scatter Plot to MA Plot
- Instead of plotting the two intensity values
against one another (Scatter plot), it is common
to use an MA plot - M log2(R/G) ratio of two intensities
- A log2SQRT(RG) ½ log2(RG) mean log
intensity of two values
27Normalisation MethodsLowess Normalization
- Locally Weighted Least Square Regression
- Assumption Variation in data is intensity
dependent - Smoothes the intensity function
- Lowess is typically applied to M-A plots
28Summary
- Normalisation used to identify if variation is
due to experimental conditions. - Typical sources of variation are
- Population, Sample, Target, Array (Probe),
Hybridisation, Data Acquisition - Different Normalisation Examples
- Probe intensity values
- Two colour arrays
- Positive controls
- Spatial normalisation within array
- Between array normalisation
- Common Normalisation Methods
- Mainly scaling factors and regression