Title: Microarray Basics
1Microarray Basics
- Part 2 Data normalization, data filtering,
measuring variability
2Log transformation of data
Most data bunched in lower left
corner Variability increases with intensity
Data are spread more evenly Variability is more
even
3Simple global normalization to try to fit the data
Slope does not equal 1 means one channel responds
more at higher intensity
Non zero intercept means one channel is
consistently brighter
Non straight line means non linearity in
intensity responses of two channels
4Linear regression of Cy3 against Cy5
5MA plots
Regressing one channel against the other has the
disadvantage of treating the two sets of signals
separately
Also suggested that the human eye has a harder
time seeing deviations from a diagonal line than
a horizontal line
MA plots get around both these issues
Basically a rotation and rescaling of the data
A (log2R log2G)/2
X axis
M log2R-log2G
Y axis
6Scatter plot of intensities
MA plot of same data
7Non linear normalization
Normalization that takes into account intensity
effects
Lowess or loess is the locally weighted
polynomial regression
User defines the size of bins used to calculate
the best fit line
Taken from Stekal (2003) Microarray Bioinformatics
8Adjusted values for the x axis (average intensity
for each feature) calculated using the loess
regression
Should now see the data centred around 0 and
straight across the horizontal axis
9Spatial defects over the slide
- In some cases, you may notice a spatial bias of
the two channels - May be a result of the slide not lying completely
flat in the scanner - This will not be corrected by the methods
discussed before
10Spatial Bias
11Regressions for spatial bias
- Carry out normal loess regression but treat each
subgrid as an entire array (block by block loess) - Corrects best for artifacts introduced by the
pins, as opposed to artifacts of regions of the
slide - Because each subgrid has relatively few spots,
risk having a subgrid where a substantial
proportion of spots are really differentially
expressed- you will lose data if you apply a
loess regression to that block - May also perform a 3-D loess- plot log ratio for
each feature against its x and y coordinates and
perform regression
12Between array normalization
- Previous manipulations help to correct for
non-biological differences between channels on
one array - In order to compare across arrays, also need to
take into account technical variation between
slides - Can start by visualizing the overall data as box
plots - Looking at the distributions of the log ratios or
the log intensities across arrays
13Extremes of distribution
Std Dev of distribution with mean
Extremes of distribution
14Data Scaling
- Makes mean of
- distributions equal
- Subtract mean
- log ratio
- from each log ratio
- Mean of measurements
- will be zero
15Data Centering
- Makes means and
- standard deviations equal
- Do as for scaling,
- but also divide by the
- mean standard deviation
- Will have means intensity
- measurements of zero,
- standard deviations of 1
16Distribution normalization
- Makes overall distributions identical between
arrays - Centre arrays
- For each array, order centered intensities from
highest to lowest - Compute new distribution whose lowest value is
average of all lowest values, and so on - Replace original data with new values for
distribution
17Some key points
- Design the experiment based on the questions you
want to ask - Look at your TIFF images
- Look at the raw data with scatter plots and MA
plots - Normalize within arrays to remove systematic
variability between channels - Scale between arrays prior to comparing results
in a data set
18Data Filtering (flagging of data)
- Can use data filtering to remove or flag features
that one might consider to be unreliable - May base the filter on parameters such as
individual intensity, average feature intensity,
signal to noise ratio, standard deviation across
a feature
19Using intensity filters
- Object is to remove features that have
measurements close to background levels- may see
large ratios that reflect small changes in very
small numbers - May want to set the filter as anything less than
2 times the standard deviation of the background
20If using signal to noise ratio, keep in mind that
the numbers calculated by QuantArray are spot
intensity/std dev of background
Should see that the S/N ratio increase at higher
intensity
Taken from DNA Microarray Data Analysis (CSC)
http//www.csc.fi/oppaat/siru/
21Removing outliers
- May want to simply remove outliers- some
estimates are that the extreme ends of the
distribution should be considered outliers and
removed (0.3 at either end) - Also want to remove saturated values (in either
channel)
22Filtering based on replicates
- Consider two replicates with dyes swapped
- A1 and B2 B1 A2
- We can calculate ? and eliminate spots with the
greatest uncertainty ? gt2
23Replicate Filtering
- Plot of the log
- ratios of 2 replicates
- Remove the data
- in red based on
- deviation of 2
- st dev
Taken from Quakenbush (2002) Nat Genet Supp 32
24Z-scores
- The uncertainty in measurements increases as
intensity decreases - Measurements close to the detection limit are the
most uncertain - Can calculate an intensity-dependent Z-score that
measures the ratio relative to the standard
deviation in the data
Z log2(R/G)-?/?
25Intensity-dependent Z-score
Z gt 2 is at the 95.5 confidence level
26Approaches to using filtering algorithms
qsize
qsignal to noise
qlocal background
qbackground variability
qsaturated
qcom composite quality score based on the
continuous and discrete functions listed above
Taken from Wang et al (2001) NAR 29 e75
27qcom in relation to log ratio plot
Taken from Wang et al (2001) NAR 29 e75
28Measuring and Quantifying Variability
- Variability may be measured
- Between replicate features on an array
- Between two replicates of a sample on an array
- Between two replicates of a sample on different
arrays - Between different samples in a population
29Quantifying variables in microarray data
- Measured value for each feature is a combination
of the true gene expression, and the sources of
variation listed - Each component of variation will have its own
distribution with a standard deviation which can
be measured
30Variability between replicate features
- Requires that features are printed multiple times
on a chip - Optimal if the features are not printed side by
side - Need to calculate this variability separately for
the 2 channels
31Calculate mean of each replicate
Calculate the deviation from the mean for each
replicate
Produce MA plots
If needed, can normalize
0
Diff (Rep1)
Calculate std dev of errors
Ch1 ave log intensity
If the error distribution is normal, you
can calculate v
Frequency
Ch1 Difference
32Variability between channels
- Perform a self to self hybridization
- Perform all the normalization procedures
discussed earlier - The variation that is left is going to be due to
random variability in measurement between the 2
channels
33Variability between arrays
- Same samples on different arrays (or just use the
common reference sample in a larger experiment) - Now are calculating both the variability due to
the manufacturing of different arrays, and the
variability of different hybridizations- these
are confounded variables
34Why calculate these values?
- Gives an estimate of comparability in quality
between experiments - Gives an estimate of noise in the data relative
to population variation - Can be used to track optimization of experiment
35Variability between individuals
- This is the population variability number that is
used in the power calculation - Generally will find that this is the largest
source of variation and this is the one that will
not be decreased by improving the experimental
system
36How to calculate population variability
- Calculate log ratio of each gene relative to the
reference sample - Calculate the average log ratio for each gene
across all samples - For each gene in each sample, subtract the log
ratio from the average log ratio - Plot the distribution of deviations and calculate
the standard deviation (and v)
37http//genome-www5.stanford.edu/mged/normalization
.html
38Part 3-Data Analysis
- How to choose the interesting genes in your
experiment - How to study relationships between groups of
genes identified as interesting - Classification of samples