Microarray Basics - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Microarray Basics

Description:

Microarray Basics Part 2: Data normalization, data filtering, measuring variability Log transformation of data MA plots Non linear normalization Spatial defects over ... – PowerPoint PPT presentation

Number of Views:252
Avg rating:3.0/5.0
Slides: 39
Provided by: deptp181
Category:

less

Transcript and Presenter's Notes

Title: Microarray Basics


1
Microarray Basics
  • Part 2 Data normalization, data filtering,
    measuring variability

2
Log transformation of data
Most data bunched in lower left
corner Variability increases with intensity
Data are spread more evenly Variability is more
even
3
Simple global normalization to try to fit the data
Slope does not equal 1 means one channel responds
more at higher intensity
Non zero intercept means one channel is
consistently brighter
Non straight line means non linearity in
intensity responses of two channels
4
Linear regression of Cy3 against Cy5
5
MA plots
Regressing one channel against the other has the
disadvantage of treating the two sets of signals
separately
Also suggested that the human eye has a harder
time seeing deviations from a diagonal line than
a horizontal line
MA plots get around both these issues
Basically a rotation and rescaling of the data
A (log2R log2G)/2
X axis
M log2R-log2G
Y axis
6
Scatter plot of intensities
MA plot of same data
7
Non linear normalization
Normalization that takes into account intensity
effects
Lowess or loess is the locally weighted
polynomial regression
User defines the size of bins used to calculate
the best fit line
Taken from Stekal (2003) Microarray Bioinformatics
8
Adjusted values for the x axis (average intensity
for each feature) calculated using the loess
regression
Should now see the data centred around 0 and
straight across the horizontal axis
9
Spatial defects over the slide
  • In some cases, you may notice a spatial bias of
    the two channels
  • May be a result of the slide not lying completely
    flat in the scanner
  • This will not be corrected by the methods
    discussed before

10
Spatial Bias
11
Regressions for spatial bias
  • Carry out normal loess regression but treat each
    subgrid as an entire array (block by block loess)
  • Corrects best for artifacts introduced by the
    pins, as opposed to artifacts of regions of the
    slide
  • Because each subgrid has relatively few spots,
    risk having a subgrid where a substantial
    proportion of spots are really differentially
    expressed- you will lose data if you apply a
    loess regression to that block
  • May also perform a 3-D loess- plot log ratio for
    each feature against its x and y coordinates and
    perform regression

12
Between array normalization
  • Previous manipulations help to correct for
    non-biological differences between channels on
    one array
  • In order to compare across arrays, also need to
    take into account technical variation between
    slides
  • Can start by visualizing the overall data as box
    plots
  • Looking at the distributions of the log ratios or
    the log intensities across arrays

13
Extremes of distribution
Std Dev of distribution with mean
Extremes of distribution
14
Data Scaling
  • Makes mean of
  • distributions equal
  • Subtract mean
  • log ratio
  • from each log ratio
  • Mean of measurements
  • will be zero

15
Data Centering
  • Makes means and
  • standard deviations equal
  • Do as for scaling,
  • but also divide by the
  • mean standard deviation
  • Will have means intensity
  • measurements of zero,
  • standard deviations of 1

16
Distribution normalization
  • Makes overall distributions identical between
    arrays
  • Centre arrays
  • For each array, order centered intensities from
    highest to lowest
  • Compute new distribution whose lowest value is
    average of all lowest values, and so on
  • Replace original data with new values for
    distribution

17
Some key points
  • Design the experiment based on the questions you
    want to ask
  • Look at your TIFF images
  • Look at the raw data with scatter plots and MA
    plots
  • Normalize within arrays to remove systematic
    variability between channels
  • Scale between arrays prior to comparing results
    in a data set

18
Data Filtering (flagging of data)
  • Can use data filtering to remove or flag features
    that one might consider to be unreliable
  • May base the filter on parameters such as
    individual intensity, average feature intensity,
    signal to noise ratio, standard deviation across
    a feature

19
Using intensity filters
  • Object is to remove features that have
    measurements close to background levels- may see
    large ratios that reflect small changes in very
    small numbers
  • May want to set the filter as anything less than
    2 times the standard deviation of the background

20
If using signal to noise ratio, keep in mind that
the numbers calculated by QuantArray are spot
intensity/std dev of background
Should see that the S/N ratio increase at higher
intensity
Taken from DNA Microarray Data Analysis (CSC)
http//www.csc.fi/oppaat/siru/
21
Removing outliers
  • May want to simply remove outliers- some
    estimates are that the extreme ends of the
    distribution should be considered outliers and
    removed (0.3 at either end)
  • Also want to remove saturated values (in either
    channel)

22
Filtering based on replicates
  • Consider two replicates with dyes swapped
  • A1 and B2 B1 A2
  • We can calculate ? and eliminate spots with the
    greatest uncertainty ? gt2

23
Replicate Filtering
  • Plot of the log
  • ratios of 2 replicates
  • Remove the data
  • in red based on
  • deviation of 2
  • st dev

Taken from Quakenbush (2002) Nat Genet Supp 32
24
Z-scores
  • The uncertainty in measurements increases as
    intensity decreases
  • Measurements close to the detection limit are the
    most uncertain
  • Can calculate an intensity-dependent Z-score that
    measures the ratio relative to the standard
    deviation in the data

Z log2(R/G)-?/?
25
Intensity-dependent Z-score
Z gt 2 is at the 95.5 confidence level
26
Approaches to using filtering algorithms
qsize
qsignal to noise
qlocal background
qbackground variability
qsaturated
qcom composite quality score based on the
continuous and discrete functions listed above
Taken from Wang et al (2001) NAR 29 e75
27
qcom in relation to log ratio plot
Taken from Wang et al (2001) NAR 29 e75
28
Measuring and Quantifying Variability
  • Variability may be measured
  • Between replicate features on an array
  • Between two replicates of a sample on an array
  • Between two replicates of a sample on different
    arrays
  • Between different samples in a population

29
Quantifying variables in microarray data
  • Measured value for each feature is a combination
    of the true gene expression, and the sources of
    variation listed
  • Each component of variation will have its own
    distribution with a standard deviation which can
    be measured

30
Variability between replicate features
  • Requires that features are printed multiple times
    on a chip
  • Optimal if the features are not printed side by
    side
  • Need to calculate this variability separately for
    the 2 channels

31
Calculate mean of each replicate
Calculate the deviation from the mean for each
replicate
Produce MA plots
If needed, can normalize
0
Diff (Rep1)
Calculate std dev of errors
Ch1 ave log intensity
If the error distribution is normal, you
can calculate v
Frequency
Ch1 Difference
32
Variability between channels
  • Perform a self to self hybridization
  • Perform all the normalization procedures
    discussed earlier
  • The variation that is left is going to be due to
    random variability in measurement between the 2
    channels

33
Variability between arrays
  • Same samples on different arrays (or just use the
    common reference sample in a larger experiment)
  • Now are calculating both the variability due to
    the manufacturing of different arrays, and the
    variability of different hybridizations- these
    are confounded variables

34
Why calculate these values?
  • Gives an estimate of comparability in quality
    between experiments
  • Gives an estimate of noise in the data relative
    to population variation
  • Can be used to track optimization of experiment

35
Variability between individuals
  • This is the population variability number that is
    used in the power calculation
  • Generally will find that this is the largest
    source of variation and this is the one that will
    not be decreased by improving the experimental
    system

36
How to calculate population variability
  • Calculate log ratio of each gene relative to the
    reference sample
  • Calculate the average log ratio for each gene
    across all samples
  • For each gene in each sample, subtract the log
    ratio from the average log ratio
  • Plot the distribution of deviations and calculate
    the standard deviation (and v)

37
http//genome-www5.stanford.edu/mged/normalization
.html
38
Part 3-Data Analysis
  • How to choose the interesting genes in your
    experiment
  • How to study relationships between groups of
    genes identified as interesting
  • Classification of samples
Write a Comment
User Comments (0)
About PowerShow.com