Title: Microarray Basics
1Microarray Basics
- Part 1 Choosing a platform, setting up, data
preprocessing
2Experimental design
- What type of microarray
- What overall design strategy
- How many replicates
3Type of Microarray
One colour
Two colour
cDNA
Long oligo
Short oligo
Genome wide
Custom
availability, cost, represented genes, need,
perceived accuracy/reproducibility
4Experimental Design Strategies
5How many replicates?
True situation
Not diff expressed
Diff expressed
correct
Type 2 error
NS
(power)
Your call
correct
Type 1 error
S
(confidence)
Technical replicates do NOT count as different
samples in the power calculation
6Power analysis requires decisions about
Difference in mean that you are trying to detect
The std dev of the population variability
Power you are trying to achieve
Significance level that you are trying to achieve
Experimental design
You have a 10,000 gene chip, and want to identify
95 of the genes that are 2 fold up or down
regulated in samples following treatment. You
will tolerate 1 false positive call out of the
10,000 genes tested. The coefficient of
variability in your population is 50. You are
doing a paired analysis.
One can conclude that you will need 22 patients
7Technical replicates
- Most publications recommend at least 3 if that is
possible - These are considered to be replicates at the
level of the experimental platform - Beware of doing 2 now and hoping to add one more
later - In downstream analysis, generally suggested to
use the average of technical replicates- these
are not different samples for analysis
8RNA required to get started
- Source of both experimental and reference RNA
- Will need about 10-20ug of total RNA from each
source for each experiment or chip - This RNA needs to be of high quality
- How do you check quality?
9Common sources of RNA
Cultured animal cells generally easy to disrupt
and get large amounts of high quality RNA
Animal tissues some require harsh disruption
treatments (such as soft tissues like kidney or
liver) and some may require addition treatments
(such as fatty tissues or fibrous tissues that
may require more stringent lysis)
Blood may be influence by anticoagulant in
collection system, and also seems to contain
enzyme inhibitors
Plant material some metabolites make
purification difficult- extractions may also be
highly viscous
Bacteria may want to consider stabilization
10Checking RNA quality
- Conventional methods include agarose gel
electrophoresis to look for evidence of
degradation - Spectrophotometric readings to give an idea of
purity - Bioanalyzer to provide scan- integrity and
quantity measurements
11Provides an RIN
Provides a
Requires 1 µl of 50ng/µl stock
12RNA amplification
- When quantity of RNA is limited, may have to
consider amplification - Several strategies, but need to decide up front
if you want sense or antisense amplified material
13(No Transcript)
14(No Transcript)
15What do you get back after an experiment?
- TIFF images- one image for each fluor used in the
experiment- same chip scanned twice (or more
times if multiple scans were done to compensate
for intensity) - Spreadsheet of quantitated data
16TIFF images
- Generally named as bar code_fluor_PMT
setting_laser setting - These settings will not necessarily be the same
for your two scans from the same chip- they are
manipulated to try to produce scans of even
intensity from the two fluors - The final image should have only a few white
spots over the whole array- these represent
saturated spots
17How can you tell anything about the quality of
your data?
- Easiest way to start is to look at your TIFF
images - Look for blank areas on the slide
- Look for areas where one fluor consistently is
brighter than the other - Look for gradients of intensity
- Differentiate between artifacts introduced by
slide quality and those by RNA quality and those
by experimental procedure
18Slide issues- printing
- Presence of donuts
- Smeared spots
- Scratches on surface of slide
- Non circular spots
- Spots off the grid
- No signals in areas
- Consistent problems with the same area of each
subarray
19RNA quality issues
- General low intensity
- Consistent problems with one sample, regardless
of fluor used - High level of background-
- grainy over entire slide
20Experimental issues
- One fluor consistently not giving good signal
regardless of RNA sample labelled - High areas of local background
- not covering entire slide
- Obvious intensity gradients
- Bubbles over surface of chip
21- After looking at your images you should have a
sense of whether or not these data are likely to
be clean and high enough quality to warrant
proceeding - If not you need to try to determine where the
problem originates
22Image processing
- Choice of methods for quantitating image
- Fixed circle
- Good for arrays with regular sizes of spots
- Variable circle
- Better for arrays with irregular sizes
- Histogram
- Best for arrays with irregular sizes and shapes
23Data quantitation
- The images are quantitated, generating a lengthy
spreadsheet - This is done in the facility using QuantArray,
but can be done using other freeware (Scanalzye)
or commercial software - The output can generally be opened in Excel for
first pass manipulation of data
24QuantArray output
- QA generates a series of columns that many people
find confusing - In general, it provides the data in two ways on a
single sheet- the first method is showing one
channel as a proportion of the other, the second
method provides absolute pixel counts for each
channel
25Information about the experiment
Data presented as ratios
Raw quantitated data
26(No Transcript)
27(No Transcript)
28Locator and identifier columns
- A unique number assigned to that spot
- B Row of subgrid
- C Column of subgrid
- D Row of spot within subgrid
- E Column of spot within subgrid
- F Gene identification
- G x coordinate of each spot
- H y coordinate of each spot
29Spot Values
- I/U intensity of signal in ch1/ch2
- J/V intensity of background in ch1/ch2
- K/W std dev of intensity of signal in ch1/ch2
- L/X std dev of background of signal in ch
1/ch2
30Quality control measurements
- M/Y spot diameter
- N/Z spot area
- O/AA spot footprint
- P/AB spot circularity
- Q/AC spot uniformity
- R/AD background uniformity
- S/AE signal to noise ratio
31Data Cleaning
Are there flagged spots?
-may see flags in last column- these are added
by user during quantitation
Are there areas of the images that you just
wouldnt trust?
Are there saturated spots?
Have the option of removing, recalculating,
ignoring , flagging or resetting the results of
these spots so that they dont interfere with
downstream analysis
At this stage, may also want to background
subtract the raw intensities
32On chip controls and how they behave
- Blank spots generally 3XSSC (print buffer)
- Expect no signal- can use the average or median
intensity of these spots as the lower cutoff for
what represents a real signal - However not all empty spots are the same on some
chips - Possibility that there is carryover from
non-empty spots printed with the same pin
33On chip controls
- Multiple spots of the same gene
- In general if it is exactly the same sequence,
can assess the variability of these spots to
assess artifacts of geography on the chip - If it is not the same sequence, less
straightforward
34On chip controls
- Housekeeping genes if you can identify a set of
genes that should remain at constant expression,
can use these to standardize the two channels - to correctly identify such genes is difficult
- May also have exogenous controls that can be
added, but must identify these prior to
hybridizing the slides
35Log transformation of data
Most data bunched in lower left
corner Variability increases with intensity
Data are spread more evenly Variability is more
even
36Within array normalization
In two colour arrays, are measuring two different
samples, labelled in two different reactions with
two different fluors and measured using two
different lasers at two different
wavelengths In addition, dealing with the
distribution of spots across a relatively large
surface
Need to try to eliminate some of these potential
sources of variation so that the variation that
is left is more likely to be due to biological
effects
37Dye Bias
- The two dyes incorporate differently into DNA of
different abundance - The two dyes may have different emission
responses to the laser at different abundances - The two dye emissions may be measured by the PMT
differently at different intensities - The intensities of the dyes may vary over the
surface of the slide, but not in synch, as the
focus of each laser is separate
38Correcting for dye bias
- Global normalization using median or mean
- Linear regression of Cy3 against Cy5
- Linear regression of the log ratio against the
average intensity (MA plots) - Non linear regression of the log ratio against
the average intensity (loess) - assumption that most genes are not
differentially expressed
39Simple global normalization to try to fit the data
Slope does not equal 1 means one channel responds
more at higher intensity
Non zero intercept means one channel is
consistently brighter
Non straight line means non linearity in
intensity responses of two channels
40Linear regression of Cy3 against Cy5
41MA plots
Regressing one channel against the other has the
disadvantage of treating the two sets of signals
separately
Also suggested that the human eye has a harder
time seeing deviations from a diagonal line than
a horizontal line
MA plots get around both these issues
Basically a rotation and rescaling of the data
A (log2R log2G)/2
X axis
M log2R-log2G
Y axis
42Scatterplot of intensities
MA plot of same data
43Non linear normalization
Normalization that takes into account intensity
effects
Lowess or loess is the locally weighted
polynomial regression
User defines the size of bins used to calculate
the best fit line
Taken from Stekal (2003) Microarray Bioinformatics
44Adjusted values for the x axis (average intensity
for each feature) calculated using the loess
regression
Should now see the data centred around 0 and
straight across the horizontal axis
45Spatial defects over the slide
- In some cases, you may notice a spatial bias of
the two channels - May be a result of the slide not lying completely
flat in the scanner - This will not be corrected by the methods
discussed before
46Regressions for spatial bias
- Carry out normal loess regression but treat each
subgrid as an entire array (block by block loess) - Corrects best for artifacts introduced by the
pins, as opposed to artifacts of regions of the
slide - Because each subgrid has relatively few spots,
risk having a subgrid where a substantial
proportion of spots are really differentially
expressed- you will lose data if you apply a
loess regression to that block - May also perform a 2-D loess- plot log ratio for
each feature against its x and y coordinates and
perform regression
47Acknowledgements
- Perseus Missirlis
- Natasha Gallo
- Jim Gore
- Jennifer Kreiger
- Scott Davey