Title: The second-simplest cDNA microarray data analysis problem
1The second-simplest cDNA microarray data
analysis problem
- Terry Speed, UC Berkeley
- Bioinformatic Strategies For Application of
Genomic Tools to Environmental Health Research,
March 5, 2001 - NIEHS National Center for Toxicogenomics NCSU
Bioinformatics Research Center
2Biological question Differentially expressed
genes Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
3Some motherhood statements
- Important aspects of a statistical analysis
include - Tentatively separating systematic from random
sources of variation - Removing the former and quantifying the latter,
when the system is in control - Identifying and dealing with the most relevant
source of variation in subsequent analyses - Only if this is done can we hope to make more or
less valid probability statements
4The simplest cDNA microarray data analysis
problem is identifying differentially expressed
genes using one slide
- This is a common enough hope
- Efforts are frequently successful
- It is not hard to do by eye
- The problem is probably beyond formal statistical
inference (valid p-values, etc) - for the foreseeable future, and heres why
5An M vs. A plot
M log2(R / G) A log2(RG) / 2
6Background matters
From Spot
From GenePix
7No background correction
With background correction
From the NCI60 data set (Stanford web site)
8An experiment having within-slide replicates
9Background makes a difference
Background method Segmentation method Exp1
Exp2 S.nbg 6 6 Gp.nbg 7 6 SA.nbg 6
6 No background QA.fix.nbg 7 6 QA.hist.nbg
7 6 QA.adp.nbg 14 14 S.valley 17 21 GP
11 11 Local surrounding SA 12 14 QA.fix
18 23 QA.hist 9 8 QA.adp 27 26 Others
S.morph 9 9 S.const 14 14
Medians of the SD of log2(R/G) for 8 replicated
spots multiplied by 100 and rounded to the
nearest integer.
10Normalisation - lowess
- Global lowess (Matt Callows data, LNBL)
- Assumption changes roughly symmetric at all
intensities.
11From the NCI60 data set (Stanford web site)
12Ngai lab, UCB
13Tiagos data from the Goodman lab, UCB
14From the Ernest Gallo Clinic Research Center
15From Peter McCallum Cancer Research Institute,
Australia
16Normalisation - print tip
Assumption For every print group, changes
roughly symmetric at all intensities.
17M vs A after print-tip normalisation
18Normalization (ctd) Another data set
Log-ratios
Print-tip groups
- After within slide global lowess normalization.
- Likely to be a spatial effect.
19Taking scale into account
- Assumption
- All print-tip-groups have the same spread
in M - True log ratio is mij where i represents
different print-tip-groups and j
represents different spots. - Observed is Mij, where
- Mij ai mij
- Robust estimate of ai is
-
-
- MADi medianj yij - median(yij)
20 Normalization (ctd) That same data set
Log-ratios
Print-tip groups
- After print-tip location and scale normalization.
- Incorporate quality measures.
21Matt Callows Srb1 dataset (5). Newtons and
Chens single slide method
22 Matt Callows Srb1
dataset (8). Newtons, Sapir Churchills and
Chens single slide method
23The approach of Roberts et al (Rosetta)
Genomic DNA vs. Genomic DNA
Data from Bing Ren
24The second simplest cDNA microarray data analysis
problem is identifying differentially expressed
genes using replicated slides
- There are a number of different aspects
- First, between-slide normalization then
- What should we look at averages, SDs
t-statistics, other summaries? - How should we look at them?
- Can we make valid probability statements?
- A report on work in progress
25Normalization (ctd) Yet another data set
- Between slides this time (10 here)
- Only small differences in spread apparent
- We often see much greater differences
Log-ratios
Slides
26The NCI 60 experiments (no bg)
27Taking scale into account
- Assumption All slides have the same spread
in M - True log ratio is mij where i represents
different slides and j represents different
spots. - Observed is Mij, where
- Mij ai mij
- Robust estimate of ai is
-
-
- MADi medianj yij - median(yij)
28Which genes are (relatively) up/down regulated?
- Two samples.
- e.g. KO vs. WT or mutant vs. WT
? n
T
C
? n
For each gene form the t statistic
average of n trt Ms sqrt(1/n (SD of n trt
Ms)2)
29Which genes are (relatively) up/down regulated?
- Two samples with a reference (e.g. pooled control)
? n
T
C
? n
C
C
- For each gene form the t statistic
- average of n trt Ms - average of n ctl Ms
- sqrt(1/n (SD of n trt Ms)2 (SD of n ctl Ms)2)
30One factor more than 2 samples
T2
T3
T4
T1
x 2
x 2
x 2
x 2
C
- Samples Liver tissue from mice treated by
cholesterol modifying drugs. - Question 1 Find genes that respond differently
between the treatment and the control. - Question 2 Find genes that respond similarly
across two or more treatments relative to control.
31One factor more than 2 samples
T6
T1
T5
T2
T4
T3
- Samples tissues from different regions of the
mouse olfactory bulb. - Question 1 differences between different
regions. - Question 2 identify genes with a pre-specified
patterns across regions.
32Two or more factors
- 6 different experiments at each time point.
- Dyeswaps.
- 4 time points (30 minutes, 1 hour, 4 hours, 24
hours) - 2 x 2 x 4 factorial experiment.
ctl
OSM
? 4 times
OSM EGF
EGF
33Which genes have changed?When permutation
testing possible
- 1. For each gene and each hybridisation (8 ko 8
ctl), use Mlog2(R/G). - 2. For each gene form the t statistic
- average of 8 ko Ms - average of 8 ctl Ms
- sqrt(1/8 (SD of 8 ko Ms)2 (SD of 8 ctl Ms)2)
- 3. Form a histogram of 6,000 t values.
- 4. Do a normal Q-Q plot look for values off the
line. - 5. Permutation testing.
- 6. Adjust for multiple testing.
34Histogram qq plot
ApoA1
35Apo A1 Adjusted and Unadjusted p-values for the
50 genes with the largest absolute t-statistics.
36Which genes have changed?Permutation testing not
possible
- Our current approach is to use averages, SDs,
t-statistics and a new statistic we call B,
inspired by empirical Bayes. - We hope in due course to calibrate B and use that
as our main tool. - We begin with the motivation, using data from a
study in which each slide was replicated four
times.
37Results from 4 replicates
38BLOR compared
39Results from the Apo AI ko experiment
40Results from the Apo AI ko experiment
41Empirical Bayes log posterior odds ratio
42- M
- B
- t
- M ? B
- t ?B
- t ? M ?B
Results from SR-BI transgenic experiment
43- M
- B
- t
- M ? B
- t ?B
- t ? M ?B
Results from SR-BI transgenic experiment
44Extensions include dealing with
- Replicates within and between slides
- Several effects use a linear model
- ANOVA are the effects equal?
- Time series selecting genes for trends
45Rosetta once more In vivo Binding Sites of
Gal4p in Galactose
P lt0.001
Un-enriched DNA (Cy3)
antibody-enriched DNA (Cy5)
46Summary (for the second simplest problem)
- Microarray experiments typically have thousands
of genes, but only few (1-10) replicates for each
gene. - Averages can be driven by outliers.
- Ts can be driven by tiny variances.
- B LOR will, we hope
- use information from all the genes
- combine the best of M. and T
- avoid the problems of M. and T
47Acknowledgments
- UCB/WEHI
- Yee Hwa Yang
- Sandrine Dudoit
- Ingrid Lönnstedt
- Natalie Thorne
- David Freedman
- CSIRO Image Analysis Group
- Michael Buckley
- Ryan Lagerstorm
- Ngai lab, UCB
- Goodman lab, UCB
- Peter Mac CI, Melb.
- Ernest Gallo CRC
- Brown-Botstein lab
- Matt Callow (LBNL)
- Bing Ren (WI)
48- Some web sites
- Technical reports, talks, software etc.
- http//www.stat.berkeley.edu/users/terry/zarray/Ht
ml/ - Statistical software R GNUs S
http//lib.stat.cmu.edu/R/CRAN/ - Packages within R environment
- -- Spot http//www.cmis.csiro.au/iap/spot.htm
- -- SMA (statistics for microarray analysis)
http//www.stat.berkeley.edu/users/terry/zarray/So
ftware /smacode.html
49Factorial Design
Age Effect
2
A1
P01
4
Zone Effect
1
3
5
P04
A 4
50Factorial design
m
ma
Different ways of estimating parameters. e.g. Z
effect. 1 (m z) - (m) z 2 - 5 ((m
a) - (m)) -((m a)-(m z)) (a) - (a z)
z 4 3 - 5 z
2
A1
P01
4
1
3
5
P04
A 4
mz
mzaza
How do we combine the information?
51Regression analysis
Define a matrix X so that E(M)X? Use least
squares estimate for z, a, za
52Looking at effect of Z log(zone 4 / zone1)
gene A
gene B
53Estimate
Z effect
Log2(SE)
54Zone Age Zone ? Age
55Top 50 genes from each effect
Zone . Age interaction
Age
19
0
48
29
2
0
19
Zone
56 57(No Transcript)
58 59 60Some statistical questions
- Image analysis addressing, segmenting,
quantifying - Normalisation within and between slides
- Quality of images, of spots, of (log) ratios
- Which genes are (relatively) up/down regulated?
- Assigning p-values to tests/confidence to
results.
61Some statistical questions, ctd
- Planning of experiments design, sample size
- Discrimination and allocation of samples
- Clustering, classification of samples, of genes
- Selection of genes relevant to any given analysis
- Analysis of time course, factorial and other
special experiments... much more
62The NCI 60 experiments (bg)