Title: Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA
1Statistical tests for differential expression in
cDNA microarray experiments (2) ANOVA
- Xiangqin Cui and Gary A. ChurchillGenome Biology
2003, 4210
Presented by M. Carme Ruíz de Villa and Alex
Sánchez Departament dEstadística U.B.
2Introduction
3Remember
- We want to measure how gene expression changes
under different conditions. - Only two conditions and an adequate number of
replicates ? t-tests extensions - More than two conditions / more than one factor
several approaches - Analysis of Variance (ANOVA) (Churchill et al.)
- Linear Models (Smyth, Speed, )
4Sources of variation (1)
- We want to determine when the variation due to
gene expression is significant, but - There are multiple sources of variation in
measurements besides just gene expression. - We want to know when the variation in
measurements is caused by - varying levels of gene expression
- versus other factors.
5Sources of variation (2)
- Some sources of variation in the measurements in
microarray experiments are - Array effects
- Dye effects
- Variety effects
- Gene effects
- Combinations
6Relative expression values
- If more than two conditions ? we cannot simply
compute ratios - ANOVA modelling yields estimates of the relative
expression for each gene in each sample - The ANOVA model is not based on log ratios.
Rather it is applied directly to intensity data.
However the difference between two relative
expression values can be interpreted as the mean
log ratio for comparing two samples.
7Technical biological replicates
- If inference is being made on the basis of
biological replicates - and there is also technical replication ?
- technical replicates should be averaged
- to yield a single value
- for each independent biological unit.
8Derived data sets
- The set of estimated relative expression values,
one for each gene in each RNA sample, is a
derived data set that may be subject to a second
level of analysis. - The derived data can be analyzed on a gene by
gene basis using standard ANOVA methods to test
for differences among conditions. (Oleksiak et
al. 28)
9Review of ANOVA models
10One way ANOVA
- Suppose you have a model for each measurement in
your experiment - yij is jth measurement for ith group.
- µ overall mean effect (constant)
- ai ith group effect (constant)
- eij experimental error term N(0,s2)
- Therefore, observations from group i are
distributed with mean µ ai and variance s2 .
11Hypothesis Testing
Overall variability
Within group variability
Between group variability
Intuition if between group variability is large
compared to within group variability then the
differences between means is significant.
12Sum of Squares
- Total sum of squares
- Within Sum of Squares
-
- Between Sum of Squares
13Mean Sum of Squares
- Between MS Between SS/(k-1)
- Within MS Within SS/(n-k)
- F Between MS / Within SS
- It is summarized in the ANOVA table
- Example 1
14Multiple Factor ANOVA
- The model can be extended by adding more
- Factors (?, ?, )
- Interactions between them (??, )
- Other
- This is used to model the different sources of
variation appearing in microarray experiments
15Experiment 1 Latin Square
16Random effects models
- If the k factor levels can be considered a random
sample of a population of factors we have a
random effect - ANOVA model Yij ? Ai eij,
- ? overall mean,
- Ai is a random variable instead of a constanty,
- eij experimental error.
- E(Ai)0, E(eij)0, var(Ai)?A2, var(eij) ?2, Ai
i eij independent? var(Yij) ?A2 ?2.
17Where to find more
- Draghici, S. (2003). ANOVA chapter (7) Data
analysis tools for microarrays Wiley - Pavlidis, P. (2003) Using ANOVA for gene
selection from microarray studies of the nervous
systemhttp//microarray.cpmc.columbia.edu/pavlidi
s/ doc/reprints/anova-methods.pdf
18ANOVA Models for Microarray Data
19Kerr Churchills model
- yijkg ? expression measurement from the ith
array, jth dye, kth variety, and gth gene. - µ ? average expression over all spots.
- Ai ? effect of the ith array.
- Dj ? effect of the jth dye.
- Vk ? effect of the kth variety (treatment,
sample, ) - Gg ? effect of the gth gene.
- (AG)ig ? effect of the ith array and gth gene.
- (VG)kg ? effect of the kth variety and gth gene.
- ?ijkg independent and identically distributed
error terms.
20Interpreting main effects
- A differences in fluorescent signal from array
to array (e.g. if arrays are probed under
inconsistent conditions that increase or reduce
hybridization of labeled cDNA) - D differences between two dye fluorescent labels
(one dye may consistently be brighter than the
other) - G differences in fluorescence for equally
expressed genes. - V differences of expression level between
different varieties (samples, tumour types,..).
21Interpreting interactions
- DV If for a particular variety labelling is
produced in separate runs of the process ?
Differences in the runs can produce pools of cDNA
of varying concentrations or quality. - AG (Spot effect) Spots for a given gene on the
different arrays vary in the amount of cDNA
available for hybridization. - DG if there are differences in the dyes that are
gene-specific - VG reflects differences in expression for
particular variety and gene combinations that are
not explained by the average effects of these
varieties and genes.THIS IS THE QUANTITY OF
INTEREST !!!
22Normalization
- A,D,V terms effectively normalize the data, thus
the normalization process is integrated with the
data analysis. - This approach has several benefits (?)
- The normalization is based on a clearly stated
set of assumptions - It systematically estimates normalization
parameters based on all the data - The model can be generalized to the situation
where genes are spotted multiple times on each
array rather
23Statistically Significant Effects
- Array, Dye , Variety Gene effect
- Goal To estimate their value.
- Need not assess their significance
- Sometimes dont appear (gene-level model)
- Array x Gene, Variety x Gene effects
- May or not be present
- Goal To assess their significance
- Mean effect 0 if fixed
- Effect variance 0 if random
24Test statistics The 3 Fs
- Hypothesis testing involves the comparison of two
models. - In this setting we consider a
- null model of no differential expression (all VG
0) and - an alternative model with differential expression
among the conditions (some VG are not equal to
zero). - F statistics are computed on a gene-by-gene
basis based on the residual sums of squares from
fitting each of these models.
25Example 1
- A gene, which is believed to be related to
ovarian cancer is investigated - The cancer is sub-classified in 3 cathegories
(stages) I, II, III-IV - 15 samples, 3 per stage are available
- They are labelled with 3 colors and hybridized on
a 4 channel cDNA array (1 channel empty)(A
seemingly more reasonable procedure double
dye-swap reference design)
26Example 1. Normalized Data
27Example 1 ANOVA table (1)
If arrays are homogeneous The appropriate model
is 1 factor ANOVA
28Example (1) Blocking
If arrays are not homogeneous? the appropriate
model is 2 factor ANOVA (1 new block factor for
arrays)
29Example 2 CAMDA kidney dataftp//ftp.camda.duke.
edu/CAMDA02_DATASETS/papers/README_normal.html
- 6 mouse kidney samples
- (suppose 6 different treatments)
- Compared to a common reference in a double
reference design - Dye swap
- Replicate arrays
2
302.1. The ANOVA model
- Work only at the gene level no main effects (A,
D, V, G) as defined - YijkDGiAGjVGk?ijk
- i1,2 (dyes)
- j1,2 (array)
- K1,,6 (sample)
31Example 3 A 2 factor design Diet X Strain
323.2. Design
333.3. The ANOVA model
- YijkDGiAGjStrainlDietm StrainDietlm
VGk?ijklm - i1,,2 (dyes)
- j1,,2 (array)
- k1,,12 (sample)
- l 1,,3 (strain)
- m 1,...,2 (diet)
343.4 Sample R code (1)
- data(paigen)
- paigen lt- createData(rawdata, 2)
- model.full.fix lt- makeModel (data
paigen,formulaDGAGSG StrainDietStrainDie
t) - anova.full.fix lt-fitmaanova (paigen,
model.full.fix) - model.noint.fix lt- makeModel (data
paigen,ormulaDGAGSGStrainDiet) - anova.noint.fix lt- fitmaanova(paigen,
model.noint.fix)
353.4. Sample R code (2)
- permutation tests
- test for interaction effect
- test.int.fix lt- ftest(paigen, model.full.fix,
model.noint.fix, n.perm500) idx.int.fix lt-
volcano(anova.full.fix, test.int.fix,
title"Int. test")