Title: Multiple Hypothesis Testing in Microarray Data Analysis
1Multiple Hypothesis Testing in Microarray Data
Analysis
2outline
- Introduction
- Multiple hypothesis testing framework
- Analyses of Dataset GSE5245
3INTRODUCTION
- Special problems that arise from the multiplicity
aspect include defining an appropriate Type I
error rate and devising powerful multiple testing
procedures that control this error rate and
account for the joint distribution of the test
statistics - develop resampling-based single-step and stepwise
multiple testing procedures (MTP) for controlling
a broad class of type I error rates
4Multiple hypothesis testing framework
5Provide data set
- X1 , . . . ,Xn be a random sample of n
independent and identically distributed (i.i.d.)
random variables, X P M, where M is a set of
possibly non-parametric distribution and P is the
data generating distribution - pairs (Xi, Yi)i1...,n, formed by the
expression profiles Xi (X is a vector of gene
expression measurements) and responses or
covariates Yi
6Define parameters of interest
- Parameters are defined as arbitrary functions of
the unknown data generating distribution P
?(p)?(?(m)m1,,M) where M is the number of
genes - Parameters of interest include (functions of )
means, differences in means, correlations, and
regression parameters
7(No Transcript)
8Define null and alternative hypotheses, Ho(m) and
H1(m)
- M null hypotheses Ho(m) IP M(m)
- Alternative hypotheses
- E.g. 1. H0j gene j has equal mean expression
levels in K different types of tumors - E.g. 2. H0j gene j is not associated with
survival for a particular type of cancer - E.g. 3. H0j H0(1) H1(1) H0(m) 0
9Specify test statistics, Tn(m)
- depend on the experimental design and the type of
response or covariate - binary covariates t-statistic
- two-sample Welch t-statistics
-
- polytomous responses F-statistic
- x2 statistics or likelihood ratio statistics
10Estimate test statistics null distribution, Qon
- In practice, the true distribution Qn Qn(P),
for the test statistics Tn is unknown and
estimated by a null distribution Q0 - resampling procedures (e.g., bootstrap and
permutation) are particularly useful to estimate
null distribution
11(No Transcript)
12Select Type I error rate
13Select Type I error rate
- The family-wise error rate (FWER) FWER Pr(V gt
1). - Generalized family-wise error rate
(gFWER)gFWER(k) Pr(V gt k) 1 - FV (k). - The false discovery rate (FDR) FDR E(V/R),
- Tail probabilities for the proportion of false
positives (TPPFP) TPPFP(q) Pr(V/R gt q) 1 -
FV/R (q), q (0, 1)
14Apply MTP
- Single-step procedures equivalent adjustments
for all H0j , assess each null hypothesis using a
rejection region which is independent of the
tests of other hypothesis. - Stepwise procedures adjustments depend on the
observed data , construct reject regions based on
the acceptance/rejection of other hypotheses , be
applied to smaller nested subsets of tests.
15Apply MTP
- maxT- based on ordered test statistics
- minP based on ordered P-values
- step-down - correspond to the most significant
test statistics - step-up - correspond to the least significant
test statistics
16Analyses of Dataset GSE5245
- parameters of interest means of gene j
expression level - H0 gene j has equal mean expression levels in
two different treatments - using two-sample Welch t-statistics
17Bootstrap vs. permutation
The reason the sample sizes are not equal for
the two groups Or the expression measures may
have different covariance structures in the two
populations
18memory.limit(size4000) windows(recordT) library(
affy) cels lt- list.celfiles("E\\GSE5245") cels da
ta lt- ReadAffy(celfile.path"E\\GSE5245") abatch.
raw lt- data rma.eset lt- rma(abatch.raw) small.eset
lt- exprs (rma.eset) T.cell lt- c(rep(0,5),rep(1,11
)) 0 for None 1 for transient or persistent
filter keep genes with cv between .7 and 10,
and where 20 of samples had exprs. gt
100 library(genefilter) e.mat lt- 2 small.eset
ffun lt- filterfun(pOverA(0.20,100),
cv(0.7,10)) t.fil lt- genefilter(e.mat,ffun) small.
eset lt- log2(e.matt.fil,) dim(e.mat) dim(small.e
set) 1 532 16
19 permutation resampling mlt-MTP(Xsmall.eset,YT.c
ell,typeone'fwer',B100,method'sd.maxT', nulldis
t'perm',seed99) summary(m) m.diff lt-
m_at_adjplt0.05 sum(m.diff) 1 55 r lt-
m_at_reject sum(r) 1 55 bootstrap
resampling m3lt-MTP(Xsmall.eset,YT.cell,typeone'
fwer',B100,method'sd.maxT', nulldist'boot',seed
1) summary(m3) r3 lt- m3_at_reject sum(r3) 37 m3.diff
lt- m3_at_adjplt0.05 sum(m3.diff) 37
20FWER vs. gFWER vs. TPPFP vs. FDR
The results illustrate that stepwise MTPs are
less conservative than their single-step
analogues because the numbers of genes rejected
from stepwise MTPs are bigger than that from
their single-step analogues.
21m1lt-MTP(Xsmall.eset,YT.cell,typeone'fwer',B100
,method'ss.maxT', seed1) summary(m1) r1 lt-
m1_at_reject sum(r1) 35 m2lt-MTP(Xsmall.eset,YT.cell
,typeone'fwer',B100,method'ss.minP',
seed1) summary(m2) r2 lt- m2_at_reject sum(r2) 129 m3
lt-MTP(Xsmall.eset,YT.cell,typeone'fwer',B100,m
ethod'sd.maxT', seed1) summary(m3) r3 lt-
m3_at_reject sum(r3) 37 m4lt-MTP(Xsmall.eset,YT.cell
,typeone'fwer',B100,method'sd.minP',
seed1) summary(m4) r4 lt- m4_at_reject sum(r4) 129
22m5lt-MTP(Xsmall.eset,YT.cell,typeone'gfwer',B10
0,method'ss.maxT', k5, seed1) summary(m5) r5
lt- m5_at_reject sum(r5) 40 m6lt-MTP(Xsmall.eset,YT.c
ell,typeone'gfwer',B100,method'ss.minP', k5,
seed1) summary(m6) r6 lt- m6_at_reject sum(r6) 134 m
7lt-MTP(Xsmall.eset,YT.cell,typeone'gfwer',B100
,method'sd.maxT', k5, seed1) summary(m7) r7
lt- m7_at_reject sum(r7) 42 m8lt-MTP(Xsmall.eset,YT.c
ell,typeone'gfwer',B100,method'sd.minP', k5,
seed1) summary(m8) r8 lt- m8_at_reject sum(r8) 134
23m9lt-MTP(Xsmall.eset,YT.cell,typeone'tppfp',B10
0,method'ss.maxT',q0.1, seed1) summary(m9) r9
lt- m9_at_reject sum(r9) 38 m10lt-MTP(Xsmall.eset,YT.
cell,typeone'tppfp',B100,method'ss.minP',q0.1,
seed1) summary(m10) r10 lt- m10_at_reject sum(r10)
143 m11lt-MTP(Xsmall.eset,YT.cell,typeone'tppfp'
,B100,method'sd.maxT',q0.1, seed1) summary(m1
1) r11 lt- m11_at_reject sum(r11) 41 m12lt-MTP(Xsmall.
eset,YT.cell,typeone'tppfp',B100,method'sd.min
P',q0.1, seed1) summary(m12) r12 lt-
m12_at_reject sum(r12) 143
24m13lt-MTP(Xsmall.eset,YT.cell,typeone'fdr',B100
,method'ss.maxT', seed1) summary(m13) r13 lt-
m13_at_reject sum(r13) 25 m14lt-MTP(Xsmall.eset,YT.c
ell,typeone'fdr',B100,method'ss.minP',
seed1) summary(m14) r14 lt- m14_at_reject sum(r14) 13
2 m15lt-MTP(Xsmall.eset,YT.cell,typeone'fdr',B1
00,method'sd.maxT', seed1) summary(m15) r15 lt-
m15_at_reject sum(r15) 25 m16lt-MTP(Xsmall.eset,YT.c
ell,typeone'fdr',B100,method'sd.minP',
seed1) summary(m16) r16 lt- m16_at_reject sum(r16) 13
2
25How many genes are in common among these methods?
- Ratiothe number of genes rejected in common
between two methods / the number of genes
rejected in the second method
26(No Transcript)
27(No Transcript)
28mbt.mat lt- cbind(r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,r1
1,r12,r13,r14,r15, r16) res lt- matrix(ncol16,
nrow16) for(i in 116) for(j in 116)
resi,j lt- sum(mbt.mat,imbt.mat,j)/sum(mbt.
mat,j) res library(cluster) library(RColorBr
ewer) hmcol lt- colorRampPalette(brewer.pal(10,"RdB
u"))(256) colnames(res) lt- rownames(res) lt-
c("fwer.ss.maxT","fwer.ss.minP", "fwer.sd.maxT","f
wer.sd.minP","gfwer.ss.maxT","gfwer.ss.minP", "gfw
er.sd.maxT","gfwer.sd.minP","tppfp.ss.maxT","tppfp
.ss.minP", "tppfp.sd.maxT","tppfp.sd.minP","fdr.ss
.maxT","fdr.ss.minP", "fdr.sd.maxT","fdr.sd.minP")
heatmap(res1, RowvNA,ColvNA)
29the influence of k in gFWER
the number of rejected hypotheses increases
linearly with the number k of allowed false
positives
30the influence of k in gFWER
k lt- c(5, 10, 50, 100) M3lt-MTP(Xsmall.eset,YT.ce
ll,typeone'fwer',B100, method 'sd.minP',
seed1) summary(m3) cyto.gfwer lt- fwer2gfwer(adjp
m3_at_adjp, k k) comp.gfwer lt- cbind(m3_at_adjp,
cyto.gfwer) mtps lt- paste("gFWER(", c(0, k), ")",
sep "") mt.plot(adjp comp.gfwer, teststat
m24btt_at_statistic, proc mtps, leg c(0.5,
200), col 15,lty 15, lwd
3) title("Comparison of gFWER(k)-controlling
AMTPs based on SD minP MTP")
31the influence of q in TPPEP
The result the number of rejections increases
with the allowed proportion q of false positives,
though not linearly.
32q lt- c(0.05, 0.1, 0.5) m3lt-MTP(Xsmall.eset,YT.ce
ll,typeone'fwer',B100, method 'sd.minP',
seed1) summary(m3) cyto.tppfp lt- fwer2tppfp(adjp
m3_at_adjp,q q) comp.tppfp lt- cbind(m3 _at_adjp,
cyto.tppfp) mtps lt- c("FWER", paste("TPPFP(", q,
")", sep "")) mt.plot(adjp comp.tppfp,
teststat m3 _at_statistic, proc mtps, leg
c(0.5, 200), col 14, lty 14, lwd
3) title("Comparison of TPPFP(q)-controlling
AMTPs based on SD minP MTP")
33Summary
- control of an appropriate and precisely defined
Type I error rate - control of this error rate under any combination
of true and false null hypotheses - accounting for the joint distribution of the test
statistics - reporting the results in terms of adjusted
p-values - availability of efficient resampling algorithms
for nonparametric procedures
34References
- Chapters 15 of Bioconductor Monograph
- Sandrine Dudoit. et.al Multiple Hypothesis
Testing in Microarray Data Analysis - Yongchao Ge et.al Resampling-based multiple
testing for microarray data analysis - Katherine S. Pollard et.al Applications of
Multiple Testing Procedures ALL Data - Sandrine Dudoit et.al statistical science 18(1),
71-103