Title: A Course in Multiple Comparisons and Multiple Tests
1A Course in Multiple Comparisons and Multiple
Tests
- Peter H. Westfall, Ph.D.
- Professor of Statistics, Department of Inf.
Systems and Quant. Sci. - Texas Tech University
2Learning Outcomes
- Elucidate reasons that multiple comparisons
procedures (MCPs) are used, as well as their
controversial nature - Know when and how to use classical interval-based
MCPs including Tukey, Dunnett, and Bonferroni - Understand how MCPs affect power
- Elucidate the definition of closed testing
procedures (CTPs) - Understand specific types of CTPs, benefits and
drawbacks - Distinguish false discovery rate (FDR) from
familywise error rate (FWE) - Understand general issues regarding Bayesian MCPs
3Outline of Material
Introduction. Overview of Problems, Issues, and
Solutions, Regulatory and Ethical Perspectives,
Families of Tests, Familywise Error Rate,
Bonferroni. (pp. 5-21) Interval-Based Multiple
Inferences in the standard linear models
framework. One-way ANOVA and ANCOVA, Tukey,
Dunnett, and Monte Carlo Methods, Adjusted
p-values, general contrasts, Multivariate T
distribution, Tight Confidence Bands,
TreatmentxCovariate Interaction, Subgroup
Analysis (pp. 22-55) Power and Sample Size
Determinations for multiple comparisons. (pp.
56-65) Stepwise and Closed Testing Procedures I
P-value-Based Methods. Closure Method, Global
Tests Holm, Hommel, Hochberg and Fisher combined
methods for p-Values (pp. 66-90) Stepwise and
Closed Testing Procedures II Fixed Sequences,
Gatekeepers and I-U tests Fixed Sequence tests,
Gatekeeper procedures, Multiple hypotheses in a
gate, Intersection-union tests with application
to dose response, primary and secondary
endpoints, bioequivalence and combination
therapies (pp. 91-101)
4Outline (Continued)
Stepwise and Closed Testing Procedures III
Methods that use logical constraints and
correlations. Lehmacher et al. Method for
Multiple endpoints Range-Based and F-based ANOVA
Tests, Fishers protected LSD, Free and
Restricted Combinations, Shaffer-Type Methods
for dose comparisons and subgroup analysis (pp.
102-118) Multiple nonparametric and
semiparametric tests Bootstrap and
Permutation-based Closed tesing. PROC MULTTEST,
examples with multiple endpoints, genetic
associations, gene expression, binary data and
adverse events (pp. 119-139) More complex models
and FWE control Heteroscedasticity, Repeated
measures, and large sample methods.
Applications multiple treatment comparisons,
crossover designs, logistic regression of cure
rates (pp. 140-152) False Discovery Rate
Benjamini and Hochbergs method, comparison with
FWE controlling methods (153-158) Bayesian
methods Simultaneous credible intervals,
ranking probabilities and loss functions, PROC
MIXED posterior sampling, Bayesian testing of
multiple endpoints (pp. 159-178) Conclusion,
discussion, references (179-184)
5Sources of Multiplicity
- Multiple variables (endpoints)
- Multiple timepoints
- Subgroup analysis
- Multiple comparisons
- Multiple tests of the same hypothesis
- Variable and Model selection
- Interim analysis
- Hidden Multiplicity File Drawers, Outliers
6The Problem
- Significant results may fail to replicate.
- Documented cases
- Ioannidis (JAMA 2005)
7An Example
- Phase III clinical trial
- Three arms Placebo, AC, Drug
- Endpoints Signs and symptoms
- Measured at weekly visits
- Baseline covariates
8Example-Continued
- Features displayed at trial conclusion
- Trends
- Baseline adjusted comparisons of raw data
- Baseline adjusted changes
- Nonparametric and parametric tests
- Specific endpoints and combinations of endpoints
- Particular week results
- AC and Placebo comparisons
- Fact The features that look the best are
biased.
9Example Continued Feature Selection
- Effect Size is a feature
- Effect size (mean difference)/sd
- Dimensionless
- .2small, .5medium, .8large
- Estimated effect sizes F1, F2,,Fk
- What if you select (maxF1,F2,,Fk) and publish
it?
10The Scientific Concern
11Feature Selection Model
- Clinical Trials Simulation
- Real data used
- Conservative!
- If you must know more
- Fj mj ej, j1,,20.
- Error terms or N(0,.22)
- True effect sizes mj are N(.3,.12)
- Features Fj are highly correlated.
12Key Points (i) Multiplicity invites
Selection(ii) Selection has an EFFECT
- Just like effects due to
- Treatment
- Confounding
- Learning
- Nonresponse
- Placebo
13Published Guidelines
- ICH Guidelines
- CPMP Points to consider
- CDRH Statistical Guidance
- ASA Ethical Guidelines
14Regulatory/Journal/Ethical/Professional Concerns
- Replicability (good science)
- Fairness
- Regulatory report
- The drug company reported efficacy at p.047.
We repeated the analysis in several different
ways that the company might have done. In 20
re-analyses of the data, 18 produced p-values
greater than .05. Only one of the 20 re-analyses
produced a p-value smaller than .047.
15Multiple Inferences Notation
- There is a family of k inferences
- Parameters are q1,, qk
- Null hypotheses are
- H01 q10, , H0k qk0
16Comparisonwise Error Rate (CER)
- Intervals
- CERj P(Intervalj incorrect)
- Tests
- CERj P(Reject H0j H0j is true)
- Usually CER a .05
17Familywise Error Rate (FWE)
Intervals FWE 1 - P(all intervals are
correct) Tests FWE P(reject at
least one true null)
18False Discovery Rate
- FDR E(proportion of rejections that are
incorrect) - Let R total of rejections
- Let V of erroneous rejections
- FDR E(V/R) (0/0 defined as 0).
- FWE P(Vgt0)
19Bonferroni Method
- Identify Family of inferences
- Identify number of elements (k) in the Family
- Use a/k for all inferences.
- Ex With k36, p-values must be less than
0.05/36 0.0014 to be significant
20FWE Control for Bonferroni
- FWE
- P(p0j1.05/36 or or p0jm .05/36
H0j1,..., H0jmtrue) - P(p0j1.05/36) P( p0jm .05/36)
- (.05)m/36 .05
B
A
P(AÈB) P(A) P(B)
21Families in clinical trials1
Efficacy
Safety
Main Interest - Primary Secondary Approval and
Labeling depend on these. Tight FWE control
needed.
Serious and known treatment- related AEs FWE
control not needed
Lesser Interest - Depending on goals and
reviewers, FWE controlling methods might be
needed.
All other AEs Reasonable to control FWE (or FDR)
Supportive Tests - mostly descriptive FWE control
not needed.
Exploratory Tests - investigate new indications
- future trials needed to confirm - do what makes
sense.
1Westfall, P. and Bretz, F. (2003). Multiplicity
in Clinical Trials. Encyclopedia of
Biopharmaceutical Statistics, second edition,
Shein-Chung Chow, ed., Marcel Decker Inc., New
York, pp. 666-673.
22Classical Single-Step Testing and Interval
Methods to Control FWE
- Simultaneous confidence intervals
- Adjusted p-values
- Dunnett method
- Tukeys method
- Simulation-based methods for general comparisons
23Specificity and Sensitivity
then use
If you want ...
- Estimates of effect sizes error margins
- Confident inequalities
- Overall Test
- Simultaneous Confidence Intervals
- Stepwise or closed tests
- F-test, OBrien, etc.
24The Model
- Y Xb e
- where e N(0, s2 I )
- Includes ANOVA, ANCOVA, regression
- For group comparisons, covariate adjustment
- Not valid for survival analysis, binary data,
multivariate data
25Example Pairwise Comparisons against Control
Goal Estimate all mean differences from control
and provide simultaneous 95 error margins
What ca to use?
26Comparison of Critical Values
27Results - Dunnett
The GLM Procedure Dunnett's t Tests for
gain NOTE This test controls the Type I
experimentwise error for comparisons of all
treatments againstba control. Alpha
0.05 Error Degrees of Freedom
21 Error Mean Square
210.0048 Critical Value of Dunnett's t
2.78972 Minimum Significant Difference
28.586 Comparisons significant at the 0.05 level
are indicated by . Difference
Simultaneous g Between 95
Confidence Comparison Means
Limits 1 - 0 -9.48 -38.07
19.11 4 - 0 -13.50 -42.09
15.09 5 - 0 -20.70 -49.29
7.89 2 - 0 -24.90 -53.49 3.69
6 - 0 -31.14 -59.73 -2.55
3 - 0 -33.24 -61.83 -4.65
28ca is the 1-a quantile of the distribution of
maxi Zi-Z0/(2c2/df)1/2, called Dunnetts
two-sided range distribution.
29Adjusted p-Values
Definition Adjusted p-value smallest FWE at
which the hypothesis is rejected. or The FWE
for which the confidence interval has 0 as a
boundary.
30Adjusted p-values for Dunnett
proc glm datatox class g model gaing
lsmeans g/adjustdunnett pdiff run
31Example All Pairwise Comparisons
Goal Estimate all mean differences and provide
simultaneous 95 error margins
What ca to use?
32Comparison of Critical Values
33Tukey Comparisons
Alpha 0.05 df 21 MSE
210.0048 Critical Value of
Studentized Range 4.597
Minimum Significant Difference 33.311
Means with the same letter are not significantly
different. Tukey Grouping
Mean N G
A 105.38 4 0
A A
95.90 4 1
A A
91.88 4 4 A
A 84.68
4 5 A
A 80.48 4
2 A
A 74.24 4 6
A
A 72.14 4 3
34Tukey Adjusted p-Values
General Linear Models
Procedure Least
Squares Means Adjustment for
multiple comparisons Tukey G GAIN Pr gt
T H0 LSMEAN(i)LSMEAN(j) LSMEAN i/j
1 2 3 4 5 6
7 0 105.380 1 . 0.9641 0.2351
0.0507 0.8364 0.4319 0.0769 1 95.900 2
0.9641 . 0.7391 0.2810 0.9996 0.9227
0.3806 2 80.480 3 0.2351 0.7391 .
0.9808 0.9172 0.9995 0.9958 3 72.140 4
0.0507 0.2810 0.9808 . 0.4860 0.8771
1.0000 4 91.880 5 0.8364 0.9996 0.9172
0.4860 . 0.9910 0.6102 5 84.680 6
0.4319 0.9227 0.9995 0.8771 0.9910 .
0.9438 6 74.240 7 0.0769 0.3806 0.9958
1.0000 0.6102 0.9438 .
35Tukey Simultaneous Intervals
Simultaneous
Simultaneous
Lower Difference Upper
Confidence Between
Confidence i j Limit
Means Limit 1
2 -23.831013 9.480000
42.791013 1 3 -8.411013
24.900000 58.211013 1
4 -0.071013 33.240000
66.551013 1 5 -19.811013
13.500000 46.811013 1
6 -12.611013 20.700000
54.011013 1 7 -2.171013
31.140000 64.451013 2
3 -17.891013 15.420000
48.731013 2 4 -9.551013
23.760000 57.071013 2
5 -29.291013 4.020000
37.331013 2 6 -22.091013
11.220000 44.531013 2
7 -11.651013 21.660000
54.971013 3 4 -24.971013
8.340000 41.651013 3
5 -44.711013 -11.400000
21.911013 3 6 -37.511013
-4.200000 29.111013 3
7 -27.071013 6.240000
39.551013 4 5 -53.051013
-19.740000 13.571013 4
6 -45.851013 -12.540000
20.771013 4 7 -35.411013
-2.100000 31.211013 5
6 -26.111013 7.200000
40.511013 5 7 -15.671013
17.640000 50.951013 6
7 -22.871013 10.440000
43.751013
36ca is (1/Ö2) the 1-a quantile of the
distribution of maxi,i Zi-Zi/(c2/df)1/2,
which is called the Studentized range
distribution.
37Unbalanced Designs and/or Covariates
- Tukey method is conservative when the design is
unbalanced and/or there are covariates otherwise
exact - Dunnett method is conservative when there are
covariates otherwise exact - Conservative means
- True FWE lt Nominal FWE
- also means less powerful
38Tukey-Kramer Method for all pairwise comparisons
- Let ca be the critical value for the balanced
case using Tukeys method and the correct df. - Intervals are
- Conservative (Hayter, 1984 Annals)
39Exact Method for General Comparisons of Means
40Multivariate T-Distribution Details
40
41Calculation of Exact ca
- Edwards and Berry Simple simulation
- Hsu and Nelson Factor analytic control variate
(better) - Genz and Bretz Integration using lattice methods
(best) - Even with simple simulation, the value ca can be
obtained - with reasonable precision.
Edwards, D., and Berry, J. (1987) The efficiency
of simulation-based multiple comparisons.
Biometrics, 43, 913-928. Hsu, J.C. and Nelson,
B.L. (1998) Multiple comparisons in the general
linear model. Journal of Computational and
Graphical Statistics, 7, 23-41. Genz, A. and
Bretz, F. (1999), Numerical Computation of
Multivariate t Probabilities with Application to
Power Calculation of Multiple Constrasts, J.
Stat. Comp. Simul. 63, pp. 361-378.
42Example ANCOVA with two covariates
Y Diastolic BP Group Therapy
(Control, D1, D2, D3) X1 Baseline
Diastolic BP X2 Baseline Systolic BP
Goal Compare all therapies, controlling for
baseline
proc glm dataresearch.bpr class therapy
model dbp10 therapy dbp7 sbp7 lsmeans
therapy/pdiff cl adjustsimulate(nsamp
10000000 cvadjust seed121011
report) run quit
43Results From ANCOVA
Source DF Type III SS Mean Square F
Value Pr gt F THERAPY 3 677.429172
225.809724 6.05 0.0006 DBP7 1
6832.878653 6832.878653 183.06 lt.0001
SBP7 1 51.123459 51.123459
1.37 0.2435
Least Squares Means for Effect THERAPY
Difference Simultaneous 95
Between Confidence Limits for i
j Means LSMean(i)-LSMean(j) 1
2 2.832658 -0.424816 6.09013
1 3 1.328481 -2.099566
4.756527 1 4 -2.536262
-5.981471 0.908947 2 3
-1.504178 -4.846403 1.838047 2 4
-5.368920 -8.734744 -2.003097 3 4
-3.864743 -7.398994 -0.330491
Note 4 is control
44 Details for Quantile Simulation
Random number seed
121011 Comparison type
All Sample
size 9999938
Target alpha 0.05
Accuracy radius (target)
0.0002 Accuracy radius
(actual) 437E-7
Accuracy confidence 99
Simulation Results
Estimated
99 Confidence Method 95
Quantile Alpha Limits
Simulated 2.594159 0.0500
0.0500 0.0500 Tukey-Kramer
2.594637 0.0499 0.0499 0.0500
Bonferroni 2.669484 0.0411
0.0410 0.0411 Sidak
2.662029 0.0419 0.0418 0.0419
GT-2 2.660647 0.0420
0.0420 0.0421 Scheffe
2.823701 0.0270 0.0270 0.0270
T 1.974017 0.2017
0.2016 0.2018
NOTE PROCEDURE GLM used real time 21.23 seconds
45Results from ANCOVA-Dunnett
H0LSMean
Control THERAPY DBP10 LSMEAN
Pr gt t Dose 1 88.8171113
0.1407 Dose 2 85.9844529 0.0002 Dose
3 87.4886307 0.0140 Placebo
91.3533732 Least Squares Means for Effect
THERAPY Difference
Simultaneous 95 Between
Confidence Limits for i j Means
LSMean(i)-LSMean(j) 1 4 -2.536262
-5.675847 0.603323 2 4 -5.368920
-8.436161 -2.301679 3 4 -3.864743
-7.085470 -0.644015
46 Details for Quantile Simulation-Dunnett
Random number seed
121011
Comparison type Control, two-sided
Sample size
9999938
Target alpha 0.05
Accuracy radius (target)
0.0002
Accuracy radius (actual) 139E-7
Accuracy confidence
99
Simulation Results
Estimated
99 Confidence Method
95 Quantile Alpha Limits
Simulated 2.364031
0.0500 0.0500 0.0500
Dunnett-Hsu, two-sided 2.364084
0.0500 0.0500 0.0500
Bonferroni 2.417902
0.0437 0.0437 0.0437 Sidak
2.411491 0.0444
0.0444 0.0444 GT-2
2.410664 0.0445 0.0445
0.0445 Scheffe
2.823701 0.0145 0.0145 0.0145
T 1.974017
0.1229 0.1229 0.1230
NOTE PROCEDURE GLM used real time 19.00 seconds
47More General Inferences
Question For what values of the covariate
is treatment A better than treatment B?
48Discussion of (Treatment Covariate) Interaction
Example
49The GLIMMIX Procedure
Computes MC-exact simultaneous confidence
intervals and adjusted p-values for any set
of linear functions in a linear model
50GLIMMIX syntax
proc glimmix dataresearch.tire class
make model cost make mph makemph
estimate "10" make 1 -1 makemph 10 -10,
"15" make 1 -1 makemph 15 -15, "20"
make 1 -1 makemph 20 -20, "25" make 1
-1 makemph 25 -25, "30" make 1 -1
makemph 30 -30, "35" make 1 -1
makemph 35 -35, "40" make 1 -1
makemph 40 -40, "45" make 1 -1
makemph 45 -45, "50" make 1 -1
makemph 50 -50, "55" make 1 -1
makemph 55 -55, "60" make 1 -1
makemph 60 -60, "65" make 1 -1
makemph 65 -65, "70" make 1 -1
makemph 70 -70 /adjustsimulate(nsamp10
000000 report) cl run
51Output from PROC GLIMMIX
Simultaneous intervals are Estimate -
2.648 StdErr Label Estimate StdErr
tValue AdjLower AdjUpper 10 -4.1067
0.9143 -4.49 -6.5279 -1.6854 15
-3.4539 0.8084 -4.27 -5.5947
-1.3131 20 -2.8011 0.7101
-3.94 -4.6815 -0.9207 25 -2.1483
0.6230 -3.45 -3.7981 -0.4985 30
-1.4956 0.5524 -2.71 -2.9585
-0.03260 35 -0.8428 0.5054
-1.67 -2.1812 0.4956 40 -0.1900
0.4887 -0.39 -1.4842 1.1042 45
0.4628 0.5054 0.92 -0.8756
1.8012 50 1.1156 0.5524
2.02 -0.3474 2.5785 55 1.7683
0.6230 2.84 0.1185 3.4181 60
2.4211 0.7101 3.41 0.5407
4.3015 65 3.0739 0.8084 3.80
0.9331 5.2147 70 3.7267
0.9143 4.08 1.3054 6.1479
Bonferroni critical value is t_16,.05/213
3.377.
52Other Applications of Linear Combinations
- Multiple Trend Tests
- (0,1,2,3), (0,1,2,4), (0,4,6,7)
- (carcinogenicity)
- (0,0,1), (0,1,1), (0,1,2) (recessive/dominant/ordi
nal genotype effects) - Subgroup Analysis
- Subgroups define linear combinations (more on
next slide)
53Subgroup Analysis Example
- Data Yijkl , where iTrt,Cntrl jOld, Yng
k GoodInit, PoorInit. - Model Yijkl mijk eijkl, where
mijkmaibjgk(ab)ij(ag)ik(bg)jk - Subgroup Contrasts
- m111 m112 m121
m122 m211 m212 m221 m222 - Overall ¼ ¼ ¼ ¼
-¼ -¼ -¼ -¼ - Older ½ ½ 0 0
-½ -½ 0 0 - Younger 0 0 ½ ½
0 0 -½ -½ - GoodInit ½ 0 ½ 0
-½ 0 -½ 0 - PoorInit 0 ½ 0 ½
0 -½ 0 -½ - OldGood 1 0 0 0
-1 0 0 0 - OldPoor 0 1 0 0
0 -1 0 0 - YoungGood 0 0 1 0
0 0 -1 0 - YoungPoor 0 0 0 1
0 0 0 -1
54 Subgroup Analysis Results Label
Estimate StdErr tValue Probt
Adjp AdjLower AdjUpper Overall
0.7075 0.1956 3.62 0.0002
0.0015 0.2460 I Older
0.9952 0.2673 3.72 0.0002
0.0011 0.3646 I Younger
0.4197 0.2847 1.47 0.0717
0.2605 -0.2521 I GoodInitHealth
0.5871 0.2878 2.04 0.0219
0.0984 -0.09197 I PoorInitHealth
0.8279 0.2644 3.13 0.0011
0.0068 0.2039 I OldGood
0.8748 0.3387 2.58 0.0056
0.0295 0.07566 I OldPoor
1.1157 0.3231 3.45 0.0004
0.0026 0.3532 I YoungGood
0.2993 0.3562 0.84 0.2014
0.5494 -0.5413 I YoungPoor
0.5401 0.3338 1.62 0.0544
0.2091 -0.2476 I
(SAS code available upon request)
55Summary
- Include only comparisons of interest.
- Utilize correlations to be less conservative.
- The critical values can be computed exactly only
in balanced ANOVA for all pairwise comparisons,
or in unbalanced ANOVA for comparisons with
control. - Simulation-based methods are exact if you let
the computer run for a while. This is my general
recommendation.
56Power Analysis
- Sample size - Design of study
- Power is less when you use multiple comparisons Þ
larger sample sizes - Many power definitions
- Bonferroni independence are convenient (but
conservative) starting points
57Power Definitions
Complete Power P(Reject all H0i that are
false) Minimal Power P(Reject at least one
H0i that is false) Individual Power P(Reject
a particular H0i that is false) Proportional
Power Average proportion of false H0i
that are rejected
58Power Calculations.
Example H1 and H2 powered individually at 50
H3 and H4 powered individually at 80, all tests
independent. Complete Power P(reject H1 and
H2 and H3 and H4)
.5 .5 .8 .8 0.16. Minimal Power
P(reject H1or H2 or H3 or H4) 1-P(accept H1
and H2 and H3 and H4) 1- (1-.5) (1 -.5)
(1-.8) (1-.8) 0.99. Individual Power
P(reject H3 (say)) 0.80. (depends on the
test) Proportional Power (.5 .5 .8 .8)/4
0.65
59Sample Size for Adequate Individual Power -
Conservative Estimate
60Individual power of two-tail two-sample
Bonferroni t-tests
let MuDiff 5 / Smallest meaningful
difference MUx-MUy that
you want to detect / let Sigma
10.0 / A guess of the population std. dev.
/ let alpha .05 / Familywise Type I
error probability of the
test / let k 4
/ Number of tests / options ls76 data
power cer alpha/k do n 2 to 100 by
2 nsample size for each group
df n n - 2 ncp (Mudiff)/(Sigmasq
rt(2/n)) The noncentrality
parameter
tcrit tinv(1-cer/2, df) The
critical t value power 1 -
probt(tcrit, df, ncp) probt(-tcrit,df,ncp)
output end proc print
datapower run proc plot datapower plot
powern/vpos30 run
61Graph of Power Function
Plot of powern. Legend A 1 obs, B 2 obs,
etc. power
1.0 ˆ
AAA 0.8 ˆ
AAAA
AAA
AAA
AAA
AA 0.6 ˆ
AAA n92 for 80
AA
power
AA AA
AA 0.4
ˆ A
AA
AA AA
AA 0.2 ˆ AA
AA
AA AA AAA
0.0 ˆ A
Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒ
ƒƒƒˆƒƒ 0 20 40
60 80 100
n
62IndividualPower macro
- Uses PROBMC and PROBT (noncentral)
- Assumes that you want to use the single-step
- (confidence interval based) Dunnett (one-
- or two-sided) or Range (two-sided) test
- Less conservative than Bonferroni
- Conservative compared to stepwise procedures
- IndividualPower(MCPDUNNETT2,g4,d5,s10)
Westfall et al (1999), Multiple Comparisons and
Multiple Tests Using SAS
63IndividualPower Output
64More general Power- Simulate!
Invocation SimPower(method dunnett
, TrueMeans (10, 10, 13, 15,
15) , s 10
, n 87 ,
seed12345 )
Output MethodDUNNETT,
Nominal FWE0.05, nrep1000
True means (10, 10, 13, 15, 15), n87, s10
Quantity Estimate
---95 CI---- Complete Power
0.28800 (0.260,0.316) Minimal
Power 0.92900 (0.913,0.945)
Proportional Power 0.65133
(0.633,0.669) True FWE
0.01900 (0.011,0.027)
Directional FWE 0.01900 (0.011,0.027)
65Concluding Remarks - Power
- Need a bigger n
- Like to avoid bigger n (see sequential,
gatekeepers methods later) - Which definition?
- Bonferroni and independence useful
- Simulation useful especially for the more
complex methods that follow
66Closed and Stepwise Testing Methods I Standard
P-Value Based Methods
then use
If you want ...
- Estimates of effect sizes error margins
- Confident inequalities
- Overall Test
- Simultaneous Confidence Intervals
- Stepwise or closed tests
- Holms Method
- Hommels Method
- Hochbergs Method
- Fisher Combination Method
- F-test, OBrien, etc.
67Closed Testing Method(s)
- Form the closure of the family by including all
intersection hypotheses. - Test every member of the closed family by a
(suitable) a-level test. (Here, a refers to
comparison-wise error rate). - A hypothesis can be rejected provided that
- its corresponding test is significant at level a,
and - every other hypothesis in the family that implies
it is rejected by its a-level test.
68Closed Testing Multiple Endpoints
H0 d1d2d3d4 0
H0 d1d2d3 0
H0 d1d2d4 0
H0 d1d3d4 0
H0 d2d3d4 0
H0 d1d4 0
H0 d2d4 0
H0 d1d2 0
H0 d1d3 0
H0 d2d3 0
H0 d3d4 0
H0 d10 p 0.0121
H0 d20 p 0.0142
H0 d30 p 0.1986
H0 d40 p 0.0191
Where dj mean difference, treatment -control,
endpoint j.
69Closed Testing Multiple Comparisons
m1m2m3m4
m1m3, m2m4
m1m4, m2m3
m1m2, m3m4
m1m2m3
m1m2m4
m1m3m4
m2m3m4
m1m2
m1m3
m1m4
m2m3
m2m4
m3m4
Note Logical implications imply that there are
only 14 nodes, not 26 -1 63 nodes.
70Control of FWE with Closed Tests
Suppose H0j1,..., H0jm all are true (unknown to
you which ones). Reject at least one of
H0j1,..., H0jmusing CTP ? Reject H0j1Ç... Ç
H0jm Thus, P(reject at least one of H0j1,...,
H0jm H0j1,..., H0jm
all are true) P(reject H0j1Ç... Ç H0jm
H0j1,..., H0jm all are true) a
71Examples of Closed Testing Methods
When the Composite Test is Then the Closed
Method is
- Bonferroni MinP
- Resampling-Based MinP
- Simes
- OBrien
- Simple or weighted test
-
- Holms Method
- Westfall-Young method
- Hommels method
- Lehmachers method
- Fixed sequence test (a-priori ordered)
72P-value Based Methods
- Test global hypotheses using p-value combination
tests - Benefit Fewer model assumptions only need to
say that the p-values are valid - Allows for models other than homoscesdastic
normal linear models (like survival analysis).
73Holms Method is Closed Testing Using the
Bonferroni MinP Test
- Reject H0j1 ÇH0j2Ç... Ç H0jm if
- Min (p0j1 , p0j2 ,... , p0jm )
a/m. - Or, Reject H0j1 ÇH0j2Ç... Ç H0jm if
- p m Min (p0j1 , p0j2 ,... ,
p0jm ) a. - (Note that p is a valid p-value for the joint
null, comparable to p-value for Hotellings T2
test.)
74Holms Stepdown Method
H0 d1d2d3d4 0 minp0.0121 p0.0484
H0 d1d3d4 0 minp.0121 p0.0363
H0 d2d3d4 0 minp0.0142 p0.0426
H0 d1d2d3 0 minp0.0121 p0.0363
H0 d1d2d4 0 minp0.0121 p0.0363
H0 d1d2 0 minp0.0121 p0.0242
H0 d1d3 0 minp0.0121 p0.0242
H0 d1d4 0 minp0.0121 p0.0242
H0 d2d3 0 minp0.0142 p0.0284
H0 d2d4 0 minp0.0142 p0.0284
H0 d3d4 0 minp0.0191 p0.0382
H0 d10 p 0.0121
H0 d20 p 0.0142
H0 d30 p 0.1986
H0 d40 p 0.0191
Where dj mean difference, treatment -control,
endpoint j.
75Shortcut For Holms Method
- Let H(1) ,,H(k) be the hypotheses corresponding
to p(1) p(k) - If p(1) a/k, reject H(1) and continue, else
stop and retain all H(1) ,,H(k) . - If p(2) a/(k-1), reject H(2) and continue,
else stop and retain all H(1) ,,H(k) . -
- If p(k) a, reject H(k)
76Adjusted p-values for Closed Tests
- The adjusted p-value for H0j is the maximum of
all p-values over all relevant nodes - In the previous example,
- pA(1)0.0484,pA(2)0.0484, pA(3)0.0484,
pA(4)0.1986. - General formula for Holm pA(j) maxij
(k-i1)p(i) .
77Worksheet For Holms Method
78Simes Test for Global Hypotheses
- Uses all p-values p1, p2, , pm not just the
MinP - Simes test rejects H01ÇH02Ç...ÇH0m if
- p(j) ja/m for at least one j.
- Þ p-value for the joint test is p min
(m/j)p(j) - Uniformly smaller p-value than m MinP
- Type I error at most a under independence or
positive dependence of p-values
79Rejection Regions
p2
1
a
a/2
0
1
p1
a/2
a
P(Simes Reject) 1 (1- a/2)2 (a/2)2
a P(Bonferroni Reject ) 1 (1- a/2)2 a -
(a/2)2
80Hommels Method (Closed Simes)
H0 d1d2d3d4 0 p0.0255
H0 d1d2d3 0 p0.0213
H0 d1d2d4 0 p0.0191
H0 d1d3d4 0 p0.0287
H0 d2d3d4 0 p0.0287
H0 d1d2 0 p0.0142
H0 d1d3 0 p0.0242
H0 d1d4 0 p0.0191
H0 d2d3 0 p0.0284
H0 d2d4 0 p0.0191
H0 d3d4 0 p0.0382
H0 d10 p 0.0121
H0 d20 p 0.0142
H0 d30 p 0.1986
H0 d40 p 0.0191
Where dj mean difference, treatment -control,
endpoint j.
81Adjusted P-values for Hommels Method
- Again, take the maximum p-value over all
hypotheses that imply the given one. - In the previous example, the Hommel adjusted
p-values are pA(1)0.0287, pA(2)0.0287,
pA(3)0.0382, pA(4)0.1986. - These adjusted p-values are always smaller than
the Holm step-down adjusted p-values.
82Adjusted P-values for Hommels Method
- They are maxima over relevant nodes
- In example, Hommel adjusted p-values are
pA(1)0.0287, pA(2)0.0287, pA(3)0.0382,
pA(4)0.1986. - Hommel adjusted p-value Holm adjusted
p-value
83Hochbergs Method
- A conservative but simpler approximation to
Hommels method -
- Hommel adjusted p-value
- Hochberg adjusted p-value
- Holm adjusted p-value
84Hochbergs Shortcut Method
- Let H(1) ,,H(k) be the hypotheses corresponding
to p(1) p(k) - If p(k) a, reject all H(j) and stop, else
retain H(k) and continue. - If p(k-1) a/2, reject H(2) H(k) and stop,
else retain H(k-1) and continue. -
- If p(1) a/k, reject H(k)
- Adjusted p-values are pA(j) minji (k-i1)p(i) .
85Worksheet for Hochbergs Method
86Comparison of Adjusted P-Values
p-Values
Stepdown Test Raw Bonferroni
Hochberg Hommel 1 0.0121
0.0484 0.0382 0.0286 2
0.0142 0.0484 0.0382
0.0286 3 0.1986 0.1986
0.1986 0.1986 4 0.0191
0.0484 0.0382 0.0382
87Fisher Combination Test for Independent p-Values
Reject H01ÇH02Ç...ÇH0m if -2Sln(pi) gt
c2(1-a, 2m)
88Example Non-Overlapping Subgroup p-values
The Multtest Procedure
p-Values
Stepdown
Fisher Test Raw Bonferroni Hochberg
Hommel Combination 1 0.0784 0.3918
0.1550 0.1550 0.0784 2 0.0480
0.2883 0.1550 0.1441
0.0480 3 0.0041 0.0325 0.0305
0.0285 0.0053 4 0.0794 0.3918
0.1550 0.1550 0.0794 5 0.0044
0.0325 0.0305 0.0305 0.0056
6 0.0873 0.3918 0.1550 0.1550
0.0873 7 0.1007 0.3918
0.1550 0.1550 0.1007 8 0.1550
0.3918 0.1550 0.1550 0.1550
Non-overlapping is required by the independence
assumption.
89Power Comparison
Liptak test stat T S F-1(pi) S Zi
90Concluding Notes
- Closed testing more powerful than single-step
(a/m rather than a/k). - P-value based methods can be used whenever
p-values are valid - Dependence issues
- MinP (Holm) conservative
- Simes (Hommel, Hochberg) less conservative,
rarely anti-conservative - Fisher combination, Liptak require independence
91Closed and Stepwise Testing Methods IIFixed
Sequences and Gatekeepers
- Methods Covered
- Fixed Sequences (hierarchical endpoints, dose
- response, non-inferiority superiority)
- Gatekeepers (primary and secondary analyses)
- Multiple Gatekeepers (multiple endpoints
- multiple doses)
- Intersection-Union tests
Doesnt really belong in this section
92Fixed Sequence Tests
- Pre-specify H1, H2, , Hk, and test in this
sequence, stopping as soon as you fail to reject.
- No a-adjustment is necessary for individual
tests. - Applications
- Dose response High vs. Control, then Mid vs.
Control, then Low vs. Control - Primary endpoint, then Secondary endpoint
93Fixed Sequence as a Closed Procedure
H123 d1d2d3 0 Rej if p1 .05
H12 d1d20 Rej if p1 .05
H13 d1d3 0 Rej if p1 .05
H23 d2d3 0 Rej if p2 .05
H1 d10 Rej if p1 .05
H2 d2 0 Rej if p2 .05
H3 d3 0 Rej if p3 .05
- Rej H1 if p1.05
- Rej H2 if p1.05 and p2.05
- Rej H3 if p1.05 and p2.05 and p3.05
94A Seemingly Reasonable But Incorrect Protocol
- 1. Test Dose 2 vs Pbo, and Dose 3 vs Pbo using
the Bonferroni method (0.025 level). - 2. Test Dose 1 vs Pbo at the unadjusted 0.05
level only if at least one of the first two tests
is significant at the 0.025 level.
95The problem FWE 0.075
Moral Caution needed when there are multiple
hypotheses at some point in the sequence.
96Correcting the Incorrect Protocol Use Closure
Where pij 2min(pi,pj)
97References Fixed Sequence and Gatekeeper Tests
- Bauer, P (1991) Multiple Testing in Clinical
Trials, Statistics in Medicine, 10, 871-890. - ONeill RT. (1997) Secondary endpoints cannot be
validly analyzed if the primary endpoint does not
demonstrate clear statistical significance.
Controlled Clinical Trials 18550 556. - DAgostino RB. (2000) Controlling alpha in
clinical trials the case for secondary
endpoints. Statistics in Medicine 19763766. - Chi GYH. (1998) Multiple testings multiple
comparisons and multiple endpoints. Drug
Information Journal 321347S1362S. - Bauer P, Röhmel J, Maurer W, Hothorn L. (1998)
Testing strategies in multi-dose experiments
including active control. Statistics in Medicine
172133 2146. - Westfall, P.H. and Krishen, A. (2001). Optimally
weighted, fixed sequence, and gatekeeping
multiple testing procedures, Journal of
Statistical Planning and Inference 99, 25-40. - Chi, G. Clinical Benefits, Decision Rules, and
Multiple Inferences, http//www.fda.gov/cder/Offi
ces/Biostatistics/chi_1/sld001.htm - Dmitrienko, A, Offen, W. and Westfall, P. (2003).
Gatekeeping strategies for clinical trials that
do not require all effects to be significant.
Stat Med. 22 2387-2400. - Chen X, Luo X, Capizzi T. (2005) The application
of enhanced parallel gatekeeping strategies. Stat
Med. 241385-97. - Alex Dmitrienko, Geert Molenberghs, Christy
Chuang-Stein, and Walter Offen (2005), Analysis
of Clinical Trials Using SAS A Practical Guide,
SAS Press. - Wiens, B, and Dmitrienko, A. (2005). The fallback
procedure for evaluating a single family of
hypotheses. J Biopharm Stat.15(6)929-42. - Dmitrienko, A., Wiens, B. and Westfall, P.
(2006). Fallback Tests in Dose Response Clinical
Trials, J Biopharm Stat, 16, 745-755.
98Intersection-Union (IU) Tests
- Union-Intersection (UI) Nulls are intersections,
alternatives are unions. - H0 d10 and d20 vs. H1 d1¹0 or d2¹0
- Intersection-Union (IU) Nulls are unions,
alternatives are intersections - H0 d10 or d20 vs. H1 d1¹0 and d2¹0
- IU is NOT a closed procedure. It is just a single
test of a different kind of null hypothesis.
99Applications of I-U
- Bioequivalence The TOST test
- Test 1. H01 d -d0 vs. HA1 d gt -d0
- Test 2. H01 d ³ d0 vs. HA1 d lt d0
- Can test both at a.05, but must reject both.
- Combination Therapy
- Test 1. H01 m12 m1 vs. HA1 m12 gt m1
- Test 2. H01 m12 m2 vs. HA1 m12 gt m2
- Can test both at a.05, but must reject both.
100Control of Type I Error for IU tests
Suppose d10 or d20. Then P(Type I error)
P(Reject H0)
(1) P(p1.05 and p2.05)
(2) lt minP(p1.05), P(p2.05)
(3) .05.
(4) Note The
inequality at (3) becomes an approximate
equality when p2 is extremely noncentral.
101Concluding Notes Fixed Sequences and
Gatekeepers
- Many times, no adjustment is necessary at all!
- Other times you can gain power by specifying
- gatekeeping sequences
- However, you must clearly state the method and
- follow the rules
- There are many incorrect no adjustment
methods - - use caution
102Closed and Stepwise Testing Methods III Methods
that Use Logical Constraints and Correlations
Methods Application Lehmac
her et al Multiple endpoints Westfall
-Tobias- Shaffer-Royen General
contrasts
103Lehmacher et al. Method
- Use OBrien test at each node (incorporates
correlations) - Do closed testing
- Note Possibly no adjustment whatsoever possibly
big - adjustment
104Calculations for Lehmachers Method
proc standard dataresearch.multend1 mean0 std1
outstdzd var Endpoint1-Endpoint4 run data
combine set stdzd H1234
Endpoint1Endpoint2Endpoint3Endpoint4 H123
Endpoint1Endpoint2Endpoint3
H124 Endpoint1Endpoint2 Endpoint4
H134 Endpoint1 Endpoint3Endpoint4
H234 Endpoint2Endpoint3Endpoint
4 H12 Endpoint1Endpoint2
H13 Endpoint1 Endpoint3
H14 Endpoint1
Endpoint4 H23
Endpoint2Endpoint3 H24
Endpoint2 Endpoint4 H34
Endpoint3Endpoint4 H1
Endpoint1 H2
Endpoint2 H3
Endpoint3
H4
Endpoint4 run proc ttest class treatment
var H1234 H123 H124 H134 H234
H12 H13 H14 H23 H24 H34 H1 H2 H3 H4
ods output tteststtests run
105Output For Lehmachers Method
Obs Variable Method Variances
tValue DF Probt 1 H1234
Pooled Equal 2.69 109 0.0082
3 H123 Pooled Equal
2.59 109 0.0108 5 H124
Pooled Equal 3.03 109 0.0031
7 H134 Pooled Equal
2.36 109 0.0201 9 H234
Pooled Equal 2.51 109 0.0136
11 H12 Pooled Equal
3.03 109 0.0030 13 H13
Pooled Equal 2.12 109 0.0365
15 H14 Pooled Equal
2.68 109 0.0085 17 H23
Pooled Equal 2.22 109 0.0287
19 H24 Pooled Equal
2.88 109 0.0047 21 H34
Pooled Equal 2.03 109 0.0450
23 H1 Pooled Equal
2.55 109 0.0121 25 H2
Pooled Equal 2.49 109 0.0142
27 H3 Pooled Equal
1.29 109 0.1986 29 H4
Pooled Equal 2.38 109 0.0191
pA1 max(0.0121, 0.0085, 0.0365, 0.0030, 0.0201,
0.0031, 0.0108, 0.0082) 0.0365 pA2
max(0.0142, 0.0047, 0.0287, 0.0030, 0.0136,
0.0031, 0.0108, 0.0082) 0.0287 pA3
max(0.1986, 0.0450, 0.0287, 0.0365, 0.0136,
0.0201, 0.0108, 0.0082) 0.1986 pA4
max(0.0191, 0.0450, 0.0047, 0.0085, 0.0136,
0.0201, 0.0031, 0.0082) 0.0450
106Free and Restricted Combinations
- If truth of some null hypotheses logically forces
- other nulls to be true, the hypotheses are
restricted. - Examples
- Multiple Endpoints, one test per endpoint - free
- All Pairwise Comparisons - restricted
107Pairwise Comparisons, 3 Groups
H0 m1m2m3
H0 m1m3,m2m3
H0 m1m2,m1m3
H0 m1m2,m2m3
H0 m1m2
H0 m1m3
H0 m2m3
Note The entire middle layer is not needed!!!!!
Fisher protected LSD valid!
108Pairwise Comparisons, 4 Groups
m1m2m3m4
m1m3, m2m4
m1m4, m2m3
m1m2, m3m4
m1m2m3
m1m2m4
m1m3m4
m2m3m4
m1m2
m1m3
m1m4
m2m3
m2m4
m3m4
Note Logical implications imply that there are
only 14 nodes, not 26 -1 63 nodes. Also,
Fisher protected LSD not valid.
109Restricted Combinations Multipliers
(Shaffer Method 1 Modified Holm)
Shaffer, J.P. (1986). Modified sequentially
rejective multiple test procedures. JASA 81,
826831.
110Shaffers (1) Adjusted p-values
111Westfall/Tobias/Shaffer/Royen Method
- Uses actual distribution of MinP instead of
- conservative Bonferroni approximation
- Closed testing incorporating logical constraints
- Hard-coded in PROC GLIMMIX
- Allows arbitrary linear functions
Westfall, P.H. and Tobias, R.D. (2007).
Multiple Testing of General Contrasts
Truncated Closure and the Extended Shaffer-Royen
Method, Journal of the American Statistical
Association 102 487-494.
112Application of Truncated Closed MinP to Subgroup
Analysis
- Compare Treatment with control as follows
- Overall
- In the Older Patients subgroup
- In the Younger Patients subgroup
- In patients with better initial health subgroup
- In patients with poorer initial health subgroup
- In each of the four (old/young)x(better/poorer)
subgroups - 9 tests overall (but better 1 gatekeeper 8
follow-up)
113Analysis File
ods output estimatesestimates_logicaltests proc
glimmix dataresearch.respiratory class
Treatment AgeGroup InitHealth model score
Treatment AgeGroup InitHealth TreatmentAgeGroup
TreatmentInitHealth AgeGroupInitHealth Estimate
"Overall" treatment 4 -4
treatmentAgegroup 2 2 -2 -2 treatmentInitHealt
h 2 2 -2 -2 (divisor4), "Older"
treatment 2 -2 treatmentAgegroup 2 0 -2 0
treatmentInitHealth 1 1 -1 -1 (divisor2), "Young
er" treatment 2 -2 treatmentAgegroup 0
2 0 -2 treatmentInitHealth 1 1 -1 -1
(divisor2), "GoodInitHealth" treatment 2 -2
treatmentAgegroup 1 1 -1 -1 treatmentInitHealt
h 2 0 -2 0 (divisor2), "PoorInitHealth"
treatment 2 -2 treatmentAgegroup 1 1 -1 -1
treatmentInitHealth 0 2 0 -2 (divisor2), "OldGo
od" treatment 1 -1 treatmentAgegroup 1
0 -1 0 treatmentInitHealth 1 0 -1 0
, "OldPoor" treatment 1 -1
treatmentAgegroup 1 0 -1 0 treatmentInitHealt
h 0 1 0 -1 , "YoungGood" treatment 1 -1
treatmentAgegroup 0 1 0 -1 treatmentInitHealt
h 1 0 -1 0 , "YoungPoor" treatment 1 -1
treatmentAgegroup 0 1 0 -1 treatmentInitHealt
h 0 1 0 -1 /adjustsimulate(nsamp1000000
0 report seed12321) upper stepdown(typelogical
report) run proc print dataestimates_logicalt
ests noobs title "Subgroup Analysis Results
Truncated Closure" var label estimate Stderr
tvalue probt Adjp run
114Results Truncated Closure
Subgroup Analysis Results
adjp_ adjp_ Label Estimate
StdErr tValue Probt logical
interval Overall 0.7075 0.1956
3.62 0.0002 0.0011 0.0015 Older
0.9952 0.2673 3.72
0.0002 0.0011 0.0011 Younger
0.4197 0.2847 1.47 0.0717 0.1049
0.2605 GoodInitHealth 0.5871
0.2878 2.04 0.0219 0.0432
0.0984 PoorInitHealth 0.8279 0.2644
3.13 0.0011 0.0023 0.0068 OldGood
0.8748 0.3387 2.58 0.0056
0.0124 0.0295 OldPoor 1.1157
0.3231 3.45 0.0004 0.0011
0.0026 YoungGood 0.2993 0.3562
0.84 0.2014 0.2014 0.5494 YoungPoor
0.5401 0.3338 1.62 0.0544
0.1049 0.2091
The adjusted p-values for the stepdown tests are
mathematically smaller than those of the
simultaneous interval-based tests,
115Example Stepwise Pairwise vs. Control Testing
- Teratology data set
- Observations are litters
- Response variable litter weight
- Treatments 0,5,50,500.
- Covariates Litter size, Gestation time
116Analysis File
proc glimmix dataresearch.litter class
dose model weight dose gesttime number
estimate "5 vs 0" dose -1 1 0 0, "50 vs
0" dose -1 0 1 0, "500 vs 0" dose -1 0 0 1
/ adjustsimulate(nsample10000000 report)
stepdown(typelogical) run quit
117Results
Estimates with Simulated Adjustment
Standard Label Estimate
Error DF t Value Pr gt t Adj P
5 vs 0 -3.3524 1.2908 68
-2.60 0.0115 0.0316 50 vs 0 -2.2909
1.3384 68 -1.71 0.0915
0.0915 500 vs 0 -2.6752 1.3343 68
-2.00 0.0490 0.0907
Note 50-0 and 500-0 not significant at .10 with
regular Dunnett
118Concluding Notes
- More power is available when combinations are
restricted. - Power of closed tests can be improved using
correlation and other distributional
characteristics
119Nonparametric Multiple Testing Methods
- Overview Use nonparametric tests at each node
of the - closure tree
- Bootstrap tests
- Rank-based tests
- Tests for binary data
120Bootstrap MinP Test (Semi-Parametric Test)
- The composite hypothesis H1ÇH2ÇÇHk may be tested
using the p-value - p P(MinP minp H1ÇH2ÇÇHk)
- Westfall and Young (1993) show
- how to obtain p by bootstrapping the residuals
in a multivariate regression model. - how to obtain all ps in the closure tree
efficiently
121 Multivariate Regression Model (Next Five
slides are from Westfall and Young, 1993)
122Hypotheses and Test Statistics
123Joint Distribution of the Test Statistics
124Testing Subset Intersection Hypotheses Using the
Extreme Pivotals
125Exact Calculation of pK
Bootstrap Approximation
126Bootstrap Tests (PROC MULTTEST)
H0 d1d2d3d4 0 min p .0121, p .0379
H0 d1d3d4 0 min p .0121, p lt .0379
H0 d2d3d4 0 min p .0142, p .0351
H0 d1d2d3 0 min p .0121, p lt .0379
H0 d1d2d4 0 min p .0121, p lt .0379
H0 d1d2 0 minp .0121 p lt .0379
H0 d1d3 0 minp .0121 p lt .0379
H0 d3d4 0 minp .0191 p .0355
H0 d1d4 0 minp .0121 p lt .0379
H0 d2d3 0 minp .0142 p lt .0351
H0 d2d4 0 minp .0142 p lt .0351
H0 d40 p 0.0191 p lt .0355
H0 d10 p 0.0121 p lt .0379
H0 d20 p 0.0142 p lt .0351
H0 d30 p 0.1986 p .1991
p P(Min P min p H0) (computed using
bootstrap resampling) (Recall, for Bonferroni, p
k(MinP) )
127Permutation Tests for Composite Hypotheses H0K
Joint p-value proportion of the n!/(nT!nC!)
permutations for which miniÎK Pi miniÎK pi .
128Problem Simplification
Problem There are 2k -1 subsets K to be
tested This might take a while...
Simplification You need only test k of the 2k-1
subsets! Why? Because P(miniÎK Pi c)
P(miniÎK Pi c) when KÌ K. Significance
for most lower order subsets is determined by
significance of higher order subsets.
129MULTTEST PROCEDURE
Tests only the needed subsets (k, not 2k -
1). Samples from the permutation
distribution. Only one sample is needed, not k
distinct samples, if the joint distribution of
minP is identical under HK and HS. (Called
the subset pivotality condition by Westfall
and Young, 1993, valid under location shift and
other models)
130Great Savings are Possible with Exact Permutation
Tests!
Why? Suppose you test H12k using MinP. The
joint p-value is p P(MinP minp)
P(P1 minp) P(P2 minp)
P(Pk minp) Many summands can be zero,
others much less than minp.
131Multiple Binary Adverse Events
Stepdown Stepdown Variable
Contrast Raw Bonferroni
Permutation ae1 t vs c
0.0008 0.0025 0.0020 ae2
t vs c 0.6955 1.0000
1.0000 ae3 t vs c
0.5000 1.0000 1.0000 ae4
t vs c 0.7525 1.0000
1.0000 ae5 t vs c
0.2213 1.0000 0.6274 ae6
t vs c 0.0601 0.3321
0.2608 ae7 t vs c
0.8165 1.0000 1.0000 ae8
t vs c 0.0293 0.1587
0.1328 ae9 t vs c
0.9399 1.0000 1.0000 ae10
t vs c 0.2484 1.0000
0.9273 ae11 t vs c
1.0000 1.0000 1.0000 ae12
t vs c 1.0000 1.0000
1.0000 ae13 t vs c
1.0000 1.0000 1.0000 ae14
t vs c 1.0000 1.0000
1.0000 ae15 t vs c
0.2484 1.0000 0.9273 ae16
t vs c 0.7516 1.0000
1.0000 ae17 t vs c
1.0000 1.0000 1.0000 ae18
t vs c 1.0000 1.0000
1.0000 ae19 t vs c
1.0000 1.0000 1.0000 ae20
t vs c 0.5000 1.0000
1.0000 ae21 t vs c
0.7516 1.0000 1.0000 ae22
t vs c 1.0000 1.0000
1.0000 ae23 t vs c
0.5000 1.0000 1.0000 ae24
t vs c 1.0000 1.0000
1.0000 ae25 t vs c
1.0000 1.0000 1.0000 ae26
t vs c 1.0000 1.0000
1.0000 ae27 t vs c
1.0000 1.0000 1.0000 ae28
t vs c 0.4344 1.0000
0.9400
132Example Genetic Associatons
Phenotype 0/1 (diseased or not). Sample n1 from
diseased, n2 from not diseased. Compare 100s of
genotype frequencies (using dominant and
recessive codings) for diseased and non-diseased
using multiple Fisher exact tests.
133PROC MULTTEST Code
proc multtest dataresearch.gen stepperm n20000
outpval hommel fdr class y test
fisher(d1-d1