Title: Basic Biostatistics for the Clinical Trialist
1Basic Biostatistics for the Clinical Trialist
?
?
?2
?
Power
P0.05
?
80
- Susan Hilsenbeck, Ph.D.
- Breast Center and Dan L. Duncan Cancer Centerat
Baylor College of Medicine - Houston, TX USA
2Overview of Material
- Types of data and summary statistics
- Confidence Intervals
- Tests of hypothesis
- Sample size calculations
3Sample vs Target Population
Do It
Protocol
4Sample vs Target Population
5Design of Clinical Trials Striking a Balance
- Answer the question (correctly)
- Control risk of errors in conclusions
- Minimize potential harm and maximize potential
benefit - Limit n of participants treated at
sub-therapeutic doses - Limit n of participants treated with ineffective
therapy or exposed to toxicity - Maximize feasibility
- Simple enough to carry out
6Types of Data Typical of Early Phase Trials
SexEthnicity
Freq count
Sex
Gender
Proportion
Tumor location
Performance
Stage
Grade of Tox
Mean, median, etc
7Summary Statistics Location
8Summary Statistics Spread
9Graphs as Summary Statistics
10Summary Statistics and Confidence Intervals
- Response rate is point estimate of the effect of
drug - Confidence interval gives a range of population
response rates that are consistent with the
sample data
11Thought ExperimentCatching the real response
rate
- Suppose the real response rate for a new therapy
is 0.3 (30) - Suppose we run a small safety and efficacy
clinical trial, and calculate the response rate
and a 95 confidence interval for the response
rate over and over and over - How often will the interval capture the real
value?
12True Rate0.3, N30, Confidence95
95 of CI's contain True Rate
13True Rate0.3, N30, Confidence99.9
What happens if we want to be more confident?
14True Rate0.3, N120, Confidence95
What happens if we want to be 95 confident, but
we increase the sample size?
15Making Decisions Test of Hypothesis
a probability of Type I error (level of
significance) b probability of Type II
error 1-b Power
16Hypothesis Testing and Jury Trials
17Hypothesis Testing and Drug Trials
18Type I and Type II Errors
- Common choices
- a 5
- b 20
- Exploratory study?
- a 10
- b 10
- Confirmatory study?
- a 1
- b 10
19Study Paradigm
Hypothesis
20Example of a test of hypothesis
Compare the rate of new breast cancers in
Tamoxifen treated and placebo treated subjects
over 5 years?
Total ------------------------- TAM
1000
-----------------
-------- Placebo 1000
------------------------- Total
2000
BRCA Dis
Free Total ------------------------- TAM
1000
------------------------- Placebo
1000
----------------------
--- Total 50 1950 2000
Expected BRCA Dis
Free Total ------------------------- TAM
1000 25
975
------------------------- Placebo
1000 25 975
----------------------
--- Total 50 1950 2000
Frequency Expected BRCA Dis
Free Total ------------------------- TAM
16 984 1000 25
975
------------------------- Placebo 34
966 1000 25 975
----------------------
--- Total 50 1950 2000
Test Statistic DF Value P-value
Chi-Square 1 6.65 ?
Hypothetical data representative of Fisher et al,
1998, JNCI 901371-1388
21Chi Square Distribution
3.84P0.05
6.69P0.01
Observed data very different from expected
Observed data very close to expected
22Example of a test of hypothesis
Compare the rate of new breast cancers in
Tamoxifen treated and placebo treated subjects
over 5 years?
Total ------------------------- TAM
1000
-----------------
-------- Placebo 1000
------------------------- Total
2000
BRCA Dis
Free Total ------------------------- TAM
1000
------------------------- Placebo
1000
----------------------
--- Total 50 1950 2000
Expected BRCA Dis
Free Total ------------------------- TAM
1000 25
975
------------------------- Placebo
1000 25 975
----------------------
--- Total 50 1950 2000
Frequency Expected BRCA Dis
Free Total ------------------------- TAM
16 984 1000 25
975
------------------------- Placebo 34
966 1000 25 975
----------------------
--- Total 50 1950 2000
Test Statistic DF Value P-value
Chi-Square 1 6.65 0.01
Hypothetical data representative of Fisher et al,
1998, JNCI 901371-1388
23What if we double the sample size?
Compare the rate of new breast cancers in
Tamoxifen treated and placebo treated subjects
over 5 years?
Frequency Expected BRCA Dis
Free Total ------------------------- TAM
32 1968 2000 50
1950
------------------------- Placebo 68
1932 2000 50 1950
----------------------
--- Total 100 3900 4000
Test Statistic DF Value P-value
Chi-Square 1 13.29 0.0003
Hypothetical data representative of Fisher et al,
1998, JNCI 901371-1388
24Chi Square Distribution
3.84P0.05
13.29P0.003
6.69P0.01
25P-Value
- Descriptive statement How consistent or
inconsistent are the observed data with what we
would have expected to see by chance (Ho true) - P0.01 means, IF Ho is true, 1 time in 100 we
would get something like this OR something even
more inconsistent with Ho
26Effect Size and Confidence Interval
Frequency Expected Row Pct BRCA Dis
Free Total ------------------------- TAM
16 984 1000 25
975 1.60 98.40 -------------
------------ Placebo 34 966
1000 25 975
3.40 96.60 ------------------------- Total
50 1950 2000
RR 95 CI _
1.6/3.40.47 0.26 to 0.85
27What if we double the sample?
Frequency Expected Row Pct BRCA Dis
Free Total ------------------------- TAM
32 1968 2000 50
1950 1.60 98.40 -------------
------------ Placebo 68 1932
2000 50 1950
3.40 96.60 ------------------------- Total
100 3900 4000
RR 95 CI _ 0.47 0.31 to
0.71
28P-values and Confidence Intervals
- Before start of trial
- specify ? and ? errors
- After analysis of trial
- summarize results of testing with p-value
- BUT Small p ? Big Effect
- Summarize size of effect with estimate and
confidence interval - Report estimates, confidence intervals and
p-values
29When you observe a small P-value
- It means the null hypothesis is unlikely to be
true? - It means that the treatment effect is big and
clinically important? - It means your results are unusual if there is
actually NO EFFECT?
NO
Not Necessarily
Yes
Pr(Hodata) ? Pr(dataHo)
30Planning a StudySample Size and Power Analysis
- Sample size calculations estimate the number of
patients needed to accomplish study goals - Power analysis estimates the power to detect
specified differences, given a particular sample
size
31Break
32Ingredients
- Test t-test, chi-square test?
- N sample size (per group?)
- K imbalance in size of groups
- ? effect size (clinically important difference
and expected variability) - ? alpha error rate
- ? beta error rate
- Other censoring, correlations among variables,
33The TTEST Procedure Summary Statistics
Group N
Mean Std Dev A 15 98.86
10.66 B 15 110.02
9.57 Equality of Variances Variable Method
Num DF Den DF F Value Pr gt F assay_value
Folded F 14 14 1.24
0.6933 T-Tests Variable Method Variances
DF t Value Prgtt assay_value Pooled
Equal 28 -3.02 0.0054
The FREQ Procedure Table of Group by
Category Frequency High Low
Total ------------------------- A
4 11 15 -------------------------
B 9 6
15 ------------------------- Total 13
17 30 Statistic
DF Value Prob Continuity Adj.
Chi-Square 1 2.1719 0.1405
Fisher's Exact Test Two-sided Pr lt P
0.1394
34Relationships between Ingredients
?
?
?
?
?
35Example 1
- Suppose you want to compare the average test
scores among research fellows following two
different training programs, a web-based
self-paced course, and an intensive course at a
plush resort. Based on previous experience, you
expect the web course students to score about 75
(standard deviation10) and you hope that the
one-on-one teaching at the resort will result in
a 6 point improvement. This comparison will
provide objective evidence to justify funding
future courses. - You plan to compare the test scores at the a5
level of significance, and you want the study to
have 90 power to detect this difference. - How many students do you need to study?
36Brief Classification of Tests
From Hulley and Cummings, 1988
37E6, S10, a 0.05, Power 90
From Hulley and Cummings, 1988
38Phase I
- Design
- Small sample size (10-40)
- Escalating/de-escalating dose
- Usually route and schedule fixed
- Nonrandomized
- Questions
- Safe dose for further study (MTD, MED, OBD)?
- Toxicity profile? (Hematopoietic, GI, CNS)
- Hints of efficacy?
- Pharmacologic profile? (AUC, half-life, etc)
- Endpoints - toxicity, change in biomarker,
response
39Phase I Designs
- 33
- Modified Continual Reassessment Method
- Accelerated titration
- Other
- Pharmacologically guided
- Storer Up and Down
- Escalation with overdose control (EWOC)
- Various Bayesian
40Dose Response
1
1.0
100
67
50
33
33
0.8
In theory, this idea could be used to home in
on the optimal dose for any outcome, but
dose/toxicity assumed to be monotonic
increasing dose/target modulation may not be
monotonic
MTD
0.6
Probability of DLT
0.4
P(DLT)0.3
0.2
0.0
0
2
4
6
8
10
Dose (mg/m2)
41Dose Response
1.0
100
67
50
33
33
0.8
0.6
Probability of DLT
0.4
P(DLT)0.3
0.2
0.0
0
2
4
6
8
10
Dose (mg/m2)
42Hypothetical 33
Define MTD highest dose with Pr(DLT) lt 30
TruePr(DLT) Level Cohort
3/0
lt1 1
3/0
lt1 2
3/1 3/0
4 3
3/0
MTD Expand?
13 4
3/0
3/2
54 5
90 6
33 picks a dose, but does not give any
precision for estimate of MTD
43Continual Reassessment Method and Modified CRM
- Designed to treat more patients near therapeutic
doses - Original CRM
- Begin with a prior guess as to dose-response
MTD - Treat a patient at near MTD, observe DLT or not
- Update and choose new dose near MTD
- Treat next patient and repeat
- Modified to improve safety, but increases N
- Start at the traditionally determined 1st level
- Treat several patients in a cohort (2 or 3?)
- Dont skip doses
44Why choose one design over another?
- Most commonly used design is still 33
- Is there support for complex design?
- 33 easy to implement
- CRM may be better but requires statistician and
special software - Is drug class and toxicity profile already
well-known? - Prior dose-response curve known, rapid escalation
- Biologically targeted agents that require
expanded cohorts to estimate target modulation?
45Phase II
- Design
- Moderate sample size (20-100)
- Defined treatment and population
- Nonrandomized (usually)
- Test of hypothesis
- Questions
- Efficacy clinically interesting?
- Toxicity profile acceptable?
- Endpoints response, TTP, RFS, toxicity, change
in biomarker
46Wide Variety of Phase II Designs
- Single stage
- Two-stage (multi-stage)
- Simon Minimax
- Simon Optimal
- Other admissible
- Multiple outcomes efficacy vs toxicity
- Bryant-Day
- Bayesian trade-off
- Randomized Phase II
- Other (Non-Cytotoxic agents?)
- Mick-Ratain paired TTP
- Randomized discontinuation
47Example 2 Single Arm
- Suppose we are planning a Phase II study of a new
treatment, theobromococanib, a small molecule
inhibitor of CHLTR, the nearly ubiquitously
expressed chocolate receptor - Outcome of interest is 6 month PFS
- If the rate is low (P010), then we want
probability of keeping the druglt5 (Type I) - If the rate is high (P130), then we want
probability of discarding the druglt20 (Type II)
48Comparison of Designs Phase II Trial of
theobromacanib
Single Stage P010 P130 a5 ?20 N25,
R5 EN25
Unacceptably low response rate of bad
drug Acceptably high response rate of good
drug Risk of keeping a bad drug Risk of
missing a good drug
Conclude drug is bad Expected sample size if
drug is bad
49P00.10 vs P10.30, a5, Power80
Two-stage design
Designs for p1 - p0 0.20 Reject Drug if
po p1 ltr1/n1 ltr/n EN(po) PET(po) 0.05 0.25 0/
9 2/24 14.5 0.63 0/9 2/17 12.0 0.63 0/9 3/30 1
6.8 0.63 0.10 0.30 1/12 5/35 19.8 0.65 1/1
0 5/29 15.0 0.74 2/18 6/36 22.5 0.71
a b 0.10 0.10 0.05 0.20 0.05 0.10
(Example 1 in Appendix)
50Comparison of Designs Phase II Trial of
theobromacanib
Single Stage P010 P130 a5 ?20 N25,
R5 EN25
Optimal Two Stage P010 P130 a5 ?20 N1
10, R11 N29, R5 EN15
Bryant-Day P010 P130 a5 ?20 aT5 P0nt60
P1nt80 N112, R10, NT17 N33, R3,
NT22 EN20.9
Unacceptably low response rate of bad
drug Acceptably high response rate of good
drug Risk of keeping a bad drug Risk of
missing a good drug
Risk of keeping a toxic drug Unacceptably low
rate of non toxicity Acceptably high rate of
nontoxicity
Stage 1 Stage 2
Conclude drug is bad Expected sample size if
drug is bad
51P10.1 vs P20.3, a5 one-tailed, Power80
The Awful Truth about Comparative Trials! Entire
study will be not just 2 times, but nearly 4
times as big as single arm.
From Hulley and Cummings, 1988 (Chi square
without continuity correction)
52Randomized Phase II Popular but sometimes misused
- NOT a cheap Phase III
- No power to compare arms
- Original RP2
- Pick winner of two or more treatment variations
(i.e. schedules, drug analogs) - Each arm can stop early
- If there is a difference (pre-specified) the
better drug will win with high probability - If there is NO difference outcome is like coin
flip - Parallel Phase IIs, Adaptive Randomization
53Why choose one design over another?
- Simon Two-stage designs probably most common
- If toxicity is a concern?
- Efficacy/Toxicity design
- If outcome can be evaluated in short term?
- 1 vs 2 stage
- If there are competing schedules, analogs,
formulations? - Randomized Phase II or Adaptive design
- Is there support for complex design?
54Two-sided versus one-sided tests
- Two-tailed test - Any difference
- One-tailed test - Specific direction of
difference - Same power from smaller sample size, but
- Only appropriate when ONLY ONE direction is
important or biologically meaningful
55Special Considerations
- Equivalence
- Interim testing (multi-stage)
- Complex designs, no tables
56What have we done?
- Summary statistics
- Confidence intervals
- Tests of hypotheses
- Sample size calculations
57Questions?