Title: EPI 260 Statistics in Phase II Clinical Trials
1EPI 260Statistics in Phase II Clinical Trials
Jimmy Hwang, Ph.D. Biostatistics Core, Cancer
Center UC San Francisco April 29, 2010
2Early Phase Clinical Development Phase II
studies Statistics (in syllabus)
- Purpose of Phase II clinical studies
- Phase II study design
- formulation of testable hypotheses
- determine the study endpoints and when they will
be evaluated - define the population to be studied
- select the appropriate study design
- Determine the required sample size by making
assumptions about the extent of benefit to be
achieved with the new treatment and acceptable
errors in making a final decision about whether
the null hypothesis can be rejected - Methods for statistical analysis
3Four types of trial designs (1)
- Phase I pharmacologically oriented
- The safe dose range       Â
- The side effects
- How the body copes with the drug
- If the treatment shrinks cancer
- Phase II preliminary evidence of efficacy and
safety - If the new treatment works well enough to test in
phase 3Â Â Â Â Â Â Â Â - Which types of cancer it is effective
against        - More about side effects and how to manage
them        - More about the most effective dose to use
4Four types of trial designs (2)
- Phase III new treatments are compared with the
best currently available treatment (the standard
treatment). - A completely new treatment with the standard
treatment     - Different doses or ways of giving a standard
treatment     - A new radiotherapy schedule with the standard one
- Phase IV post-marketing surveillance
- More about the side effects and safety of the
drug     - What the long term risks and benefits are    Â
- How well the drug works when its used more
widely than in clinical trials
5Statistical Considerations
- Define Clinical Question (Objectives).
- Study Development and Protocol Development
- Types of Study (pilot, clinical trial,
observational, etc.) - Endpoints (feasibility and appropriateness)
- Protocol Development (objectives, aims,
statistical design, patient selection, data
collection procedures, number of points, stopping
rules and interim analysis, statistical
endpoints, analysis plan, sample size) - During Study Randomization, data quality
control, interim analysis and/or monitoring of
patient safety - Study Finishing Data lock, data analysis and
interpretation, assist decisions for the
follow-up studies, and preparation for papers and
presentations
6Statistical Perspectives
- Philosophy of inference divides statisticians
Frequentists versus Bayesian - Statistical procedures are not standardized.
- Things to consider
- Randomization
- Intent-to-treat Design
- Unbalanced groups
- Stratification
- Large-scale, small clinical trials, meta analysis
- Adjusted or weighted analysis
- Trials can provide confirmatory evidence.
- Other methods are valid for making clinical
inferences.
7Basic Question
- Clinical reasoning requires generalizing from
individual patients. - Statistical reasoning emphasizes inference based
on structured data processing.
- Which treatment is safer and better?
8- Benefit could be defined as
- Antitumor activity
- Safety
- The pharmacokinetics or pharmacodynamics
- The biologic correlates which may predict
response or resistance to treatment and/or
toxicity
9Intent-to-treat (ITT) Principle
- Unlike animal studies, investigator cannot
dictate what a participant should do in a
clinical trial. - A participant may forget to take the pills,
receive dose reduction due to toxicity, drop out
from the study at any point or lost to f/u. - Use only full compliers? Use all subjects?
- ITT compares intervention strategies and not
interventions.
10Standards of Ethical Conduct
- The study participants must give voluntary
consent. - There must be no reasonable alternative to
conducting the experiment. - The anticipated results must have a basis in
biological knowledge and animal experimentation. - The procedures should avoid unnecessary suffering
and injury. - There is no expectation for death or disability
as a result of the trial.
11Standards of Ethical Conduct
- The degree of risk for the patient is consistent
with the humanitarian importance of the study. - The subjects are protected against even a remote
possibility of death or injury. - The study must be conducted by qualified
scientists. - The subject can stop participation at will.
- The investigator has an obligation to terminate
the experiment if injury seems likely.
12Study Protocol
- Every well-designed study required a protocol.
- Protocol is a written agreement between
investigators, participants, and the scientific
community. - Protocol is a comprehensive operational manual.
It specifies the standard operation procedure
(SOP).
13Defining study questions
- Each clinical trial must have a primary question.
- The primary question, as well as any secondary or
subsidiary questions, should be carefully
selected, clearly defined, and stated in advance. - Selection of the questions
- Primary and secondary objectives
- Interventions
- Response variables
- Surrogate endpoints, biomarkers
14Primary Objective
- Define one question the investigators are most
interested in answering and is capable of being
adequately answered. - Define the primary endpoint
- Toxicity, efficacy (response/survival), QOL
- Define the type of study
- Hypothesis testing or estimation,
- Superiority or equivalence trials
- The sample size is based on.
15Secondary Objectives
- Different endpoints
- Subgroup hypotheses
- Prospectively defined
- Based on reasonable expectations
- Limited in number
- Hypothesis testing vs. hypothesis generating
- Hunting expedition vs. fishing expedition
- Multiplicity Issues
16What Study Aims Tell You
- Type of study general design
- (pilot, phase I, II or III study arms)
- Who is eligible
- Outcome measure
- (e.g. toxicity, response, duration, biomarker)
- When outcome will be evaluated
- (Timing of evaluations)
17Interim Analysis Why ?
- Many trials require large N and/or long duration.
- Interim analysis can result in more efficient
designs and correct conclusion can be reached
sooner. - Ethical considerations
- Pace of scientific advancement demands learning
from the observed data. - Public health concerns, pressure from activists
- Requirement from IRB and other regulatory agencies
18Interim Analysis Factors to Consider before
Early Termination
- Possible difference in prognostic factors among
arms - Bias in assessing response variables
- Impact of missing data
- Differential concomitant tx or adherence
- Differential side effects
- Secondary outcomes
- Internal consistency
- External consistency, other trials
19Interim Analysis Reasons for early stopping
- Efficacy Treatments are convincingly different
or not different (by impartial knowledgeable
experts) - Toxicity Serious Adverse Events, Side effects or
toxicity are too severe (outweight the potential
benefits) - Futility Significant difference at the end of
the trial is unlikely - Data are of poor quality
- Accrual too slow in a timely fashion
- More information becomes available outside the
study (unnecessary or unethical to continue) - Scientific questions are no longer important
- Poor adherence (preventing answers to basic
question) - Resources to study are lost or no longer
available - Fraud or misconduct undermines study integrity.
20Interim Analysis To Stop or Not To Stop?
- How sure?
- Is the evidence strong enough, or just due to
stochastic variation, or imbalance in covariates
or other factors? - Wrongly stopping for efficacy false positive
- False claim that the drug is active
- Waste time and money for future development
- Wrongly stopping for futility false negative
- Kill a promising drug
- Group ethics vs. individual ethics
21Data Assessment Reasons for Noncompliance
- Toxicity or side effects
- Involving life style/behavior change
- Complex or inconvenient interventions
- Insufficient or lack of understanding
instructions - Change of mind, refusal
- Lack of family support
- If non-compliance is treatment dependent, it will
result in biased data
22Data Assessment Non-adherence
- Include non- or partial compliers, drop-in and
drop-outs - Could due to toxicity, lack of efficacy, refusal.
- Need to compare the non-adherence rate between
arms - Exclude in the analysis
- Rationale pts not taking medication will not
benefit from it. - Compare the optimal intervention vs. control
- Can lead to biased result
- Include in the analysis
- Intend-to-treat (ITT) principle
- Power reduced but also less bias
- More relevant to generalize study result to the
real world setting - Do both. Sensitivity analysis
23Data Assessment Poor Quality or Missing Data
- Missing visits may or may not due to outcomes
related to treatment, such as pts health status - Informative or non-informative missing
- Missing completely at random
- Missing at random (missing does not depends on
unobserved values) - Not missing at random
- Available methods
- Complete case analysis
- Last value carried forward
- Single imputation
- Multiple imputation
- Sensitivity analysis
24Defining Response Variables
- Dose limiting toxicities (DLT), complications
- Response, incidence of a disease, total
mortality, death from a specific cause - Overall survival, time to progression, time to
cancer - Blood pressure, biomarkers, PSA, CD4 count
- Quality of life
- Cost and ease of administrating the intervention
- In general, a single response variable should be
identified to answer the primary question.
25Defining Response Variables
- Define the questions prospectively and
specifically - Study drug can increase the response rate
(PRCR) from 25 to 50 in patients with certain
cancer - The primary response variable can be assessed in
all participants and as completely as possible - Informative drop-out or lost to f/u due to
toxicity - Participation generally ends when the primary
response variable occurs - Off-drug, off-study, extended f/u
- Response variables should be unbiased and
precisely assessed - Hard, objective endpoints vs. soft, subjective
endpoints - Standardization of evaluation, central lab and
pre-trial training
26Scales of measurement
- Nominal
- Ordinal
- Interval
- Ratio
27Statistical Methods for Categorical Data
- Goal Analysis
- Describe one group Proportion
-
- Compare one group to a Chi-square test
- hypothetical value
- Compare two unpaired groups Chi-square test
- Compare two paired groups McNemar's test
- Compare three or more Chi-square test
- unmatched groups
- Model the effect of multiple Logistic regression
- prognostic variables
- When sample size is small, use Fishers exact
test
28Statistical Methods for Continuous Data
- Goal Analysis
- Describe one group Mean, SD
-
- Compare one group to a One-sample t-test
- hypothetical value
- Compare two unpaired groups Two-sample t-test
-
- Compare paired data Paired t-test
-
- Compare three or more One-way ANOVA
- unmatched groups
29Statistical Methods for Non-Parametric Data
- Goal Analysis
- Describe one group Median, Percentiles
-
- Compare one group to a Signed-rank test
- hypothetical value
- Compare two unpaired groups Mann-Whitney test
- Wilcoxon rank sum test
- Compare paired data Signed-rank test
- Compare three or more
- unmatched groups Kruskal-Wallis test
30Statistical Methods for Survival Data
- Goal Analysis
- Describe one group Kaplan-Meier
- Compare two unpaired groups log-rank test
- Compare three or more Cox regression
- unmatched groups/continuous
- risk factors
-
- Model the effect of multiple Cox regression
- prognostic factors
31Samples and Population
- Research findings are based on samples drawn from
populations - Inferential statistics allow us to infer what the
population is like, based on sample data - The defined group of individuals from which a
sample is drawn - Sample should closely reflect the population
otherwise there is sampling bias.
32Sampling
- The process of choosing members of a population
to be included in the sample - Research uses data from a sample to make
inferences about a population.
33Variability
- How much do scores vary about the average?
- Variance (sum of squared deviations of each
score from the mean)/(n-1) - Variance is small when scores are close to the
mean - Standard deviation square root of variance
34Within-group variability
- Variability within-groups is measured by the
variance and divided by sample size - Tells us how far individual scores deviate from
the group mean - This reflects "error"
- The number becomes lower with increasing sample
size
35Two Group Means
- Ask samples of males and females about their
number of doctor visits during the past year - Suppose the mean for males is 1.3 and the mean
for females is 2.1
36Do males and females differ?
- Is the mean number for males different from the
mean number for females? - Obviously, the sample means are different
- Can we infer that the population means differ as
well?
37Whats the Problem?
- The difference observed in the samples may be
real - However, the difference could just reflect the
fact that there is some chance of error there
is always a margin of error around the sample
value
38Hypothesis Testing
a Type I error (level of significance) b
Type II error (1- b Power)
(specificity)
(false - )
power
(sensitivity)
(false )
Inverse relationship between a and b for given
sample size Sample Size Calculation Find N s.t.
to a and b are under control. Typically, compute
N for a given a to yield (1-b)x100 power. For
example, compute N for a 0.05 to yield 80
power.
39Null and Research Hypotheses
- Null hypothesis H0
- Population means are in fact equal
- Any mean difference observed in the samples
reflects the margin of error - straw man or what you want to reject
- any observed deviation from what we expect to see
is due to chance variability - Research hypothesis H1
- Population means are not equal
- The mean difference observed is real
- claim, or what you want to accept or test)
40Alternative Hypotheses H1
Is the New" Treatment Different from the
standard? (2-sided) Better than the standard?
(1-sided, directional) Not different from the
standard? (Equivalency) Not worse than
the standard? (Not inferiority)
41Hypothesis testing
- Problem Determine whether or not the population
means of two groups of subjects truly differ with
respect to the outcome of interest. - Solution Assume that the two groups do not
differ, and see if the sample data disagree with
this assumption. That is, perform a hypothesis
test.
42Hypothesis testing (contd)
- The null hypothesis assumes that there is no
difference in outcome between the two groups. - The alternative hypothesis assumes that one group
has a more favorable outcome than the other. - The research hypothesis is usually the
alternative hypothesis.
43Hypothesis testing (contd)
- To do a hypothesis test
- Calculate a test statistic from the data.
- Determine whether the value of the test statistic
is likely or unlikely under the null hypothesis. - If the value is very unlikely, reject the null
hypothesis.
44Hypothesis testing (contd)
- Problem we might reject the null hypothesis
when it is true. - That is, we might commit Type I error.
- Solution Construct the test so that there is
only a 5 chance of incorrectly rejecting the
null hypothesis. - That is, the level of the test (alpha) is 0.05.
45Type I Error
- The chance of rejecting a NULL which is true is
a this type of mistake is called a Type I error
or false positive - Reject the null hypothesis when it is true
- Likelihood is set the alpha level decision rule
(.05 usually) - 5 is a reasonably low probability of being
wrong, but could set lower - For early phase II trials, we often use more
liberal type I errors for not missing the
possible treatments - In medical contexts, the specificity of a test is
the chance that the test result is negative given
that the subject is negative this is just 1 - a
46P lt .05
- The alpha level for rejecting the null hypothesis
is conventionally set as .05 - Obtained sample data are inconsistent with what
the null hypothesis expects - Reject the null hypothesis and therefore accept
the research hypothesis - Therefore, conclude that the obtained difference
in means is statistically significant
47Type II Error
- Incorrectly accepting the null hypothesis when
there really is a difference - The chance of not rejecting a NULL which is false
is ß this type of mistake is called a Type II
error or a false negative - In medical contexts, the sensitivity of a test is
the chance that the test result is positive given
that the subject is positive this is just 1 - ß,
also called power
48Power
- Probability of correctly rejecting the null
hypothesis - 1-Beta
- Power is higher with
- Large sample size
- Large difference between group means
- Low within-group variability
49What is p value?
- The p-value is the probability of obtaining data
as extreme as the observed result when the null
hypothesis is true. - That is, the p-value is the strength of the
evidence against the null hypothesis. - For a level 0.05 test, we reject the null
hypothesis if the p-value is 0.05 or less. - Smaller p-values ? stronger evidence against H0.
- Statistical Significance or Clinical Significance
- Large samples small differences may be
significant - Small samples large differences may not be
significant - The frequentist inference depends on sample
space, i.e. the design.
50What is p value?
- Decide on whether or not to reject the NULL
hypothesis H0 based on the chance of obtaining a
TS as or more extreme (as far away from what we
expected or even farther, in the direction of the
ALT) than the one we got, ASSUMING THE NULL IS
TRUE - The likelihood of observing the same outcome or
one more extreme if the study were carried out
again. - This chance is called the observed significance
level, or p-value - A TS with a p-value less than some prespecified
false positive level (or size) a is said to be
statistically significant at that level
51What is p value?
- The interpretation of a p-value is a little
tricky In particular, it does NOT tell us the
probability that the NULL hypothesis is true - The p-value represents the chance that we would
see a difference as big as we saw (or bigger) if
there were really nothing happening other than
chance variability. - p 0.08, 8 times out of 100 the same result or
more extreme would occur due to chance alone - A single convenient number giving a measure of
the degree of surprise which the experiment
should cause a believer of the null hypothesis
52Judging a p-value
0.01
0.05
The results are significant.
The results are highly significant.
The results are very highly significant
lt 0.001
The results are not statistically significant
gt 0.05
0.05
0.10
A trend toward statistical significance
53Statistical Significance Tests
- Significance tests provide a way of making a
decision about the population means - There are many such tests used for different
types of data. But all use the same logic
54Test statistic
- Measure how far the observed data are from what
is expected assuming the NULL (H0) by computing
the value of a test statistic (TS) from the data - The particular TS computed depends on the
parameter - For example, to test the population mean µ, the
TS is the sample mean (or standardized sample
mean)
55Example
- An experiment is conducted to study the effect of
exercise on the reduction of the cholesterol
level in slightly obese patients considered to be
at risk for heart attack. 80 patients are put on
a specified exercise plan while maintaining a
normal diet. At the end of 4 weeks the change in
cholesterol level will be noted. It is thought
that the program will reduce the average
cholesterol reading by more than 25 points. - Data
- sample mean 27
- sample SD 18
56Steps in hypothesis testing (I)
- 1. Identify the population parameter being tested
(ie population mean). Here, the parameter being
tested is the population mean cholesterol reading
µ - 2. Formulate the NULL (H0) and ALT hypotheses
(H1) - H0 µ 25 (or µ 25)
- Ha µ gt 25
- 3. Compute the test statistic (TS)
- t (27 25)/(18/v 80) .99
57Steps in hypothesis testing (II)
- Compute the p-value.
- Here, p P(T79 gt .99) .16
- (Optional) Decision Rule
- REJECT H0 if the p-value a
- (This is a type of argument by contradiction)
A typical value of a is .05, but theres no law
that it needs to be. If we use .05, the decision
here will be) - DO NOT REJECT H0
58Summary
Hypotheses Null New drug doesnt work
Alternative New drug works Decisions New
drug works Correctly reject H0Power Abandon
new drug Correctly dont reject H0 Proceed
with an ineffective drug Type I error
Abandon a drug that might work Type II error
59Pitfalls in hypothesis testing
- Even if a result is statistically significant,
it can still be due to chance - Statistical significance is not the same as
practical importance - A test of significance does not say how important
the difference is, or what caused it - A test does not check the study design If the
test is applied to a nonrandom sample (or the
whole population), the p-value may be meaningless - Data-snooping makes p-values hard to interpret
60Introduction to Permutation test (Rank Test)
- A type of nonparametric hypothesis test
- Also called randomization test, exact test
- Very widely applicable class of tests
- Introduced in the 1930s
- Usually require only a few weak assumptions
- Often shows good power
615 Steps to a permutation test
- 1. Analyze the problem identify the NULL and ALT
hypotheses - 2. Choose a test statistic (TS)
- 3. Compute the TS for the original labeling of
the observations - 4. Rearrange (permute) the labels and recompute
the TS for the rearranged labels (do for all
possible permutations) - 5. Decide whether to reject NULL based on this
permutation distribution
62Permutations
- A permutation is a reordering of the numbers 1,
..., n - Example What are some permutations of the
numbers 1, 2, 3, 4?? - The NULL specifies that the permutations are all
equally likely - The sampling distribution of the TS under the
NULL is computed by forming all permutations,
calculating the TS for each and considering these
values all equally likely
63Example
- Suppose we wish to compare the length of stay in
the hospital for patients with the same diagnosis
at two different hospitals. We have the following
results - 1st hospital
- 21,10,32,60,8,44,29,5,13,26,33
- 2nd hospital
- 86,27,10,68,87,76,125,60,35,73,96,44,238
- How could we carry out a permutation test to test
the NULL hypothesis of no difference between two
hospitals? - Why is a t test not useful in this case?
64Example
- The distribution of length of stay is very skewed
and far from normal distribution. - Using Rank-sum test,
- R 83.5, T 3.10 p 0.002
- This is an example of an unpaired 2 sample test
- Here, we have to find all of the combinations
(since order within each group doesnt matter)
65Advantages
- Can get a permutation test for any TS, even if
its sampling distribution is unknown - This gives more freedom in choosing a TS
- Can use on unbalanced designs
- Can combine dependent tests on mixtures of
different data types (e.g. with numerical and
categorical data)
66Limitations
- Assumption that the observations are exchangeable
under the NULL - This allows us to randomly move observations
between the groups - For example, when testing for a difference in 2
group means you would need to assume that the
distributions in both groups have the same shape
and spread - Cannot use for testing hypotheses in a single
population, or to compare groups that are
different under the NULL
67Introduction to ROC curves
- ROC Receiver Operating Characteristic
- Started in electronic signal detection theory
(1940s - 1950s) - Has become very popular in biomedical
applications, particularly radiology and imaging - Also used in machine learning applications to
assess classifiers - Can be used to compare tests/procedures
- True positive rate (sensitivity) vs. false
positive rate (1-specificity)
68Examples using ROC analysis
- Threshold selection for tuning on already trained
classifier (eg neural nets) - Defining signal thresholds in DNA microarrays
- Comparing test statistics for identifying
differentially expressed genes in replicated
microarray data - Assessing performance of different protein
prediction algorithms - Inferring protein homology
69ROC curves simplest case
- Consider diagnostic test for a disease
- Test has 2 possible outcomes
- positive suggesting presence of disease
- negative
- An individual can test either positive or
negative for the disease
70Specific Example
Test Result
71Threshold
Test Result
72Four groups
True Positives
True Negatives
False Negatives
False Positives
Test Result
73Moving the threshold
True Positives
True Negatives
False Negatives
False Positives
Test Result
74ROC Curve
True positive rate (sensitivity)
False Positive Rate (1-specificity)
75ROC Curve
True positive rate (sensitivity)
False Positive Rate (1-specificity)
76Area under ROC curve (AUC)
- Overall measure of test performance
- Comparisons between two tests based on
differences between (estimated) AUC - For continuous data, AUC is equivalent to
Mann-whitney U-statistic (non-parametric test of
difference in location between two populations)
77Interpretation of AUC
- The probability that the test result from a
randomly chosen diseased individual is more
indicative of disease than that from a randomly
chosen nondiseased individual P(Xi gt Xj Di1,
Dj0) - A nonparametric distance between
disease/nondisease test results. - No clinically relevant meaning
- A lot of the area is coming from the range of
large false positive values, no one cares whats
going on in that region. - The curves might cross, so that there might be a
meaningful difference in performance that is not
picked up by AUC
78Elements of sample size calculation
- Hypothesis
- H0 New treatment standard treatment
- Ha New treatment is better.
- Type I and Type II errors
- ? .025 (or two-sided ? .05)
- ? .15 (Power 85)
- Effect size
- ? mu1 mu2 (for continuous outcomes)
- ? Pi1 Pi2 (for dichotomous outcomes)
- Sample variation
- s(? )
79Test of Proportions
- Determining the Sample Size
- What is the level of significance?
- (Prob. or ? level)
- Rejecting a true null hypothesis
- What are the chances of detecting
- a real difference? (Power)
- How large a difference (?) is clinically
important?
80Determining the Sample Size
- Criteria are inter-related
- If you know 3 of 4 parameters, the other is fixed
(n, ?, ? and ?) - Must keep the study feasible
- There are trade offs
- There is no one correct answer
81Sample Size Calculation Is Only An Estimate
- Parameters used in calculation are estimates
themselves with a level of uncertainty. - Estimated treatment effect may be based on a
different population. - Estimated treatment effect is often overly
optimistic based on highly selected pilot
studies. - Patients eligibility criteria may be changed,
thus, affect the sample population. - Better to design a larger study with early
stopping and a smaller study than try to expand N
/extend f/u during the trial.
82Sample Size and Power Why?
- Before a study how large of a sample does a
study require? (in planning) - After a study if no association was found, could
it be due to either true lack of association in
population low power and small sample size?
83Power sample size
- Problem we might fail to reject the null
hypothesis when the alternative is true. - That is, we might commit Type II error.
- Solution Select a large enough sample so that
there is an 80 chance of rejecting the null
hypothesis if the alternative is true. - Then the power to detect the alternative is 80.
84Power sample size (contd)
- Problem Sometimes the sample size required is
too large. - Solutions
- Be content to detect with less power (allow more
type II error). - Increase the level of the test (allow more type I
error). - Pick a more extreme alternative.
85Sample Size
- Larger sample sizes provide more accurate
estimates of the characteristics of the
population - Confidence interval specify where the
population value probably lies - As sample size increases, there is less margin of
error
86Change in Sample Size Test of Proportions
Test of Hypothesis for Phase II Trial 1 arm H0
p lt 0.10 H1 p gt 0.25 n
40 Design ?10.04 1-sided test 1 - ?
0.82 ? 0.15 1 arm
87Change in Sample Size Test of Proportions
Test of Hypothesis for Phase II Trial 1 arm H0
p lt 0.10 H1 p gt 0.25 ? 0.15 ?1
0.05 0.025 0.01 1 - ? 0.80 40 49 62 0.90 55
64 78 0.95 70 79 103
88TTP Example
Assumptions 1
arm ?1 0.05 Power 0.80 H0 Med30
mos. H1 Med40 mos. Hazard Reduction 26 Accrua
l 12/mo. Duration of Accrual 14.7
mos. Follow-up 24 mos. Total Sample
Size 176 pts.
89Change to a 2 Arm Study
Assumptions 2 arm study ?1 0.05 Power
0.80 H0 Med30 mos. H1 Med40 mos. Hazard
Reduction 26 Accrual12/mo. Duration of Accrual
(mos) 43.1 Follow-up 24 mos. Total Sample
Size 518
90Increase Power
Assumptions 2 arm study ?1 0.05 Power
0.80 0.90 H0 Med30 mos. H1 Med40
mos. Hazard Reduction 26 Accrual12/mo. Duration
of Accrual (mos) 43.1 55.8 Follow-up 24
mos. Total Sample Size 518 670
91Statistical Power
Unacceptable
0.01
0.69
Poor
0.80
Good
0.89
0.90
0.99
Excellent
92Characteristics of Phase I Trials
- Small sample sizes
- Not hypothesis driven
- Toxicity (DLT and MTD) and Efficacy
- Patient safety and benefit
- Dose escalation and drug discovery
- Clinician, Patients and Drug Development
93Phase I trial designs
- Conventional/Standard Method
- 33 Dose Escalation Design
- Sequential/Bayesian Methods
- Continual Reassessment Method (CRM)
- Random Walk Rules (RWR)
- Decision-theoretic Approaches
- Escalation with Overdose Control (EWOC)
94Phase I Dose StudyStandard Method- 33 design
- At each predefined dose level, treat 3 patients
with dose level 1. - If 0 of 3 have DLT, increase to next level
- If 2 or more have DLT, decrease to previous level
- If 1 of 3 has DLT, treat 3 more at current dose
- If 1 of 6 has DLT, increase to next level
- If 2 or more have DLT, decrease to previous level
- If a dose has de-escalated to previous level
- If only 3 had been treated, enroll 3 more for a
total of 6 - If 6 have been treated, stop study and declare it
as MTD. - MTD the largest dose for which 1 or fewer DLT
occurred. - Escalation never occurs to a dose at which 2 or
more DLT have occurred.
95Sample Size for Safety Trials
- Type I Error (Alpha)
- Acceptable Safety Rate (Rho)
- Sample Size (N)
Alpha 0.10 0.05 0.05 0.10 0.10
Rho 5 10 14 20 25
Sample Size 45 28 20 10 8
96Characteristics of Phase II Trials
- Aim To determine the efficacy of a new
treatment (what outcomes to observe) - Small study of one experimental treatment (E)
- Often a single-arm trial of E alone, without
randomization - Efficacy and safety are evaluated using an
early outcome - Data on E are compared to historical data on
standard treatment (S) - If E is promising, then Organize a randomized
phase III trial of E-vs-S based on a
time-to-event outcome (T)
97Primary Outcome Measure Point Estimate
Mean Hgb µ Proportion Responding p Median
Nadir PSA Failure Rate ?
98Typical Phase II Trials
- Typical cancer phase II trials investigate the
response rate - Historical reference p0
- Desired clinical significant response p1
- Hypotheses
- H0 pp0 (If true response rate is no larger
than p0, a minimum response rate of interest) - H1 pp1 (If true response rate is at least
p1, a target response rate) - Stop the trial early if p is not sufficiently
promising
99Typical Phase II Trials
- One stage design Using the Fishers exact test
to reject null - Two stage design First stage to have N1
patients. If not enough responses, stop the
trial Otherwise, continue to full N (gt N1 )
patients evaluate treatment response based on
the number of responses - The choice of N and N1 according prespecified
type I and II errors.
100Phase II Trial Designs
- Single sample (1 stage)
- Multiple stage design
- Gehan (2 stage), Fleming,
- Simons Optimal, MiniMax
- Bayesian
- Multiple Outcomes Measures
- Interim Analyses
- Stop for toxicity or lack of activity
- Not rejecting null hypothesis.
101Phase II Trial Designs
- Randomized Phase II (2 samples)
- Reduce bias by randomizing pts.
- Concurrent accrual/Comparative
- Control/Selection
- Randomized discontinuation
- Interim Analysis
- Stop for toxicity or lack of activity
- Not rejecting null hypothesis
- Adaptive
102Simons Optimal 2-Stage Design
? 0.10 ? 0.10 E(Np0) 48 PET(p0) 0.65
103Characteristics of Phase III Trials
- Use phase 2 data to decide what to test in phase
3 - Randomize between E and S, usually multi-center
- Typically based on T survival time or DFS time
- The scientific standard for deciding if E is
effective