Statistical aspects of clinical research

About This Presentation

Title:

Statistical aspects of clinical research

Description:

Randomization Practical Tips ... safety data in a timely fashion, ... endpoints which can be measured in a completely objective fashion are preferred ... – PowerPoint PPT presentation

Number of Views:182

Avg rating:3.0/5.0

Slides: 65

Provided by: davidgi1

Category:

more less

Transcript and Presenter's Notes

Title: Statistical aspects of clinical research

1
Statistical aspects of clinical research

David Giltinan
May 2006

2
Outline

Why is clinical research hard?
Key statistical concerns
Get the correct answer to the right question,
using the appropriate number of subjects
Key components of a clinical trial
Clear, feasible, appropriate study objective(s)
Target patient population
Study design visit and evaluation schedule
Efficacy and safety endpoints
Sample size
Analysis methods
Next week
Interim analyses early termination?
Subgroup analyses

3
Clinical research is not for sissies

Answering even relatively simple questions under
the best conditions a controlled clinical
trial can be tricky. Possible sources of bias
abound, and if appropriate safeguards are not
taken, may combine to give a false or misleading
conclusion
Some of the factors which make clinical research
hard
Formulating the right scientific question can
be deceptively tricky
Logistical complexity, especially the need to use
multiple sites
Trial conduct is highly interdisciplinary,
requiring sustained, well-coordinated effort from
many groups
Staggered recruitment of subjects, uncertainty
about accrual pattern is unavoidable
Patient dropout, particularly in longer trials
Potential for the goalpost to move mid-trial
unforeseen events can destroy, or severely
reduce, the relevance of the study even before it
ends

4
Laws governing clinical trial conduct ¹

Lasagnas Law
The prevalence of any disease under study drops
dramatically once study enrolment opens up, and
returns to previous levels only once enrolment
closes
Murphys Law
Anything that can go wrong, will go wrong
In particular, the most egregious breach of
protocol instructions will occur at the
highest-enrolling site
Giltinans Law
The quality of data obtained from any site is
inversely proportional to the degree of
exaltation of the thought leader or principal
investigator at the site (in extreme cases, the
role of thought leader is so all-consuming that
delays in filing the necessary paperwork result
in actual enrolment levels close to zero)
¹ clearly, all just different manifestations of
Murphys Law

5
Strategy to tactics protocol development

A key concern is that each individual study
protocol must achieve its goals, not just on its
own terms it must also make sense within the
broader picture
A major practical issue is the ever-changing
nature of the landscape the long duration of
most trials, and the uncertainty about the
results means that the original target may have
shifted by completion of a given trial
Nonetheless, a key requirement when designing any
trial is that the proposed design should give the
best chance possible of enabling the development
plan to proceed to the next stage, once results
from the trial become available
The previous condition should be met, even when
results do not correspond to the desired answer
it is important to remember that a failed
clinical trial is not one which fails to give the
desired answer, but rather one which fails to
give an unambiguous answer

6
Study objectives should be clear, specific, and
relevant

Phase III objectives determined primarily by (i)
target product profile (think desired label
claim) (ii) norms for the given disease
Primary and secondary objectives should map
readily to corresponding statistical hypotheses
Safety objectives are given greater emphasis in
Phases I and II Phase III focuses on efficacy
and safety
Objectives should be specified as precisely as
possible. At a minimum, include information on
What measure of efficacy/safety will be used?
Key features of the target patient population
Dosing regimen, i.e. amount, frequency, and route
of dosing
Preferable to use neutral language when
specifying objectives (personal opinion). Phrases
like to compare (investigate) the efficacy or
to characterize the pharmacokinetics are
preferable to, e.g., to demonstrate efficacy or
to establish superiority

7
Protocol Tip 1 Specify clear study objectives

Examples
To investigate the effect of a single 5mg dose
of rhwonderprotein, administered by transgenic
snakebite, on clotting ability in Irish
clergymen, as measured by the change from
baseline in prothrombin time, rather than To
demonstrate the efficacy of rhwonderprotein in
improving clotting ability
To investigate the effect of twice daily SC
injection of 40µg/kg of rhIGF-I for 12 weeks on
glycemic control, in subjects with moderate to
severe Type II diabetes, as measured by the
average change from baseline in HbA1c, compared
to subjects in the placebo group

8
Bias sources and precautions

Selection bias
Allocation bias
Evaluation bias (observer/instrument)
Recall bias
Time (systematic change in patient population,
treatment, or other aspect of study conduct as
trial progresses)
Withdrawal / drop out patterns
Lack of compliance with study protocol
Unblinding (of patient, physician, or study
personnel)

Unambiguous eligibility criteria
Randomization, stratification, blinding
Blinding, standardization
(training, or central evaluation)
Appropriate data collection instruments
Balanced treatment allocation, protocol should
specify salient details of study conduct,
avoiding room for differential interpretations
Pre-specified analysis conventions, sensitivity
analyses
Training engaged study coordinators at site
Randomized allocation suitable precautions
surrounding treatment codes and drug
inventory/supply

9
Bias the statisticians arch-nemesis

Loosely speaking bias arises as a result of
Groups differ at baseline w.r.t. an important
prognostic factor
Groups differ w.r.t. some aspect of study conduct
that could affect response
Key statistical tools against bias are
Randomization (allocation of subjects to
treatment groups is randomized)
Blinding
Stratification
Uniform implementation of study procedures across
study sites is also critical. Differences may
complicate interpretation, or compromise
generalizability of results. Of particular
concern
Different interpretation of eligibility criteria
Systematic differences across sites in how key
variables are measured

10
Bias, efficiency, and generalizability

Trial design and execution should
Avoid bias - wrong, or misleading, result
Generalize to the target population of interest -
avoid an irrelevant result
Be efficient - avoid using more subjects than
necessary
Studies which are inadequately powered, or
otherwise deficiently designed, may be viewed as
particularly inefficient (and ethically dubious)

11
Randomization

Randomization is the basis for statistical
inference
A significance level represents the probability
that differences in outcome can be the result of
random fluctuations.
Without randomization a statistically
significant difference may be the result of non
random differences in the distribution of unknown
prognostic factors
Randomization does not ensure that groups are
medically equivalent, but it distributes randomly
the unknown biasing factors
Randomization plays an important role for the
generalization of the observed clinical trials
data

12
Randomization Practical Tips

If prognostic factors are known use randomization
methods that can account for it
Stratification / blocking
Adaptive randomization
If possible randomize patients within a site
Patients enrolled early may differ from patients
enrolled later
Watch out for staggered enrollment
Temporary closing of study sites or arms can
cause problems
Protocol amendments that affect
inclusion/exclusion criteria may be tricky
Even in open label studies randomization codes
should be locked

13
Blinding

Randomization does not guarantee that there will
be no bias by subjective judgment in evaluating
and reporting the treatment effect
Such bias can be minimized by blocking the
identity of treatment (blinding)
Types of blinding
Challenges
Ethical considerations
Unblinding procedures for safety reasons
Unblinding procedures at final analysis

14
Protocol Tip 2 Avoid Ambiguity

Protection against certain types of bias is
through appropriate design precautions
(stratification, randomization, blinding)
Other types of bias are prevented only by giving
unambiguous instructions to the sites on the
intended patient population and how all aspects
of the study should be conducted
Sites will sniff out each ambiguity in the
protocol, and interpret and execute the
instructions more divergently than you can
imagine
There is vagueness regarding key aspects of study
conduct, e.g. use of con meds, evaluation
schedule, endpoint definition, handling of
dropouts, how key evaluations will be carried
out, etc. etc. etc.
Major divergence in interpretation (e.g. in
deciding eligibility, or how to measure a key
response variable)
has the potential to torpedo the protocol
entirely
may not become evident until its too late

15
Protocol Tip 3 Accommodating multiple sites

As a routine precaution, it is advisable to limit
the contribution to enrolment of any single site
to no more than 15 of the total. Note that this
limit is generally not specified explicitly in
protocol text, but is communicated to sites at
study initiation nonetheless
Non-standard evaluations may require intensive
training of site personnel to reduce systematic
differences in evaluation among sites
Centralized (blinded) evaluation, when feasible,
is often the best option
It is a good idea to develop a prospective
publication strategy, securing upfront buy-in
from key stakeholders
A plan and timetable for disseminating study
results should be developed, following existing
SOPs, and communicated to sites prospectively

16
Protocol Tip 3 Accommodating multiple sites

Regular, frequent communication with sites is
important
Early monitoring of key variables is advisable,
to allow problems to be detected and fixed early
Appropriate mechanisms should be in place to
allow evaluation of aggregated safety data in a
timely fashion, (remember that individual sites
may not be able to discern adverse patterns,
based only on their data)
Each team member should try to attain at least a
basic understanding of the role of every other
team member

17
Endpoints (1)

Discussion here will focus primarily on efficacy
endpoints
What about other kinds of endpoints?
Pharmacokinetic endpoints are generally standard
parameters derived from the observed
concentration-time profiles
Safety endpoints also tend to be fairly standard
most are common across protocols, with occasional
disease/drug-specific markers
Incidence of adverse events (general,
protocol-specified, by body system, etc.)
Changes in key laboratory parameters
Incidence of antibodies (neutralizing or not)
Pharmacodynamic endpoints, in contrast, are
measures of activity, and will vary from study to
study. Recommendations for efficacy endpoints
apply.

18
Endpoints (2) General Remarks

No problem in Phase I, where focus is primarily
on safety and PK endpoints. Limited sample sizes
preclude formal evaluation of efficacy if it
must be mentioned in the protocol, it is
preferable to refer to activity, rather than
efficacy
Drug approval requires establishing an acceptable
risk-benefit profile. It is important to bear in
mind that the regulatory expectation is that of
clinical benefit to the patient
Thus, in general, the primary efficacy endpoint
should be a measure of clinical effect (as
opposed to, e.g. a biochemical or physiological
marker)
Taking the primary efficacy endpoint in a pivotal
trial to be a biomarker which is not a direct
measure of clinical benefit is something which
should be done only with prior buy-in from all
relevant regulatory agencies
In general, such buy-in can be attained only in
the case of an established surrogate endpoint
more on this below

19
Endpoints (3) relevance should be accepted

Ideally, there is a well-established primary
efficacy endpoint, accepted as a suitable
measure of patient benefit.
This can circumvent much tedious discussion, and
has the added advantage that consensus on what
constitutes a meaningful treatment effect is
likely already to exist.
When such consensus exists, to ignore it would be
foolhardy
Often there may be consensus on the choice of
primary efficacy variable, but secondary aspects,
such as definition of relapse or rebound may
still be under debate
For diseases with no consensus on how best to
measure efficacy, expect longer development times
It is not recommended to launch Phase I without a
reasonably clear vision of what the primary
efficacy variable will be in pivotal studies
postponing difficult discussions wont
necessarily make them any easier
Agreement on conventions for handling
dropouts/missing data is also important

20
Endpoints (4) Objective is better

Generally speaking, endpoints which can be
measured in a completely objective fashion are
preferred
This may not always be possible some degree of
subjectivity may be unavoidable (e.g. in
endpoints such as physicians or patients
evaluation of improvement)
The degree to which this kind of subjectivity may
be acceptable is likely to depend on perceptions
about the integrity of blinding in the study
In evaluating quality of life, use of a
validated instrument is preferable. In many
cases, a disease-specific QOL questionnaire
exists
Consultation with the Health Economics group is
highly recommended, to ensure that collection of
QOL data supports the target product profile
(dont wait until Phase III to do this)

21
Endpoints (5) measurement aspects

In general, key efficacy endpoints should be
straightforward to measure. Avoid measures which
might still be considered experimental, which
require highly complex instrumentation, or
involve extremely specialized assays.
Measurements which rely heavily on technician
skill or judgement can also be problematic
Centralized evaluation of key endpoints may help
guard against inter-site variation
If key variables do involve specialized assays,
make sure that assay procedures are thoroughly
understood, and consistently implemented

22
Endpoints (6) Multiple Endpoints

Multiple secondary endpoints are common
Multiple primary endpoints are sometimes used
If consensus on a single 1? endpoint is
impossible
Should be a course of last resort (personal view)
Have an associated penalty, in terms of a higher
bar to declare statistical significance at a
given level ?
A common approach is to require significance at
level ? k, where k is the number of
co-primary endpoints (Bonferroni)
Bonferroni works reasonably, provided k is not
too large, and if the constituent endpoints are
uncorrelated
For highly correlated endpoints, Bonferroni is
inefficient true attained significance will be lt
?
Especially problematic if there is interest in
multiple subsets
Try to show some discipline regarding of 2?
endpoints

23
Endpoints (7) a statistical taxonomy

Continuous - e.g. reduction in cholesterol,
HbA1c, visual acuity
Categorical
Multiple categories with no natural ordering
Ordered categorical - e.g. different degrees of
improvement
Dichotomous e.g. response/non-response,
dead/alive at a specific time post-treatment
Time-to-event e.g. survival, time to
progression
Different analysis methods are appropriate for
each main
endpoint type sample size requirements differ as
well
(3) is obviously a special case of (2)

24
Endpoints (8) statistical properties

Approximate ordering by information content (from
highest to lowest) is
Continuous gt time-to-event ordered
categorical
gt categorical gt binary
As a result, demonstrating an effect when the
primary efficacy measure is a response rate is
typically most demanding, in terms of sample size
Although continuous response variables may have
preferable statistical properties, it is quite
common for FDA to require the primary efficacy
variable to be a response rate, where response is
defined as the proportion of subjects who reach a
specified threshold of improvement on the
continuous scale (Raptiva, Lucentis)

25
Endpoints in cancer trials

Response rate (where response is based on change
in tumor size, according to well-defined
criteria best post-treatment evaluation is
counted, so response is not linked to a specific
timepoint)
Duration of response (note that the resolution
with which this can be determined will depend on
the frequency of scheduled evaluations)
Survival time
Time to disease progression, where criteria for
progression are well-defined
Progression-free survival
One major question is the extent to which a
treatment effect
on response, in terms of reduction of tumor size,
is predictive
for treatment effect on survival. Unfortunately,
this seems to vary by tumor
and treatment class.

26
Sample Size Considerations

In the standard hypothesis testing framework for
efficacy
Type I error conclude an ineffective drug is
effective (false positive)
Type II error conclude an effective drug is
ineffective (false negative)
Ideally, both error probabilities should be
controlled
Generally, sample size is chosen to give
acceptable power (defined as 1- Type II error
rate, or 1 - ?) for a prespecified false positive
rate, ?
In phase III efficacy trials, ? is 0.05, by
regulatory fiat
Acceptable power is generally taken to be 90 for
pivotal studies

27
Phase III Trials Sample Sizes

This has implications for sample size, due to
tension between both types of error
Timeline implications, as study duration
treatment duration accrual time
Common pitfall exaggerate extent of the
possible treatment effect (power for the home
run), over-optimistic sample sizes
General guideline power study to detect
treatment effect specified in the target product
profile (regular, not optimistic, scenario)
In some cases, sample size is dictated by safety,
rather than efficacy, considerations (satisfy
minimum regulatory requirements)

28
Sample Size Considerations

For a given value of ?, power depends on
Magnitude of the treatment effect (?)
Sample size (?)
Inter-subject variability for continuous
measurements (?)
Response rates for binary responses (??)
For most pivotal efficacy trials, the standard
approach is to calculate the sample size
necessary to give adequate (90) power to detect
a clinically meaningful treatment effect, with a
type I error rate of 5
Calculating the sample size needed for a given
power requires some knowledge about variability
of continuous responses (or response rates, for
binary data)
Clinically meaningful needs to be defined in
terms of the target product profile, not as the
effect size which will give acceptable power for
the sample size Im willing/able to use

29
Sample Size other approaches

Sample size is not always dictated by this kind
of power analysis in some cases, safety
requirements may be the deciding factor
(rheumatoid arthritis, psoriasis)
In earlier phases, it may not be practical to run
trials big enough to control both Type I and Type
II error rates as well as we might like
80 power is generally considered adequate in
Phase II on occasion we may settle for less
Similarly, requiring significance at the 5 level
may be overly stringent in Phase II
Personal view it is foolish to allow the
hegemony of hypothesis testing to control our
thinking prior to Phase III
Instead, view the issue as an estimation problem
Precision analysis
Choose sample size in such a way that there is a
desired precision at fixed confidence level
Small chance of detecting true treatment effect

30
Sample Size for Time to Event Endpoints

Challenge
Power for correctly detecting a clinical
meaningful difference at a fixed type I error
rate depends primarily on the number of events
(deaths, progressions, etc.)
Specifying the number of events doesnt uniquely
determine the number of subjects
For instance, suppose the required number of
events is 280. If 300 subjects per group is
sufficient to give the required number of events,
then 250 per group must as well it will just
take longer
Thus, sample size calculations are a little more
complex for time-to-event responses and will
depend on
calculating the number of events needed to give
the desired power
an assumption about the median time-to-event in
the control group
an assumption about the size of the difference
between control and treated groups
projected accrual patterns
targeted study duration

31
Interim Analyses

Interim analysis is a tool to protect the welfare
of subjects
By stopping enrollment/treatment as soon as a
drug is determined to be harmful
By stopping enrollment as soon as a drug is
determined to be beneficial
By stopping trials which will yield little
additional useful information (or which have
negligible chance of demonstrating efficacy if
fully enrolled, given results to date)
The associated statistical methods are generally
referred to as group sequential methods

32
Interim analysis Concerns

Should preserve an overall false positive rate of
? for the trial cannot claim statistical
significance at level ? if the unadjusted p-value
at one of the interim analyses happens to be less
than ?
In general, the unadjusted p-value for testing
treatment effect at any given interim analysis
will be compared to a more stringent (lower)
bound to stop early (for efficacy) requires
compelling evidence
Regulatory agencies need to be convinced that
interim analyses do not compromise the integrity
of the blind
Regulatory guidelines over the past 10 years have
become stricter and stricter, ultimately
requiring that interim analyses be conducted by
an external, independent group, i.e. study team
members are no longer privy to interim results

33
Interim analysis Concerns

Basically, interim results should not be shared
with anyone in the sponsor company, or at
participating study centers
The only feedback to the sponsor is in the form
of the recommendations from the Data Monitoring
Committee
Details of any proposed interim analysis,
including the sponsors expectations of the DMC,
should be laid out prospectively in a written
charter
SOPs and a charter template exist and should be
followed
Although team members do not conduct the actual
analyses, scheduled interim analyses can be
highly labor-intensive nonetheless. Genentechs
biostatistician/statistical programmer will still
need to work with the external data group to
develop detailed specifications for the analyses
and displays to be made available to the Data
Monitoring Board

34
Interim analysis

Early stopping for efficacy is not the only
possibility (recent experience notwithstanding).
Doing so is generally non-controversial, provided
an appropriate group sequential stopping rule,
and the role of the DMC, have been identified
prospectively
Early stopping for safety can range from
scenarios which are very clear-cut to situations
which are considerably more ambiguous. In the
latter case, having an experienced DMC chair can
be particularly important
Early stopping for lack of efficacy (futility
analysis) is not particularly common (with one
exception, discussed on the next slide) the
idea that incorporating this option can result in
substantial reduction in the number of patients
(gating risk) seems slightly misleading
(personal opinion)
Stopping for futility in a controlled trial will
typically happen only if the treatment appears
considerably inferior to control at the interim
analysis
Enrolment continues during preparation for the
interim analysis, which typically occurs at a
point where accrual has gained momentum, so of
subjects saved may not be that great

35
Early stopping for futility

An exception is the case of uncontrolled oncology
trials focusing on estimation of response rate
Use of a two-stage (or multi-stage) design is
common
At a given analysis stage, if the observed
response rate is so low that it essentially rules
out the possibility that the true response rate
is acceptable, may choose to stop
Typically the argument is based on the upper 90
or 95 confidence limit for the true response
rate stop if this is lower than the minimum
rate identified as interesting in the TPP
Recall the rule of 3, often invoked in the
context of safety data. If a particular event
(adverse reaction, response) occurs in 0 out of N
subjects tested, then the 95 upper confidence
limit for the true rate of occurrence is 3/N.
Thus, for instance, if no responses are observed
in the first 20 subjects, this effectively rules
out values of the true response rate greater than
3/20, or 15. If the TPP requires a response rate
of at least 20, stopping for futility seems
warranted

36
Statistical analysis methods for rates

A fairly detailed exposition can be found on our
website at gwiz/projects/stathelp
introductory course notes, lecture 4
Use of the binomial distribution
Calculating standard errors normal approximation
for large samples
Estimation and confidence intervals for a single
rate
Testing for difference between two rates (z-test,
?²-test, Fishers exact test)
Estimation and confidence intervals for the
difference between two rates
Testing for differences in rates among several
groups (?²-test, Fishers exact test)

37
Statistical methods for survival analysis

If the response of interest is survival time,
then specialized methods are needed, for two main
reasons
Frequency distribution of survival times is
usually not well-behaved not normal, not even
symmetric
In the context of clinical studies, cannot wait
to observe all survival times this means, for
some subjects, all we know is that their survival
time exceeds the observation period
In statistical jargon, such survival times are
called (right)-censored observations
Methods for survival times are also applicable to
any response of type time-to-event e.g. time
to disease progression, etc.

38
Overview of survival analysis methods

Definitions survivor function, hazard function
Estimation of survival curve Kaplan-Meier
Comparison of one or more survival curves
logrank test, Wilcoxon test
Comparing survival curves, allowing adjustment
for other factors (e.g. baseline disease status)
proportional hazard regression, aka the Cox
model

39
Kaplan-Meier disease-free survival curves
stratified by p53 mutation status (n 542)
Solid/dotted without/with a p53 tumor mutation
40
Graphing survival data Kaplan-Meier estimation

We wish to estimate the proportion remaining
disease-free at any given time, equivalently, the
estimated probability of that a member of the
population from which the sample is drawn is
alive without disease at that time
Because of the censoring we use the Kaplan-Meier
method. For each time interval we estimate the
probability that those without disease at the
beginning remain so throughout the interval. This
is a conditional probability.
The probability of being disease-free at any time
point is calculated as the product of the
conditional probabilities of surviving without
disease through each interval prior to that time
point.
The calculations are simplified by ignoring times
at which there were no recorded events (whether
progressions or losses to censorship).
Censorship is accommodated in the calculations by
ensuring that all subjects previously lost to
censoring are removed from the risk set when
calculating the conditional probability for a
given timepoint
Because the overall probability of being disease
free at a particular timepoint is calculated as a
product of the relevant conditional
probabilities, this (Kaplan-Meier) method of
estimating the survival curve is sometimes
referred to as the product-limit estimate

41
Describing survival pattern for a single group

Survival probabilities are usually presented as a
connected "curve. The curve takes the form of a
step function, with changes in the estimated
probability occurring (only) when an event
(progression) was observed
Observations censored during any interval affect
the number still at risk at the start of the next
interval. Censoring is thus accommodated when
calculating the step sizes, its effect on the
curve is relatively subtle, but becomes
cumulatively more important over time. Some
versions of the Kaplan-Meier curve display
censoring times as superimposed short vertical
lines (works best for relatively small sample
sizes)
In practice, a computer is used to do these
calculations.
Standard errors and confidence intervals for
estimated survival probabilities can be found by
using a formula due to Greenwood
Reporting estimated median survival with
associated confidence limits is usual estimating
other percentiles is also possible

42
Comparing survival patterns across groups

Two most common tests are
Logrank test
Wilcoxon test
If comparison needs to allow adjustment for other
covariates besides group ID (e.g baseline disease
status), the most common approach is
Cox (proportional hazards) regression
As the name implies, this analysis frames the
comparison in terms
of the effect a treatment or covariate exerts on
the hazard function,
rather than directly on the survival function

43
Comparing survival patterns testing

Logrank test
Basic idea at each new event time, figure out
the survival pattern that would be expected if
the null hypothesis (no difference) were true
Quantify the difference between the observed
survival pattern and that expected under null
hypothesis. This is done at each new event time.
Obtain a cumulative measure of discrepancy from
H0 by adding up the contributions across all
event times
Compare the result to appropriate tables
(chi-square) to obtain a p-value
Wilcoxon test variation of logrank text which
gives greater weight to discrepancies occurring
earlier

44
Comparing survival patterns estimation

Limitations of the logrank test
Only addresses the question is there a
difference? No direct quantification of the size
of the difference
Doesnt allow adjustment for other relevant
prognostic factors (e.g. differences at baseline)
These questions usually addressed by Cox
(proportional hazards) regression. Salient output
is
estimated coefficient with standard error and/or
confidence interval
Usually interested in whether or not coefficient
is zero
Quantifies effect on hazard, rather than the
survival function

45
Definitions of survival and hazard functions

For completeness, here are the definitions
Survival function
S(t) Probability of surviving past time t
Hazard function
h(t) Probability of dying at time t, given one
has survived until that time
For calculus fans, the hazard function turns out
to be d/dt - log (S(t)

46
Safety analyses

Safety and efficacy data differ in some key
aspects
Safety hypotheses are not specified a priori
Failure to achieve statistical significance does
not mean that a safety finding can be ignored
With safety data the goal is to prove a negative
Safety analyses are usually descriptive
A few serious medical events can lead to the
termination of products development extreme
value distributions are relevant to safety
analyses
Concurrent controls may not provide adequate
context for interpretation

47
Safety Analysis - Challenges

Phase III trials are typically sized based on
efficacy what type of safety statements are
appropriate?
Drug exposure how to summarize, how to
correlate with adverse events observed, etc.
Dose response
Open label trials
Placebo-controlled trials
Sources of bias (under-reporting, longer
follow-up leads to more events)
Adverse events very very many types, so what is
an appropriate way to summarize/analyze?
Multiplicity

48
Safety Analyses - Challenges

Number of subjects and duration of exposure
during development is minimal relative to the
of patients that may receive drug post-approval
Only the most common AEs (e.g., incidence of 1
or more) are identified
Less common AEs (1 in 1000) cannot be reliably
detected
Rare events (1 in 10,000) will almost certainly
not be observed at all
Some patient groups may have been excluded from
trials entirely, or insufficiently represented
to a degree which precludes identifying any risks
specific to them

49
Regulatory Requirements

Safety
Applicant must demonstrate product safety (FDA
has obligation to demand)
Extent of data There must be sufficient
information to decide whether the drug is safe.
Adequate analyses Adequate tests by all
methods reasonably applicablemust be performed
to evaluate safety for labeled use.
Reasonable results Tests should show that drug
is safe as labeled
Risks must be adequately defined.
Extreme risks (even if rare) must be obvious.

50
Regulatory Requirements

Efficacy
Applicant must demonstrate substantial evidence
of effectiveness claimed.
Substantial evidence evidence consisting of
adequate and well-controlled investigations,
including clinical investigations, from which
experts could conclude the drug will have the
claimed effect.
Investigations imply replication or
corroboration.
Typical 2 Phase III trials with identical or
similar designs
In special circumstances 1 Phase III trial may
be sufficient.
E.g. life-threatening diseases with very limited
therapeutic options (always a good idea to talk
to regulatory agencies prior to trial initiation)

51
Guidelines and Regulations

Regulatory Agencies
FDA
EEC (European Economic Community)
U.S. Codes of Federal Regulations for Clinical
Trials
ICH (International Conference on Harmonization)
Initiatives undertaken by regulatory authorities
and industry associations to promote
international harmonization of regulatory
requirements
Good Clinical Practice (GCP)
Structure and content of clinical studies
Clinical safety data management Definitions and
standards for expedited reporting
Statistical principles for clinical trials

52
Biomarker - working definition

. a laboratory measurement or physical sign
used as a substitute for a clinical endpoint that
measures how a patient feels, functions, or
survives.
from a definition of the term surrogate
endpoint by
Temple, cited in Fleming and DeMets (1996),
Annals of Internal Medicine, 125, pages 605-613
Surrogate endpoints in clinical trials are we
being misled?

53
Appendix

Some thoughts on biomarkers

54
Biomarkers as surrogate endpoints

Predict clinical efficacy of treatment based
on its effect on biomarker (data may be
available earlier may provide answer with fewer
number of subjects)
Use in Phase II is common
dose ranging based on biomarker
Phase III go/no go decision based on observed
treatment effect on biomarker

55
Common biomarker types

Biochemical (cholesterol, HIV viral load,
cytokine concentration, hemoglobin A1c )
Immunological (lymphocyte subpopulation counts,
CD4 , CD11a T cells, CD20 B cells..)
Saturation of target cell surface antigen or
soluble ligand
Physiological (e.g. blood pressure, pulmonary
function testing, episodes of arrythmia )
Imaging (angiography, tumor size, bone density by
DEXA scan )

56
Biomarkers as surrogates - successes

Lowering of cholesterol level by treatment with
statins (survival benefit established)
Reduction in viral RNA in peripheral blood
through treatment with protease inhibitors delays
HIV disease progression
Improved glycemic control (HbA1c) predictive of
delayed onset of microvascular complications
(retino-, nephro-, neuropathy) in Type I diabetes
90-minute TIMI flow (angiography) predictive of
30-day survival following thrombolytic therapy
Reduction in free IgE following treatment with an
anti-IgE antibody correlates with symptom
improvement scores in allergic rhinitis and asthma

57
Biomarkers as surrogates cant win em all

Experience with biomarkers is not always positive
CD4 counts as a surrogate in AIDS trials mixed
performance as a predictor of clinical benefit
Tumor size in cancer trials experience runs
both ways appears to depend both on tumor type
and on class of treatments
Experience in the CAST trial demonstrated that
treatment with encainide/flecainide clearly
reduced the incidence of arrythmias, but
increased mortality
Similar results in context of treating atrial
fibrillation
Blood pressure as surrogate effect translates
to clinical benefit for some drug classes, but
not others

58
What can make biomarkers unreliable?

Biomarker not on causal pathway of disease
process
Several pathways intervention affects that
mediated through biomarker, but not others
(redundancy)
Biomarker not on the pathway affected by the
intervention, or is insensitive to treatment
effect
Intervention has mechanisms of action unrelated
to the disease process (aka the law of
unintended consequences)
Failure of either type is possible - biomarker
could falsely predict, or fail to predict,
clinical benefit

59
What can make biomarkers unreliable?

Other potential contributing factors include
Measurement difficulties due to rater effects
GNE experience (?-interferon in renal cell
carcinoma)
strongly supports advisability of blinded tumor
evaluation by a single central review board
(avoid
bias, minimize center differences)
Measurement difficulties arising from sample
preparation,
transport, storage, and handling
Time constraints in assaying fresh blood,
possible effects of
activation of T-cells, lack of standardization of
FACS assay
protocols and reporting methods, heterogeneity of
tumor
samples, center differences (use of local or
central labs)

60
What can make biomarkers unreliable?

Other potential assay-related difficulties
include -
Matrix effects
Interference by other proteins can affect assay
specificity and/or sensitivity
Development of antibodies
Can be hard to detect harder to quantify
reliably
extremely difficult to assess clinical
significance, if any
Inter-laboratory differences
Can be large enough to make biomarker data
uninterpretable

61
Biomarkers editorial comments

Avoid the what we can measure is what we should
measure fallacy
Experience with imaging-based biomarkers to date
has been disappointing
Non-targeted genomic assays (e.g. microarrays
followed by data mining) has the potential for
much wasted effort
Avoid the rearranging the deckchairs on the
Titanic fix, e.g. straining to improve assay
precision from a CV of 20 to 15 when the
within-subject CV for the marker is 40 and the
inter-subject CV is 50.
Cytokines make particularly treacherous
biomarkers
Proteomics is not for sissies
Distinguish between must know and
nice-to-know
An understanding of mechanism of action may be
nice to know, but is not a requirement for drug
approval

62
Personal opinions (tongue in cheek)

If the word cascade appears in the description
of the disease process, all bets are off
The topic of biomarkers seems to drive otherwise
thoughtful researchers to an irrational frenzy of
wishful thinking
The message so eloquently expounded by Jagger et
al remains as relevant today as it was in 1969
Lasagnas Law already mitigates against rapid
accrual of eligible subjects to clinical trials
To slow recruitment from a trickle to a complete
grinding halt only two words are needed in the
protocol serial biopsy

63
Biomarkers - general conclusions

Utility of a particular biomarker depends not
only on the disease, but also on the nature of
the therapeutic intervention
Validation of any candidate biomarker must
necessarily be considered on a case-by-case basis
Validity of a marker for a given drug class may
not transfer to other drug classes for the same
disease
Success is most likely when intervention clearly
affects the biomarker, whose role in the disease
process is well-established and clearly
understood
Validation of a putative marker cannot happen
without ultimately generating the required
clinical outcome data
Regulatory conservatism is to be expected, and
seems appropriate