Title: Revising FDA
1Revising FDAs Statistical Guidance on Reporting
Results from Studies Evaluating Diagnostic Tests
- FDA/Industry Statistics Workshop
- September 28-29, 2006
- Kristen Meier, Ph.D.
- Mathematical Statistician, Division of
Biostatistics - Office of Surveillance and Biometrics
- Center for Devices and Radiological Health, FDA
2Outline
- Background of guidance development
- Overview of comments
- STARD Initiative and definitions
- Choice of comparative benchmark and implications
- Agreement measures pitfalls
- Bias
- Estimating performance without a perfect
reference standard - latest research - Reporting recommendations
3Background
- Motivated by CDC concerns with IVDs for sexually
transmitted diseases - Joint meeting of four FDA device panels
(2/11/98) Hematology/Pathology, Clinical
Chemistry/Toxicology, Microbiology and Immunology - Provide recommendations on appropriate data
collection, analysis, and resolution of
discrepant results, using sound scientific and
statistical analysis to support indications for
use of in vitro diagnostic devices when the new
device is compared to another device, a
recognized reference method or gold standard,
or other procedures not commonly used, and/or
clinical criteria for diagnosis
4Statistical Guidance Developed
- Statistical Guidance on Reporting Results from
Studies Evaluating Diagnostic Tests Draft
Guidance for Industry and FDA Reviewers - issued in Mar. 12, 2003 with a 90-day comment
period - http//www.fda.gov/cdrh/osb/guidance/1428.html
- for all diagnostic products not just in vitro
diagnostics - only addresses diagnostic devices with 2 possible
outcomes (positive/negative) - does not address design and monitoring of
clinical studies for diagnostic devices
5Dichotomous Diagnostic Test Performance
- Study Population
- TRUTH
- Truth Truth?
- New Test TP (true) FP (false)
- Test Test? FN (false? ) TN (true?)
- estimate
- sensitivity (sens) Pr(TestTruth) ?
100TP/(TPFN) - specificity (spec) Pr(Test?Truth?) ?
100TN/(FPTN) - Perfect test sensspec100 (FPFN0)
6Example Data 220 Subjects
- TRUTH Imperfect Standard
- ? ?
- New 44 1 New 40 5
- Test ? 7 168 Test ? 4 171
- total 51 169 total 44 176
- Unbiased Estimates Biased Estimates
- Sens 86.3 (44/51) 90.9 (40/44)
- Spec 99.4 (168/169) 97.2 (171/176)
-
- Misclassification bias (see Begg 1987)
7Recalculation of Performance Using Discrepant
Resolution
- STAGE 1 retest discordants STAGE 2
revise 2x2 - using a resolver test based on
resolver result -
- Imperfect Standard
Resolver/imperfect std. - ? ?
-
- New 40 5 (5, 0?) New
45 0 - Test ? 4 (1, 3?) 171 Test ?
1 174 -
- total 44 176 total 46
174 -
- sens 90.9 (40/44) ? 97.8 (45/46)
- spec 97.2 (171/176) ? 100 (174/174)
- assumes concordantcorrect
8Topics for Guidance
- Realization
- Problems are much larger than discrepant
resolution - 2x2 is an oversimplification, but still useful to
start - Provide guidance
- What constitutes truth?
- What to do if we dont know truth?
- What name do we give performance measures when we
dont have truth? - Describing study design how were subjects,
specimens, measurements, labs collected/chosen?
9Comments on Guidance
- FDA received comments from 11 individuals/organiza
tions - provide guidance on what constitutes perfect
standard - remove perfect/imperfect standard concept and
include and define reference/non-reference
standard concept (STARD) - reference and use STARD concepts
- provide approach for indeterminate, inconclusive,
equivocal, etc results - minimal recommendations
- discuss methods for estimating sens and spec when
a perfect reference standard is not used - cite new literature
- include more discussion on bias, including
verification bias - some discussion added, add more references
- add glossary
10STARD Initiative
- STAndards for Reporting of Diagnostic Accuracy
Initiative - effort by international working group to improve
quality of reporting of studies of diagnostic
accuracy - checklist of 25 items to include when reporting
results - provide definitions for terminology
- http//www.consort-statement.org/stardstatement.ht
m
11STARD Definitions Adopted
- Purpose of a qualitative diagnostic test is to
determine whether a target condition is present
or absent in a subject from the intended use
population - Target condition (condition of interest) can
refer to a particular disease , a disease stage,
health status, or any other identifiable
condition within a patient, such as staging a
disease already known to be present, or a health
condition that should prompt clinical action,
such as the initiation, modification or
termination of treatment - Intended use population (target population)
those subjects/patients for whom the test is
intended to be used
12Reference Standard (STARD)
- Move away from notion of a fixed, theoretical
Truth - considered to be the best available method for
establishing the presence or absence of the
target conditionit can be a single test or
method, or a combination of methods and
techniques, including clinical follow-up - dichotomous - divides the intended use population
into condition present or absent - does not consider outcome of new test under
evaluation
13Reference Standard (FDA)
- What constitutes best available
method/reference method? - opinion and practice within the medical,
laboratory and regulatory community - several possible methods could be considered
- maybe no consensus reference standard exists
- maybe reference standard exists but for
non-negligible or intended use population, the
reference standard is known to be in error - FDA ADVICE
- consult with FDA on choice of reference standard
before beginning your study - performance measures must be interpreted in
context report reference standard along with
performance measures
14Benchmarks for Assessing Diagnostic Performance
- NEW FDA recognizes 2 major categories of
benchmarks - reference standard (STARD)
- non-reference standard (a method or predicate
other than a reference standard 510(k)
regulations) - OLD perfect standard and imperfect standard,
gold standard concepts and terms deleted - Choice of comparative method determines which
performance measures can be reported
15Comparison with Benchmark
- If a reference standard is available use it
- If a reference standard is available, but
impractical use it to the extent possible - If a reference standard is not available or
unacceptable for your situation consider
constructing one - If a reference standard is not available and
cannot be constructed, use a non-reference
standard and report agreement
16Naming Performance Measures Depends on
Benchmarks
- Terminology is important help ensure correct
interpretation - Reference standard (STARD)
- a lot of literature on studies of diagnostic
accuracy (Pepe 2003, Zhou et al. 2002) - report sensitivity, specificity (and
corresponding CIs), predictive values of positive
and negative results - Non-reference standard (due to 510(k)
regulations) - report positive percent agreement and negative
percent agreement - NEW include corresponding CIs (consider score
CIs) - interpret with care many pitfalls!
17Agreement
- Study Population
- Non-Reference Standard
- ?
- New Test a b
- Test Test? c d
-
- Positive percent agreement (new/non ref. std.)
100a/(ac) - Negative percent agreement (new/non ref.
std.)100d/(bd) - overall percent agreement100(ad)/(abcd)
- Perfect new test PPA?100 and NPA?100
18Pitfalls of Agreement
- agreement as defined here is not symmetric
calculation is different depending on which
marginal total you use for the denominator - overall percent agreement is symmetric, but can
be misleading (very different 2x2 data can give
the same overall agreement - agreement ? correct
- overall agreement, PPA and NPA can change
(possibly a lot) depending the prevalence
(relative frequency of target condition in
intended use population)
19Overall Agreement Misleading
- Non-Ref Non-Ref Standard
Standard - ? ?
- New 40 1 New 40 19
- Test ? 19 512 Test ? 1 512
- total 59 513 total 41 531
- overall agreement 96.5 ((40512)/572))
- PPA 67.8 (40/59) PPA 97.6 (40/41)
- NPA 99.8 (512/513) NPA 96.4 (512/531)
20Agreement ? Correct
- Original data Non-Reference Standard
- ?
- New 40 5
- Test ? 4 171
- Stratify data above by Reference Standard
outcome - Reference Std Reference Std ?
- Non-Ref Std Non-Ref Std
- ? ?
- New 39 5 New 1 0
- Test ? 1 6 Test ? 3 165
- tests agree and are wrong for 61 7 subjects
21Bias
- Unknown and non-quantified uncertainty
- Often existence, size (magnitude), and direction
of bias cannot be determined - Increasing overall number of subjects reduces
statistical uncertainty (confidence interval
widths) but may do nothing to reduce bias
22Some Types of Bias
- error in reference standard
- use test under evaluation to establish diagnosis
- spectrum bias do not choose the right
subjects - verification bias only a non-representative
subset of subjects evaluated by reference
standard, no statistical adjustments made to
estimates - many other types of bias
- See Begg (1987), Pepe (2003), Zhou et al. (2002)
23Estimating Sens and Spec Without a Reference
Standard
- Model-based approaches latent class models and
Bayesian models. See Pepe (2003), and Zhou et
al. (2002) - Albert and Dodd (2004)
- incorrect model leads to biased sens and spec
estimates - different models can fit data equally well, yet
produce very different estimates of sens and spec - FDA concerns recommendations
- difficult to verify that model and assumptions
are correct - try a range of models and assumptions and report
range of results
24Reference Standard Outcomeson a Subset
- Albert and Dodd (2006, under review)
- use info from verified and non-verified subjects
- choosing between competing models is easier
- explore subset choice (random, test dependent)
- Albert (2006, under review)
- estimation via imputation
- study design implications (Albert, 2006)
- Kondratovich (2003 2002-Mar-8 FDA Microbiology
Devices Panel Meeting) - estimation via imputation
25Practices to Avoid
- using terms sensitivity and specificity if
reference standard is not used - discarding equivocal results in data
presentations and calculations - using data altered or updated by discrepant
resolution - using the new test as part of the comparative
benchmark
26External validity
- A study has high external validity if the study
results are sufficiently reflective of the real
world performance of the device in the intended
use population
27External validity
- FDA recommends
- include appropriate subjects and/or specimens
- use final version of the device according to the
final instructions for use - use several of these devices in your study
- include multiple users with relevant training and
range of expertise - cover a range of expected use and operating
conditions
28Reporting Recommendations
- CRITICAL - need sufficient detail to be able to
assess potential bias and external validity - just as (more?) important as computing CIs
correctly - see guidance for specific recommendations
29- References
- Albert, P. S. (2006). Imputation approaches for
estimating diagnostic accuracy for multiple tests
from partially verified designs. Technical
Report 042, Biometric Research Branch, Division
of Cancer Treatment and Diagnosis, National
Cancer Institute (http//linus.nci.nih.gov/brb/Te
chReport.htm). - Albert, P.S., Dodd, L.E. (2004). A cautionary
note on the robustness of latent class models for
estimating diagnostic error without a gold
standard. Biometrics, 60, 427435. - Albert, P. S. and Dodd, L. E. (2006). On
estimating diagnostic accuracy with multiple
raters and partial gold standard evaluation.
Technical Report 041, Biometric Research Branch,
Division of Cancer Treatment and Diagnosis,
National Cancer Institute (http//linus.nci.nih.go
v/brb/TechReport.htm). - Begg, C.G. Biases in the assessment of
diagnostic tests. Statistics in Medicine
19876411-423. - Bossuyt, P.M., Reitsma, J.B., Bruns, D.E.,
Gatsonis, C.A., Glasziou, P.P., Irwig, L.M.,
Lijmer, J.G., Moher, D., Rennie, D., deVet,
H.C.W. (2003). Towards complete and accurate
reporting of studies of diagnostic accuracy The
STARD initiative. Clinical Chemistry, 49(1), 16.
(Also appears in Annals of Internal Medicine
(2003) 138(1), W112 and in British Medical
Journal (2003) 329(7379), 4144)
30- References (continued)
- Bossuyt, P.M., Reitsma, J.B., Bruns, D.E.,
Gatsonis, C.A., Glasziou, P.P., Irwig, L.M.,
Moher, D., Rennie, D., deVet, H.C.W., Lijmer,
J.G. (2003). The STARD statement for reporting
studies of diagnostic accuracy Explanation and
elaboration. Clinical Chemistry, 49(1), 718.
(Also appears in Annals of Internal Medicine
(2003) 138(1), W112 and in British Medical
Journal (2003) 329(7379), 4144) - Lang, Thomas A. and Secic, Michelle. How to
Report Statistics in Medicine. Philadelphia
American College of Physicians, 1997. - Kondratovich, Marina (2003). Verification bias in
the evaluation of diagnostic devices.
Proceedings of the 2003 Joint Statistical
Meetings, Biopharmaceutical Section, San
Francisco, CA. - Pepe, M. S. (2003). The statistical evaluation of
medical tests for classification and prediction.
New York Oxford University Press. - Zhou, X. H., Obuchowski, N. A., McClish, D. K.
(2002). Statistical methods in diagnostic
medicine. New York John Wiley Sons.
31(No Transcript)