Title: Assessing agreement for diagnostic devices
1Assessing agreement for diagnostic devices
- FDA/Industry Statistics Workshop
- September 28-29, 2006
- Bipasa Biswas
- Mathematical Statistician, Division of
Biostatistics - Office of Surveillance and Biometrics
- Center for Devices and Radiological Health, FDA
- No official support or endorsement by the Food
and Drug Administration of this presentation is
intended or should be inferred
2Outline
- Accuracy measures for diagnostic tests with a
dichotomous outcome. Ideal world -tests with
reference standard. - Two indices to measure accuracy Sensitivity and
Specificity - Assessing agreement between two tests in the
absence of a reference standard. - Overall agreement
- Cohens Kappa
- McNemars test
- Proposed remedy
- Extending agreement to tests with more than 2
outcomes. - Cohens Kappa
- Extension to Random Marginal Agreement
coefficient (RMAC) - Should agreement per cell be reported?
3Ideal World-Tests with perfect reference standard
(Single)
- If a perfect reference standard exists to
classify patients as diseased (D) versus not
diseased (D-) then we can represent the data as - True Status
- Test D D-
- T
-
- T -
- If the true status of the disease is known then
we can estimate the Se TP/(TPFN) and the
SpTN/(TNFP)
TP FP TPFP
FN TN FNTN
TPFN FPTN TPFPFNTN
4Ideal World-Tests with perfect reference standard
(Comparing two tests)
- McNemars test to test equality of either
sensitivity or specificity. - True Status
- Disease D No Disease D-
- Comparator test Comparator test
- New test R R- New test R
R- - T T
-
- T - T -
-
- McNemar Chi square
- Check equality of sensitivities of the two tests
(b1-c1-1)2/(b1c1) - Check equality of specifities of the two tests
(c2-b2-1)2/(c2b2)
a1 b1 a1b1
c1 d1 c1d1
a1c1 b1d1 a1b1c1d1
a2 b2 a2b2
c2 d2 c2d2
a2c2 b2d2 a2b2c2d2
5Ideal World-Tests with perfect reference standard
(Comparing two tests)
- Example
- True Status
- Disease D Disease D-
- Comparator test Comparator test
- New test R R- New test R R-
- T T
-
- T - T -
-
-
- SeT85.0(85/100) SpT88.3(795/900)
- SeR90.0(90/100) SpR90.0(810/900)
-
- McNemar Chi square
- Check equality of sensitivities of the two
tests (5101)2/(510) - p-value0.30
- 95 CI (13.5,3.5)
85 20 105
5 790 795
90 810 900
80 5 85
10 5 15
90 10 100
6McNemars test when a reference standard exists
- Note however that the McNemars test is only
checking for equality and thus the null
hypothesis is of equivalence and the alternative
hypothesis of difference. This is not an
appropriate hypothesis as a failure to find a
statistically significant difference is naively
interpreted as evidence for equivalence. - The 95 confidence interval of the difference in
sensitivities and specificities provides a better
idea on the difference between the two tests.
7Imperfect reference standard
- A subjects true disease status is seldom known
with certainty. - What is the effect on sensitivity and specificity
when the comparator test R itself has error? - Imperfect reference test (Comparator test)
- New test R R-
- T
- T -
a b ab
c d cd
ac bd abcd
8Imperfect reference standard
- Example1 Say we have a new Test T with 80
sensitivity and 70 specificity. And an
imperfect reference test R (the comparator test)
which misses 20 of the diseased subjects but
never falsely indicates disease. - True Status Imperfect reference test
- D D- R R-
- T
- T
-
- Se (80/100)80.0 Se (relative to R) (64/80)
80.0 - Sp (70/100)70.0 Sp (relative to R)
(74/120)62.0 -
80 30 110
20 70 90
100 100 200
64 46 110
16 74 90
80 120 200
9Imperfect reference standard
- Example 2 Say we have a new Test T with 80
sensitivity and 70 specificity. And an
imperfect reference test R which misses 20 of
the diseased subjects but the error in R is
related to the error in T. - True Status Imperfect reference test
- D D- R R-
- T
- T
-
- Se (80/100)80.0 Se (relative to R)(80/80)
100.0 - Sp (70/100)70.0 Sp (relative to R)
(90/120)75.0
80 30 110
20 70 90
100 100 200
80 30 110
0 90 90
80 120 200
10Imperfect reference standard
- Example3 Now suppose our test is perfect, that
is has 100 sensitivity and 100 specificity, but
the imperfect reference test R has only 90
sensitivity and 90 specificity. - True Status Imperfect reference test
- D D- R R-
- T
- T
- Se (100/100)100.0 Se (relative to R)(90/100)
90.0 - Sp (100/100)100.0 Sp (relative to R)(90/100)
90.0
100 0 100
0 100 100
100 100 200
90 10 100
10 90 100
100 100 200
11Challenges in assessing agreement in the absence
of a reference standard.
- Two commonly used overall measures are
- Overall agreement measure
- Cohens Kappa
- McNemars Test
- In stead report positive percent agreement (ppa)
and negative percent agreement (npa).
12Estimate of Agreement
- The overall percent agreement can be calculated
as - 100x(ad)/(abcd)
- The overall percent agreement however, does not
differentiate between the agreement on the
positives and agreement on the negatives. - Instead of overall agreement, report positive
percent agreement (PPA) with respect to the
imperfect reference standard positives and
negative percent agreement (NPA) with respect to
imperfect reference standard negative. (reference
Feinstein et. al.) - PPA100xa/(ac)
- NPA100xd/(bd)
13Why not to report just the overall percent
agreement?
-
- The overall percent agreement is insensitive to
off diagonal - imperfect reference test
- R R-
- New T
- Test
- T-
-
- The overall percent agreement is 85.0 and yet
it does not account for the off-diagonal
imbalance. The PPA is 100 and the NPA is only
50
70 15 85
0 15 15
70 30 100
14Why report both PPA and NPA?
- imperfect reference test imperfect
reference test - R R- R R-
- New T new T
- Test T- test T-
-
- Table 1 Table2
- Overall pct. agreement90.0 Overall pct.
agreement90.0 - PPA50.0 (5/10) PPA87.5 (35/40)
- 95 CI 18.7,81.3 95 CI73.2,95.8
- NPA94.4 (85/90) NPA91.7 (55/60)
- 95 CI 87.5,98.2 95 CI81.6,97.2
5 5 10
5 85 90
10 90 100
35 5 40
5 55 60
40 60 100
15 Kappa measure of agreement
- Kappa is defined as the difference between
observed and expected agreement expressed as a
fraction of the maximum difference and ranges
between -1 to 1. -
- Imperfect reference standard
- R R-
- New T
- Test
- T-
- k(Io-Ie)/(1-Ie) where Io(ad)/n,
Ie((ac)(ab)(bd)(cd))/n2
a b ab
c d cd
ac bd nabcd
16 Kappa measure of agreement
-
- Imperfect reference standard
- R R-
- New T
- Test
- T-
- Io(70)/1000.70, Ie((50)(50)(50)(50))/10000
0.50 - ?(0.70-0.50)/(1-0.50)0.40
- 95 CI0.22,0.58
- By the way the overall percent agreement is 70.0
35 15 50
15 35 50
50 50 100
17Kappa measure of agreement sensitive to
off-diagonal?
- Imperfect reference test
- R R-
- New T
- Test T-
-
- Kappa?0.45 95 CI0.31,0.59
- Although the overall agreement stayed the same
(70) and the marginal differences are much
bigger than before, the kappa agreement index
indicates otherwise. - Kappa statistics is impacted by the marginal
totals even though the overall agreement is the
same.
35 30 65
0 35 35
35 65 100
18McNemars Test to check for equality in the
absence of a reference standard
- Hypothesizes Equality of rates of positive
response - Imperfect reference test
- R R-
- New T
- Test T-
- McNemar Chi square(b-c-1)2/(bc)
- (30-5-1)2/(305)16.46
- Two sided p-value0.00005
37 30 67
5 28 33
42 58 100
19McNemars test (insensitivity to main diagonal)
- Imperfect reference test
- R R-
- New T
- Test T-
- Same p-value as when A37 and D28, even though
the new and the old test agree on 99.5 of
individual cases.
3700 30 3730
5 2800 2805
3705 2830 6535
20McNemars test (insensitivity to main diagonal)
- Imperfect reference test
- R R-
- New T
- Test T-
- Two sided p-value1 even though old and new test
agree on no cases.
0 19 19
18 0 18
18 19 37
21Proposed remedy
- In stead of reporting overall agreement or kappa
or the McNemars test p-value, report both
positive percent agreement and negative percent
agreement. - In the 510(k) paradigm where a new device is
compared to an already marketed device the
positive percent agreement and the negative
percent agreement is relative to the comparator
device, which is appropriate.
22Agreement of tests with more than two outcomes
- For example in radiology one often compares the
standard film mammogram to a digital mammogram
where the radiologists assign a score of
1(negative finding) to 5 (highly suggestive of
malignancy) depending on severity. - The article by Fay in 2005 in Biostatistics
proposes a random marginal agreement coefficient
(RMAC) which uses a different adjustment for
chance than the standard agreement coefficient
(Cohens Kappa).
23Comparing two tests with more than two outcomes
- The advantages of RMAC is that the differences
between two marginal distributions will not
induce greater apparent agreement. - However, as stated in the paper similar to
Cohens Kappa with the fixed marginal assumption,
the RMAC also depends on the heterogeneity of the
population. Thus in cases where the probability
of responding in one category is nearly 1 then
the chance agreement will be large leading to low
agreement coefficients.
24Comparing two tests with more than two outcomes
- An omnibus agreement index for situations with
more than two outcomes is also ridden by similar
situations faced for tests with dichotomous
outcome. Also, in a regulatory set-up where a new
test device is being compared to a predicate
device RMAC may not be appropriate as it gives
equal weight to the marginals from the test and
the predicate device. - In stead report individual agreement for each
category.
25Summary
- Perfect standard exists then for a dichotomous
test then both sensitivity and specificity can be
estimated and appropriate hypothesis tests can be
performed. - If a new test is being compared to an imperfect
predicate test then the positive percent
agreement and negative percent agreement along
with their 95 confidence interval is a more
appropriate way of comparison than reporting the
overall agreement or the kappa statistics or the
McNemars test. - In case of tests with more than two outcomes the
kappa statistics or the overall agreement has the
same problems if the goal of the study is to
compare the new test against a predicate. A
suggestion would be to report agreement for each
cell.
26References
- Pepe, M.S. (2003). The Statistical Evaluation of
Medical Tests for Classification and Prediction.
Oxford University Press. - Statistical Guidance on Reporting Results from
Studies Evaluating Diagnostic Tests Draft
Guidance for Industry and FDA Reviewers. March 2,
2003. - Fleiss, JL, Statistical Methods for Rates and
Proportions, John Wiley Sons, New York (2nd
ed., 1981). - Bossuyt, P.M., Reitsma, J.B., Bruns, D.E.,
Gatsonis, C.A., Glasziou, P.P., Irwig, L.M.,
Lijmer, J.G., Moher, D., Rennie, D., deVet,
H.C.W. (2003). Towards complete and accurate
reporting of studies of diagnostic accuracy The
STARD initiative. Clinical Chemistry, 49(1), 16.
(Also appears in Annals of Internal Medicine
(2003) 138(1), W112 and in British Medical
Journal (2003) 329(7379), 4144)
27References (continued)
- Dunn, G and Everitt, B, Clinical Biostatistics
An Introduction to Evidence-Based Medicine, John
Wiley Sons, New York. - Feinstein A. R. and Cicchetti D. V. (1990). High
agreement but low kappa I. The problems of two
paradoxes. J. Clin. Epidemiol 1990 Vol. 43, No.
6, 543-549. - Feinstein A. R. and Cicchetti D. V. (1990). High
agreement but low kappa II. Resolving the
paradoxes. J. Clin. Epidemiol 1990 Vol. 43, No.
6, 551-558. - Fay M. P. (2005). Random marginal agreement
coefficients rethinking the adjustment for
chance when measuring agreement 2005
Biostatistics 6171-180.