Title: Validity and Agreement
1????? ??? ?? ???????????
????? ? ?????
Validity and Agreement
2A study can only be as good as the data . . .
3Reproducibility vs Validity
- Reproducibility
- the degree to which a measurement provides the
same result each time it is performed on a given
subject or specimen - Validity
- from the Latin validus - strong
- the degree to which a measurement truly measures
(represents) what it purports to measure
(represent)
4Reproducibility vs Validity
- Reproducibility
- reliability, repeatability, precision,
variability, dependability, consistency,
stability - Validity
- accuracy
5Why Care About Reproducibility?
- ?2O ?2T ?2E
- More measurement error means more variability in
observed measurements - e.g. measure height in a group of subjects.
- If no measurement error
- If measurement error
Height
6Impact of Reproducibility on Statistical
Precision
- observed value (O) true value (T) measurement
error (E) - E is random and N (0, ?2E)
- When measuring a group of subjects, the
variability of observed values is a combination
of - the variability in their true values and the
variability in the measurement error - ?2O ?2T ?2E
7Why Care About Reproducibility?
- ?2O ?2T ?2E
- More variability of observed measurements has
profound influences on statistical
precision/power - Descriptive studies wider confidence intervals
- RCTs power to detect a treatment difference is
reduced - Observational studies power to detect an
influence of a particular risk factor upon a
given disease is reduced.
8Mathematical Definition of Reproducibility
- Reproducibility
- Varies from 0 (poor) to 1 (optimal)
- As ?2E approaches 0 (no error), reproducibility
approaches 1
9Why Care About Reproducibility?
- Impact on Validity
- Mathematically, the upper limit of a
measurements validity is a function of its
reproducibility - Consider a study to measure height in the
community - Assume the measurement has imperfect
reproducibility if we measure height twice on a
given person, we get two different values 1 of
the 2 values must be wrong (imperfect validity) - If study measures everyone only once, errors,
despite being random, will lead to biased
inferences when using these measurements (i.e.
lack validity)
10Sources of Measurement Error
- Observer
- within-observer (intrarater)
- between-observer (interrater)
- Instrument
- within-instrument
- between-instrument
11Sources of Measurement Error
- e.g. plasma HIV viral load
- observer measurement to measurement differences
in tube filling, time before processing - instrument run to run differences in reagent
concentration, PCR cycle times, enzymatic
efficiency
12Within-Subject Variability
- Although not the fault of the measurement
process, moment-to-moment biological variability
can have the same effect as errors in the
measurement process - Recall that
- observed value (O) true value (T) measurement
error (E) - T the average of measurements taken over time
- E is always in reference to T
- Therefore, lots of moment-to-moment
within-subject biologic variability will serve to
increase the variability in the error term and
thus increase overall variability because
?2O ?2T ?2E
13(No Transcript)
14Selected Indices or Graphic Approaches for the
Assessment of Validity and Reliability
15Selected Indices or Graphic Approaches for the
Assessment of Validity and Reliability
16Indices forCategorical Variables
17Sensitivity and Specificity
18Predictive Values at Different Prevalence with
Sensitivity .90 and Specificity .90
Prev 10 PPV .50 NPV .99
Prev 25 PPV .76 NPV .96
Prev 50 PPV .90 NPV .90
19Influence of prevalence on predictive values
20??? ?????? ? ????? ??????? ??????? ????? ??????
?? ???? ??? ????? ???? ?????? ???? ? ???? ??
?????? ????? ??? ????? ??????
21Spectrum of severity
22Youdens J statistic
- Sensitivity a
- Specificity b
- Youdens J statistic a b -1
- All these three measures have a range of 0 to 1
(Youdens Index can be less than 0, but only if
the sensitivity and specificity are worse than
would be obtained by chance with a random
definition)
23Youdens J statistic
- Suppose that we are doing a survey in a
population in which the true prevalence is P - The observed prevalence isaP (1-b)(1-P)
P(ab-1) (1-b)
24Youdens J statistic
- If we compare two populations, then the observed
prevalences areP1(ab-1) (1-b)P0(ab-1)
(1-b) - the observed prevalence difference is
- (P1-P0)(ab-1)
- Youdens Index indicates the reduction in the
true prevalence difference due to
misclassification
25Youdens J statistic
- In population-based prevalence surveys, Youdens
H statistic is the most appropriate measure of
validity - 95 Confidence interval for Youdens J is
Var(J) Sen(1-Sen)/n1 Spe(1-Spe)/n2 95 CI
for J J 1.96 ?Var(J)
26Example Jenkins et al (1996)ISAAC questionnaire
for Asthma in children
27- Reliability or Reproducibility
- Is there good agreement between these two
imperfect measurements?
28Percent agreement
29Cohens Kappa
- Reported in 1960
- Kappa corrects for the chance agreement that
would be expected to occur if the 2
classifications were completely unrelated
30Kappa
- Definition
- Chance corrected measure of nominal scale
agreement among raters - Assumptions
- Subjects are independent
- Categories are independent, mutually exclusive,
and exhaustive - Raters operate independently
31Kappa
p - pe
K
1 - pe
- p Observed proportion of agreement
- pe Proportion of agreement expected to occur
by chance alone - Varies from -1 to 1
32Weighted Kappa Coefficient
- Definition
- Proportion of weighted agreement corrected for
chance - Application
- All disagreements between categories are not of
equal importance
33Weighted Kappa Coefficient
pw - pew
K
1 - pew
- qw Observed weighted proportion of agreement
- qcw Weighted proportion of agreement
expected to occur by chance alone - Varies from -1 to 1
34Rater A
Category
Pi .
2
1
3
P12
P13
P1 .
1
P11
Rater B
P21
P23
P2 .
2
P22
P32
P31
3
P3 .
P33
P.1
P.2
P.3
P1
P.j
35Example K and Kw
- Diagnostic Category
- - Personality Disorder
- - Neurosis
- - Psychosis
- a - Disagreement weight Vij
- b - Chance-expected cell proportion
- Pcij (Pi.)(p.j)
- o - Observed cell proportion Pij
36Rater A
piB
Category
1
3
2
1a
1
0.75
0.25
.6
(.30)b
.44o
(.18)
.07
(.12)
.09
0.75
0.1
1
Rater
2
.3
(.06)
(.09)
(.15)
.05
.20
.05
B
0.25
0.1
1
3
.1
(.02)
(.03)
(.05)
.06
.03
.01
piA
1
.3
.5
.2
37Calculating P ? Pij where i j Pc ?
(Pi.)(P.j) where i j P .44 .20 .06
.70 Pc .30 .09 .02 .41
P - Pc
1 - Pc
K
.70 - .41
.49
K
1 - .41
38Pw - Pew
Calculating
1 - Pew
Kw
Pw ? VijPij for all ij Pew ? Vij
(Pi.)(P.j) for all ij Pw 1(.44 .2 .06)
0.75(.07 .05) 0.25(.09 .01) 0.1(.03
.05) .823 Pcw 1(.30 .09 .02)
0.75(.18 .15) 0.25(.12 .05) 0.1(.03
.06) .709
.823 - .709
.39
Kw
1 - .709
39Standard Error of Kappa Statistic
- The standard error of the Kappa statistic is
calculated by - To test the hypothesis Hok0 vs. H1kgt0, use the
test statistic
40Interpretation of Kappa
- Various authors have developed classifications
for the interpretation of a kappa value - See Altman (1991) or Fleiss (1981) or Byrt (1996)
41Interpretation of Kappa
42Interpretation of Kappa
- Below 0.0 ? Poor
- 0.00 - 0.20 ? Slight
- 0.21 - 0.41 ? Fair
- 0.42 - 0.60 ? Moderate
- 0.61 - 0.80 ? Substantial
- 0.81 - 1.00 ? Almost Perfect
- Landis Koch (1977a)
43- K s
- Adjustment for chance agreement
- Most commonly used measure of agreement
- Many Variants and generalizations of kappa
- Interpretability in qualitative as well as
quantitative terms - K -s
- Base rate controversy
44- Kw s
- Adjustment for chance agreement
- Ability to determine where the largest source
of disagreement is occurring - Interpretability in qualitative as well as
quantitative terms - Kw -s
- Weights are arbitrarily set by researcher
- Decreases generalizability across studies
45Kappa and Prevalence
- Limitation of kappa when comparing the
reliability of a diagnostic procedure in
different populations is its dependence on the
prevalence of true positivity in each
population (from Szklo Nieto, Epidemiology
Beyond the Basics)
46Population One (Prevalence 0.05)Table for true
positives
Observer B
From Szklo and Nieto, 2000
47Population One (Prevalence 0.05)Table for true
negatives
Observer B
From Szklo and Nieto, 2000
48Population One (Prevalence 0.05)Table for
total population
Observer B
? 0.296
From Szklo and Nieto, 2000
49Population Two (Prevalence 0.30)Table for true
positives
Observer B
From Szklo and Nieto, 2000
50Population One (Prevalence 0.05)Table for true
negatives
Observer B
From Szklo and Nieto, 2000
51Population One (Prevalence 0.05)Table for
total population
Observer B
? 0.598
From Szklo and Nieto, 2000
52Kappa and Prevalence
- So, for the same sensitivity and specificity of
the observers, the kappa value is greater in the
population in which the prevalence of positivity
is higher
53Indices forContinuous Variables
54Reproducibility of an Interval Scale Measurement
Peak Flow
- Assessment requires
- gt1 measurement per subject
- Peak Flow Rate in 17 adults
- (Bland Altman)
55Assessment by Simple Correlation
56Pearson Product-Moment Correlation Coefficient
- r (rho) ranges from -1 to 1
- r
- r describes the strength of linear association
- r2 proportion of variance (variability) of one
variable accounted for by the other variable
57r -1.0
r 1.0
r 1.0
r -1.0
r 0.0
r 0.8
r 0.8
r 0.0
58Correlation Coefficient for Peak Flow Data
- r ( meas.1, meas. 2) 0.98
59Limitations of Simple Correlation for Assessment
of Reproducibility
- Depends upon range of data
- e.g. Peak Flow
- r (full range of data) 0.98
- r (peak flow lt450) 0.97
- r (peak flow gt450) 0.94
- Measures linear association only
60(No Transcript)
61- Avoid using the usual correlation coefficient
(the Pearson correlation coefficient) - It does not correct for systematic error!
62- Instead, calculate the intraclass correlation
coefficient
Vbetween individulas
ICC
Vtotal
63Intraclass Correlation Coefficient (ICC)
- Say you have 2 raters
- What if Rater 2 consistently overestimates the
measurement when compared to Rater 1?
64Fake Data (Margo, et al. 2002)
65Plot of Fake Data
66Evaluation of theScatter Diagram
- Strong linear association Pearson correlation
coefficient is 0.99 - However, the ICC is weaker 0.89
67Pearsons vs. ICC
- The weaker concordance is due to the fact that
the ICC takes into the account the difference in
the mean, which for Rater 1 is 3.7 and for Rater
2 is 4.9
68Intra- and interobserver agreementBland-Altman
method (Lancet, 86)
- there are two measurements of I patients Yi1
and Yi2 - Per patient calculate average and difference
- Pi(Yi1 Yi2)/2 and diYi1 Yi2.
- make a scatter-plot of di versus Pi
- always calculate intraclass correlation (not
Pearson correlation) to quantify agreement.
69Purpose of Bland and Altman-plot
- to check for systemic difference
- to check for equality of variance (slope0 if
variances are equal)
70Example
LDL-cholesterol of 50 patients was measured with
the Friedewald formula, and directly. Friedewald
Directly Mean 251.6 250.4 SD 15.6 17.0 Me
an difference 1.21 (SD 6.08) p-value 0.16
according to paired t-test
71Mixed-model ANOVA MSe18.488 gt ?e
4.30 MSp511.794 gt ?p ?246.65315.71 and
intraclass-correlation 246.653/(246.65318.488)
0.930 Repeatability 1.96 SD(d) 1.966.08
12.16
72Illustrations
Bland-Altman plot -gt
73Conclusions
- Measurement reproducibility plays a key role in
determining validity and statistical precision in
all different study designs - When assessing reproducibility, for interval
scale measurements - avoid correlation coefficients
- use intraclass correlation coefficient
- For categorical scale measurements, use Kappa
- What is acceptable reproducibility depends upon
desired use - Assessment of validity depends upon whether or
not gold standards are present, and can be a
challenge when they are absent
74Conclusions
- Measurement reproducibility plays a key role in
determining validity and statistical precision in
all different study designs - When assessing reproducibility, for interval
scale measurements - avoid correlation coefficients
- use intraclass correlation coefficient
- or coefficient of variation if within-subject sd
is proportional to the magnitude of measurement - For categorical scale measurements, use Kappa
- What is acceptable reproducibility depends upon
desired use - Assessment of validity depends upon whether or
not gold standards are present, and can be a
challenge when they are absent