Title: Class 5 Additional Psychometric Characteristics: Validity and Bias, Responsiveness, Sensitivity to C
1Class 5Additional Psychometric
Characteristics Validity and Bias,
Responsiveness, Sensitivity to Change October
16, 2008
- Anita L. Stewart
- Institute for Health Aging
- University of California, San Francisco
2Overview
- Validity
- Including bias
- How bias affects validity
- Responsiveness, sensitivity to change
- Meaningfulness of change
3Validity
- Does a measure (or instrument) measure what it is
supposed to measure? - AndDoes a measure NOT measure what it is NOT
supposed to measure?
4Valid Scale? No!
- There is no such thing as a valid scale
- We accumulate evidence of validity in a variety
of populations in which it has been tested - Similar to reliability
5Validation of Measures is an Iterative, Lengthy
Process
- Accumulation of evidence
- Different samples
- Longitudinal designs
6Types of Measurement Validity
- Content
- Criterion
- Construct
- Convergent
- Discriminant
- Convergent/discriminant
All can be Concurrent Predictive
7Content Validity
- Relevant when writing items
- Extent to which a set of items represents the
defined concept
8Relevance of Content Validity to Selecting
Measures
- Conceptual adequacy
- Does candidate measure represent adequately the
concept YOU are intending to measure
9Content Validity Appropriate at Two Levels
- Battery or Are all relevant domainsinstrument
represented in an instrument? - Measure Are all aspects of a defined
concept represented in the items of a
scale?
10Example of Content Validity of Instrument
- You are studying health-related quality of life
(HRQL) in clinical depression - Your HRQL concept includes sleep problems,
ability to work, and social functioning - SF-36 - a candidate
- Missing sleep problems
11Types of Measurement Validity
- Content
- Criterion
- Construct
- Convergent
- Discriminant
- Convergent/discriminant
All can be Concurrent Predictive
12Criterion Validity
- How well a measure correlates with another
measure considered to be an accepted standard
(criterion) - Can be
- Concurrent
- Predictive
13Criterion Validity of Self-reported Health Care
Utilization
- Compare self-report with objective data
(computer records of utilization) - MD visits past 6 months (self-report)
correlated .64 with computer records - hospitalizations past 6 months (self-report)
correlated .74 with computer records
Ritter PL et al, J Clin Epid, 200154136-141
14Criterion Validity of Screening Measure
- Develop depression screening tool to identify
persons likely to have disorder - Do clinical assessment only on those who screen
likely - Criterion validity
- Extent to which the screening tool detects
(predicts) those with disorder - sensitivity and specificity, ROC curves
15Criterion Validity of Measure to Predict Outcome
- If goal is to predict health or other outcome
- Extent to which the measure predicts the outcome
- Example Develop self-reported war-related stress
measure to identify vets at risk of PTSD - How well does it predict subsequent PTSD (Vogt et
al., 2004, readings)
16Interpreting Validity Coefficients
- Magnitude and conformity to hypothesis are
important, not statistical significance - Nunnally rarely exceed .30 to .40 which may be
adequate (1994, p. 99) - McDowell and Newell typically between 0.40 and
0.60 (1996, p. 36) - Max correlation between 2 measures square root
of product of reliabilities - 2 scales with .70 reliabilities, max correlation
.70 - Correlation of .60 would be high
17Types of Measurement Validity
- Content
- Criterion
- Construct
- Convergent
- Discriminant
- Convergent/discriminant
All can be Concurrent Predictive
18Construct Validity Basics
- Does measure relate to other measures in
hypothesized ways? - Do measures behave as expected?
- 3-step process
- State hypothesis direction and magnitude
- Calculate correlations
- Do results confirm hypothesis?
19Source of Hypotheses in Construct Validity
- Prior literature in which associations between
constructs have been observed - e.g., other samples, with other measures of
constructs you are testing - Theory, that specifies how constructs should be
related - Clinical experience
20Who Tests for Validity?
- When measure is being developed, investigators
should test construct validity - As measure is applied, results of other studies
provide information that can be used as evidence
of construct validity
21Types of Measurement Validity
- Content
- Criterion
- Construct
- Convergent
- Discriminant
- Convergent/discriminant
All can be Concurrent Predictive
22Convergent Validity
- Hypotheses stated as expected direction and
magnitude of correlations - We expect X measure of depression to be
positively and moderately correlated with two
measures of psychosocial problems - The higher the depression, the higher the level
of problems on both measures
23Testing Validity of Expectations Regarding Aging
Measure
- Hypothesis 1 ERA-38 total score would correlate
moderately with ADLS, PCS, MCS, depression,
comorbidity, and age - Hypothesis 2 Functional independence scale would
show strongest associations with ADLs, PCS, and
comorbidity
Sarkisian CA et al. Gerontologist. 200242534
24Testing Validity of Expectations Regarding Aging
Measure
- Hypothesis 1 ERA-38 total score would correlate
moderately with ADLS, PCS, MCS, depression,
comorbidity, and age (convergent) - Hypothesis 2 Functional independence scale would
show strongest associations with ADLs, PCS, and
comorbidity
Sarkisian CA et al. Gerontologist. 200242534
25ERA-38 Convergent Validity Results Hypothesis 1
26ERA-38 Non-Supporting Convergent Validity Results
27Types of Measurement Validity
- Content
- Criterion
- Construct
- Convergent
- Discriminant
- Convergent/discriminant
All can be Concurrent Predictive
28Discriminant Validity Known Groups
- Does the measure distinguish between groups
known to differ in concept being measured? - Tests for mean differences between groups
29Example of a Known Groups Validity Hypothesis
- Among three groups
- General population
- Patients visiting providers
- Patients in a public health clinic
- Hypothesis scores on functioning and well-being
measures will be the best in a general population
and the worst in patients in a public health
clinic
30Mean Scores on MOS 20-item Short Form in Three
Groups
- Public
- General MOS health
- population patients patients
- Physical function 91 78 50
- Role function 88 78 39
- Mental health 78 73 59
- Health perceptions 74 63 41
- Bindman AB et al.,
Med Care 1990281142
31PedsQL Known Groups Validity
- Hypothesis PedsQL scores would be lower in
children with a chronic health condition than
without
JW Varni et al. PedsQL 4.0 Reliability and
Validity of the Pediatric Quality of Life
Inventory , Med Care, 200139800-812.
32Types of Measurement Validity
- Content
- Criterion
- Construct
- Convergent
- Discriminant
- Convergent/discriminant
All can be Concurrent Predictive
33Convergent/Discriminant Validity
- Does measure correlate lower with measures it is
not expected to be related to than to measures
it is expected to be related to? - The extent to which the pattern of correlations
conforms to hypothesis is confirmation of
construct validity
34Basis for Convergent/Discriminant Hypotheses
- All measures of health will correlate to some
extent - Hypothesis is of relative magnitude
35Example of Convergent/Discriminant Validity
Hypothesis
- Expected pattern of relationships
- A measure of physical functioning is
hypothesized to be more highly related to a
measure of mobility than to a measure of
depression
36Example of Convergent/Discriminant Validity
Evidence
- Pearson correlation
- Mobility Depression
- Physical functioning .57 .25
37Testing Validity of Expectations Regarding Aging
Measure
- Hypothesis 1 ERA-38 total score would correlate
moderately with ADLS, PCS, MCS, depression,
comorbidity, and age (convergent) - Hypothesis 2 Functional independence scale would
show strongest associations with ADLs, PCS, and
comorbidity (convergent/discriminant)
Sarkisian CA et al. Gerontologist. 200242534
38ERA-38 Convergent/Discriminant Validity Results
Hypothesis 2
39ERA-38 Non-Supporting Validity Results
40Construct Validity Thoughts Lee Sechrest
- There is no point at which construct validity is
established - It can only be established incrementally
- Our attempts to measure constructs help us better
understand and revise these constructs
Sechrest L, Health Serv Res, 200540(5 part II),
1596
41Construct Validity Thoughts Lee Sechrest (cont)
- An impression of construct validity emerges from
examining a variety of empirical results that
together make a compelling case for the assertion
of construct validity
42Construct Validity Thoughts Lee Sechrest (cont)
- Because of the wide range of constructs in the
social sciences, many of which cannot be exactly
defined.. - once measures are developed and in use, we must
continue efforts to understand them and their
relationships to other measured variables.
43Overview
- Validity
- Including bias
- Responsiveness, sensitivity to change
- Meaningfulness of change
44Components of an Individuals Observed Item Score
(from Class 3)
- Observed true item
score score -
random systematic
error
45Random versus Systematic Error
- Observed true item
score score -
Relevant to reliability
random systematic
error
Relevant to validity
46Bias is Systematic Error
- Affects validity of scores
- If scores contain systematic error, cannot know
the true mean score - Will obtain an observed score that is either
systematically higher or lower than the true
score
47Bias or Systematic Error?
- Bias implies that the direction of error known
- Systematic error direction neutral
- Same error applies to entire sample
48Sources of Bias in Observed Scores of
Individuals
- Respondent
- Socially desirable responding
- Acquiescent response bias
- Cultural beliefs (e.g., not reporting distress)
- Halo affects
- Observer
- Belief that respondent is ill
- Instrument
49Socially Desirable Responding
- Tendency to respond in socially desirable ways to
present oneself favorably - Observed score is consistently lower or higher
than true score in the direction of a more
socially acceptable score
50Socially Desirable Response Set Looking good
- After coming up with an answer to a question,
respondent screens the answer - Will this make the person like me less?
- May edit their answer to be more desirable
- Example a woman has 2 drinks of alcohol a day,
but responds that she drinks a few times a week - Systematic underreporting of risk behavior
51Ways to Minimize Socially Desirable Responding
- Write items to increase acceptability of an
undesirable response - Instead of
- Have you followed your doctors
recommendations? - Use
- Have you had any of the following problems
following your doctors recommendations?
52Example of Bias Due to Cultural Norms or Beliefs
- A person feels sad most of the time
- Unwilling to admit this to the interviewer so
answers a little of the time - Not culturally appropriate to admit to negative
feelings - Always present a positive personality
- Observed response reflects less sadness than
true sadness of respondent
53Acquiescent Response Set - Yea Saying
- Tendency to
- agree with statements regardless of content
- give positive response such as yes, true,
satisfied - Extent and nature of bias depends on direction of
wording of the questions - Minimizing acquiescence
- Include positively- and negatively-worded items
in the same scale
54Discrepancies in Various Information Sources
Bias or Different Perspectives?
- In reporting on a patients well-being
- Patients report highest levels
- Clinicians report levels in the middle
- Family members report the lowest levels
- No way to know which is the true score
- to say one score is biased implies another one
is the true score
55Overview
- Validity
- Including bias
- Responsiveness, sensitivity to change
- Meaningfulness of change
56Two Meanings of Sensitivity and Responsiveness
to Change
- Measure able to detect true changes
- One knows how much change is meaningful
- regardless of statistical significance
- change scores are interpretable in terms of
meaningfulness
57Sensitivity to Change Detects True Change
- Sensitive to true differences or changes in the
attribute being measured - Sensitive enough to measure differences in
outcomes that might be expected given the
relative effectiveness of treatments - Ability of a measure to detect change
statistically
58Instrument has Potential Distribution of Scores
to Detect Change
- Evidence of good variability in sample like yours
(at baseline) - Room to improve
- Multi-item scales many scale levels
59Importance of Sensitivity
- Need to know measure can detect change if
planning to use it as outcome of intervention - Approaches for testing sensitivity are often
simultaneous tests of - effectiveness of an intervention
- sensitivity or responsiveness of measures
60Considerations in Developing CHAMPS Physical
Activity (PA) Questionnaire
- Needed outcome measure to detect changes in PA
due to CHAMPS intervention - increase PA levels in everyday life (e.g.,
walking, stretching) in activities of their
choice - Existing measures designed to capture younger
persons PA
Stewart AL et al. Med Sci Sports Exerc,
2001331126-1141.
61Changes in Measure Resulting from Intervention
Validity Evidence for Others
- After intervention detected PA change, others
used our results as evidence of sensitivity to
change - Used in Project ACTIVE because of its
sensitivity to change in CHAMPS (S Wilcox et al,
Am J Pub Health, 2006961201-1209) - Changes in a depression measure in a drug trial
is evidence that the measure is capable of
detecting change in another study
62Measuring Sensitivity
- Score is stable in those who are not changing
- Score changes in those who are actually changing
(true change) - Not easy to quantify
- can administer multiple measures of same concept
in intervention - see which measures change the most
63Responsiveness to Change
- Used DSM-IV criteria to classify patients who had
major depression at earlier time into - Persistent depression
- Partial remission
- Full remission
- Examined PHQ-9 change scores in relation to these
criteria - PHQ-9 a short screener for depression
Löwe B et al. Med Care, 2004421194-1201
64Changes in PHQ-9 Scores by Criteria of Change in
Depression
Löwe et al, 2004, p. 1200
65Relevant or Meaningful Change
- Is the observed change important?
- To clinician
- meaningful to clinician
- change might influence patient management
- To patient
- patient notices change
- amount of change matters
66Statistical Significance versus Importance
- Statistical significance is not sufficient for
clinical importance - Depends on sample size
- Can obtain statistical significance of a very
small change
67Minimal Important Difference (MID)
- MID the minimal difference that is clinically
important - Smallest difference considered to be worthwhile
or important - Context specific
68Anchor-Based Approaches to Estimating MID
- Anchor external information on amount of change
- Identify group that you know has changed by a
minimal amount - Clinical change
- Patient reported change
- Change in health measure for this group MID
69Example of Patient-Reported Anchor
- Since one year ago, how would you rate your
health in general now? - Much worse now than one year ago
- Somewhat worse than one year ago
- About the same as one year ago
- Somewhat better than one year ago
- Much better than one year ago
70Two Categories Can Define Minimal Change Groups
- Since one year ago, how would you rate your
health in general now? - Much worse now than one year ago
- Somewhat worse than one year ago
- About the same as one year ago
- Somewhat better than one year ago
- Much better than one year ago
71Minimal Change Groups
- Select subset of respondents who reported
somewhat better or somewhat worse - change in your health measure for this subset
would constitute the MID - Could also combine two groups using absolute
change
72Other Approaches to Assess Meaning of Change
(Relative to a Measured Change)
- Patient noticed change
- Since ., how would you rate the amount of change
in your physical functioning? - 7-point scale very much better . very much
worse - Patient satisfied with change
- How satisfied are you with the amount of change
in physical functioning? - 7-point scale extremely satisfied not at all
satisfied -
73Other Measures of Perceived Change
- Study of patients with hip or knee replacement
- How successful was your (hip, knee) replacement
in.. - allowing you to return to your normal daily
activities? - relieving your pain?
- Response choices extremely, very, moderately,
slightly, not at all successful
KB Bayley et al. Med Care 199533AS226
74Next Class (Class 5)
- Factor analysis with Steve Gregorich
75Homework
- Complete rows 21-27 in matrix for your two
measures - Nature of samples on which it has been tested,
validity, responsiveness and sensitivity to
change