Title: Validity%20
1Validity Outline
- Definition
- Two different views Traditional
- Two different views CSEPT
- Face Validity
- Content Validity CSEPT
- Content Validity Borsboom
2Validity Outline
- Criterion Validity CSEPT
- Predictive vs. Concurrent
- Validity Coefficients
- Criterion Validity Borsboom
- Construct Validity CSEPT
- Convergent
- Discriminant
3Validity Definition
- Validity measures agreement between a test score
and the characteristic it is believed to measure
4Validity CSEPT view
- Validity is a property of test score
interpretations
- Validity exists when actions based on the
interpretation are justified given a theoretical
basis and social consequences
5Validity Traditional view
- Validity is a property of tests
- Does the test measure what you think it measures?
6Note the difference
- Validity exists when actions based on the
interpretation are justified given a theoretical
basis and social consequences
- Does the test measure what you think it measures?
7A problem with the CSEPT view
- Who is to say the social consequences of test
use are good or bad?
- According to CSEPT validity is a subjective
judgment - In my view, this makes the concept useless if
you like the result the test gives you, you will
consider it valid. If you dont, you wont. - Thats not how scientists think.
8Borsboom et al. (2004)
- Borsboom et al reject CSEPTs view
- Validity is a very basic concept and was
correctly formulated, for instance, by Kelley
(1927, p. 14) when he stated that a test is valid
if it measures what it purports to measure. (p.
1061)
9Borsboom et al. (2004)
- Variations in what you are measuring cause
variations in your measurements.
- E.g., variations across people in intelligence
cause variations in their IQ scores - This is not a correlational model of validity
10Borsboom et al. (2004)
- You dont create a test and then do the analysis
necessary to establish its validity
- Rather, you begin by doing the theoretical work
necessary to understand your subject and create a
valid test in the first place. - On this view, validity is not a big problem.
11Borsboom et al. vs. CSEPT
- Who is right?
- Each scientist has to make up his or her own
mind on that question
- I agree with Borsboom et al.s arguments.
- Other psychologists may disagree.
12The CSEPT view
- CSEPT recognizes 3 types of evidence for test
validity - Content-related
- Criterion-related
- Construct-related
- Boundaries not clearly defined
- Cronbach (1980) Construct is basic, while
Content Criterion are subtypes.
13Parenthetical Point Face Validity
- Face validity refers to the appearance that a
test measures what it is intended to measure.
- Face validity has P.R. value test-takers may
have better motivation if the test appears to be
a sensible way to measure what it measures.
14Content validity CSEPT
- Content-related evidence considers coverage of
the conceptual domain tested.
- Important in educational settings
- Like face validity, it is determined by logic
rather than statistics - Typically assessed by expert judges
15Content validity CSEPT
- Construct-irrelevant variance
- arises when irrelevant items are included
- or when external factors such as illness
influence test scores - requires a judgment about what is truly
external
- Construct under-representation
- Is domain adequately covered or are parts of it
left out?
16Content validity Borsboom et al.
- Borsboom et al. would say that content validity
is not something to be established after the test
has been created.
- Rather, you build it into your test by having a
good theory of what you are testing
17Criterion validity CSEPT
- Criterion-related evidence tells us how well a
test score corresponds to a particular criterion
measure.
- Generally, we want the test score to tell us
something about the criterion score. - How well the test does this provides
criterion-related evidence
18Criterion validity CSEPT
- CSEPT we could compare undergraduate GPAs to SAT
scores to produce evidence of validity of
conclusions draw on basis of SAT scores.
- Two basic types
- Predictive
- Concurrent
19Criterion validity CSEPT
- Test scores used to predict future performance
how good is the prediction? - E.g., SAT is used to predict final undergraduate
GPA - SAT GPA are moderately correlated
20Criterion validity CSEPT
- Predictive validity
- Concurrent validity
- Correlation between test scores and criterion
when the two are measured at same time. - Test illuminates current performance rather than
predicting future performance (e.g., why does
patient have a temperature? Why cant student do
math?)
21Criterion validity Borsboom et al.
- Criterion validity involves a correlation, of
test scores with some criterion such as GPA
- That does not establish the tests validity, only
its utility. - E.g., height and weight are correlated, but a
test of height is not a test of what bathroom
scales measure.
22Criterion validity Borsboom et al.
- SAT is valid because it was developed on the
sensible theory that past academic achievement
is a good guide to future academic achievement
- Validity is built into the test, not established
after the test has been created
23Criterion validity
- Note no point in developing a test if you
already have a criterion unless impracticality
or expense makes use of the criterion difficult.
- Criterion measure only available in the future?
- Criterion too expensive to use?
24Criterion validity
- Compute correlation (r) between test score and
criterion. - r .30 or .40 would be considered normal.
- r gt .60 is rare
- Note r varies between -1.0 and 1.0
25Criterion validity
- r2 gives proportion of variance in criterion
explained by test score. - E.g., if rxy .30, r2 .09, so 9 of
variability in Y can be explained by variation
in X
26Interpreting validity coefficients
- Changes in causal relationships
- What does criterion mean? Is it valid, reliable?
- Is subject population for validity study
appropriate? - Sample size
27Interpreting validity coefficients
- Criterion/predictor confusion
- Range restrictions
- Do validity study results generalize?
- Differential predictions
28Construct validity CSEPT
- Problem for many psychological characteristics
of interest there is no agreed-upon universe of
content and no clear criterion
- We cannot assess content or criterion validity
for such characteristics - These characteristics involve constructs
something built by mental synthesis.
29Construct validity CSEPT
- Examples of constructs
- Intelligence
- Love
- Curiosity
- Mental health
- CSEPT We obtain evidence of validity by
simultaneously defining the construct and
developing instruments to measure it. - This is bootstrapping.
30Bootstrapping construct validity
- assemble evidence about what a test means in
other words, about the characteristic it is
testing.
- CSEPT this process is never finished
31Bootstrapping construct validity
- assemble evidence about what a test means in
other words, about the characteristic it is
testing.
- Borsboom this is part of the process of creating
the test in the first place, not something done
after the fact
32Bootstrapping construct validity
- assemble evidence
- show relationships between a test and other tests
- CSEPT none of the other tests is a criterion but
the web of relationships tells us what the test
means
33Bootstrapping construct validity
- assemble evidence
- show relationships between a test and other tests
- Borsboom these relationships do not tell us what
a test score means - (e.g., age is correlated with annual income but a
measure of age is not a measure of annual income).
34Bootstrapping construct validity
- assemble evidence
- show relationships
- each new relationship adds meaning to the test
- CSEPT a tests meaning is gradually clarified
over time
35Bootstrapping construct validity
- assemble evidence
- show relationships
- each new relationship adds meaning to the test
- Borsboom would say, why all the mystery? The
meaning of many tests (e.g., WAIS, academic
exams, Piagets tests) is clear right from the
start
36Construct validity
- Example from text Rubins work on Love.
- Rubin collected a set of items for a Love scale
- He read poetry, novels he asked people for
definitions - created a scale of Love and one of Liking
37CSEPT Construct validity
- Rubin gave scale to many subjects
factor-analyzed results
- Love integrates Attachment, Caring, Intimacy
- Liking integrates Adjustment, Maturity, Good
Judgment, and Intelligence - The two are independent you can love someone you
dont like (as song-writers know)
38Rubins study of Love
- Borsboom et al. when creating a test, the
researcher specifies the processes that convey
the effect of the measured attribute on the test
score.
- Rubin laboriously built a theory about what the
construct Love means. - Rubins process reading poetry and novels,
asking people for definitions was a good
process, so his test has construct validity.
39Campbell Fiske (1959)
- Two types of Construct-related Evidence
- Convergent evidence
- When a test correlates well with other tests
believed to measure the same construct
40Campbell Fiske (1959)
- Two types of Construct-related Evidence
- Convergent evidence
- Discriminant evidence
- When a test does not correlate with other tests
believed to measure some other construct.
41Convergent validity
- Scores correlated with age, number of symptoms,
chronic medical conditions, physiological
measures - Treatments designed to improve health should
increase Health Index scores. They do.
42Discriminant validity
- Low correlations between new test and tests
believed to tap unrelated constructs.
- Evidence that the new test measures something
unique
43Validity Reliability CSEPT
- CSEPT No point in trying to establish validity
of an unreliable test.
- Its possible to have a reliable test that is not
valid (has no meaning). - Logically impossible to produce evidence of
validity for an unreliable test.
44Validity Reliability Borsboom
- Borsboom et al what does it mean to say that a
test is reliable but not valid?
- What is it a test of?
- It isnt a test at all, just a collection of
items
45Blanton Jaccard arbitrary metrics
- We observe a behavior in order to learn about the
underlying psychological characteristic - A persons test score represents their standing
on that underlying dimension
- Such scores form an arbitrary metric
- That is, we do not know how the observed scores
are related to the true scores on the underlying
dimension
46Person A
Person B
Underlying dimension
Neutral
Test 1
0
1
2
3
4
5
6
Test 2
6
5
4
3
2
1
0
Adapted from Blanton Jaccard (2006) Figure 1,
p. 29
47Arbitrary metrics the IAT
- Implicit Association Test (IAT) claimed to
diagnose implicit attitudinal preferences or
racist attitudes
- IAT authors say you may have prejudices you dont
know you have. - Are these claims true?
48Arbitrary metrics the IAT
- Task categorize stimuli using 2 pairs of
categories
- 2 buttons to press, 2 assignments of categories
to buttons, used in sequence
49Arbitrary metrics the IAT
- Assignment pattern A
- Button 1 press if stimulus refers to the
category White or the category Pleasant - Button 2 press if stimulus refers to the
category Black or the category Unpleasant
- Assignment pattern B
- Button 1 press if stimulus refers to the
category White or the category Unpleasant - Button 2 press if stimulus refers to the
category Black or the category Pleasant
50Arbitrary metrics the IAT
- IAT authors claim that if responses are faster to
Pattern A than to Pattern B, that indicates a
preference for Whites over Blacks in other
words, a racist attitude
- IAT authors also give test-takers feedback about
how strong their preferences are, based on how
much faster their responses are to Pattern A than
to Pattern B - This is inappropriate
51Arbitrary metrics the IAT
- The IAT does not tell us about racist attitudes
- IAT authors take a dimension which is
non-arbitrary when used by physicists time
and use it in an arbitrary way in psychology
52Arbitrary metrics the IAT
- The function relating the response dimension
(time) to the underlying dimension (attitudes) is
unknown
- Zero on the (Pattern A Pattern B) difference
may not be zero on the underlying attitude
preference dimension - There are alternative models of how that (Pattern
A Pattern B) difference could arise
53Review
- CSEPT
- Validity is a characteristic of evidence, not of
tests. - Valid evidence supports conclusions drawn using
test results - Validity is determined by social consequences of
test use
- Borsboom et al.
- Validity is not a methodological issue, but a
substantive (theoretical) issue - A test of an attribute is valid if (a) the
attribute exists, and (b) variation in the
attribute causes variation in test scores
54Review
- CSEPT
- Validity can be established in three ways, though
boundaries between them are fuzzy - Content-related evidence
- Criterion-related evidence
- Construct-related evidence
- Borsboom et al
- Its all the same validity a test is valid if it
measures what you think it measures - Validity is not mysterious
55Review
- CSEPT
- Content-related evidence do test items represent
whole domain of interest? - Criterion-related evidence do test scores relate
to a criterion either now (concurrent) or in the
future (predictive)?
- Borsboom et al.
- These questions are properly part of the process
of creating a test
56Review
- CSEPT
- Construct-related evidence is obtained when we
develop a psychological construct and the way to
measure it at the same time. - A test can be reliable but not valid. A test
cannot be valid if not reliable.
- Borsboom et al.
- A test must be valid for a reliability estimate
to have any meaning
57Review
- Blanton Jaccard (2006) warn against
over-interpretation of scores which are based on
an arbitrary metric
- For an arbitrary metric, we have no idea how the
test scores are actually related to the
underlying dimension