Title: Reliability
1Reliability
- a measure is reliable if it gives the same
information every time it is used.
- reliability is assessed by a number typically a
correlation between two sets of scores
2Reliability
- Measurement of human ability and knowledge is
challenging because
- ability is not directly observable we infer
ability from behavior - all behaviors are influenced by many variables,
only a few of which matter to us
3Observed Scores
- O Observed score
- T True score
- e error
4Reliability the basics
- A true score on a test does not change with
repeated testing - A true score would be obtained if there were no
error of measurement.
- We assume that errors are random (equally likely
to increase or decrease any test result).
5Reliability the basics
- Because errors are random, if we test one person
many times, the errors will cancel each other out - (Positive errors cancel negative errors)
- Mean of many observed scores for one person will
be the persons true score
6Reliability the basics
- Example to measure Sarahs spelling ability for
English words. - We cant ask her to spell every word in the
dictionary, so
- Ask Sarah to spell a subset of English words
- correct estimates her true English spelling
skill - But which words should be in our subset?
7Estimating Sarahs spelling ability
- Suppose we choose 20 words randomly
- Then, by chance, we may get a lot of very easy
words cat, tree, chair, stand - Or, by chance, we may get a lot of very difficult
words desiccate, arteriosclerosis, numismatics
8Estimating Sarahs spelling ability
- Sarahs observed score will vary with the
difficulty of the random sets of words we choose
- But presumably her actual spelling ability
remains constant.
9Reliability the basics
- Other things can produce error in our measurement
- E.g. on the first day that we test Sarah shes
tired - but on the second day, shes rested
10Estimating Sarahs spelling ability
- Conclusion
- O T e
- But e1 ? e2 ? e3
- The variation in Sarahs scores is produced by
measurement error. - How can we measure such effects how can we
measure reliability?
11Reliability the basics
- In what follows, we consider various sources of
error in measurement.
- Different ways of measuring reliability are
sensitive to different sources of error.
12How do we deal with sources of error?
13Domain Sampling error
- A knowledge base or skill set containing many
items is to be tested. - E.g., chemical properties of foods.
- We cant test the entire set of items.
- So we sample items.
- That produces sampling error, as in Sarahs
spelling test.
14Domain Sampling error
- Smaller sets of items may not test entire
knowledge base. - A persons score may vary depending upon what is
included or excluded from test.
- Reliability increases with number of items on a
test
15Domain Sampling error
- Parallel Forms Reliability
- Choose 2 different sets of test items.
- Across all people tested, if correlation between
scores on 2 sets of words is low, then we
probably have domain sampling error.
16How do we deal with sources of error?
- Error due to test items
- Error due to testing occasions
17Time Sampling error
- Test-retest Reliability
- person taking test might be having a very good
or very bad day due to fatigue, emotional
state, preparedness, etc.
- Give same test repeatedly check correlations
among scores - High correlations indicate stability less
influence of bad or good days.
18Time sampling error
- Advantage easy to evaluate, using correlation
- Disadvantage carryover practice effects
19How do we deal with sources of error?
- Error due to test items
- Error due to testing occasions
- Error due to testing multiple traits
- Internal consistency error
20Internal consistency approach
- Suppose a test includes both (1) items on social
psychology and (2) items requiring mental
rotation of abstract visual shapes.
- Would you expect much correlation between scores
on the two parts? - No because the two skills are unrelated.
21Internal consistency approach
- A low correlation between scores on 2 halves of a
test, suggests that the test is tapping two
different abilities or traits.
- In such a case, the two halves of the test give
information about two different, uncorrelated
traits
22Internal consistency approach
- So we assess internal consistency by dividing the
test into 2 halves and computing the correlation
between scores on those two halves for the people
who took the test
- But how should we divide the test into halves to
check the correlation?
23Internal consistency approach
- Split-half method
- Kuder-Richardson formula
- Cronbachs alpha
- All of these assess the extent to which items on
a given test measure the same ability or trait.
24Split-half Reliability
- After testing, divide test items into halves A
B that are scored separately. - Compute correlation of results for A with results
for B.
- Various ways of dividing test into two
randomly, first half vs. second half, odd-even
25Kuder-Richardson 20
- Kuder Richardson (1937) an internal-consistency
measure that doesnt require arbitrary splitting
of test into 2 halves.
- KR-20 avoids problems associated with splitting
by simultaneously considering all possible ways
of splitting a test into 2 halves.
26Internal Consistency Cronbachs a
- KR-20 can only be used with test items scored as
1 or 0 (e.g., right or wrong, true or false).
- Cronbachs a (alpha) generalizes KR-20 to tests
with multiple response categories. - a is a more generally-useful measure of internal
consistency than KR-20
27Review How do we deal with sources of error?
- Approach Measures Issues
- Test-Retest Stability of scores Carryover
-
- Parallel Forms Equivalence Stability Effort
- Split-half Equivalence Internal Shortened con
sistency test - KR-20 a Equivalence Internal Difficult to
- consistency calculate
28Reliability in Observational Studies
- Some psychologists collect data by observing
behavior rather than by testing.
- This approach requires time sampling, leading to
sampling error - Further error due to
- observer failures
- inter-observer differences
29Reliability in Observational Studies
- Deal with possibility of failure in the
single-observer situation by having more than 1
observer.
- Deal with inter-observer differences using
- Inter-rater reliability
- Kappa statistic
30Validity
- We distinguish between the validity of a measure
of some psychological process or state and the
validity of a conclusion. - Here, we focus on validity of measures.
- A subsequent lecture will consider the validity
of conclusions.
31Well look at validity of these phases today
32Validity
- a measure is valid if it measures what you think
it measures.
- we traditionally distinguish between four types
of validity - face
- content
- construct
- criterion
33Four types of validity
- The test appears to measure what it is supposed
to measure - not formally recognized as a type of validity
34Four types of validity
- The measure captures the theoretical construct it
is supposed to measure
35Four types of validity
- The measure samples the range of behavior covered
by the construct.
36Four types of validity
- Face
- Construct
- Content
- Criterion
- Results relate closely to those produced by other
measures of the same construct. - Results do not relate to those produced by
measures of other constructs
37Review (last week this week)
- Were not really interested in things that stay
the same. - Were interested in variation.
- But only systematic variation, not random
variation - systematic variation can be explained
- random variation cant
38Quick Review
- Some variation in performance is random and some
is systematic
- The scientists tasks are to separate the
systematic variation from the random, and then to
build models of the systematic variation.
39Quick Review
- We choose a measurement scale.
- We prefer either ratio or interval scales, when
we can get them.
- We try to maximize both the reliability and the
validity of our measurements using that scale.
40Review questions
- Which would you expect to be easier to assess
reliability or validity? - Why do we have tools and machines to measure
some things for us (such as rulers, scales, and
money)?
- What are some analogues for rulers and scales,
used when we measure psychological constructs?