Title: Chapter 4
1Chapter 4 Reliability
- Observed Scores and True Scores
- Error
- How We Deal with Sources of Error
- Domain sampling test items
- Time sampling test occasions
- Internal consistency traits
- Reliability in Observational Studies
- Using Reliability Information
- What To Do about Low Reliability
2Chapter 4 - Reliability
- Measurement of human ability and knowledge is
challenging because - ability is not directly observable we infer
ability from behavior - all behaviors are influenced by many variables,
only a few of which matter to us
3Observed Scores
- O T e
- O Observed score
- T True score
- e error
4Reliability the basics
- A true score on a test does not change with
repeated testing - A true score would be obtained if there were no
error of measurement.
- We assume that errors are random (equally likely
to increase or decrease any test result).
5Reliability the basics
- Because errors are random, if we test one person
many times, the errors will cancel each other out - (Positive errors cancel negative errors)
- Mean of many observed scores for one person will
be the persons true score
6Reliability the basics
- Example to measure Sarahs spelling ability for
English words. - We cant ask her to spell every word in the OED,
so
- Ask Sarah to spell a subset of English words
- correct estimates her true English spelling
skill - But which words should be in our subset?
7Estimating Sarahs spelling ability
- Suppose we choose 20 words randomly
- What if, by chance, we get a lot of very easy
words cat, tree, chair, stand - Or, by chance, we get a lot of very difficult
words desiccate, arteriosclerosis, numismatics
8Estimating Sarahs spelling ability
- Sarahs observed score varies as the difficulty
of the random sets of words varies
- But presumably her true score (her actual
spelling ability) remains constant.
9Reliability the basics
- Other things can produce error in our measurement
- E.g. on the first day that we test Sarah shes
tired - But on the second day, shes rested
- This would lead to different scores on the two
days
10Estimating Sarahs spelling ability
- Conclusion
- O T e
- But e1 ? e2 ? e3
- The variation in Sarahs scores is produced by
measurement error. - How can we measure such effects how can we
measure reliability?
11Reliability the basics
- In what follows, we consider various sources of
error in measurement.
- Different ways of measuring reliability are
sensitive to different sources of error.
12How do we deal with sources of error?
13How do we deal with sources of error?
- Error due to test items
- Error due to testing occasions
14How do we deal with sources of error?
- Error due to test items
- Error due to testing occasions
- Error due to testing multiple traits
- Internal consistency error
15Domain Sampling error
- A knowledge base or skill set containing many
items is to be tested. - E.g., the chemical properties of foods.
- We cant test the entire set of items.
- So we select a sample of items.
- That produces domain sampling error, as in
Sarahs spelling test.
16Domain Sampling error
- There is a domain of knowledge to be tested
- A persons score may vary depending upon what is
included or excluded from the test.
17Domain Sampling error
- Smaller sets of items may not test entire
knowledge base. - Larger sets of items should do a better job of
covering the whole knowledge base.
- As a result, reliability of a test increases with
the number of items on that test
18Domain Sampling error
- Parallel Forms Reliability
- choose 2 different sets of test items.
- these 2 sets give you parallel forms of the
test
- Across all people tested, if correlation between
scores on 2 parallel forms is low, then we
probably have domain sampling error.
19Time Sampling error
- Test-retest Reliability
- person taking test might be having a very good
or very bad day due to fatigue, emotional
state, preparedness, etc.
- Give same test repeatedly check correlations
among scores - High correlations indicate stability less
influence of bad or good days.
20Time Sampling error
- Test-retest approach is only useful for traits
characteristics that dont change over time
- Not all low test-retest correlations imply a weak
test - Sometimes, the characteristic being measured
varies with time (as in learning)
21Time Sampling error
- Interval over which correlation is measured
matters - E.g., for young children, use a very short period
(lt 1 month, in general) - In general, interval should not be gt 6 months
- Not all low test-retest correlations imply a weak
test - Sometimes, the characteristic being measured
varies with time (as in learning)
22Time sampling error
- Test-retest approach advantage easy to evaluate,
using correlation - Disadvantage carryover practice effects
- Carryover first testing session influences
scores on next session - Practice when carryover effect involves learning
23Internal Consistency error
- Suppose a test includes both items on social
psychology and items requiring mental rotation of
abstract visual shapes.
- Would you expect much correlation between scores
on the two parts? - No because the two skills are unrelated.
24Internal Consistency Approach
- A low correlation between scores on 2 halves of a
test, suggests that the test is tapping two
different abilities or traits.
- A good test has high correlations between scores
on its two halves. - But how should we divide the test in two to check
that correlation?
25Internal Consistency error
- Split-half method
- Kuder-Richardson formula
- Cronbachs alpha
- All of these assess the extent to which items on
a given test measure the same ability or trait.
26Split-half Reliability
- After testing, divide test items into halves A
B that are scored separately. - Check for correlation of results for A with
results for B.
- Various ways of dividing test into two
randomly, first half vs. second half, odd-even
27Split-half Reliability a problem
- Each half-test is smaller than the whole
- Smaller tests have lower reliability (domain
sampling error)
- So, we shouldnt use the raw split-half
reliability to assess reliability for the whole
test
28Split-half reliability a problem
- We correct reliability estimate using the
Spearman-Brown formula - re 2rc
- 1 rc
- re estimated reliability for the test
- rc computed reliability (correlation between
scores on the two halves A and B)
29Kuder-Richardson 20
- Kuder Richardson (1937) an internal-consistency
measure that doesnt require arbitrary splitting
of test into 2 halves.
- KR-20 avoids problems associated with splitting
by simultaneously considering all possible ways
of splitting a test into 2 halves.
30Kuder-Richardson 20
- The formula contains two basic terms
- a measure of all the variance in the whole set of
test results.
31Kuder-Richardson 20
- The formula contains two basic terms
- item variance when items measure the same
trait, they co-vary (same people get them right
or wrong). More co-variance less item variance
32Internal Consistency Cronbachs a
- KR-20 can only be used with test items scored as
1 or 0 (e.g., right or wrong, true or false).
- Cronbachs a (alpha) generalizes KR-20 to tests
with multiple response categories. - a is a more generally-useful measure of internal
consistency than KR-20
33Review How do we deal with sources of error?
- Approach Measures Issues
- Test-Retest Stability of scores Carryover
-
- Parallel Forms Equivalence Stability Effort
- Split-half Equivalence Internal Shortened con
sistency test - KR-20 a Equivalence Internal Difficult to
- consistency calculate
34Reliability in Observational Studies
- Some psychologists collect data by observing
behavior rather than by testing.
- This approach requires time sampling, leading to
sampling error - Further error due to
- observer failures
- inter-observer differences
35Reliability in Observational Studies
- Deal with possibility of failure in the
single-observer situation by having more than 1
observer.
- Deal with inter-observer differences using
- Inter-rater reliability
- Kappa statistic
36Reliability in Observational Studies
- agreement between 2 or more observers
- problem in a 2-choice case, 2 judges have a 50
chance of agreeing even if they guess! - this means that agreement may over-estimate
inter-rater reliability.
37Reliability in Observational Studies
- Kappa Statistic (Cohen,1960)
- estimates actual inter-rater agreement as a
proportion of potential inter-rater agreement
after correction for chance.
38Using Reliability Information
- Standard error of measurement (SEM)
- estimates extent to which test score
misrepresents a true score. - SEM (S)?(1 r)
39Standard Error of Measurement
- We use SEM to compute a confidence interval for a
particular test score.
- The interval is centered on the test score
- We have confidence that the true score falls in
this interval - E.g., 95 of the time the true score will fall
within 1.96 SEM either way of the test (observed)
score.
40Standard Error of Measurement
- A simple way to think of the SEM
- Suppose we gave one student the same test over
and over - Suppose, too, that no learning took place between
tests and the student did not memorize questions
- The standard deviation of the resulting set of
test scores (for this one student) would be the
standard error of measurement.
41What to do about low reliability
- Increase the number of items
- To find how many you need, use Spearman-Brown
formula - Using more items may introduce new sources of
error such as fatigue, boredom
42What to do about low reliability
- Discriminability analysis
- Find correlations between each item and whole
test - Delete items with low correlations