Chapter 4 - PowerPoint PPT Presentation

About This Presentation

Title:

Chapter 4

Description:

Chapter 4 Reliability Observed Scores and True Scores Error How We Deal with Sources of Error: Domain sampling test items Time sampling test occasions – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 43

Provided by: Patr581

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 4

1
Chapter 4 Reliability

Observed Scores and True Scores
Error
How We Deal with Sources of Error
Domain sampling test items
Time sampling test occasions
Internal consistency traits
Reliability in Observational Studies
Using Reliability Information
What To Do about Low Reliability

2
Chapter 4 - Reliability

Measurement of human ability and knowledge is
challenging because
ability is not directly observable we infer
ability from behavior
all behaviors are influenced by many variables,
only a few of which matter to us

3
Observed Scores

O T e
O Observed score
T True score
e error

4
Reliability the basics

A true score on a test does not change with
repeated testing
A true score would be obtained if there were no
error of measurement.

We assume that errors are random (equally likely
to increase or decrease any test result).

5
Reliability the basics

Because errors are random, if we test one person
many times, the errors will cancel each other out
(Positive errors cancel negative errors)

Mean of many observed scores for one person will
be the persons true score

6
Reliability the basics

Example to measure Sarahs spelling ability for
English words.
We cant ask her to spell every word in the OED,
so

Ask Sarah to spell a subset of English words
correct estimates her true English spelling
skill
But which words should be in our subset?

7
Estimating Sarahs spelling ability

Suppose we choose 20 words randomly

What if, by chance, we get a lot of very easy
words cat, tree, chair, stand
Or, by chance, we get a lot of very difficult
words desiccate, arteriosclerosis, numismatics

8
Estimating Sarahs spelling ability

Sarahs observed score varies as the difficulty
of the random sets of words varies

But presumably her true score (her actual
spelling ability) remains constant.

9
Reliability the basics

Other things can produce error in our measurement

E.g. on the first day that we test Sarah shes
tired
But on the second day, shes rested
This would lead to different scores on the two
days

10
Estimating Sarahs spelling ability

Conclusion
O T e
But e1 ? e2 ? e3

The variation in Sarahs scores is produced by
measurement error.
How can we measure such effects how can we
measure reliability?

11
Reliability the basics

In what follows, we consider various sources of
error in measurement.

Different ways of measuring reliability are
sensitive to different sources of error.

12
How do we deal with sources of error?

Error due to test items

Domain sampling error

13
How do we deal with sources of error?

Error due to test items
Error due to testing occasions

Time sampling error

14
How do we deal with sources of error?

Error due to test items
Error due to testing occasions
Error due to testing multiple traits

Internal consistency error

15
Domain Sampling error

A knowledge base or skill set containing many
items is to be tested.
E.g., the chemical properties of foods.

We cant test the entire set of items.
So we select a sample of items.
That produces domain sampling error, as in
Sarahs spelling test.

16
Domain Sampling error

There is a domain of knowledge to be tested

A persons score may vary depending upon what is
included or excluded from the test.

17
Domain Sampling error

Smaller sets of items may not test entire
knowledge base.
Larger sets of items should do a better job of
covering the whole knowledge base.

As a result, reliability of a test increases with
the number of items on that test

18
Domain Sampling error

Parallel Forms Reliability
choose 2 different sets of test items.
these 2 sets give you parallel forms of the
test

Across all people tested, if correlation between
scores on 2 parallel forms is low, then we
probably have domain sampling error.

19
Time Sampling error

Test-retest Reliability
person taking test might be having a very good
or very bad day due to fatigue, emotional
state, preparedness, etc.

Give same test repeatedly check correlations
among scores
High correlations indicate stability less
influence of bad or good days.

20
Time Sampling error

Test-retest approach is only useful for traits
characteristics that dont change over time

Not all low test-retest correlations imply a weak
test
Sometimes, the characteristic being measured
varies with time (as in learning)

21
Time Sampling error

Interval over which correlation is measured
matters
E.g., for young children, use a very short period
(lt 1 month, in general)
In general, interval should not be gt 6 months

Not all low test-retest correlations imply a weak
test
Sometimes, the characteristic being measured
varies with time (as in learning)

22
Time sampling error

Test-retest approach advantage easy to evaluate,
using correlation
Disadvantage carryover practice effects

Carryover first testing session influences
scores on next session
Practice when carryover effect involves learning

23
Internal Consistency error

Suppose a test includes both items on social
psychology and items requiring mental rotation of
abstract visual shapes.

Would you expect much correlation between scores
on the two parts?
No because the two skills are unrelated.

24
Internal Consistency Approach

A low correlation between scores on 2 halves of a
test, suggests that the test is tapping two
different abilities or traits.

A good test has high correlations between scores
on its two halves.
But how should we divide the test in two to check
that correlation?

25
Internal Consistency error

Split-half method
Kuder-Richardson formula
Cronbachs alpha

All of these assess the extent to which items on
a given test measure the same ability or trait.

26
Split-half Reliability

After testing, divide test items into halves A
B that are scored separately.
Check for correlation of results for A with
results for B.

Various ways of dividing test into two
randomly, first half vs. second half, odd-even

27
Split-half Reliability a problem

Each half-test is smaller than the whole
Smaller tests have lower reliability (domain
sampling error)

So, we shouldnt use the raw split-half
reliability to assess reliability for the whole
test

28
Split-half reliability a problem

We correct reliability estimate using the
Spearman-Brown formula
re 2rc
1 rc

re estimated reliability for the test
rc computed reliability (correlation between
scores on the two halves A and B)

29
Kuder-Richardson 20

Kuder Richardson (1937) an internal-consistency
measure that doesnt require arbitrary splitting
of test into 2 halves.

KR-20 avoids problems associated with splitting
by simultaneously considering all possible ways
of splitting a test into 2 halves.

30
Kuder-Richardson 20

The formula contains two basic terms

a measure of all the variance in the whole set of
test results.

31
Kuder-Richardson 20

The formula contains two basic terms

item variance when items measure the same
trait, they co-vary (same people get them right
or wrong). More co-variance less item variance

32
Internal Consistency Cronbachs a

KR-20 can only be used with test items scored as
1 or 0 (e.g., right or wrong, true or false).

Cronbachs a (alpha) generalizes KR-20 to tests
with multiple response categories.
a is a more generally-useful measure of internal
consistency than KR-20

33
Review How do we deal with sources of error?

Approach Measures Issues
Test-Retest Stability of scores Carryover
Parallel Forms Equivalence Stability Effort
Split-half Equivalence Internal Shortened con
sistency test
KR-20 a Equivalence Internal Difficult to
consistency calculate

34
Reliability in Observational Studies

Some psychologists collect data by observing
behavior rather than by testing.

This approach requires time sampling, leading to
sampling error
Further error due to
observer failures
inter-observer differences

35
Reliability in Observational Studies

Deal with possibility of failure in the
single-observer situation by having more than 1
observer.

Deal with inter-observer differences using
Inter-rater reliability
Kappa statistic

36
Reliability in Observational Studies

Inter-rater reliability

agreement between 2 or more observers
problem in a 2-choice case, 2 judges have a 50
chance of agreeing even if they guess!
this means that agreement may over-estimate
inter-rater reliability.

37
Reliability in Observational Studies

Kappa Statistic (Cohen,1960)

estimates actual inter-rater agreement as a
proportion of potential inter-rater agreement
after correction for chance.

38
Using Reliability Information

Standard error of measurement (SEM)

estimates extent to which test score
misrepresents a true score.
SEM (S)?(1 r)

39
Standard Error of Measurement

We use SEM to compute a confidence interval for a
particular test score.

The interval is centered on the test score
We have confidence that the true score falls in
this interval
E.g., 95 of the time the true score will fall
within 1.96 SEM either way of the test (observed)
score.

40
Standard Error of Measurement

A simple way to think of the SEM
Suppose we gave one student the same test over
and over
Suppose, too, that no learning took place between
tests and the student did not memorize questions