Chapter 4 - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 4

Description:

Chapter 4 Reliability Observed Scores and True Scores Error How We Deal with Sources of Error: Domain sampling test items Time sampling test occasions – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 43
Provided by: Patr581
Category:
Tags: chapter | taking | test

less

Transcript and Presenter's Notes

Title: Chapter 4


1
Chapter 4 Reliability
  • Observed Scores and True Scores
  • Error
  • How We Deal with Sources of Error
  • Domain sampling test items
  • Time sampling test occasions
  • Internal consistency traits
  • Reliability in Observational Studies
  • Using Reliability Information
  • What To Do about Low Reliability

2
Chapter 4 - Reliability
  • Measurement of human ability and knowledge is
    challenging because
  • ability is not directly observable we infer
    ability from behavior
  • all behaviors are influenced by many variables,
    only a few of which matter to us

3
Observed Scores
  • O T e
  • O Observed score
  • T True score
  • e error

4
Reliability the basics
  1. A true score on a test does not change with
    repeated testing
  2. A true score would be obtained if there were no
    error of measurement.
  • We assume that errors are random (equally likely
    to increase or decrease any test result).

5
Reliability the basics
  • Because errors are random, if we test one person
    many times, the errors will cancel each other out
  • (Positive errors cancel negative errors)
  • Mean of many observed scores for one person will
    be the persons true score

6
Reliability the basics
  • Example to measure Sarahs spelling ability for
    English words.
  • We cant ask her to spell every word in the OED,
    so
  • Ask Sarah to spell a subset of English words
  • correct estimates her true English spelling
    skill
  • But which words should be in our subset?

7
Estimating Sarahs spelling ability
  • Suppose we choose 20 words randomly
  • What if, by chance, we get a lot of very easy
    words cat, tree, chair, stand
  • Or, by chance, we get a lot of very difficult
    words desiccate, arteriosclerosis, numismatics

8
Estimating Sarahs spelling ability
  • Sarahs observed score varies as the difficulty
    of the random sets of words varies
  • But presumably her true score (her actual
    spelling ability) remains constant.

9
Reliability the basics
  • Other things can produce error in our measurement
  • E.g. on the first day that we test Sarah shes
    tired
  • But on the second day, shes rested
  • This would lead to different scores on the two
    days

10
Estimating Sarahs spelling ability
  • Conclusion
  • O T e
  • But e1 ? e2 ? e3
  • The variation in Sarahs scores is produced by
    measurement error.
  • How can we measure such effects how can we
    measure reliability?

11
Reliability the basics
  • In what follows, we consider various sources of
    error in measurement.
  • Different ways of measuring reliability are
    sensitive to different sources of error.

12
How do we deal with sources of error?
  • Error due to test items
  • Domain sampling error

13
How do we deal with sources of error?
  • Error due to test items
  • Error due to testing occasions
  • Time sampling error

14
How do we deal with sources of error?
  • Error due to test items
  • Error due to testing occasions
  • Error due to testing multiple traits
  • Internal consistency error

15
Domain Sampling error
  • A knowledge base or skill set containing many
    items is to be tested.
  • E.g., the chemical properties of foods.
  • We cant test the entire set of items.
  • So we select a sample of items.
  • That produces domain sampling error, as in
    Sarahs spelling test.

16
Domain Sampling error
  • There is a domain of knowledge to be tested
  • A persons score may vary depending upon what is
    included or excluded from the test.

17
Domain Sampling error
  • Smaller sets of items may not test entire
    knowledge base.
  • Larger sets of items should do a better job of
    covering the whole knowledge base.
  • As a result, reliability of a test increases with
    the number of items on that test

18
Domain Sampling error
  • Parallel Forms Reliability
  • choose 2 different sets of test items.
  • these 2 sets give you parallel forms of the
    test
  • Across all people tested, if correlation between
    scores on 2 parallel forms is low, then we
    probably have domain sampling error.

19
Time Sampling error
  • Test-retest Reliability
  • person taking test might be having a very good
    or very bad day due to fatigue, emotional
    state, preparedness, etc.
  • Give same test repeatedly check correlations
    among scores
  • High correlations indicate stability less
    influence of bad or good days.

20
Time Sampling error
  • Test-retest approach is only useful for traits
    characteristics that dont change over time
  • Not all low test-retest correlations imply a weak
    test
  • Sometimes, the characteristic being measured
    varies with time (as in learning)

21
Time Sampling error
  • Interval over which correlation is measured
    matters
  • E.g., for young children, use a very short period
    (lt 1 month, in general)
  • In general, interval should not be gt 6 months
  • Not all low test-retest correlations imply a weak
    test
  • Sometimes, the characteristic being measured
    varies with time (as in learning)

22
Time sampling error
  • Test-retest approach advantage easy to evaluate,
    using correlation
  • Disadvantage carryover practice effects
  • Carryover first testing session influences
    scores on next session
  • Practice when carryover effect involves learning

23
Internal Consistency error
  • Suppose a test includes both items on social
    psychology and items requiring mental rotation of
    abstract visual shapes.
  • Would you expect much correlation between scores
    on the two parts?
  • No because the two skills are unrelated.

24
Internal Consistency Approach
  • A low correlation between scores on 2 halves of a
    test, suggests that the test is tapping two
    different abilities or traits.
  • A good test has high correlations between scores
    on its two halves.
  • But how should we divide the test in two to check
    that correlation?

25
Internal Consistency error
  • Split-half method
  • Kuder-Richardson formula
  • Cronbachs alpha
  • All of these assess the extent to which items on
    a given test measure the same ability or trait.

26
Split-half Reliability
  • After testing, divide test items into halves A
    B that are scored separately.
  • Check for correlation of results for A with
    results for B.
  • Various ways of dividing test into two
    randomly, first half vs. second half, odd-even

27
Split-half Reliability a problem
  • Each half-test is smaller than the whole
  • Smaller tests have lower reliability (domain
    sampling error)
  • So, we shouldnt use the raw split-half
    reliability to assess reliability for the whole
    test

28
Split-half reliability a problem
  • We correct reliability estimate using the
    Spearman-Brown formula
  • re 2rc
  • 1 rc
  • re estimated reliability for the test
  • rc computed reliability (correlation between
    scores on the two halves A and B)

29
Kuder-Richardson 20
  • Kuder Richardson (1937) an internal-consistency
    measure that doesnt require arbitrary splitting
    of test into 2 halves.
  • KR-20 avoids problems associated with splitting
    by simultaneously considering all possible ways
    of splitting a test into 2 halves.

30
Kuder-Richardson 20
  • The formula contains two basic terms
  • a measure of all the variance in the whole set of
    test results.

31
Kuder-Richardson 20
  • The formula contains two basic terms
  1. item variance when items measure the same
    trait, they co-vary (same people get them right
    or wrong). More co-variance less item variance

32
Internal Consistency Cronbachs a
  • KR-20 can only be used with test items scored as
    1 or 0 (e.g., right or wrong, true or false).
  • Cronbachs a (alpha) generalizes KR-20 to tests
    with multiple response categories.
  • a is a more generally-useful measure of internal
    consistency than KR-20

33
Review How do we deal with sources of error?
  • Approach Measures Issues
  • Test-Retest Stability of scores Carryover
  • Parallel Forms Equivalence Stability Effort
  • Split-half Equivalence Internal Shortened con
    sistency test
  • KR-20 a Equivalence Internal Difficult to
  • consistency calculate

34
Reliability in Observational Studies
  • Some psychologists collect data by observing
    behavior rather than by testing.
  • This approach requires time sampling, leading to
    sampling error
  • Further error due to
  • observer failures
  • inter-observer differences

35
Reliability in Observational Studies
  • Deal with possibility of failure in the
    single-observer situation by having more than 1
    observer.
  • Deal with inter-observer differences using
  • Inter-rater reliability
  • Kappa statistic

36
Reliability in Observational Studies
  • Inter-rater reliability
  • agreement between 2 or more observers
  • problem in a 2-choice case, 2 judges have a 50
    chance of agreeing even if they guess!
  • this means that agreement may over-estimate
    inter-rater reliability.

37
Reliability in Observational Studies
  • Kappa Statistic (Cohen,1960)
  • estimates actual inter-rater agreement as a
    proportion of potential inter-rater agreement
    after correction for chance.

38
Using Reliability Information
  • Standard error of measurement (SEM)
  • estimates extent to which test score
    misrepresents a true score.
  • SEM (S)?(1 r)

39
Standard Error of Measurement
  • We use SEM to compute a confidence interval for a
    particular test score.
  • The interval is centered on the test score
  • We have confidence that the true score falls in
    this interval
  • E.g., 95 of the time the true score will fall
    within 1.96 SEM either way of the test (observed)
    score.

40
Standard Error of Measurement
  • A simple way to think of the SEM
  • Suppose we gave one student the same test over
    and over
  • Suppose, too, that no learning took place between
    tests and the student did not memorize questions
  • The standard deviation of the resulting set of
    test scores (for this one student) would be the
    standard error of measurement.

41
What to do about low reliability
  • Increase the number of items
  • To find how many you need, use Spearman-Brown
    formula
  • Using more items may introduce new sources of
    error such as fatigue, boredom

42
What to do about low reliability
  • Discriminability analysis
  • Find correlations between each item and whole
    test
  • Delete items with low correlations
Write a Comment
User Comments (0)
About PowerShow.com