Reliability - PowerPoint PPT Presentation

About This Presentation
Title:

Reliability

Description:

reliability is assessed by a number typically a correlation between two ... may get a lot of very difficult words desiccate, arteriosclerosis, numismatics ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 41
Provided by: bro130
Category:

less

Transcript and Presenter's Notes

Title: Reliability


1
Reliability
  • a measure is reliable if it gives the same
    information every time it is used.
  • reliability is assessed by a number typically a
    correlation between two sets of scores

2
Reliability
  • Measurement of human ability and knowledge is
    challenging because
  • ability is not directly observable we infer
    ability from behavior
  • all behaviors are influenced by many variables,
    only a few of which matter to us

3
Observed Scores
  • O T e
  • O Observed score
  • T True score
  • e error

4
Reliability the basics
  • A true score on a test does not change with
    repeated testing
  • A true score would be obtained if there were no
    error of measurement.
  • We assume that errors are random (equally likely
    to increase or decrease any test result).

5
Reliability the basics
  • Because errors are random, if we test one person
    many times, the errors will cancel each other out
  • (Positive errors cancel negative errors)
  • Mean of many observed scores for one person will
    be the persons true score

6
Reliability the basics
  • Example to measure Sarahs spelling ability for
    English words.
  • We cant ask her to spell every word in the
    dictionary, so
  • Ask Sarah to spell a subset of English words
  • correct estimates her true English spelling
    skill
  • But which words should be in our subset?

7
Estimating Sarahs spelling ability
  • Suppose we choose 20 words randomly
  • Then, by chance, we may get a lot of very easy
    words cat, tree, chair, stand
  • Or, by chance, we may get a lot of very difficult
    words desiccate, arteriosclerosis, numismatics

8
Estimating Sarahs spelling ability
  • Sarahs observed score will vary with the
    difficulty of the random sets of words we choose
  • But presumably her actual spelling ability
    remains constant.

9
Reliability the basics
  • Other things can produce error in our measurement
  • E.g. on the first day that we test Sarah shes
    tired
  • but on the second day, shes rested

10
Estimating Sarahs spelling ability
  • Conclusion
  • O T e
  • But e1 ? e2 ? e3
  • The variation in Sarahs scores is produced by
    measurement error.
  • How can we measure such effects how can we
    measure reliability?

11
Reliability the basics
  • In what follows, we consider various sources of
    error in measurement.
  • Different ways of measuring reliability are
    sensitive to different sources of error.

12
How do we deal with sources of error?
  • Error due to test items
  • Domain sampling error

13
Domain Sampling error
  • A knowledge base or skill set containing many
    items is to be tested.
  • E.g., chemical properties of foods.
  • We cant test the entire set of items.
  • So we sample items.
  • That produces sampling error, as in Sarahs
    spelling test.

14
Domain Sampling error
  • Smaller sets of items may not test entire
    knowledge base.
  • A persons score may vary depending upon what is
    included or excluded from test.
  • Reliability increases with number of items on a
    test

15
Domain Sampling error
  • Parallel Forms Reliability
  • Choose 2 different sets of test items.
  • Across all people tested, if correlation between
    scores on 2 sets of words is low, then we
    probably have domain sampling error.

16
How do we deal with sources of error?
  • Error due to test items
  • Error due to testing occasions
  • Time sampling error

17
Time Sampling error
  • Test-retest Reliability
  • person taking test might be having a very good
    or very bad day due to fatigue, emotional
    state, preparedness, etc.
  • Give same test repeatedly check correlations
    among scores
  • High correlations indicate stability less
    influence of bad or good days.

18
Time sampling error
  • Advantage easy to evaluate, using correlation
  • Disadvantage carryover practice effects

19
How do we deal with sources of error?
  • Error due to test items
  • Error due to testing occasions
  • Error due to testing multiple traits
  • Internal consistency error

20
Internal consistency approach
  • Suppose a test includes both (1) items on social
    psychology and (2) items requiring mental
    rotation of abstract visual shapes.
  • Would you expect much correlation between scores
    on the two parts?
  • No because the two skills are unrelated.

21
Internal consistency approach
  • A low correlation between scores on 2 halves of a
    test, suggests that the test is tapping two
    different abilities or traits.
  • In such a case, the two halves of the test give
    information about two different, uncorrelated
    traits

22
Internal consistency approach
  • So we assess internal consistency by dividing the
    test into 2 halves and computing the correlation
    between scores on those two halves for the people
    who took the test
  • But how should we divide the test into halves to
    check the correlation?

23
Internal consistency approach
  • Split-half method
  • Kuder-Richardson formula
  • Cronbachs alpha
  • All of these assess the extent to which items on
    a given test measure the same ability or trait.

24
Split-half Reliability
  • After testing, divide test items into halves A
    B that are scored separately.
  • Compute correlation of results for A with results
    for B.
  • Various ways of dividing test into two
    randomly, first half vs. second half, odd-even

25
Kuder-Richardson 20
  • Kuder Richardson (1937) an internal-consistency
    measure that doesnt require arbitrary splitting
    of test into 2 halves.
  • KR-20 avoids problems associated with splitting
    by simultaneously considering all possible ways
    of splitting a test into 2 halves.

26
Internal Consistency Cronbachs a
  • KR-20 can only be used with test items scored as
    1 or 0 (e.g., right or wrong, true or false).
  • Cronbachs a (alpha) generalizes KR-20 to tests
    with multiple response categories.
  • a is a more generally-useful measure of internal
    consistency than KR-20

27
Review How do we deal with sources of error?
  • Approach Measures Issues
  • Test-Retest Stability of scores Carryover
  • Parallel Forms Equivalence Stability Effort
  • Split-half Equivalence Internal Shortened con
    sistency test
  • KR-20 a Equivalence Internal Difficult to
  • consistency calculate

28
Reliability in Observational Studies
  • Some psychologists collect data by observing
    behavior rather than by testing.
  • This approach requires time sampling, leading to
    sampling error
  • Further error due to
  • observer failures
  • inter-observer differences

29
Reliability in Observational Studies
  • Deal with possibility of failure in the
    single-observer situation by having more than 1
    observer.
  • Deal with inter-observer differences using
  • Inter-rater reliability
  • Kappa statistic

30
Validity
  • We distinguish between the validity of a measure
    of some psychological process or state and the
    validity of a conclusion.
  • Here, we focus on validity of measures.
  • A subsequent lecture will consider the validity
    of conclusions.

31
Well look at validity of these phases today
32
Validity
  • a measure is valid if it measures what you think
    it measures.
  • we traditionally distinguish between four types
    of validity
  • face
  • content
  • construct
  • criterion

33
Four types of validity
  • Face
  • The test appears to measure what it is supposed
    to measure
  • not formally recognized as a type of validity

34
Four types of validity
  • Face
  • Construct
  • The measure captures the theoretical construct it
    is supposed to measure

35
Four types of validity
  • Face
  • Construct
  • Content
  • The measure samples the range of behavior covered
    by the construct.

36
Four types of validity
  • Face
  • Construct
  • Content
  • Criterion
  • Results relate closely to those produced by other
    measures of the same construct.
  • Results do not relate to those produced by
    measures of other constructs

37
Review (last week this week)
  • Were not really interested in things that stay
    the same.
  • Were interested in variation.
  • But only systematic variation, not random
    variation
  • systematic variation can be explained
  • random variation cant

38
Quick Review
  • Some variation in performance is random and some
    is systematic
  • The scientists tasks are to separate the
    systematic variation from the random, and then to
    build models of the systematic variation.

39
Quick Review
  • We choose a measurement scale.
  • We prefer either ratio or interval scales, when
    we can get them.
  • We try to maximize both the reliability and the
    validity of our measurements using that scale.

40
Review questions
  • Which would you expect to be easier to assess
    reliability or validity?
  • Why do we have tools and machines to measure
    some things for us (such as rulers, scales, and
    money)?
  • What are some analogues for rulers and scales,
    used when we measure psychological constructs?
Write a Comment
User Comments (0)
About PowerShow.com