Testing 05 - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Testing 05

Description:

Testing 05 Reliability Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors: Obvious: poor ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 64
Provided by: linl5
Category:

less

Transcript and Presenter's Notes

Title: Testing 05


1
Testing 05
  • Reliability

2
Errors Reliability
  • Errors in the test cause unreliability.
  • The fewer the errors, the more reliable the test
  • Sources of errors
  • Obvious poor health, fatigue, lack of interest
  • Less obvious facets discussed in Fig. 5.3

3
Reliability Validity
  • Reliability is a necessary condition for
    validity.
  • Reliability validity are complementary aspects
    of the measurement.
  • Reliability How much of the performance is due
    to measurement errors, or to factors other than
    the language ability we want to measure.
  • Validity How much of the performance is due to
    the language ability we want to measure.

4
Reliability Measurement
  • Reliability measurement includes logical
    analysis and empirical research, i.e. identify
    sources of errors and estimate the magnitude of
    their effects on the scores.

5
Logical Analysis
  • Example of identification of source of errors
  • Topic in an oral interview business negotiation
  • Source of error if we want to measure the test
    takers ability of general topics.
  • Indicator of the ability if we want to the test
    takers ability of business English.

6
Empirical Research
  • Procedures are usually complex.
  • Three kinds of theories
  • Classical true score theory (CTS)
  • Generalizability theory (G-Theory)
  • Item Response Theory (IRT)

7
Factors on Test Scores
  • Characteristics of factors
  • general vs. specific
  • lasting vs. temporary
  • systematic vs. unsystematic

8
Factors that affect language test scores
9
Variance Standard Deviation
  • s standard deviation of the sample
  • s standard deviation of the population
  • s2 variance of the sample
  • s2 variance of the population
  • sv?(X-X)2/n-1
  • where
  • X individual score
  • X mean score
  • n number of students

10
Correlation Coefficient (????)
  • Covariance (COV) two variables, X and Y, vary
    together.
  • COV(X,Y)1/(n-1)?(Xi-X)(Yi-Y)
  • Correlation Coefficient (Pearson Product-moment
    Correlation Coefficient ?????????)
  • r(x,y)COV(x,y)/sxsy
  • r(x,y) 1/(n-1)?(Xi-X)(Yi-Y)/ sxsy

11
Correlation Coefficient
  • Where
  • n number of items
  • Xi individual score of the first half
  • X mean of the scores in the first half
  • Yi individual score of the second half
  • Y mean of the scores of the second half
  • sx standard deviation of the first half
  • sy standard deviation of the second half

12
Calculation of Correlation Coefficient
  • Manually
  • Manual Excel
  • Excel

13
Classical True Score Theory
  • also referred to as the classical reliability
    theory because its major task is to estimate the
    reliability of the observed scores of a test.
    That is, it attempts to estimate the strength of
    the relationship between the observed score and
    the true score.
  • sometimes referred to as the true score theory
    because its theoretical derivations are based on
    a mathematical model known as the true score
    model

14
(No Transcript)
15
Assumptions in CTS
  • Assumption 1 The observed score consists of the
    true score and the error score, i..e. xxtxe
  • Assumption 2 Error scores are unsystematic,
    random and uncorrelated to the true score, i.e.
    s2st2se2

16
Parallel Test
  • Two tests are parallel if
  • xx
  • sx2sx2
  • rxyrxy

17
Correlation Between Parallel Tests
  • If the observed scores on two parallel tests are
    highly correlated, the effects of the error
    scores are minimal.
  • Reliability is the correlation between the
    observed scores of two parallel tests.
  • The definition is the basis for all estimates of
    reliability within CTS theory.
  • Condition the observed scores on the two tests
    are experimentally independent.

18
Error Score Estimation and Measurement
  • Relations between reliability, true score and
    error score
  • The higher the portion of the true score, the
    higher the correlation of the two parallel tests.
    (True scores are systematic)
  • The higher the portion of the error score, the
    lower the correlation of the two parallel tests.
    (Error scores are random)

19
Error Score Estimation and Measurement
  • rxxst2/se2
  • (st2se2)/sx21
  • se2/ sx21- st2/ sx2
  • st2/ sx2 rxx
  • se2/ sx21- rxx
  • se2(1- rxx)/ sx2

20
Approaches to Estimate Reliability
  • Three approaches based on different sources of
    errors.
  • Internal consistency source of errors from
    within the test and scoring procedure
  • Stability How consistent test scores are over
    time.
  • Equivalence Scores on alternative forms of tests
    are equivalent.

21
Internal Consistency
  • Dichotomous
  • Split-half reliability estimates
  • The Spearman-Brown split-half estimate
  • The Guttman split-half estimate
  • Kuder-Richardson reliability coefficients
  • Non-dichotomous
  • Coefficient alpha
  • Rater consistency

22
Split-half Reliability Estimates
  • Split the test into two halves which have equal
    means and variances (equivalence) and are
    independent of each other (independence).
  • 1. divide the test into the first and second
    halves.
  • 2.  random halves
  • 3.  odd-even method

23
Spearman-Brown Reliability Estimate
  • rxx2rhh/(1rhh)
  • where
  • rhh correlation between the two halves of the
    test
  • Procedure
  • 1.   Divide the test into two equal halves
  • 2.  Calculate the correlation coefficient between
    the two halves
  • 3. Calculate the Spearman-Brown reliability
    estimate

24
Guttman Split-Half Estimate
  • rxx2(1-(sh12sh22)/sx2)
  • where
  • sh12 variance of the first half
  • sh22 variance of the second half
  • sx2 variance of the total scores

25
Kuder-Richardson Formula 20
  • rxxk/(k-1)(1-?pq/sx2)
  • where
  • k number of items on the test
  • p proportion of the correct answers, i.e.
    correct answers/total answers (difficulty)
  • q proportion of the incorrect answers, i.e. 1-p
  • sx2 total test score variance

26
Kuder-Richardson Formula 21
  • rxx(ksx2-x(k-x))/(k-1)sx2
  • where
  • k number of items on the test
  • sx2 total test score variance
  • x mean score

27
Coefficient alpha
  • ak/(k-1)(1-?si2/sx2)
  • where
  • k number of items on the test
  • ?si2 sum of the variances of the different
    parts of the test
  • sx2 variance of the test scores

28
Comparison of Estimates Assumptions
 
29
Summary Estimate Procedure
  • Spearman-Brown
  • 1. split
  • 2. variances of each half
  • 3. correlation coefficient of each half
  • 4. reliability coefficient

30
Summary Estimate Procedure
  • Guttman
  • 1. split
  • 2. variances of each half
  • 3. variance of the whole test
  • 4. reliability coefficient

31
Summary Estimate Procedure
  • K-C 20
  • 1. number of questions
  • 2. proportion of correct answers of each question
  • 3. proportion of incorrect answers of each
    question
  • 4. sum of the product of p and q
  • 5. variance of the whole test
  • 6. reliability coefficient

32
Summary Estimate Procedure
  • K-C 21
  • 1. number of questions
  • 2. mean of the test
  • 3. variance of the test
  • 4. reliability coefficient

33
Summary Estimate Procedure
  • Coefficienta
  • 1. number of the parts of the test
  • 2. mean of each part
  • 3. variance of each part
  • 4. sum of variances of all parts
  • 5. mean of the test
  • 6. variance of the test
  • 7. reliability coefficient

34
Rater Consistency
  • Intra-rater
  • Inter-rater

35
Intra-rater Reliability
  • Rate each paper twice. Condition the two ratings
    must be independent of each other.
  • Two ways of estimating
  • Spearman-Brown Take each rating as a split half
    and compute the reliability coefficient.

36
Intra-rater Reliability
  • Conditions the two ratings must have the similar
    means and variances to ensure the equivalence of
    the two ratings
  • Coefficient alpha Take two ratings as two parts
    of a test.
  • a(k/(k-1))(1-(sx12sx22)/sx1x22)

37
Intra-rater Reliability
  • where
  • k number of ratings
  • sx12 variance of the first rating
  • sx22 variance of the second rating
  • sx1x22 variance of the summed ratings
  • Since k2, the formula can be reduced to the
    Guttman Reliability Coefficient Formula.

38
Inter-rater Reliability
  • If there are only two raters, use split-half
    estimates to obtain the reliability coefficient.
  • Or Grade Correlation Coefficient
  • rxx1-6?D2/(n(n2-1))
  • where
  • D difference between the grades of the two
    ratings

39
Inter-rater Reliability
  • n number of the test takers
  • See testing 05-2 sheet 5 for example
  • Note the same grade should be shared.
  • If there are more than two raters, use
    Coefficient alpha estimate

40
Stability (test-retest reliability)
  • Administer the test twice to a group of
    individuals and compute the correlation between
    the two set of scores. The correlation can then
    be interpreted as an indicator of how stable the
    scores are over time.
  • Learning effects and practice effects must be
    taken into account.

41
Equivalence (parallel forms reliability)
  • Use alternative forms of a given test. Compute
    and compare the means and standard deviations of
    for each of the two forms to determine their
    equivalence. The correlation between the two sets
    can be interpreted as an indicator of the
    equivalence of the two tests or an estimate of
    the reliability of either one.

42
GENERALIZABILITY THEORY
43
GENERALIZABILITY THEORY
  • Generalizability theory (G-theory) is a framework
    of factorial design and the analysis of variance.
    It constitutes a theory and set of procedures for
    specifying and estimating the relative effects of
    different factors on observed test scores, and
    thus provides a means for relating the uses or
    interpretations to the way test users specify and
    interpret different factors as either abilities
    or sources of error.

44
GENERALIZABILITY THEORY
  • G-theory treats a given measure or score as a
    sample from a hypothetical universe of possible
    measures, i.e. on the basis of an individual's
    performance on a test we generalize to his
    performance in other contexts.
  • Reliability generalizability
  • The way we define a given universe of measures
    will depend upon the universe of generalization

45
Application of G-theory
  • Two stages
  • G-study
  • D-study

46
G-study
  • consider the uses that will be made of the test
    scores, investigate the sources of variance that
    are of concern or interest.On the basis of this
    generalizability study, the test developer
    obtains estimates of the relative sizes of the
    different sources of variance ('variance
    components').

47
D-study
  • When the results of the G-study are satisfactory,
    the test developer administers the test under
    operational conditions, and uses G-theory
    procedures to estimate the magnitude of the
    variance components. These estimates provide
    information that can inform the interpretation
    and use of the test scores.

48
Significance of G-theory
  • The application of G-Theory thus enables test
    developers and test users to specify the
    different sources of variance that are of concern
    for a given test use, to estimate the relative
    importance of these different sources
    simultaneously, and to employ these estimates in
    the interpretation and use of test scores.

49
Universes Of Generalization And Universe Of
Measures
  • universe of generalization, a domain of uses or
    abilities (or both)
  • the universe of possible measures types of test
    scores we would be willing to accept as
    indicators of the ability to be measured for the
    purpose intended.

50
Populations of Persons
  • In addition to defining the universe of possible
    measures, we must define the group, or population
    of persons about whom we are going to make
    decisions or inferences.

51
Universe Score
  • A universe score xp is thus defined as the mean
    of a person's scores on all measures from the
    universe of possible measures. The universe score
    is thus the G-theory analog of the CTS-theory
    true scores. The variance of a group of persons'
    scores on all measures would be equal to the
    universe score variance sp2, which is similar to
    CTS true score variance in the sense that it
    represents that proportion of observed score
    variance that remains constant across different
    individuals and different measurement facets and
    conditions.

52
Universe Score
  • The universe score is different from the CTS true
    score, however, in that an individual is likely
    to have different universe scores for different
    universes of measures.

53
Generalizability Coefficients
  • The G-theory analog of the CTS-theory reliability
    coefficient is the generalizability coefficient,
    which is defined as the proportion of observed
    score variance that is universe score variance
  • pxx2sp2/sx2
  • where sp2 is universe score variance and sx2 is
    observed score variance, which includes both
    universe score and error variance.

54
Estimation
  • Variance components sources of variances
  • persons(p), forms(f), raters(r)
  • sx2sp2sf2sr2spf2spr2sfr2spfr2
  • Use ANOVA to compute for the magnitude of the
    variance
  • Analyse those that are significantly large.

55
Standard Error of Measurement (SEM)
  • We need to know the extent the test score may
    vary.(SEM)
  • Formula of SEM Estimation
  • sesxv(1-rxx)
  • From
  • rxxst2/sx2 (1)
  • st2/sx2se2/sx21 (2)
  • se2/sx21-st2/sx2 (3)
  • se2/sx21-rxx
  • se2sx2(1-rxx)

56
Interpretation of Test Scores
  • Difficulty
  • Distinction
  • Z score

57
Difficulty for Dichotomous Scoring
  • pR/n
  • where
  • p difficulty index
  • R right answers
  • n number of students

58
Difficulty for Dichotomous Scoring (Corrected)
  • Cp(kp-1)/(k-1)
  • Where
  • Cp corrected difficulty index
  • p uncorrected difficulty index
  • k number of choices

59
Difficulty for Non-dichotomous Scoring
  • pmean/full score
  • 30--85

60
Distinction
  • Label the top 27 of the total as the high group
    and the lowest 27 of the total as the low group.
  • DPH-PL
  • Where
  • D distinction index
  • PH rate of the correct answers in the high group
  • PL rate of the correct answers in the low group

61
Z score
  • A way of placing an individual score in the whole
    distribution of scores on a test it expresses
    how many standard deviation units lie above or
    below the mean. Scores above the mean are
    positive those below the mean are negative.
  • An advantage of z scores is that they allow
    scores from different tests to be compared, where
    the mean and standard deviation differ, and where
    score points may not be equal.
  • Z(X-X)/s

62
T-score
  • A transformation of a z score, equivalent to it
    but with the advantage of avoiding negative
    values, and hence often used for reporting
    purposes.
  • T10Z50

63
Standardized Score
  • A transformation of raw scores which provides a
    measure of relative standing in a group and
    allows comparison of raw scores from different
    distributions, eg. from tests of different
    lengths. It does this by converting a raw score
    into a standard frame of reference which is
    expressed in terms of its relative position in
    the distribution of scores. The z score is the
    most commonly used standardized score.
  • Standardized score 100Z500
Write a Comment
User Comments (0)
About PowerShow.com