Maintaining an invariant unit in Rasch measurement - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Maintaining an invariant unit in Rasch measurement

Description:

Expert judgement is essential to the task. Standard setting methodologies ... High Correlation Between The Two Formats Of Judgements ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 40
Provided by: E1161
Category:

less

Transcript and Presenter's Notes

Title: Maintaining an invariant unit in Rasch measurement


1
Maintaining an invariant unit in Rasch measurement
  • Paper presented at an International Symposium
  • Methodological tools for accountability systems
    in education
  • European Commission, Ispra, 6-9 February, 2006,
    Joint Research Centre

2
Acknowledgements
  • An earlier version of this paper was presented at
    the 10th Annual National Roundtable Conference,
    Melbourne Australia, October, 2005. The research
    reported in this paper was supported in part by
    an Australian Research Council grant with the
    following industry partners, the national MCEETYA
    Performance Measurement and Reporting Task Force,
    IIEP (UNESCO) and the Australian Council for
    Educational Research (ACER).

3
Maintaining an invariant unit in Rasch measurement
4
Why the Mars probe went off course? (Spectrum
Magazine December, 1999)
  • In 1999 the Mars Climate Orbiter was about 100
    kilometers off course at the end of its
    500-million kilometer voyage more than enough
    to accidentally hit the planets atmosphere and
    be destroyed
  • Preliminary public statements faulted a slip-up
    between the probes builders and its operators, a
    failure to convert the English units of measure
    used in construction into the metric units used
    for operation

5
Setting Benchmarks
  • The Australian Government invested a lot of money
    in articulating benchmarks of achievement in
    literacy and numeracy in Years 3, 5, 7 and 10
  • They were set independent of a metric
  • However, they imply a metric
  • Expert judgement is essential to the task

6
Standard setting methodologies
  • The Angoff methodology is one of the most common
    methodologies quoted in referred to in the
    literature.
  • The kernal of the Angoff is the independent
    judgement that a minimally competent person can
    or can not answer an item correctly.

7
Overview of Findings in the Literature
  • Lorge and Krugluv (1953) reported judges were
    unable to estimate item difficulty very
    accurately, but they could rank order items in
    terms of difficulty very accurately
  • Shepard (1995) concluded the Angoff method may
    not provide valid scores because judges could not
    estimate probabilities
  • Impara and Blake (1996) confirmed judges could
    not estimate probabilities even for groups of
    students who are well known to them.

8
Overview of Findings in the Literature
  • Variability between judges has been an area of
    further investigation.
  • Green, Trimble and Lewis (2003) report studies
    such as Impara and Plake (2000) where convergence
    of results among multiple standard settings are
    used as evidence of validity of cut scores, but
    note that while convergence may occur to a
    reasonable degree when variations of the same
    method are used, there are few reports of
    convergence when different procedures are used.

9
The Australian Context
  • The Benchmark Standard describes a minimum
    standard of achievement, or minimal competency.
  • The Benchmark Standard has been defined in terms
    of criteria and exemplar material, detailed in
    background documentation.
  • Therefore, the standard is criterion-referenced.
    The goal is to determine the location of the
    Benchmark Standard on an existing scale.
  • The raw score on an assessment corresponding
    with the Benchmark location is referred to as the
    cut-score.
  • The raw score is tangible for interpretation

10
Scope
  • Two methodologies were used in this study to set
    a benchmark cut-score. These are
  • The Likelihood Methodology and
  • The Pair Comparison (Pairwise) Methodology.

11
The Likelihood Methodology
Expert judges were asked to envisage a minimally
competent year 7 student in reading. A revised
Angoff (1971) procedure was used, involving the
rating scale shown below.
0 20 40
60 80 100
0 1 2 3 4
5 6 7 8 9
10 More demanding Benchmark Easier than than
bmk standard Standard
bmk standard
  • Judges were instructed that
  • The minimally competent benchmark student should
    answer an item very close to benchmark standard
    correctly 50 of the time.
  • If the skills needed to answer an item were
    more demanding than the benchmark standard, the
    likelihood should be rated as less than 50.
  • If the skills needed to answer an item were
    less demanding than the benchmark standard, the
    rating should be greater than 50.

12
Setting The Cut-score With The Likelihood
Methodology
  • Judges rated the items on the Year 7 Reading
    assessment. The ratings represent judges
    conception of likelihood of success on each item.
  • A rating of 5 was treated as 0.5, 6 as 0.6,
    etc.
  • Sum of likelihood ratings is treated as the
    expected benchmark raw score on test
  • This is consistent with item response theory
    where the probability of a correct response is
    estimated through a model from student data

13
Setting The Cut-score With The Likelihood
Methodology
  • Each judge replaces a student in a response to
    each item in the usual response matrix
  • Each judges location is therefore taken to
    represent that judges conception of benchmark
    ability, on the Likelihood scale.
  • Item locations were derived from judge
    likelihood data using customised software,
    RUMMmm. The scale is referred to as the
    Likelihood scale
  • Note RUMMmm uses Joint Maximum Likelihood (JML)
    estimation, and can handle non-integer item and
    person totals derived from the Likelihood data
    (the sum of likelihood ratings for a particular
    judge might be 15.4, for example)

14
Setting The Cut-score With The Likelihood
Methodology
An example of data collection under the
Likelihood Methodology
  • Common item equating is used to translate the
    benchmark location onto Student scale i.e. the
    mean of items on the Likelihood and Student
    scales is equated.

15
The Pairwise Methodology
  • The pairwise design required that all 54 items
    (40 items from the WALNA test and all 14 exemplar
    items) be compared with each other.
  • The location of the items was obtained by a pair
    comparison model identical to the Rasch model.
  • The Benchmark Standard is operationalised by the
    average of 14 exemplar items.

16
Comparisons
  • Nearly all judges who participated in the
    likelihood exercise participated in the pairwise
    exercise.
  • No benchmark conceptualisation of a student was
    necessary
  • The benchmark was implicit in the benchmark items

17
Pairwise Design
  • The number of comparison with 54 (40 test items
    14 exemplar items) if each item is compared with
    every other item is 1378 .
  • A design was constructed in which each item was
    judged 116 or 117 times. Each pair of items was
    judged twice. Each judge was required to compare
    137 pairs of items.
  •  

18
Findings
  • The findings of these two exercises would appear
    to support the findings from other standard
    setting exercises
  • Judges were unable to estimate absolute item
    difficulty for students
  • Where two different procedures are used, there is
    no convergence in estimating a benchmark
  • Judges ratings within an exercise vary widely

19
Findings
  • Comparisons between the pairwise scale values and
    the likelihood scale values reveal that the two
    formats of judgements are very highly correlated.
  • The dispersions of the items from both the
    likelihood and pairwise designs are different
    from that attained from the student responses.
  • The benchmark standard as represented by the
    exemplar items represents a wide range of
    ability.

20
Empirical Benchmark students Likelihood
Methodology
  • The benchmark cut-score determined by the
    likelihood method was 16.5 of a possible 35.
    Approximately 1400 students with raw scores of 16
    and 17 were extracted.
  • If the judges Likelihood ratings are consistent
    with the performance of actual students of
    benchmark ability, there should be a reasonable
    correspondence between ratings and proportions
    correct.
  • The comparison revealed a disparity between
    predicted and observed.

21
Variability of Ratings in the Likelihood Exercise
  • There was considerable variation between the cut
    scores set by each of the judges.

22
Variations between cut scores
  • While the wide difference between the cut scores
    may suggest that the judges conceptualisation of
    the standard varied in the two exercise, such a
    difference is surprising given that standard is
    articulated.
  • Summary of cut scores

23
High Correlation Between The Two Formats Of
Judgements
  • The likelihood and pairwise scales correlate very
    highly, and have comparable correlations to the
    student scales, which suggests high consistency
    in the judges interpretation of relative item
    difficulties, across the two methodologies.

24
Correlation Of Item Difficulties Generated From
The Likelihood Data With Item Difficulties From
Pairwise Data
25
Closer examination of the likelihood ratings
Student proportion correct compared with judges
mean likelihood ratings
There is a clear disparity between predicted and
observed proportions correct. Judges
systematically overrated the likelihood of
success on the most difficult items (left), and
systematically
underrated the likelihood of success on the
easiest items (right). Judges were conservative
in their use of the range of the rating scale.
26
Dispersions Of The Items From The Two Designs
  • While the correlation between item locations on
    the two judge scales was 0.95, the standard
    deviation of the pairwise scale locations was
    approximately five times that of the likelihood
    scale locations and twice that of the student
    scale locations.
  • This implies that the underlying unit of scale is
    different in the two designs.

27
PREDICTED AND OBSERVED PROPORTIONS AFTER SCALE
TRANSFORMATION
Student proportion correct against predicted
proportions after transforming the judge
locations for unit of scale.
  • Predicted proportions were derived using the
    Rasch equation after scale transformation.
  • There is greater agreement between the lines
    than was evident in the original data.
  • Data for students with lower raw scores were
    extracted.
  • The evidence suggests a difference in unit of
    scale does, in some sense, underlie ratings made
    by judges.

Original data
28
Comparison Of Cut Scores
29
Discussion- Accounting for differences between
the units of the scales
  • Given the finding that the judges ratings from
    the likelihood and pairwise exercises correlate
    highly, it would appear that underlying
    differences between the units of the scales is
    the factor that has the greatest impact on the
    location of the cut score.
  • When the difference in dispersion is accounted
    for, the two exercises provide similar locations
    of the benchmark cut score.
  • Explains the findings in the literature that
    judges rank order the items consistently and
    quite correctly but do not predict the actual
    probabilities of items correctly. Specifically,
    the predicted probabilities depend on the design
    of the data collection from the judges, and each
    design has its own inherent unit of scale.

30
The equations of the Rasch model revealing the
arbitrary unit of scale
  • The adjustments are consistent with the implied
    arbitrary unit in the Rasch models applied to
    each of the different frames of reference for
    data collection.

31
Frame of Reference SStudent assessment data
32
Frame of Reference LLikelihood data collection
format
- benchmark location according to judge v
Both in a unit specific to the Likelihood scale
- item difficulty of item i
33
Elimination of the person parameter
  • For both the Student and the Likelihood designs,
    the person parameters are eliminated,
    irrespective of the value of the scaling constant
    provided it is constant within each frame of
    reference for data collection.
  • The distinguishing feature of Rasch models
    therefore preserved invariant comparisons
    within a specified frame of reference.
  • Person parameter is eliminated conditionally by
    grouping persons according to raw scores
    consequently, raw scores contain all information
    about person measures available within the frame
    of reference (sufficiency).

34
Equations for estimation
The resulting equations with the scaling
constants for each specific frame of reference .
1. Likelihood
2. Student
These functions are expressions of the models
which contain pairs of items, after having
eliminating person parameter by conditioning on
raw scores, r.
35
The pair comparison design and model
  • In the pairwise comparison design, the judge
    (person) parameter is eliminated experimentally.

36
Frame of Reference P Pairwise format and model
Individual judges not explicitly modelled but the
judges are inherently part of the data collection
format. The judge parameter is eliminated by the
pair comparison design.
Denotes event that item j selected as more
difficult than item i
37
Equations for each frame of reference
The resulting equations with the scaling
constants for each specific frame of reference .
1. Likelihood
2. Student
3. Student
38
Frame of reference and Units
  • Every frame of reference has its own empirical
    unit
  • In every analysis we impose an arbitrary scaling
    factor
  • If the empirical units are different across
    frames of reference, then the arbitrary factor
    does not take this difference into account
  • The formats in this study make these differences
    understandable
  • Others can be more subtle

39
The importance of the unit
  • As with the Mars orbiter, an error involving a
    unit mix up occurred during a shuttle mission in
    the 1980s.
  • The shuttles mission involved pointing a mirror
    toward a telescope on the top of Mount Haleakala,
    Maui. The author of the control program expected
    the measurement of the altitude of the mountain
    to be in nautical miles but the measurement was
    entered in metres.
  • The result was that the mirror pointed out to
    space, toward a location 3,000 nautical miles
    (5556 km) above the earth! Fortunately, this
    error was correctable.
  • The moral we must keep a track of the units
Write a Comment
User Comments (0)
About PowerShow.com