Title: Maintaining an invariant unit in Rasch measurement
1Maintaining an invariant unit in Rasch measurement
- Paper presented at an International Symposium
- Methodological tools for accountability systems
in education - European Commission, Ispra, 6-9 February, 2006,
Joint Research Centre
2 Acknowledgements
- An earlier version of this paper was presented at
the 10th Annual National Roundtable Conference,
Melbourne Australia, October, 2005. The research
reported in this paper was supported in part by
an Australian Research Council grant with the
following industry partners, the national MCEETYA
Performance Measurement and Reporting Task Force,
IIEP (UNESCO) and the Australian Council for
Educational Research (ACER).
3Maintaining an invariant unit in Rasch measurement
4Why the Mars probe went off course? (Spectrum
Magazine December, 1999)
- In 1999 the Mars Climate Orbiter was about 100
kilometers off course at the end of its
500-million kilometer voyage more than enough
to accidentally hit the planets atmosphere and
be destroyed - Preliminary public statements faulted a slip-up
between the probes builders and its operators, a
failure to convert the English units of measure
used in construction into the metric units used
for operation
5Setting Benchmarks
- The Australian Government invested a lot of money
in articulating benchmarks of achievement in
literacy and numeracy in Years 3, 5, 7 and 10 - They were set independent of a metric
- However, they imply a metric
- Expert judgement is essential to the task
6Standard setting methodologies
- The Angoff methodology is one of the most common
methodologies quoted in referred to in the
literature. - The kernal of the Angoff is the independent
judgement that a minimally competent person can
or can not answer an item correctly.
7Overview of Findings in the Literature
- Lorge and Krugluv (1953) reported judges were
unable to estimate item difficulty very
accurately, but they could rank order items in
terms of difficulty very accurately - Shepard (1995) concluded the Angoff method may
not provide valid scores because judges could not
estimate probabilities - Impara and Blake (1996) confirmed judges could
not estimate probabilities even for groups of
students who are well known to them.
8Overview of Findings in the Literature
- Variability between judges has been an area of
further investigation. - Green, Trimble and Lewis (2003) report studies
such as Impara and Plake (2000) where convergence
of results among multiple standard settings are
used as evidence of validity of cut scores, but
note that while convergence may occur to a
reasonable degree when variations of the same
method are used, there are few reports of
convergence when different procedures are used.
9The Australian Context
- The Benchmark Standard describes a minimum
standard of achievement, or minimal competency. - The Benchmark Standard has been defined in terms
of criteria and exemplar material, detailed in
background documentation. - Therefore, the standard is criterion-referenced.
The goal is to determine the location of the
Benchmark Standard on an existing scale. - The raw score on an assessment corresponding
with the Benchmark location is referred to as the
cut-score. - The raw score is tangible for interpretation
10Scope
- Two methodologies were used in this study to set
a benchmark cut-score. These are - The Likelihood Methodology and
- The Pair Comparison (Pairwise) Methodology.
11The Likelihood Methodology
Expert judges were asked to envisage a minimally
competent year 7 student in reading. A revised
Angoff (1971) procedure was used, involving the
rating scale shown below.
0 20 40
60 80 100
0 1 2 3 4
5 6 7 8 9
10 More demanding Benchmark Easier than than
bmk standard Standard
bmk standard
- Judges were instructed that
- The minimally competent benchmark student should
answer an item very close to benchmark standard
correctly 50 of the time. - If the skills needed to answer an item were
more demanding than the benchmark standard, the
likelihood should be rated as less than 50. - If the skills needed to answer an item were
less demanding than the benchmark standard, the
rating should be greater than 50.
12Setting The Cut-score With The Likelihood
Methodology
- Judges rated the items on the Year 7 Reading
assessment. The ratings represent judges
conception of likelihood of success on each item. - A rating of 5 was treated as 0.5, 6 as 0.6,
etc. - Sum of likelihood ratings is treated as the
expected benchmark raw score on test - This is consistent with item response theory
where the probability of a correct response is
estimated through a model from student data
13Setting The Cut-score With The Likelihood
Methodology
- Each judge replaces a student in a response to
each item in the usual response matrix - Each judges location is therefore taken to
represent that judges conception of benchmark
ability, on the Likelihood scale. - Item locations were derived from judge
likelihood data using customised software,
RUMMmm. The scale is referred to as the
Likelihood scale - Note RUMMmm uses Joint Maximum Likelihood (JML)
estimation, and can handle non-integer item and
person totals derived from the Likelihood data
(the sum of likelihood ratings for a particular
judge might be 15.4, for example)
14Setting The Cut-score With The Likelihood
Methodology
An example of data collection under the
Likelihood Methodology
- Common item equating is used to translate the
benchmark location onto Student scale i.e. the
mean of items on the Likelihood and Student
scales is equated.
15The Pairwise Methodology
- The pairwise design required that all 54 items
(40 items from the WALNA test and all 14 exemplar
items) be compared with each other. - The location of the items was obtained by a pair
comparison model identical to the Rasch model. - The Benchmark Standard is operationalised by the
average of 14 exemplar items.
16Comparisons
- Nearly all judges who participated in the
likelihood exercise participated in the pairwise
exercise. - No benchmark conceptualisation of a student was
necessary - The benchmark was implicit in the benchmark items
17Pairwise Design
- The number of comparison with 54 (40 test items
14 exemplar items) if each item is compared with
every other item is 1378 . - A design was constructed in which each item was
judged 116 or 117 times. Each pair of items was
judged twice. Each judge was required to compare
137 pairs of items. -
18Findings
- The findings of these two exercises would appear
to support the findings from other standard
setting exercises - Judges were unable to estimate absolute item
difficulty for students - Where two different procedures are used, there is
no convergence in estimating a benchmark - Judges ratings within an exercise vary widely
19Findings
- Comparisons between the pairwise scale values and
the likelihood scale values reveal that the two
formats of judgements are very highly correlated.
- The dispersions of the items from both the
likelihood and pairwise designs are different
from that attained from the student responses. - The benchmark standard as represented by the
exemplar items represents a wide range of
ability.
20Empirical Benchmark students Likelihood
Methodology
- The benchmark cut-score determined by the
likelihood method was 16.5 of a possible 35.
Approximately 1400 students with raw scores of 16
and 17 were extracted. - If the judges Likelihood ratings are consistent
with the performance of actual students of
benchmark ability, there should be a reasonable
correspondence between ratings and proportions
correct. - The comparison revealed a disparity between
predicted and observed.
21Variability of Ratings in the Likelihood Exercise
- There was considerable variation between the cut
scores set by each of the judges.
22Variations between cut scores
- While the wide difference between the cut scores
may suggest that the judges conceptualisation of
the standard varied in the two exercise, such a
difference is surprising given that standard is
articulated. - Summary of cut scores
23High Correlation Between The Two Formats Of
Judgements
- The likelihood and pairwise scales correlate very
highly, and have comparable correlations to the
student scales, which suggests high consistency
in the judges interpretation of relative item
difficulties, across the two methodologies.
24Correlation Of Item Difficulties Generated From
The Likelihood Data With Item Difficulties From
Pairwise Data
25Closer examination of the likelihood ratings
Student proportion correct compared with judges
mean likelihood ratings
There is a clear disparity between predicted and
observed proportions correct. Judges
systematically overrated the likelihood of
success on the most difficult items (left), and
systematically
underrated the likelihood of success on the
easiest items (right). Judges were conservative
in their use of the range of the rating scale.
26Dispersions Of The Items From The Two Designs
- While the correlation between item locations on
the two judge scales was 0.95, the standard
deviation of the pairwise scale locations was
approximately five times that of the likelihood
scale locations and twice that of the student
scale locations. - This implies that the underlying unit of scale is
different in the two designs.
27PREDICTED AND OBSERVED PROPORTIONS AFTER SCALE
TRANSFORMATION
Student proportion correct against predicted
proportions after transforming the judge
locations for unit of scale.
- Predicted proportions were derived using the
Rasch equation after scale transformation. - There is greater agreement between the lines
than was evident in the original data. - Data for students with lower raw scores were
extracted. - The evidence suggests a difference in unit of
scale does, in some sense, underlie ratings made
by judges.
Original data
28Comparison Of Cut Scores
29Discussion- Accounting for differences between
the units of the scales
- Given the finding that the judges ratings from
the likelihood and pairwise exercises correlate
highly, it would appear that underlying
differences between the units of the scales is
the factor that has the greatest impact on the
location of the cut score. - When the difference in dispersion is accounted
for, the two exercises provide similar locations
of the benchmark cut score. - Explains the findings in the literature that
judges rank order the items consistently and
quite correctly but do not predict the actual
probabilities of items correctly. Specifically,
the predicted probabilities depend on the design
of the data collection from the judges, and each
design has its own inherent unit of scale.
30The equations of the Rasch model revealing the
arbitrary unit of scale
- The adjustments are consistent with the implied
arbitrary unit in the Rasch models applied to
each of the different frames of reference for
data collection.
31Frame of Reference SStudent assessment data
32Frame of Reference LLikelihood data collection
format
- benchmark location according to judge v
Both in a unit specific to the Likelihood scale
- item difficulty of item i
33Elimination of the person parameter
- For both the Student and the Likelihood designs,
the person parameters are eliminated,
irrespective of the value of the scaling constant
provided it is constant within each frame of
reference for data collection. - The distinguishing feature of Rasch models
therefore preserved invariant comparisons
within a specified frame of reference. - Person parameter is eliminated conditionally by
grouping persons according to raw scores
consequently, raw scores contain all information
about person measures available within the frame
of reference (sufficiency).
34Equations for estimation
The resulting equations with the scaling
constants for each specific frame of reference .
1. Likelihood
2. Student
These functions are expressions of the models
which contain pairs of items, after having
eliminating person parameter by conditioning on
raw scores, r.
35The pair comparison design and model
- In the pairwise comparison design, the judge
(person) parameter is eliminated experimentally.
36Frame of Reference P Pairwise format and model
Individual judges not explicitly modelled but the
judges are inherently part of the data collection
format. The judge parameter is eliminated by the
pair comparison design.
Denotes event that item j selected as more
difficult than item i
37Equations for each frame of reference
The resulting equations with the scaling
constants for each specific frame of reference .
1. Likelihood
2. Student
3. Student
38Frame of reference and Units
- Every frame of reference has its own empirical
unit - In every analysis we impose an arbitrary scaling
factor - If the empirical units are different across
frames of reference, then the arbitrary factor
does not take this difference into account - The formats in this study make these differences
understandable - Others can be more subtle
39The importance of the unit
- As with the Mars orbiter, an error involving a
unit mix up occurred during a shuttle mission in
the 1980s. - The shuttles mission involved pointing a mirror
toward a telescope on the top of Mount Haleakala,
Maui. The author of the control program expected
the measurement of the altitude of the mountain
to be in nautical miles but the measurement was
entered in metres. - The result was that the mirror pointed out to
space, toward a location 3,000 nautical miles
(5556 km) above the earth! Fortunately, this
error was correctable. - The moral we must keep a track of the units