Maintaining an invariant unit in Rasch measurement - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Maintaining an invariant unit in Rasch measurement

Description:

Expert judgement is essential to the task. Standard setting methodologies ... High Correlation Between The Two Formats Of Judgements ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 40

Provided by: E1161

Category:

more less

Transcript and Presenter's Notes

Title: Maintaining an invariant unit in Rasch measurement

1
Maintaining an invariant unit in Rasch measurement

Paper presented at an International Symposium
Methodological tools for accountability systems
in education
European Commission, Ispra, 6-9 February, 2006,
Joint Research Centre

2
Acknowledgements

An earlier version of this paper was presented at
the 10th Annual National Roundtable Conference,
Melbourne Australia, October, 2005. The research
reported in this paper was supported in part by
an Australian Research Council grant with the
following industry partners, the national MCEETYA
Performance Measurement and Reporting Task Force,
IIEP (UNESCO) and the Australian Council for
Educational Research (ACER).

3
Maintaining an invariant unit in Rasch measurement
4
Why the Mars probe went off course? (Spectrum
Magazine December, 1999)

In 1999 the Mars Climate Orbiter was about 100
kilometers off course at the end of its
500-million kilometer voyage more than enough
to accidentally hit the planets atmosphere and
be destroyed
Preliminary public statements faulted a slip-up
between the probes builders and its operators, a
failure to convert the English units of measure
used in construction into the metric units used
for operation

5
Setting Benchmarks

The Australian Government invested a lot of money
in articulating benchmarks of achievement in
literacy and numeracy in Years 3, 5, 7 and 10
They were set independent of a metric
However, they imply a metric
Expert judgement is essential to the task

6
Standard setting methodologies

The Angoff methodology is one of the most common
methodologies quoted in referred to in the
literature.
The kernal of the Angoff is the independent
judgement that a minimally competent person can
or can not answer an item correctly.

7
Overview of Findings in the Literature

Lorge and Krugluv (1953) reported judges were
unable to estimate item difficulty very
accurately, but they could rank order items in
terms of difficulty very accurately
Shepard (1995) concluded the Angoff method may
not provide valid scores because judges could not
estimate probabilities
Impara and Blake (1996) confirmed judges could
not estimate probabilities even for groups of
students who are well known to them.

8
Overview of Findings in the Literature

Variability between judges has been an area of
further investigation.
Green, Trimble and Lewis (2003) report studies
such as Impara and Plake (2000) where convergence
of results among multiple standard settings are
used as evidence of validity of cut scores, but
note that while convergence may occur to a
reasonable degree when variations of the same
method are used, there are few reports of
convergence when different procedures are used.

9
The Australian Context

The Benchmark Standard describes a minimum
standard of achievement, or minimal competency.
The Benchmark Standard has been defined in terms
of criteria and exemplar material, detailed in
background documentation.
Therefore, the standard is criterion-referenced.
The goal is to determine the location of the
Benchmark Standard on an existing scale.
The raw score on an assessment corresponding
with the Benchmark location is referred to as the
cut-score.
The raw score is tangible for interpretation

10
Scope

Two methodologies were used in this study to set
a benchmark cut-score. These are
The Likelihood Methodology and
The Pair Comparison (Pairwise) Methodology.

11
The Likelihood Methodology
Expert judges were asked to envisage a minimally
competent year 7 student in reading. A revised
Angoff (1971) procedure was used, involving the
rating scale shown below.
0 20 40
60 80 100
0 1 2 3 4
5 6 7 8 9
10 More demanding Benchmark Easier than than
bmk standard Standard
bmk standard

Judges were instructed that
The minimally competent benchmark student should
answer an item very close to benchmark standard
correctly 50 of the time.
If the skills needed to answer an item were
more demanding than the benchmark standard, the
likelihood should be rated as less than 50.
If the skills needed to answer an item were
less demanding than the benchmark standard, the
rating should be greater than 50.

12
Setting The Cut-score With The Likelihood
Methodology

Judges rated the items on the Year 7 Reading
assessment. The ratings represent judges
conception of likelihood of success on each item.
A rating of 5 was treated as 0.5, 6 as 0.6,
etc.
Sum of likelihood ratings is treated as the
expected benchmark raw score on test
This is consistent with item response theory
where the probability of a correct response is
estimated through a model from student data

13
Setting The Cut-score With The Likelihood
Methodology

Each judge replaces a student in a response to
each item in the usual response matrix
Each judges location is therefore taken to
represent that judges conception of benchmark
ability, on the Likelihood scale.
Item locations were derived from judge
likelihood data using customised software,
RUMMmm. The scale is referred to as the
Likelihood scale
Note RUMMmm uses Joint Maximum Likelihood (JML)
estimation, and can handle non-integer item and
person totals derived from the Likelihood data
(the sum of likelihood ratings for a particular
judge might be 15.4, for example)

14
Setting The Cut-score With The Likelihood
Methodology
An example of data collection under the
Likelihood Methodology

Common item equating is used to translate the
benchmark location onto Student scale i.e. the
mean of items on the Likelihood and Student
scales is equated.

15
The Pairwise Methodology

The pairwise design required that all 54 items
(40 items from the WALNA test and all 14 exemplar
items) be compared with each other.
The location of the items was obtained by a pair
comparison model identical to the Rasch model.
The Benchmark Standard is operationalised by the
average of 14 exemplar items.

16
Comparisons

Nearly all judges who participated in the
likelihood exercise participated in the pairwise
exercise.
No benchmark conceptualisation of a student was
necessary
The benchmark was implicit in the benchmark items

17
Pairwise Design

The number of comparison with 54 (40 test items
14 exemplar items) if each item is compared with
every other item is 1378 .
A design was constructed in which each item was
judged 116 or 117 times. Each pair of items was
judged twice. Each judge was required to compare
137 pairs of items.

18
Findings

The findings of these two exercises would appear
to support the findings from other standard
setting exercises
Judges were unable to estimate absolute item
difficulty for students
Where two different procedures are used, there is
no convergence in estimating a benchmark
Judges ratings within an exercise vary widely

19
Findings

Comparisons between the pairwise scale values and
the likelihood scale values reveal that the two
formats of judgements are very highly correlated.
The dispersions of the items from both the
likelihood and pairwise designs are different
from that attained from the student responses.
The benchmark standard as represented by the
exemplar items represents a wide range of
ability.

20
Empirical Benchmark students Likelihood
Methodology

The benchmark cut-score determined by the
likelihood method was 16.5 of a possible 35.
Approximately 1400 students with raw scores of 16
and 17 were extracted.
If the judges Likelihood ratings are consistent
with the performance of actual students of
benchmark ability, there should be a reasonable
correspondence between ratings and proportions
correct.
The comparison revealed a disparity between
predicted and observed.

21
Variability of Ratings in the Likelihood Exercise

There was considerable variation between the cut
scores set by each of the judges.

22
Variations between cut scores

While the wide difference between the cut scores
may suggest that the judges conceptualisation of
the standard varied in the two exercise, such a
difference is surprising given that standard is
articulated.
Summary of cut scores

23
High Correlation Between The Two Formats Of
Judgements

The likelihood and pairwise scales correlate very
highly, and have comparable correlations to the
student scales, which suggests high consistency
in the judges interpretation of relative item
difficulties, across the two methodologies.

24
Correlation Of Item Difficulties Generated From
The Likelihood Data With Item Difficulties From
Pairwise Data
25
Closer examination of the likelihood ratings
Student proportion correct compared with judges
mean likelihood ratings
There is a clear disparity between predicted and
observed proportions correct. Judges
systematically overrated the likelihood of
success on the most difficult items (left), and
systematically
underrated the likelihood of success on the
easiest items (right). Judges were conservative
in their use of the range of the rating scale.
26
Dispersions Of The Items From The Two Designs

While the correlation between item locations on
the two judge scales was 0.95, the standard
deviation of the pairwise scale locations was
approximately five times that of the likelihood
scale locations and twice that of the student
scale locations.
This implies that the underlying unit of scale is
different in the two designs.

27
PREDICTED AND OBSERVED PROPORTIONS AFTER SCALE
TRANSFORMATION
Student proportion correct against predicted
proportions after transforming the judge
locations for unit of scale.

Predicted proportions were derived using the
Rasch equation after scale transformation.
There is greater agreement between the lines
than was evident in the original data.
Data for students with lower raw scores were
extracted.
The evidence suggests a difference in unit of
scale does, in some sense, underlie ratings made
by judges.

Original data
28
Comparison Of Cut Scores
29
Discussion- Accounting for differences between
the units of the scales

Given the finding that the judges ratings from
the likelihood and pairwise exercises correlate
highly, it would appear that underlying
differences between the units of the scales is
the factor that has the greatest impact on the
location of the cut score.
When the difference in dispersion is accounted
for, the two exercises provide similar locations
of the benchmark cut score.
Explains the findings in the literature that
judges rank order the items consistently and
quite correctly but do not predict the actual
probabilities of items correctly. Specifically,
the predicted probabilities depend on the design
of the data collection from the judges, and each
design has its own inherent unit of scale.

30
The equations of the Rasch model revealing the
arbitrary unit of scale

The adjustments are consistent with the implied
arbitrary unit in the Rasch models applied to
each of the different frames of reference for
data collection.

31
Frame of Reference SStudent assessment data
32
Frame of Reference LLikelihood data collection
format
- benchmark location according to judge v
Both in a unit specific to the Likelihood scale
- item difficulty of item i
33
Elimination of the person parameter

For both the Student and the Likelihood designs,
the person parameters are eliminated,
irrespective of the value of the scaling constant
provided it is constant within each frame of
reference for data collection.
The distinguishing feature of Rasch models
therefore preserved invariant comparisons
within a specified frame of reference.
Person parameter is eliminated conditionally by
grouping persons according to raw scores
consequently, raw scores contain all information
about person measures available within the frame
of reference (sufficiency).

34
Equations for estimation
The resulting equations with the scaling
constants for each specific frame of reference .
1. Likelihood
2. Student
These functions are expressions of the models
which contain pairs of items, after having
eliminating person parameter by conditioning on
raw scores, r.
35
The pair comparison design and model

In the pairwise comparison design, the judge
(person) parameter is eliminated experimentally.

36
Frame of Reference P Pairwise format and model
Individual judges not explicitly modelled but the
judges are inherently part of the data collection
format. The judge parameter is eliminated by the
pair comparison design.
Denotes event that item j selected as more
difficult than item i
37
Equations for each frame of reference
The resulting equations with the scaling
constants for each specific frame of reference .
1. Likelihood
2. Student
3. Student
38
Frame of reference and Units

Every frame of reference has its own empirical
unit
In every analysis we impose an arbitrary scaling
factor
If the empirical units are different across
frames of reference, then the arbitrary factor
does not take this difference into account
The formats in this study make these differences
understandable
Others can be more subtle

39
The importance of the unit

As with the Mars orbiter, an error involving a
unit mix up occurred during a shuttle mission in
the 1980s.
The shuttles mission involved pointing a mirror
toward a telescope on the top of Mount Haleakala,
Maui. The author of the control program expected
the measurement of the altitude of the mountain
to be in nautical miles but the measurement was
entered in metres.
The result was that the mirror pointed out to
space, toward a location 3,000 nautical miles
(5556 km) above the earth! Fortunately, this
error was correctable.
The moral we must keep a track of the units