Title: Using the Manyfacets Rasch Model to Resolve Standard Setting Issues'
1Using the Many-facets Rasch Model to Resolve
Standard Setting Issues.
- Noor Lide Abu Kassim - IIUM
- and
- Trevor G Bond - HKIEd
2Use of educational standards is not without
controversy
- The reason for this lies primarily in the
judgmental nature of the standard setting process
in which cutscores that correspond to
pre-specified performance levels are established.
The lack of objectivity due to use of human
judgment in constructing cutscores, instead of a
straightforward process of parameter estimation
(Kane, 2001, p.81), to some experts, renders
standards as arbitrary, and thus invalid at worst
or imprudent at best (e.g., Glass, 1978 Burton,
1977).
3Some Fundamental Issues in Choice of Standard
Setting Methodology
- In selecting the right standard setting method
several issues are of primary concern. The first
relates to the judgment task that judges or
panelists are required to perform. The Standards
for Educational and Psychological Testing (AERA,
APA, NCME, 1999) has made a very clear stand on
this issue.
4- When cut scores defining pass-fail or proficiency
categories are based on direct judgments about
the adequacy of item or test performances or
performance levels, the judgmental process should
be designed so that judges can bring their
knowledge and experience to bear in a reasonable
way. (p.60)
5In non-objective methods (Stone, 1996)
- e.g. Angoffs, the judgment task requires judges
or panelists to estimate the probability that a
minimally competent examinee will succeed on test
items. This is ineffectual as judges are asked to
perform a task that is too difficult and
confusing and nearly cognitively impossible
(e.g., Pellegrino et al., 1999 NCES, 2003).
6- A more serious flaw in this judgment task is that
it draws the focus of judgment from content, and
therefore the measured construct, to prediction
of examinee performance on test items (Stone,
1996). - Methods such as the Angoff procedure begin with
content, but ends up atomized into hundreds of
contentless score fractions devoid of a clear and
meaningful description of the standard (Stone,
1995, p.1).
7- One of the criticisms that have been leveled at
some widely-used standard setting methods is that
these methods are only relevant for use with
particular item types namely, selected-response
items (e.g., the Nedelsky method).
8- A second issue in addressing the utility of a
standard setting method is the capacity of the
standard setting method to deal with diverse item
types (Mitzel et al., 2001). Constructed response
items are fast becoming a common feature in most
high-stakes assessment programmes.
9- It is therefore important to examine the
generalizability of the standard setting method
to other types of items other than
selected-response. Since different methods focus
on different information and use different
procedures to arrive at the final results, it is
often recommended that the same standard setting
method is used to set standards within a
particular assessment programme to ensure
consistency in the resulting standards.
10- The third source of variability relates to
judges internal consistency. The tendency to
overlook intrajudge consistency is not peculiar
to these two standard setting methods. A review
of newly-developed and long-standing standard
setting methods indicates no clear strategies or
procedures for the examination of intrajudge
consistency.
11Standard Setting Using Rasch Measurement
- In Rasch measurement several standard setting
procedures or method have been developed for the
setting of standards/ cutscores. These procedures
capitalizes on the two key attributes of a
scientific measurement system in the human
sciences the validity of the test being used and
the Rasch measurement properties of the resultant
scale (Stone, 1995, p.452).
12- Given the limitations of this paper, only three
of the procedures are discussed mainly because of
their significant contribution to the standard
setting literature.
131. Grosse Wright (Stone, 1996).
- The first of these is a method introduced by In
this three-stage method individual judges are
asked to first select a set of criterion items.
Then they are required to determine a minimum
passing score (percentage correct) for their
individual set of items. In the third stage,
judges are given performance data and with this
new information, judges are asked to review their
set of criterion items. The quantification of the
final criterion point is computed using the Rasch
PROX formula
14b H X lnP/(1-P)
- where b the judges criterion standard for the
entire test in logits - H average difficulty of the judges criterion
items - X (1 w2/ 2.89)½
- (where w SD of the judges criterion
item difficulties) and - P percent correct standard set by the judge
152 Julian Wright (1993)
- The second procedure which was developed by
Julian and Wright (cited in Stone, 1996) is a
refinement of the first (Stone, 1996). In this
procedure relevant items and students around the
criterion region are first identified (Figure 1).
Judges are then asked to decide whether the items
are required by a passing examinee to be
considered competent (Stone, 1996). Judges are
also asked to rate selected students based on
their performances and histories.
16Relevant Items and Students within the Criterion
RegionSource Wright (1993)
17Judge-by-Item and Judge-by-Student MatrixSource
Wright (1993)
18- The selected items and students are rated using
the scale in Figure 2. The data from this
judge-by-item and judge-by-student matrix are
then analyzed, separately or together, to locate
a coherent nucleus of agreement on the definition
of the criteria (Wright, 1993). In the analysis,
the judge mean is set to zero whereas items and
persons are allowed to float. With regard to
misfitting items and students, it is suggested
that these are put aside in order to clearly
define the variable (i.e., Judgment Line).
19(No Transcript)
20This method is significant in several respects.
- First, it takes into account both test content
and student performance in the establishment of
the criterion level/ standard. - Second it provides a possibility of examining
items and persons that cause disagreements in
judges decisions for the refinement of the final
standard. - Third, it provides a useful and much needed
technique for identification of inconsistent
judges a critical component which is missing in
most standard setting methods.
213 Objective Standard Setting method (OSS), Stone
(1996).
- This method which has its roots in the work of
Grosse and Wright (Stone, 1996) involves three
evaluative decisions and three translations in
the quantification of the standard.
22These evaluative decisions pertain to
- (a) judgment on the essentiality of each item
- (b) the level of mastery required, and
- (c) decision confidence.
23- Quantification of the qualitative or evaluative
decisions involves - (a) the calculation of mean item difficulty of
essential items - (b) translating the mastery decision via odds to
logits transformation and - (c) quantification of the standard error (Stone,
1996 2001).
24OSS steps (Stone (1996 2001) are as follows
- 1. Judges assess the content as presented in each
item and determine their essentiality. - 2. Each judges set of essential items are
quantified by calculating the mean item
difficulty of the items. The item difficulty
means for all judges are then summed and
averaged. - 3. Level of mastery required is then determined
by judges. It is expressed as a probability
(e.g., 50, 60 or 80) and converted to a logit
measure through the odds to logits transformation
(e.g., 60 is equivalent to .41 logits) - 4. Confidence in decisions whether to protect
innocence or to ensure quality is determined
using the standard error (Stone, 2001 Wright
Grosse, 1993).
25The OSS is desirable in several respects.
- IT refocuses the judgment task from prediction of
the performance of a hypothetical group of
minimally competent examinees to the judgment of
essentiality of content (Stone, 1996 2001), - Its simplicity is another factor that
contributes to its appeal. By requiring judges to
focus only on essentiality of content, the
complexity of the judgment task is greatly
reduced. This subsequently makes judge training
less demanding and less intensive.
26Quantification of judgments is a straightforward
process.
- Mean estimate of essential items is calculated
for each judge and the results averaged across
all judges using a simple mathematical operation.
Additionally, the OSS is not iterative and does
not require quantification of countless judgments
as a result of multiple iterations of judgments
of examinee performance on test items as in the
case of non-objective standard setting methods.
27However, there are some limitations to this
method.
- The OSS which was originally developed for
selected-response items is not easily generalized
to constructed-response items nor is it readily
used to handle assessments utilizing a
combination of SR and CR items. Quantification
and determination of the criterion region (see
Wright, 1993) though possible is cumbersome when
CR items are involved. Second, it requires a
well-established construct theory as item
disordinality has serious consequences on
resulting cutscores.
28OSS does not address the problem of intrajudge
variability
- Nonetheless, Stone and Engelhard, (1998) has
developed a procedure for assessing the quality
of judges ratings using the Rasch model. In this
procedure rating errors such as intrajudge
inconsistency, halo, central tendency and
restriction of range are examined. However how
this procedure could be integrated in the
standard setting process and the calibration of
cutscores has not been explored.
29Purpose of Study
- This study attempts to extend the utility of the
OSS to deal with both multiple-choice items as
well as constructed-response items with the use
of a Facet-based approach (Linacre, personal
communication, June 22, 2005) in the
quantification of judges ratings.
30Participants
- The standard setting panel consisted of 12 judges
who were language instructors as well as item
writers at the Centre for Languages and
Pre-University Academic Development of the
International Islamic University Malaysia. The
judges possessed at least a basic degree in TESOL
or a masters degree in a similar field, and at
least 2 years teaching experience at the Centre.
31Instrument
- a placement battery developed by the Centre for
purposes of student exemption from and placement
into four English language support courses. The
placement battery consisted of 3 subtests Paper
1 (Grammar Reading), Paper 2 (Essay Writing)
and Paper 3 (Speaking). In this study, the focus
was on the first two subtests..
32Procedures
- The OSS procedure was used to generate judges
ratings. For the multiple-choice subtest (Paper
1), judges were asked to individually select
essential items for each of the four cutscores.
Essentiality of items is referenced against
verbal descriptions of what a minimally competent
student are expected to be able to do/know (which
the Centre had developed) in order to classified
as having achieved a given cutscore (or
standard).
33- Items that were selected as essential were marked
as 1 and non essential items were marked as 0.
This procedure was carried out for dichotomously
scored items. Applying the same concept, judgment
of polytomous items (i.e., the essay items) was
carried out by matching the performance level
description with the corresponding descriptor and
numerical rating/value as stated in the
evaluation profile (rating scale) used in the
evaluation of the given polytomous items (Figure
4).
34(No Transcript)
35Facets Data Matrix
36Results
- Facets calibrations of judges, cutscores (levels)
and items are presented in Figure 1. The first
column is the logit scale followed by the
standard setting judges, cutscores (levels), test
items and the scale used for rating of essay
performances.
37(No Transcript)
38- From judges distribution in Figure 1 it is
evident that there is some variation in judges
perception of essential items. The separation
index (2.24) and the chi-square value of 72.6
with 11 df, significant at p lt.01 indicate that
judges consistently differ from one another in
overall severity of judgment (Table 1). Judge 9
is seen to the most severe and judges 10 and 6
the least severe. With the exception of judge 9,
all the judges cluster within -0.5 to 0.2 logit.
39- Figure 1 also indicates that the four criterion
levels are clearly separated. However, note that
the criterion levels for 3 and 4 are
exceptionally high, in relation to the
distribution of test items. The first cutscore
appears to be low but it does not fall below some
of the easiest items on the test. However, how
the lowest cutscore relates to the actual
examinee distribution will be examined later in
this paper.
40a detailed judge measurement report.
- Although the difference in judge severity is
quite small (about 1.3 logits from the most
severe to the least severe), there is significant
variation in judge severity. In terms of judges
self-consistency, Judges 3 and 4 appears to be
clearly misfitting. Judge 4 who has Infit and
Outfit MnSq statistics of 2.14 and 3.03
respectively also shows a low discrimination
index of .05.
41(No Transcript)
42- The Levels (cutscores) measurement report (Table
2) indicates a significant difference between
levels. The separation index is 18.59, chi-square
value is1337.3 with 3 df, significant at plt.01.
Note that Cutscore (level) 4 has a higher Infit
MnSq value (1.48) as compared with the other
levels. Estimated discrimination showed fairly
reasonable values (although below the expected
value of 1) with Level 2 showing the lowest
discrimination estimate (.61). The cutscore
(criterion level) to separate candidates into the
first and second level of English language
performance is calibrated at -1.10 logits whereas
the highest cutscore that exempts examinees from
the language support courses is calibrated at
4.66 logits.
43Levels (Cutscores) Measurement Report
44Item displacement
- As regards item displacement, five items (Items
27, 39, 63, 69 and 73) showed positive
displacement values of above 3. Item 39 showed
the highest positive displacement estimate
indicating that judges had misjudged this item,
which is easy (measure -2.16) as being a
difficult item and hence assigned it to a high
level of language performance. Item 5 (measure
1.10), which has a displacement estimate of
-2.98, on the other hand, has been misjudged as
an easy item. Overall, there are more items that
have been misjudged as difficult (as indicated by
the empirical calibrations) than those that have
been misjudged as easy. With respect to fit,
items that showed high displacement values also
indicate high Infit and Outfit Mean Square
estimates.
45Table 3 Item Fit Statistics and Displacement
46(No Transcript)
47- The following figure displays the estimated
cutscores (criterion levels) derived from the
Facets analysis as applied to actual examinee and
item distributions. It is evident that the
judges had underestimated the lowest cutscore
(Level 1), and overestimated the highest cutscore
(Level 4). Based on the item distribution, the
underestimation of the lowest cutscore or
criterion level could be attributed to the
substantial number of easy items (off-target
items) on the test.
48(No Transcript)
49- The overestimation of the highest cutscore (Level
4), on the other hand is due to judges
expectations that examinees should get all items
correct to be exempted from the language support
courses (Figure 8). It must be noted that the
cutscore to separate examinees into the third and
fourth levels of language performance suffers
from the same problem.
50(No Transcript)
51Scaling Expert Judgment and Rasch Calibrations
52(No Transcript)
53Figure 10 Scaling of Reading Test Items
According to Expert Judgment and Rasch
Calibrations
54(No Transcript)
55This Facets-based procedure has some clear
advantages.
- The first is efficiency in relation to
identification of judges internal inconsistency.
In using this procedure inconsistent judges can
be identified in the same process in which the
cutscores are calibrated. If it is decided that
the ratings of inconsistent judges are to be
excluded from the computation of the final
cutscores, subsequent computation of the
cutscores can be easily and quickly processed.
56- The second advantage of utilizing this approach
pertains to judgment of essential items. Items
that are misjudged as difficult or easy can be
easily identified through the use of fit
statistics and displacement values.
57- Third, as Facets compute the error of measurement
with regard to judges ratings for each cutscore,
this can be used in the adjustment of the final
cutscore, if so desired.
58- In relation to situations where multiple
cutscores are necessary, statistical significance
of cutscore separation can also be clearly
established. This approach to computing judges
ratings also allows for the investigation of
judge-item interaction. Through bias analysis,
unexpectedly harsh or lenient ratings with regard
to any particular items by particular judges can
be identified. These idiosyncratic ratings can
be intercepted and, if necessary treated as
missing without disturbing the validity of the
remainder of the analysis (Linacre, 1989, p.13).
Alternatively, feedback can be given to the
judges in question for improvements in the
judging process (Linacre, 1989).
59- The findings of this study also underscore the
importance of a clear understanding of the
measured construct. As cutscores and performance
standards are set on test/s, the quality of the
test/s used is bound to impact the integrity of
derived cutscores to some extent. Therefore it is
critical that the quality of the test/s used in a
standard setting study/ process is carefully
examined. This is particularly important when
model-based methods are involved.
60This is explained in Kane (2002)
- The model-based, theoretical interpretation is
considerably richer than a simple generalization
from performance on a sample of tasks to expected
performance on the universe of tasks from which
the sample is drawn, and as a result, requires
more evidence for its support. In particular, the
validity argument for model-based, theoretical
interpretations will require evidence for the
validity of the theory of performance as well as
evidence that the assessments can be interpreted
in terms of the theory. That is, an
interpretation of test scores in terms of a
theoretical model depends on evidence for the
model and for the relationship between the
observed scores and terms in the model (p.32).
61Finally, judge competency must also be given due
attention.
- Based on a study involving judge competency,
Chang, Dziuban, Hynes and Olson (1996) have
suggested that judges should not only be trained
to perform the judgment task but also should be
trained in the domain content for which they are
to set the competency standard (p. 170).
62(No Transcript)
63(No Transcript)
64(No Transcript)