Using the Manyfacets Rasch Model to Resolve Standard Setting Issues' PowerPoint PPT Presentation

presentation player overlay
1 / 64
About This Presentation
Transcript and Presenter's Notes

Title: Using the Manyfacets Rasch Model to Resolve Standard Setting Issues'


1
Using the Many-facets Rasch Model to Resolve
Standard Setting Issues.
  • Noor Lide Abu Kassim - IIUM
  • and
  • Trevor G Bond - HKIEd

2
Use of educational standards is not without
controversy
  • The reason for this lies primarily in the
    judgmental nature of the standard setting process
    in which cutscores that correspond to
    pre-specified performance levels are established.
    The lack of objectivity due to use of human
    judgment in constructing cutscores, instead of a
    straightforward process of parameter estimation
    (Kane, 2001, p.81), to some experts, renders
    standards as arbitrary, and thus invalid at worst
    or imprudent at best (e.g., Glass, 1978 Burton,
    1977).

3
Some Fundamental Issues in Choice of Standard
Setting Methodology
  • In selecting the right standard setting method
    several issues are of primary concern. The first
    relates to the judgment task that judges or
    panelists are required to perform. The Standards
    for Educational and Psychological Testing (AERA,
    APA, NCME, 1999) has made a very clear stand on
    this issue.

4
  • When cut scores defining pass-fail or proficiency
    categories are based on direct judgments about
    the adequacy of item or test performances or
    performance levels, the judgmental process should
    be designed so that judges can bring their
    knowledge and experience to bear in a reasonable
    way. (p.60)

5
In non-objective methods (Stone, 1996)
  • e.g. Angoffs, the judgment task requires judges
    or panelists to estimate the probability that a
    minimally competent examinee will succeed on test
    items. This is ineffectual as judges are asked to
    perform a task that is too difficult and
    confusing and nearly cognitively impossible
    (e.g., Pellegrino et al., 1999 NCES, 2003).

6
  • A more serious flaw in this judgment task is that
    it draws the focus of judgment from content, and
    therefore the measured construct, to prediction
    of examinee performance on test items (Stone,
    1996).
  • Methods such as the Angoff procedure begin with
    content, but ends up atomized into hundreds of
    contentless score fractions devoid of a clear and
    meaningful description of the standard (Stone,
    1995, p.1).

7
  • One of the criticisms that have been leveled at
    some widely-used standard setting methods is that
    these methods are only relevant for use with
    particular item types namely, selected-response
    items (e.g., the Nedelsky method).

8
  • A second issue in addressing the utility of a
    standard setting method is the capacity of the
    standard setting method to deal with diverse item
    types (Mitzel et al., 2001). Constructed response
    items are fast becoming a common feature in most
    high-stakes assessment programmes.

9
  • It is therefore important to examine the
    generalizability of the standard setting method
    to other types of items other than
    selected-response. Since different methods focus
    on different information and use different
    procedures to arrive at the final results, it is
    often recommended that the same standard setting
    method is used to set standards within a
    particular assessment programme to ensure
    consistency in the resulting standards.

10
  • The third source of variability relates to
    judges internal consistency. The tendency to
    overlook intrajudge consistency is not peculiar
    to these two standard setting methods. A review
    of newly-developed and long-standing standard
    setting methods indicates no clear strategies or
    procedures for the examination of intrajudge
    consistency.

11
Standard Setting Using Rasch Measurement
  • In Rasch measurement several standard setting
    procedures or method have been developed for the
    setting of standards/ cutscores. These procedures
    capitalizes on the two key attributes of a
    scientific measurement system in the human
    sciences the validity of the test being used and
    the Rasch measurement properties of the resultant
    scale (Stone, 1995, p.452).

12
  • Given the limitations of this paper, only three
    of the procedures are discussed mainly because of
    their significant contribution to the standard
    setting literature.

13
1. Grosse Wright (Stone, 1996).
  • The first of these is a method introduced by In
    this three-stage method individual judges are
    asked to first select a set of criterion items.
    Then they are required to determine a minimum
    passing score (percentage correct) for their
    individual set of items. In the third stage,
    judges are given performance data and with this
    new information, judges are asked to review their
    set of criterion items. The quantification of the
    final criterion point is computed using the Rasch
    PROX formula

14
b H X lnP/(1-P)
  • where b the judges criterion standard for the
    entire test in logits
  • H average difficulty of the judges criterion
    items
  • X (1 w2/ 2.89)½
  • (where w SD of the judges criterion
    item difficulties) and
  • P percent correct standard set by the judge

15
2 Julian Wright (1993)
  • The second procedure which was developed by
    Julian and Wright (cited in Stone, 1996) is a
    refinement of the first (Stone, 1996). In this
    procedure relevant items and students around the
    criterion region are first identified (Figure 1).
    Judges are then asked to decide whether the items
    are required by a passing examinee to be
    considered competent (Stone, 1996). Judges are
    also asked to rate selected students based on
    their performances and histories.

16
Relevant Items and Students within the Criterion
RegionSource Wright (1993)
17
Judge-by-Item and Judge-by-Student MatrixSource
Wright (1993)
18
  • The selected items and students are rated using
    the scale in Figure 2. The data from this
    judge-by-item and judge-by-student matrix are
    then analyzed, separately or together, to locate
    a coherent nucleus of agreement on the definition
    of the criteria (Wright, 1993). In the analysis,
    the judge mean is set to zero whereas items and
    persons are allowed to float. With regard to
    misfitting items and students, it is suggested
    that these are put aside in order to clearly
    define the variable (i.e., Judgment Line).

19
(No Transcript)
20
This method is significant in several respects.
  • First, it takes into account both test content
    and student performance in the establishment of
    the criterion level/ standard.
  • Second it provides a possibility of examining
    items and persons that cause disagreements in
    judges decisions for the refinement of the final
    standard.
  • Third, it provides a useful and much needed
    technique for identification of inconsistent
    judges a critical component which is missing in
    most standard setting methods.

21
3 Objective Standard Setting method (OSS), Stone
(1996).
  • This method which has its roots in the work of
    Grosse and Wright (Stone, 1996) involves three
    evaluative decisions and three translations in
    the quantification of the standard.

22
These evaluative decisions pertain to
  • (a) judgment on the essentiality of each item
  • (b) the level of mastery required, and
  • (c) decision confidence.

23
  • Quantification of the qualitative or evaluative
    decisions involves
  • (a) the calculation of mean item difficulty of
    essential items
  • (b) translating the mastery decision via odds to
    logits transformation and
  • (c) quantification of the standard error (Stone,
    1996 2001).

24
OSS steps (Stone (1996 2001) are as follows
  • 1. Judges assess the content as presented in each
    item and determine their essentiality.
  • 2. Each judges set of essential items are
    quantified by calculating the mean item
    difficulty of the items. The item difficulty
    means for all judges are then summed and
    averaged.
  • 3. Level of mastery required is then determined
    by judges. It is expressed as a probability
    (e.g., 50, 60 or 80) and converted to a logit
    measure through the odds to logits transformation
    (e.g., 60 is equivalent to .41 logits)
  • 4. Confidence in decisions whether to protect
    innocence or to ensure quality is determined
    using the standard error (Stone, 2001 Wright
    Grosse, 1993).

25
The OSS is desirable in several respects.
  • IT refocuses the judgment task from prediction of
    the performance of a hypothetical group of
    minimally competent examinees to the judgment of
    essentiality of content (Stone, 1996 2001),
  • Its simplicity is another factor that
    contributes to its appeal. By requiring judges to
    focus only on essentiality of content, the
    complexity of the judgment task is greatly
    reduced. This subsequently makes judge training
    less demanding and less intensive.

26
Quantification of judgments is a straightforward
process.
  • Mean estimate of essential items is calculated
    for each judge and the results averaged across
    all judges using a simple mathematical operation.
    Additionally, the OSS is not iterative and does
    not require quantification of countless judgments
    as a result of multiple iterations of judgments
    of examinee performance on test items as in the
    case of non-objective standard setting methods.

27
However, there are some limitations to this
method.
  • The OSS which was originally developed for
    selected-response items is not easily generalized
    to constructed-response items nor is it readily
    used to handle assessments utilizing a
    combination of SR and CR items. Quantification
    and determination of the criterion region (see
    Wright, 1993) though possible is cumbersome when
    CR items are involved. Second, it requires a
    well-established construct theory as item
    disordinality has serious consequences on
    resulting cutscores.

28
OSS does not address the problem of intrajudge
variability
  • Nonetheless, Stone and Engelhard, (1998) has
    developed a procedure for assessing the quality
    of judges ratings using the Rasch model. In this
    procedure rating errors such as intrajudge
    inconsistency, halo, central tendency and
    restriction of range are examined. However how
    this procedure could be integrated in the
    standard setting process and the calibration of
    cutscores has not been explored.

29
Purpose of Study
  • This study attempts to extend the utility of the
    OSS to deal with both multiple-choice items as
    well as constructed-response items with the use
    of a Facet-based approach (Linacre, personal
    communication, June 22, 2005) in the
    quantification of judges ratings.

30
Participants
  • The standard setting panel consisted of 12 judges
    who were language instructors as well as item
    writers at the Centre for Languages and
    Pre-University Academic Development of the
    International Islamic University Malaysia. The
    judges possessed at least a basic degree in TESOL
    or a masters degree in a similar field, and at
    least 2 years teaching experience at the Centre.

31
Instrument
  • a placement battery developed by the Centre for
    purposes of student exemption from and placement
    into four English language support courses. The
    placement battery consisted of 3 subtests Paper
    1 (Grammar Reading), Paper 2 (Essay Writing)
    and Paper 3 (Speaking). In this study, the focus
    was on the first two subtests..

32
Procedures
  • The OSS procedure was used to generate judges
    ratings. For the multiple-choice subtest (Paper
    1), judges were asked to individually select
    essential items for each of the four cutscores.
    Essentiality of items is referenced against
    verbal descriptions of what a minimally competent
    student are expected to be able to do/know (which
    the Centre had developed) in order to classified
    as having achieved a given cutscore (or
    standard).

33
  • Items that were selected as essential were marked
    as 1 and non essential items were marked as 0.
    This procedure was carried out for dichotomously
    scored items. Applying the same concept, judgment
    of polytomous items (i.e., the essay items) was
    carried out by matching the performance level
    description with the corresponding descriptor and
    numerical rating/value as stated in the
    evaluation profile (rating scale) used in the
    evaluation of the given polytomous items (Figure
    4).

34
(No Transcript)
35
Facets Data Matrix
36
Results
  • Facets calibrations of judges, cutscores (levels)
    and items are presented in Figure 1. The first
    column is the logit scale followed by the
    standard setting judges, cutscores (levels), test
    items and the scale used for rating of essay
    performances.

37
(No Transcript)
38
  • From judges distribution in Figure 1 it is
    evident that there is some variation in judges
    perception of essential items. The separation
    index (2.24) and the chi-square value of 72.6
    with 11 df, significant at p lt.01 indicate that
    judges consistently differ from one another in
    overall severity of judgment (Table 1). Judge 9
    is seen to the most severe and judges 10 and 6
    the least severe. With the exception of judge 9,
    all the judges cluster within -0.5 to 0.2 logit.

39
  • Figure 1 also indicates that the four criterion
    levels are clearly separated. However, note that
    the criterion levels for 3 and 4 are
    exceptionally high, in relation to the
    distribution of test items. The first cutscore
    appears to be low but it does not fall below some
    of the easiest items on the test. However, how
    the lowest cutscore relates to the actual
    examinee distribution will be examined later in
    this paper.

40
a detailed judge measurement report.
  • Although the difference in judge severity is
    quite small (about 1.3 logits from the most
    severe to the least severe), there is significant
    variation in judge severity. In terms of judges
    self-consistency, Judges 3 and 4 appears to be
    clearly misfitting. Judge 4 who has Infit and
    Outfit MnSq statistics of 2.14 and 3.03
    respectively also shows a low discrimination
    index of .05.

41
(No Transcript)
42
  • The Levels (cutscores) measurement report (Table
    2) indicates a significant difference between
    levels. The separation index is 18.59, chi-square
    value is1337.3 with 3 df, significant at plt.01.
    Note that Cutscore (level) 4 has a higher Infit
    MnSq value (1.48) as compared with the other
    levels. Estimated discrimination showed fairly
    reasonable values (although below the expected
    value of 1) with Level 2 showing the lowest
    discrimination estimate (.61). The cutscore
    (criterion level) to separate candidates into the
    first and second level of English language
    performance is calibrated at -1.10 logits whereas
    the highest cutscore that exempts examinees from
    the language support courses is calibrated at
    4.66 logits.

43
Levels (Cutscores) Measurement Report
44
Item displacement
  • As regards item displacement, five items (Items
    27, 39, 63, 69 and 73) showed positive
    displacement values of above 3. Item 39 showed
    the highest positive displacement estimate
    indicating that judges had misjudged this item,
    which is easy (measure -2.16) as being a
    difficult item and hence assigned it to a high
    level of language performance. Item 5 (measure
    1.10), which has a displacement estimate of
    -2.98, on the other hand, has been misjudged as
    an easy item. Overall, there are more items that
    have been misjudged as difficult (as indicated by
    the empirical calibrations) than those that have
    been misjudged as easy. With respect to fit,
    items that showed high displacement values also
    indicate high Infit and Outfit Mean Square
    estimates.

45
Table 3 Item Fit Statistics and Displacement
46
(No Transcript)
47
  • The following figure displays the estimated
    cutscores (criterion levels) derived from the
    Facets analysis as applied to actual examinee and
    item distributions. It is evident that the
    judges had underestimated the lowest cutscore
    (Level 1), and overestimated the highest cutscore
    (Level 4). Based on the item distribution, the
    underestimation of the lowest cutscore or
    criterion level could be attributed to the
    substantial number of easy items (off-target
    items) on the test.

48
(No Transcript)
49
  • The overestimation of the highest cutscore (Level
    4), on the other hand is due to judges
    expectations that examinees should get all items
    correct to be exempted from the language support
    courses (Figure 8). It must be noted that the
    cutscore to separate examinees into the third and
    fourth levels of language performance suffers
    from the same problem.

50
(No Transcript)
51
Scaling Expert Judgment and Rasch Calibrations
52
(No Transcript)
53
Figure 10 Scaling of Reading Test Items
According to Expert Judgment and Rasch
Calibrations
54
(No Transcript)
55
This Facets-based procedure has some clear
advantages.
  • The first is efficiency in relation to
    identification of judges internal inconsistency.
    In using this procedure inconsistent judges can
    be identified in the same process in which the
    cutscores are calibrated. If it is decided that
    the ratings of inconsistent judges are to be
    excluded from the computation of the final
    cutscores, subsequent computation of the
    cutscores can be easily and quickly processed.

56
  • The second advantage of utilizing this approach
    pertains to judgment of essential items. Items
    that are misjudged as difficult or easy can be
    easily identified through the use of fit
    statistics and displacement values.

57
  • Third, as Facets compute the error of measurement
    with regard to judges ratings for each cutscore,
    this can be used in the adjustment of the final
    cutscore, if so desired.

58
  • In relation to situations where multiple
    cutscores are necessary, statistical significance
    of cutscore separation can also be clearly
    established. This approach to computing judges
    ratings also allows for the investigation of
    judge-item interaction. Through bias analysis,
    unexpectedly harsh or lenient ratings with regard
    to any particular items by particular judges can
    be identified. These idiosyncratic ratings can
    be intercepted and, if necessary treated as
    missing without disturbing the validity of the
    remainder of the analysis (Linacre, 1989, p.13).
    Alternatively, feedback can be given to the
    judges in question for improvements in the
    judging process (Linacre, 1989).

59
  • The findings of this study also underscore the
    importance of a clear understanding of the
    measured construct. As cutscores and performance
    standards are set on test/s, the quality of the
    test/s used is bound to impact the integrity of
    derived cutscores to some extent. Therefore it is
    critical that the quality of the test/s used in a
    standard setting study/ process is carefully
    examined. This is particularly important when
    model-based methods are involved.

60
This is explained in Kane (2002)
  • The model-based, theoretical interpretation is
    considerably richer than a simple generalization
    from performance on a sample of tasks to expected
    performance on the universe of tasks from which
    the sample is drawn, and as a result, requires
    more evidence for its support. In particular, the
    validity argument for model-based, theoretical
    interpretations will require evidence for the
    validity of the theory of performance as well as
    evidence that the assessments can be interpreted
    in terms of the theory. That is, an
    interpretation of test scores in terms of a
    theoretical model depends on evidence for the
    model and for the relationship between the
    observed scores and terms in the model (p.32).

61
Finally, judge competency must also be given due
attention.
  • Based on a study involving judge competency,
    Chang, Dziuban, Hynes and Olson (1996) have
    suggested that judges should not only be trained
    to perform the judgment task but also should be
    trained in the domain content for which they are
    to set the competency standard (p. 170).

62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com