Title: Equating And Scaling
1Equating And Scaling
2The Goal of this Session Is
- To give you a general idea of
- What equating is
- Why we need to do it
- How it works
- As part of this, we will discuss
- Different equating models
- A quick review of IRT
- Scaling
3Why Equate?
- Why would the average raw scores on a test
administered in 2006 and 2007 differ?
4Why Equate?
- Why would the average raw scores on a test in
2006 and 2007 differ? - The 2006 form is harder than the 2007 form
- This years students are better prepared than
last years were - Both
5Why Equate?
- Equating allows us to determine the extent to
which - one test is harder than the other (which is
usually the case) - one group is more able (i.e., has more of the
construct of interest) than the other (also
usually the case) - This enables us to ensure that well-prepared
examinees get higher scores than the less
well-prepared, regardless of the test they took
6Why Equate?
- If we gave both groups the same test, we could
directly compare their performance, but this is
not practical - Security
- Release of items
- This is where equating items come in a subset
of items administered in both tests
7Why Equate?
- The performance on the equating items is used to
compare student ability across the two groups - We can use this information to determine to what
extent the difference in performance is due to
one group being better prepared than the other - Once we know that, we can then determine how much
harder or easier one test is than the other and
adjust so that scores based on the two tests can
be compared directly - ? the tests are equated
8Equating Items
- In order to ensure that this is done accurately,
the equating items should have the following
characteristics - Good psychometric properties
- Be parallel to the overall test
- Content
- MC vs. CR items
- Passage length
- Graphics
- etc.
9Equating Items
- The difficulty of the non-equating items in each
test can vary -- within reason. However - We dont want the feel of the assessment to
change differentially for different subgroups of
students. As one example - If one test is harder than another, lower
performing students may be more frustrated on the
harder test - If one test is easier than another, higher
performing students may be more bored and
unmotivated on the easier test - Either of these could result in differences in
performance on the two tests that are unrelated
to the construct of interest
10Equating Items
- In addition, equating items should not be changed
in any way from one administration to the next - Again, any change in the item (wording, location
within the test, response options, etc.) can
cause a change in student performance that is
unrelated to the construct of interest
MUST!
11Equating Models
- Classical test theory models (CTT)
- Item response theory models (IRT)
- Internal anchor (counts toward student scores)
- External anchor (doesnt count toward student
scores) - Intact, separate anchor test
- Embedded anchor test
12Equating Models
- CTT models are concerned with estimating the
relationships of the anchor test with each total
test, and the anchor test in group 1 with the
anchor test in group 2. - IRT models focus on estimating the relationship
of each item with the underlying trait (q) that
is being measured.
13Equating Models
Difficulty
Difficulty
Ability
Anchor 1
Anchor 2
Test 2
Test 1
Classical test theory equating diagram
14Example
- Group 1 (2006)
- Total test score 30.6
- Score on equating items 14.2
- Group 2 (2007)
- Total test score 38.6
- Score on equating items 15.5
- Based on their performance on the equating items,
we know that Group 2 is a bit higher performing,
but their total score on the test is quite a bit
higher which suggests that the 2007 test is
easier.
15Equating Models
- CTT models are well known, commonly used, and are
relatively easy computationally - IRT models have a shorter history and are
computationally difficult, but they have certain
advantages that make their use desirable - At MP, we use pretty much exclusively IRT
equating models
16Basics of Item Response Theory
- Why Use IRT?
- Review of IRT
- The Item Characteristic Curve (ICC)
- The Test Characteristic Curve (TCC)
17Why Use IRT?
- Advantages over CTT
- IRT allows us to calculate an estimate of student
ability (q), not just observe how a particular
student performs on a particular test - IRT uses the same theta scale to describe
students and items this has certain advantages - It provides more sophisticated information that
(depending on the specific model used) takes into
consideration various characteristics of the item
18The ICC
- Describes the interaction between examinees and
test items - In the simplest case, ability is a function of
item difficulty - As more sophisticated models are used, other item
characteristics are taken into consideration as
well
19The Basics
20The Basics
21Item Difficulty
22Item Discrimination
23Item Guessing
24A Test is Made up of Many ICCs
25A Test is Made up of Many ICCs
26A Test is Made up of Many ICCs
27A Test is Made up of Many ICCs
28A Test is Made up of Many ICCs
29A Test is Made up of Many ICCs
30For a given examinee with ability (?) 1.0
31For a given examinee with ability (?) 1.0
- The expected score on the total test is equal to
the sum of the probabilities for each item on the
test 0.820.480.980.990.820.354.41
32The TCC
- Summation of ICCs
- Describes the relationship between ability and
expected performance on the whole test
33TCC is the sum of the ICCs
34TCC is the sum of the ICCs
35Is It Really That Simple?
- Polytomous Items
- Parameter Estimation
- Item Parameters
- Person Parameters
- Various IRT Models
- Examinee-Model Fit
36So What Does This Do For Us?
- Using the TCC, we can estimate the total test
score for a student at a given level of ability - In actuality, however, this isnt what we want to
do we already know the students total raw
scores what we dont know is their ability. - Fortunately, once we have the ICCs and TCC, we
can go the other way we can estimate ability
based on a students observed total test score.
37So What Does This Have to Do with Equating?
- Back in 2006, we established the relationship
between the total test and student ability using
the theta scale - Using the equating items, we can put the 2007
test on the same scale
38How Do We Do This?
- Estimate item parameters (i.e., calibrate the
items) for 2006 test - Estimate item parameters for 2007 test, fixing
the parameters for the equating items to their
2006 values - This forces the ability estimates for 2007 to
be on the same scale as those for 2006 - As a result, we will get the same ability
estimate for a student regardless of which test
they took
392006 and 2007 TCCson the Same Scale
40Typical Equating Process
- Selecting Equating Items
- IRT Calibrations/equating
- Determining scores for reporting (scaling)
41Selecting Equating Items
- Initial Selection
- Test questions from last years test are included
in this years test - The total points from equating items should be at
least 40 of the total points on the test - The distribution of the items across different
relevant categories is similar to that of the
whole test - Each item should be in about the same position
this year and last year
42Selecting Equating Items
- We also do some statistical checks to look for
items that are functioning very differently in
2007 than they did in 2006, relative to the rest
of the equating items - If we find those, we will exclude them from use
as equating items
43Item Calibrations
- We talked about this earlier, remember?
- Estimate parameters for 2006 items
- Estimate parameters for 2007 items, fixing the
values for the equating items - Voila the same ability estimate for students,
regardless of which test they took!
44Scaling
- It does not really make sense to report scores on
the raw score metric - Equated raw scores do not equal the number of
points the student achieved on that test, but
rather the number of points that the student
would be expected to achieve on the equated to
test
45Scaling
- Similarly, it does not really make sense to
report scores on the theta metric - While psychometricians are quite fond of theta
scores, they have some unfortunate
characteristics (decimal and negative values)
that would make them alarming to most test users - (Note they in the previous sentence refers to
the theta scores)
46Scaling
- It does make sense to report scores on an
arbitrary scale that has no inherent meaning. - The meaning of the scale is defined by the
assessment - Scaled scores are typically a linear
transformation of ability estimates - Example of a linear transformation
- (Ability x Slope) Intercept
47Scaling
- This appears to be pretty simple, but, like most
things, scaling is more complicated than it
appears at first
48Issues in Scaling
- Endpoints
- If one test is more difficult than the other, the
highest possible raw score on the harder test
ought to result in a higher scaled score than the
top score on the easier test. - However, top bottom scores may be truncated so
that a student who gets one or more items wrong
may still receive the top scaled score, or a
student who gets some items right may still
receive the lowest scaled score.
49Issues in Scaling
- Number of points
- Should be sufficient to differentiate examinees.
- Should not be more than the number of raw score
points. - Cut points
- If more than two cut-points are used and each
cutpoint is a pre-determined scaled score, the
scale will be non-linear. In this case taking
averages is questionable.
50Issues in Scaling
- Scale compression and/or expansion
- If cut points are very close together on the
theta scale and far apart on the scaled score
scale, or vice versa - You can have compression in one part of the scale
and expansion in another part
51Determining Scaled Scores
52Determining Scaled Scores
53Determining Scaled Scores
Raw Score
Scaled Score