Equating And Scaling presentation

About This Presentation

Transcript and Presenter's Notes

Title: Equating And Scaling

1
Equating And Scaling
2
The Goal of this Session Is

To give you a general idea of
What equating is
Why we need to do it
How it works
As part of this, we will discuss
Different equating models
A quick review of IRT
Scaling

3
Why Equate?

Why would the average raw scores on a test
administered in 2006 and 2007 differ?

4
Why Equate?

Why would the average raw scores on a test in
2006 and 2007 differ?
The 2006 form is harder than the 2007 form
This years students are better prepared than
last years were
Both

5
Why Equate?

Equating allows us to determine the extent to
which
one test is harder than the other (which is
usually the case)
one group is more able (i.e., has more of the
construct of interest) than the other (also
usually the case)
This enables us to ensure that well-prepared
examinees get higher scores than the less
well-prepared, regardless of the test they took

6
Why Equate?

If we gave both groups the same test, we could
directly compare their performance, but this is
not practical
Security
Release of items
This is where equating items come in a subset
of items administered in both tests

7
Why Equate?

The performance on the equating items is used to
compare student ability across the two groups
We can use this information to determine to what
extent the difference in performance is due to
one group being better prepared than the other
Once we know that, we can then determine how much
harder or easier one test is than the other and
adjust so that scores based on the two tests can
be compared directly
? the tests are equated

8
Equating Items

In order to ensure that this is done accurately,
the equating items should have the following
characteristics
Good psychometric properties
Be parallel to the overall test
Content
MC vs. CR items
Passage length
Graphics
etc.

9
Equating Items

The difficulty of the non-equating items in each
test can vary -- within reason. However
We dont want the feel of the assessment to
change differentially for different subgroups of
students. As one example
If one test is harder than another, lower
performing students may be more frustrated on the
harder test
If one test is easier than another, higher
performing students may be more bored and
unmotivated on the easier test
Either of these could result in differences in
performance on the two tests that are unrelated
to the construct of interest

10
Equating Items

In addition, equating items should not be changed
in any way from one administration to the next
Again, any change in the item (wording, location
within the test, response options, etc.) can
cause a change in student performance that is
unrelated to the construct of interest

MUST!
11
Equating Models

Classical test theory models (CTT)
Item response theory models (IRT)
Internal anchor (counts toward student scores)
External anchor (doesnt count toward student
scores)
Intact, separate anchor test
Embedded anchor test

12
Equating Models

CTT models are concerned with estimating the
relationships of the anchor test with each total
test, and the anchor test in group 1 with the
anchor test in group 2.
IRT models focus on estimating the relationship
of each item with the underlying trait (q) that
is being measured.

13
Equating Models
Difficulty
Difficulty
Ability
Anchor 1
Anchor 2
Test 2
Test 1
Classical test theory equating diagram
14
Example

Group 1 (2006)
Total test score 30.6
Score on equating items 14.2
Group 2 (2007)
Total test score 38.6
Score on equating items 15.5
Based on their performance on the equating items,
we know that Group 2 is a bit higher performing,
but their total score on the test is quite a bit
higher which suggests that the 2007 test is
easier.

15
Equating Models

CTT models are well known, commonly used, and are
relatively easy computationally
IRT models have a shorter history and are
computationally difficult, but they have certain
advantages that make their use desirable
At MP, we use pretty much exclusively IRT
equating models

16
Basics of Item Response Theory

Why Use IRT?
Review of IRT
The Item Characteristic Curve (ICC)
The Test Characteristic Curve (TCC)

17
Why Use IRT?

Advantages over CTT
IRT allows us to calculate an estimate of student
ability (q), not just observe how a particular
student performs on a particular test
IRT uses the same theta scale to describe
students and items this has certain advantages
It provides more sophisticated information that
(depending on the specific model used) takes into
consideration various characteristics of the item

18
The ICC

Describes the interaction between examinees and
test items
In the simplest case, ability is a function of
item difficulty
As more sophisticated models are used, other item
characteristics are taken into consideration as
well

19
The Basics
20
The Basics
21
Item Difficulty
22
Item Discrimination
23
Item Guessing
24
A Test is Made up of Many ICCs
25
A Test is Made up of Many ICCs
26
A Test is Made up of Many ICCs
27
A Test is Made up of Many ICCs
28
A Test is Made up of Many ICCs
29
A Test is Made up of Many ICCs
30
For a given examinee with ability (?) 1.0
31
For a given examinee with ability (?) 1.0

The expected score on the total test is equal to
the sum of the probabilities for each item on the
test 0.820.480.980.990.820.354.41

32
The TCC

Summation of ICCs
Describes the relationship between ability and
expected performance on the whole test

33
TCC is the sum of the ICCs
34
TCC is the sum of the ICCs
35
Is It Really That Simple?

Polytomous Items
Parameter Estimation
Item Parameters
Person Parameters
Various IRT Models
Examinee-Model Fit

36
So What Does This Do For Us?

Using the TCC, we can estimate the total test
score for a student at a given level of ability
In actuality, however, this isnt what we want to
do we already know the students total raw
scores what we dont know is their ability.
Fortunately, once we have the ICCs and TCC, we
can go the other way we can estimate ability
based on a students observed total test score.

37
So What Does This Have to Do with Equating?

Back in 2006, we established the relationship
between the total test and student ability using
the theta scale
Using the equating items, we can put the 2007
test on the same scale

38
How Do We Do This?

Estimate item parameters (i.e., calibrate the
items) for 2006 test
Estimate item parameters for 2007 test, fixing
the parameters for the equating items to their
2006 values
This forces the ability estimates for 2007 to
be on the same scale as those for 2006
As a result, we will get the same ability
estimate for a student regardless of which test
they took

39
2006 and 2007 TCCson the Same Scale
40
Typical Equating Process

Selecting Equating Items
IRT Calibrations/equating
Determining scores for reporting (scaling)

41
Selecting Equating Items

Initial Selection
Test questions from last years test are included
in this years test
The total points from equating items should be at
least 40 of the total points on the test
The distribution of the items across different
relevant categories is similar to that of the
whole test
Each item should be in about the same position
this year and last year

42
Selecting Equating Items

We also do some statistical checks to look for
items that are functioning very differently in
2007 than they did in 2006, relative to the rest
of the equating items
If we find those, we will exclude them from use
as equating items

43
Item Calibrations

We talked about this earlier, remember?
Estimate parameters for 2006 items
Estimate parameters for 2007 items, fixing the
values for the equating items
Voila the same ability estimate for students,
regardless of which test they took!

44
Scaling

It does not really make sense to report scores on
the raw score metric
Equated raw scores do not equal the number of
points the student achieved on that test, but
rather the number of points that the student
would be expected to achieve on the equated to
test

45
Scaling

Similarly, it does not really make sense to
report scores on the theta metric
While psychometricians are quite fond of theta
scores, they have some unfortunate
characteristics (decimal and negative values)
that would make them alarming to most test users
(Note they in the previous sentence refers to
the theta scores)

46
Scaling

It does make sense to report scores on an
arbitrary scale that has no inherent meaning.
The meaning of the scale is defined by the
assessment
Scaled scores are typically a linear
transformation of ability estimates
Example of a linear transformation
(Ability x Slope) Intercept

47
Scaling

This appears to be pretty simple, but, like most
things, scaling is more complicated than it
appears at first

48
Issues in Scaling

Endpoints
If one test is more difficult than the other, the
highest possible raw score on the harder test
ought to result in a higher scaled score than the
top score on the easier test.
However, top bottom scores may be truncated so
that a student who gets one or more items wrong
may still receive the top scaled score, or a
student who gets some items right may still
receive the lowest scaled score.

49
Issues in Scaling

Number of points
Should be sufficient to differentiate examinees.
Should not be more than the number of raw score
points.
Cut points
If more than two cut-points are used and each
cutpoint is a pre-determined scaled score, the
scale will be non-linear. In this case taking
averages is questionable.

50
Issues in Scaling

Scale compression and/or expansion
If cut points are very close together on the
theta scale and far apart on the scaled score
scale, or vice versa
You can have compression in one part of the scale
and expansion in another part

51
Determining Scaled Scores
52
Determining Scaled Scores
53
Determining Scaled Scores
Raw Score
Scaled Score

Write a Comment

User Comments (0)

About PowerShow.com

Equating And Scaling PowerPoint PPT Presentation