Title: Scale development Theory and applications
1Scale developmentTheory and applications
- Chapter 7
- Item Response Theory
Samuel O. Ortiz, Ph.D. St. Johns University
2Item Response Theory (IRT)
- Is an alternative to classical measurement theory
(CMT) or classical test theory (CTT) - In CMT, observed score is the result of
repondents true score plus error - In CMT error is not differentiated but rather
collected in a single error term - In IRT, error is differentiated more finely,
particularly with respect to item characteristics
3Item Response Theory (IRT)
- In CMT, focus is on composites, or scales
- In IRT, focus is on individual items and their
characteristics - IRT used mainly for ability tests (e.g., SAT)
dichotomous responses but can be applied to other
domains. - In CMT items are aggregated to gain
reliabilityachieved by redundancy - In IRT items are individually evaluated in the
search for better items, better relationship to
the attribute. - More IRT items increase ability to differentiate
levels of the attribute but doesnt increase
reliability
4Item Response Theory (IRT)
- In CMT items share a common causesame in IRTand
are thus similar to each other - But IRT items are designed to tap different
degrees or levels of the attribute -
- Seeks to establish certain characteristics of
items irrespective of who completes themlike a
scale that measures weight or a ruler that
measures inches
5Item Response Theory (IRT)
- Different Models
- Main difference is in number of parameters of
concern - Most common is three parameter model item
difficulty, capacity to discriminate, and
susceptibility to false positives - Rasch scaling concerned only with item
difficulty
6Item Response Theory (IRT)
- Item Difficulty
- Refers to level of the attribute being measured
that is associated with a transition from
failing to passing that item - Idea is to construct items with different
degrees of difficulty - Should be able to calibrate difficulty of items
independent of who is responding
7Item Response Theory (IRT)
- Item Difficulty
- If accomplished, then passing the item
represents and absolute level of the attribute
needed and it has a constant meaning with respect
to the attributethat is, you know what amount of
the attribute is required - The attribute is assessed via a common metric
that is not subject to individual differences
other than the variable of interest.
8Item Response Theory (IRT)
- Item Discrimination
- Refers to degree to which an item unambiguously
classifies a response as a pass or fail - The less ambiguity about whether someone passed
or failed, the higher the item discrimination - Not as easy as it sounds, even on ability tests,
e.g., writing samples on the WJ III
9Item Response Theory (IRT)
- Item Discrimination
- An item that discriminates very well has a very
narrow portion of the range of the phenomenon of
interest in which the results are ambiguous. - A less discriminating item has a larger region
of ambiguity (similar to issue of reliability,
but not exact)
10Item Response Theory (IRT)
- False Positives
- Is a response indicating that some characteristic
or degree of an attribute exists when in
actuality it does not. - Difficult to do on an ability test where answer
is either correct or not, but could happen to
some extent with guessing or luck on certain
tests (blocks falling together in the right place
on Block Design or guessing answers to arithmetic
questions) - In cases where guessing or false positives are
not an issue (measuring weight) a 2 parameter
model may be enough (difficulty and
discrimination)
11Item Response Theory (IRT)
- Summary
- The parameters represent measurement error
- difficulty of the item is inappropriate (too hard
or too easy) - the area of ambiguity between a pass and a fail
is too large - the item indicates that the attribute is present
when it really is not - IRT quantifies these sources of error so that
items can be selected that will perform well in a
given context
12Item Response Theory (IRT)
- Item Characteristic Curves (ICC)
- When parameters are quantified, item
characteristic curves are formed as a graphical
summary of them - X axis typically represents the strength of the
characteristic or attribute - Y axis represents probability of passing the
item - Easiest to see when comparing curves
13Item Response Theory (IRT)
- Difficulty point at which 50 pass is
differenta factual difference (B is harder, more
difficult)
100
B
A
80
65
50
LIKELIHOOD OF PASSING
40
25
10
0
STRENGTH OF ATTRIBUTE
14Item Response Theory (IRT)
- Discrimination A steeper slope around the 50
pass point means smaller increase in attribute
leads to passing - Thus, because region for A (blue) is smaller than
region for B (green), less ambiguity means A
discriminates better
wide region
narrow region
A
100
B
80
65
50
LIKELIHOOD OF PASSING
40
25
10
0
STRENGTH OF ATTRIBUTE
15Item Response Theory (IRT)
- False Positives is the point at which curves
intersect Y axis indicating the lowest percent of
passes at zero level of the attribute - Thus, a false positive is the probability of
passing without having any of the attribute - Lower is better thus, A has less false positives
than B and is a better item
100
B
A
80
65
50
Intercept B
17
LIKELIHOOD OF PASSING
40
Intercept A
5
25
10
0
STRENGTH OF ATTRIBUTE
16Item Response Theory (IRT)
- Additional Issues in IRT
- Utility of IRT is that items can be made for
special groups or populationsmatching the
parameters of performance to the expected levels
of the attribute of interest. - High-stakes decision making should use items with
better discrimination (low ambiguity) and better
(lower) rates of false positive responses.
17Item Response Theory (IRT)
- Differences between IRT and CMT
- In CMT, we know if an item performs well or
poorly, but dont know reasons why. - In IRT, we can pinpoint the nature of an items
deficiencies or its strengths and weaknesses as
compared to other items. - Like CMT, IRT does not determine characteristics
of itemsonly quantifies them. - The work of developing good items still rests on
the researcher. IRT doesnt write good items or
make bad ones better.
18Item Response Theory (IRT)
- Differences between IRT and CMT
- CMT trades simplicity in development for a more
general notion of error. IRT gains precision in
nature of error but gains complexity. Not easy to
do, generally speaking. - Item characteristics must not be associated with
other attribute-independent sample
characteristics such as gender, age, or other
variables that SHOULD be uncorrelated to the one
being measured. Same assumption in CMTeffect of
a single latent variable.
19Item Response Theory (IRT)
- Differences between IRT and CMT
- Difficult to start out with IRT since true level
of the attribute theta (-) is unknown. Need to
use lots of people, going back and forth to test
items and reveal nature of attribute. - Developing anchoring itemsthose that perform
equivalently across groups can serve as
calibration points.
20Item Response Theory (IRT)
- When to use IRT
- Hierarchical Items In CMT items, are roughly
parallel. In reality, may not be the case. So in
cases where hierarchical phenomena are of
interest, and IRT model may be best - I can ambulate independently
- I can ambulate only with an assistive device
- I cannot walk at all
21Item Response Theory (IRT)
- When to use IRT
- Items are not parallelanswer of yes to c
means a and b cannot be answered yes also. Not
the same as hierarchical responses, e.g., on
Likert scalesresponse options should lead to
similar ratings or same level of attribute. IRT
is more similar to Guttman or Thurstone scaling. - Another advantage of IRT with hierarchical items
is possibility of developing item banks
tailored to specific ranges of attribute or
ability or development (e.g., age or grade level)
22Item Response Theory (IRT)
- When to use IRT
- By focusing on an appropriate attribute level,
items can be selected that are within the
individuals range that can discriminate best.
Eliminates need to give all out-of-range items
(reflected as basals and ceilings on many tests). - Psychological variables not typically assessed
via IRT, except for cognitive abilities/intelligen
ce. But there may be some that are well-suited to
IRT techniques, e.g., self-efficacy (or other
notions that are considered stable and intrinsic
properties of human beings). But even such
variables can be measured quite well by CMT.
23Item Response Theory (IRT)
- Differential Item Functioning (DIF)
- Useful when research questions indicate that its
necessary to distinguish differences in group
membership and item characteristics. - Mostly it boils down to figuring out why an item
performs differently across groups that are
actually equivalent in the attribute being
assessed. - We want the items to be stable and perform the
same across groups, but empirical verification
may be necessary when groups differ on some
characteristics such as age, grade (DeVellis
mentions culture also but doesnt explain
muchlikely that English language proficiency is
another variable here).
24Item Response Theory (IRT)
- Differential Item Functioning (DIF)
- Note that the identification of DIF can be
interpreted in two waysthe item is flawed (due
to the influence of another co-variable) or that
the two groups do actually differ on the
attribute being measured. - Often hierarchical items and DIF analysis are
combined in assessmenteducational assessment,
health outcomes or other cases where its
important to differentiate true group differences
for variables where endpoints or hierarchies are
important.
25Item Response Theory (IRT)
- Conclusion
- Both IRT and CMT continue to be useful. One not
necessarily better than the other. - Just because an item performs well, doesnt mean
it will. - According to DeVellis, Having credible
independent knowledge of the attributes being
measured is a requirement of IRT that is
difficult to satisfy strictly but that can be
very adequately approximated with repeated
testing of large and heterogeneous samples.