Title: Item Response Modeling in Behavioral Research
1Item Response Modeling in Behavioral Research
- Diane Allen, Mark Wilson,
- and Jun Corser Li
- University of California, Berkeley
- March, 2005
2Outline
- Introduction
- The Data
- Results for the Self-Efficacy Scale
- Comparison with Classical Test Theory
- Further Work with IRM
- Conclusion
3Item Response Models Connections
- theory of test/instrument scores (CTT)
-
- content referencing (e.g., Guttman)
-
- IRM
4CTT vs IRM Equations
or
5CTT vs IRM Issues
- CTT
- Confounding of instrument and respondents
- Assumption of linearity of scores
- IRM
- Model needs to fit/allows one to select models
- Comment IRM addresses CTT issues
6The Rasch Model Idea
?
?i
?i
?
?i
?
7The Rasch Model Graph
8The Rasch Model Dichotomous Function
9The Rasch Model Polytomous Function
10The Data
- Courtesy of Behavior Change Consortium
- Multiple data sources
- (Ory, Jordan, Bazzarre, 2002)
- Stanford, OHSU, UT, U of Rochester, IIT
- Multiple behaviors/interventions
- exercise, diet, smoking
11Scales for Mediators of Changed Behavior
- self-efficacy scale
- self-determination scale
- decisional balance scale
12Self-efficacy (SE) Scale for exercise
- a specific belief in ones ability to perform a
particular behavior (Garcia King, 1991, p.
396) - 14 items that express the certainty the
respondent has that he or she could exercise
under various adverse conditions (see next slide) - Respondents rate each item in 10 increments from
0 indicating I cannot do it at all to 100
indicating certain that I can do it
13Self-Efficacy Items
14Self-Determination Scale
- Assesses the motivating factors for pursuing a
particular behavior - a person who is self-determined has autonomous
reasons for behaving - 15 items
- Respondents rate how true a statement is, from 1
not at all to 7 very
15Self-Determination Items, Examples
16Decisional Balance (DB) Scale
- Examines how people think about exercise
- Ten items that acknowledge positive aspects of
exercise (pros) - Six items that focus on the negative aspects
(cons) - Respondents rate importance of statement 1 not
at all to 5 extremely - Score is calculated by subtracting the cons total
from the pros total
17Decisional Balance Items, Examples
- I would feel more comfortable with my body if I
exercised regularly - Regular exercise would help me have a more
positive outlook on life - I think I would be too tired to do my daily work
after exercising - Regular exercise would help me relieve tension
- I would find it difficult to find an exercise
activity that I enjoy that is not affected by bad
weather
18SE Scale results
- 11 categories--10 thresholds
- Wright map
19(No Transcript)
20Standard Error of Measurement
21Standard Errors of Measurement
22Model fit
23Framework for Comparison
- Standards for Educational and
- Psychological Tests
- (AERA/APA/NCME, 1999)
24Choosing a Model
- CTT
- same model always
- IRM
- Different models fit persons and items better
- may be informative
- Alternative models allow exploration of
measurement implications
25Choosing a Model Partial Credit Model vs.
Rating Scale Model
- RSM constrains all thresholds to same relative
distances apart for every item. - Likelihood ratio test for SE Scale
- c2 336.23 (df117), p lt .0001
- Effect size (real difference)
26(No Transcript)
27Reliability Reliability Coefficients
- CTT
- Cronbachs ? .91.
- IRM
- MML reliability .92.
- Comment
- usually similar except under missing data contexts
28ReliabilityStandard Errors of Measurement
- CTT Constant value 7.66
- IRM
29Validity Based on Instrument Content
- CTT
- Contributes little
- IRM
- Can contribute a lot (cf. work of Wright et al.)
- Comment
- SE Scale not a good example of content validity
30High Self-Efficacy
Low Self-Efficacy
31ValidityBased on Response Process
- Respondents react to the instrument as projected.
- Sources think-alouds exit interviews
- No differences in CTT and IRM usage
- Potential uses of IRM may emerge
- Comment No response processes with SE Scale data
32ValidityBased on Internal Structure 1
Structure of Construct
- CTT
- no usage
- IRM
- Well-established methodology for relating
theoretical construct to parameters in Wright
maps. - Comment
- SE Scale not a good example of construct validity
33ValidityBased on Internal Structure 2 Item
Analysis
- CTT
- item discrimination index
- for categories, point biserial correlations
- IRM
- means of respondents who chose each category
34CTT Point-biserial Correlations
35IRM Mean of Respondent Locations for Each
Category
36Validity Based on Internal Structure3
Differential Item Functioning
- DIF occurs when respondents in different groups,
but with the same location, have different
probabilities of positive response on an item - CTT no contribution (but could use, say,
logistic regression on raw scores--ignoring
measurement results)
37Validity Internal Structure DIF--Continued
- IRM Add interaction parameter between item i and
group g, gig, to the equation -
(? - ?i ?ig) -
e - Probability (Xi 1? ?, ?i, ?ig)
-
(? - ?i ?ig) -
1 e - Test for statistical significance and effect size
of DIF for Gender in SE Scale - Overall c2 13.021 (df14), p gt .5
38ValidityBased on Other Variables
- CTT Many external validity studies available for
SE Scale - IRM Would give very similar results
39ValidityBased on Consequences
- Use of the instrument led to the projected
consequences. - CTT and IRM Similar usage
40Results for the SE Scale
- Aligned with some but not all Standards
- model
- aspects of reliability
- aspects of validity
- Positive features include
- categories cover respondents well, and behave
well - no threat from DIF (for gender)
- Recommend
- incorporating meaningful category labels
- interpreting results at extremes with caution
41Results of Comparing CTT and IRM
- Three types
- Similar usage and results
- reliability coefficients, external validity
- Not much usage currently, neutral results
- response process, consequential validity
- IRM used much more, extended results
- choosing a model, standard error of measurement,
content validity, construct validity
42Further Work with IRM
- Equating
- self-determination scale
- two diverse groups
- simulation study, comparing the effect of
different numbers of overlapping items - Multi-dimensional analyses
- SD and DB scales
- better fit, more information for researcher
- improved reliability with few items
43Conclusion
- IRM has strengths that can benefit behavioral
researchers - refinement of construct
- dimensionality
- different models
- aligning persons and items on same scale
- item and person specific standard error of
measurement