Title: Using IRT Methods to Construct and Score Personality Measures that are FakeResistant
1Using IRT Methods to Construct and Score
Personality Measures that are Fake-Resistant
- Stephen Stark
- Georgia Institute of Technology
- Oleksandr S. Chernyshenko
- University of Canterbury
2Addressing Quality and Fairness of Personality
Testing
- IRT methods can be used to
- Understand nature of response process
- Test hypotheses about behavior by comparing fit
of models bearing different assumptions - Facilitate computer adaptive testing
- Create shorter, informative tests that provide
accurate scoring - Benefits may not be realized unless IRT model,
used for parameter estimation, adequately fits
the data
3Modeling Responses to Traditional Personality
Items
- Stark, Chernyshenko, Drasgow (2002) compared
fit of ideal point and dominance models to 16PF
data - Found comparable fit for several scales
- Some scales, which were fit poorly by dominance
models, were fit better by ideal point models - Conclusion
- Ideal point process seems appropriate for
personality items - Fly in the ointment
- Correct specification of response process does
not guarantee more accurate assessment, because
traditional items are easily FAKED
4 How to Deal With Faking?
- Social Desirability (SD) scales often used to
detect and correct for faking - Adjustments made to content scale scores
- Little effect on validity
- Correcting for faking using SD scores is
problematic, because - SD scales may function differently across testing
situations (Stark, Chernyshenko, Chan, Lee
Drasgow, 2001) - Need to develop fake-resistant items
5Examples of Traditional Itemsthat are Easily
Faked
In each case, socially desirable response is
obvious.
- I get along well with others. (A)
- I try to be the best at everything I do. (C)
- I insult people. (A-)
- My peers call me absent minded. (C-)
Because these items consist of individual
statements, theyare commonly referred to as
single stimulus items.
6Fake-Resistant Format forAdministering
Personality Items
- Create items by pairing stimuli that are similar
in desirability, but representing different
dimensions - Positive item
- I get along well with others. (A)
- I set very high standards for myself. (C)
- Negative item
- I insult people. (A-)
- I work just enough to pass my classes. (C-)
- Variation of this approach (Army AIM) has shown
score inflation of only 0.1 SD - (as compared to 1.5 SD for traditional items in
Army ABLE)
7Purpose of Research
- Develop IRT methods for constructing and scoring
pairwise preference personality items involving
statements on different dimensions - Formulation of model and scoring algorithm
- Construction of fake-resistant tests
- Investigation of scoring accuracy
8Model Notation
9General Model for Scoring Pairwise Preference
Responses
- Respondent evaluates each stimulus (personality
statement) separately and makes independent
decisions about endorsement. - Stimuli may be on different dimensions.
- Single stimulus response probabilities P0 and
P1 computed using a unidimensional ideal point
model for traditional items (GGUM)
1 Agree 0 Disagree
Refer to new pairwise preference model as MUPP
10MUPP Scoring
- Latent trait scores (thetas) and standard errors
(SEs) obtained using Bayes modal estimation. - Latent trait score represents a respondents
standing on a personality dimension - SE indicates the precision of a respondents score
11Test Construction Involves 3 Steps
- Estimating parameters for individual statements
representing different dimensions - Estimating social desirability ratings for
individual statements - Creating fake-resistant items by pairing
statements having similar desirability, but
representing different dimensions
12Test Construction (Step 1)Get Parameters for
Individual Statements
- Data
- 465 Army recruits were instructed to respond
HONESTLY to approximately 500 personality
statements measuring six dimensions, using 1 to 6
format - Response data were dichotomized
- GGUM stimulus parameters were estimated for each
dimension separately using GGUM2000 - Model-data fit was examined
13Calibration and Fit Results fromStarks MODFIT
Computer Program
14Test Construction (Steps 2 3)Creating
Fake-Resistant Items
- Social desirability ratings obtained by
- Computing mean proportion endorsement scores
obtained from 269 recruits instructed to FAKE
GOOD - Values ranged from 1 (Low) to 6 (High)
desirability. - Created fake-resistant items by pairing
statements - Similar desirability
- Different dimensions
- Different location parameters
15Investigating MUPP Scoring Accuracy1-D
Simulation Study Design
- Created 10, 20, and 40 item tests by pairing ADJ
stimuli - Could not create items that measured well at
extremes - Scoring accuracy examined by
- Generating responses for 50 simulees at theta
values -3, -2.8. , 3 - Comparing estimated to known thetas using bias
and error statistics, averaged over replications
161-D Simulation ResultsTest Information for 10,
20, 40 Item Tests
High information ? high measurement precision
171-D Simulation ResultsBias in Estimated Thetas
for 10, 20, 40 Item Tests
Correlations between estimated and generating
thetas gt .9 for all tests.
18Investigating MUPP Scoring Accuracy2-D
Simulation Study Design
- Two factors manipulated
- Test length
- Percent of unidimensional pairings (to set common
metric) - Nine tests required
- Created in similar manner to 1-D case
- Parameter recovery examined by
- Generating response vectors for 50 simulees at
each of 169 points on 2-D grid i.e., -3, -2.5,
, 3 -3, -2.5, ,3 - Comparing bias and error statistics across
experimental conditions using graphs and MANOVA
192-D Simulation ResultsTest Information Functions
202-D Simulation ResultsAvg. Absolute Bias Across
Dimensions Replications
212-D Simulation Results
- MANOVA
- Modest main effect for TESTLEN (EtaSqr .39)
- But, biases did not decrease much
- Estimation was accurate over wide range of grid
points, even for short tests - Weak main effect for UNIPCT (EtaSqr .08)
- Only a relatively small percentage of
unidimensional pairings was needed. - Correlations between estimated and generating
thetas for all tests were large (.77 to .95).
22Summary Conclusions
- MUPP scoring procedure was accurate for1-D and
2-D tests - In practice, scoring accuracy depends on quality
of estimated stimulus (statement) parameters. - Tests should be constructed using
- Roughly 20 items per dimension involved
- 10 20 of the items should be unidimensional
- Test construction and scoring approach holds
promise for reducing effects of faking.
23Related Research in Progress
- Constructing and validating fake-resistant
inventory involving - Multidimensional paired comparison items
- Lower-order facets
- Computerized adaptive item selection and scoring,
based on MUPP