Title: The Science and Art of Exam Development
1The Science and Art of Exam Development
Paul E. Jones, PhD Thomson Prometric
2What is validity and how do I know if my test has
it?
3Validity
- Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by the proposed uses of tests.
Validity is, therefore, the most fundamental
considerations in developing and evaluating
tests. (APA Standards, 1999, p. 9)
4A test may yield valid judgments about people
- If it measures the domain it was defined to
measure. - If the test items have good measurement
properties. - If the test scores and the pass/fail decisions
are reliable. - If alternate forms of the test are on the same
scale. - If you apply defensible judgment criteria.
- if you allow enough time for competent (but not
necessarily speedy) candidates to take the test. - If it is presented to the candidate in a
standardized fashion, without environmental
distractions. - If the test taker is not cheating and the test
has not deteriorated.
5Is this a Valid Test?
- 1. 4 - 3 _____ 6. 3 - 2 _____
- 2. 9 - 2 _____ 7. 8 - 7 _____
- 3. 4 - 4 _____ 8. 9 - 5 _____
- 4. 7 - 6 _____ 9. 6 - 2 _____
- 5. 5 - 1 _____ 10. 8 - 3 _____
6The Validity Technical Quality of the Testing
System
Design
Item Bank
7The Validity Argument is Part of the Testing
System
Design
Item Bank
8How should I start a new testing initiative?
9A Testing System Begins with Design
Design
Item Bank
10Test Design Begins with Test Definition
- Test Title
- Credential Name
- Test Purpose (This test will certify that the
successful candidate has important knowledge and
skills necessary to ) - Intended Audience
- Candidate Preparation
- High-Level Knowledge and Skills Covered
- Products or Technologies Addressed
- Knowledge and Skills Assumed but Not Tested
- Knowledge and Skills Related to the Test but Not
Tested - Borderline Candidate Description
- Testing Methods
- Test Organization
- Test Stakeholders
- Other Information
11Test Definition Begins with Program Design
12Test Definition Leads to Practice Analysis
13Practice Analysis Leads to Test Objectives
14Test Objectives are Embedded in a Blueprint
15Once I have a blueprint, how do I develop
appropriate exam items?
16The Testing System
Design
Item Bank
17Creating Items
Content Characteristics
Response Modes Choose one
Content Options Choose Many
Text Graphics Audio Video Simulations Application
s
Item
Single M/C Multiple M/C Single PC Multiple
PC Drag Drop Brief FR Essay FR Simulation/App
Scoring
18Desirable Measurement Properties of Items
- Item-objective linkage
- Appropriate difficulty
- Discrimination
- Interpretability
19Item-Objective Linkage
20Good Item Development Practices
- SME writers in a social environment
- Industry-accepted item writing principles
- Item banking tool
- Mentoring
- Rapid editing
- Group technical reviews
21How can I gather and use data to develop an item
bank?
22The Testing System
Design
Item Bank
23Classical Item Analysis Difficulty and
Discrimination
24Classical Option Analysis Good Item
n proportion discrim Q1 Q2
Q3 Q4 Q5
gt
25Classical Option Analysis Problem Item
26IRT Item Analysis Difficulty and Discrimination
27Good IRT Model Fit
28How can I assemble test forms from my item bank?
29The Testing System
Design
Item Bank
30Reliability
- Reliability refers to the degree to which test
scores are free from errors of measurement. (APA
Standards, 1985, p. 19)
31More Reliable Test
32Less Reliable Test
33How to Enhance Reliability When Assembling Test
Forms
- Score reliability/generalizability
- Select items with good measurement properties.
- Present enough items.
- Target items at candidate ability level.
- Sample items consistently from across the content
domain (use a clearly-defined test blueprint). - Score dependability
- Same as above.
- Minimize differences in test difficulty.
- Pass-Fail consistency
- Select enough items.
- Target items at the cut score.
- Maintain same score distribution shape between
forms
34Building Simultaneous Parallel Forms Using
Classical Theory
35Building Simultaneous Parallel Forms Using IRT
36What options do I have for setting the passing
score for my exam?
37The Testing System
Design
Item Bank
38Setting Cut Scores
Why not just set the cut score at 75 correct?
39Setting Cut Scores
Why not just set the cut score so that 80 of the
candidates pass?
40The logic of criterion-based cut score setting
- Certain knowledge and skills are necessary for
practice. - The test measures an important subset of these
knowledge and skills, and thus readiness for
practice. - The passing cut score is such that those who
pass have a high enough level of mastery of the
KSJs to be ready for practice at the level
defined in the test definition, while those who
fail do not. (Kane, Crooks, and Cohen, 1997)
41The Main Goal in Setting Cut Scores
Meeting the Goldilocks Criteria
We want the passing score to be neither too high
nor too low, but at least approximately, just
right.
Kane, Crooks, and Cohen, 1997, p. 8
42Two General Approaches to Setting Cut Scores
- Test-Centered ApproachesModified Angoff
- Bookmark
- Examinee-Centered ApproachesBorderline
- Contrasting Groups
43The Testing System
Design
Item Bank
44What should I consider as I manage my testing
system?
45Security of a Testing System
Design
- Write more items!!!
- Create authentic items.
- Use isomorphs.
- Use Automated Item Generation.
- Use secure banking software and connectivity
- Use in-person development
Item Bank
46Security of a Testing System
Design
- Establish prerequisite qualifications.
- Use narrow testing windows.
- Establish test/retest restrictions.
- Use identity verification and biometrics.
- Require test takers to sign NDAs.
- Monitor test takers on site.
- Intervene if cheating is detected.
- Monitor individual test center performance.
- Track suspicious test takers over time.
Item Bank
47Security of a Testing System
- Perform frequent detailed psychometric review.
- Restrict the use of items and test forms.
- Analyze response times.
- Perform DRIFT analyses.
- Calibrate items efficiently.
Design
Item Bank
48Item Parameter Drift
49Security of a Testing System
Design
Item Bank
- Many unique fixed forms
- Linear on-the-Fly testing (LOFT)
- Computerized adaptive testing (CAT)
- Computerized mastery testing (CMT)
- Multi-staged testing (MST)
50Item Analysis Activity