Title: Test Development
1Test Development
2Test Development Process
- Test Conceptualization
- Test Construction
- Test Tryout
- Item Analysis
- Test Revision
3Test Conceptualization
- Role of self-talk
- Preliminary questions
- What is test designed to measure?
- Whats the objective of test?
- Is there a need for the test?
- Potential harm/benefits?
- What content will be covered?
4Test Conceptualization (contd)
- How will meaning be attributed to scores on this
test? - Norm-referenced compare individual score to
others scores (who have already taken test) - Criterion-referenced compare score to criterion
group (known to have trait)
5Test Conceptualization (contd)
- Pilot work
- Preliminary research surrounding the creation of
the prototype of test - Aim determine how best to measure targeted
construct
6Test Construction
- Three steps
- Scaling
- Writing items
- Scoring items
7Test Construction (contd)
- Scaling setting rules for assigning numbers in
measurement - Decision of type of scales
- Types of scales
- Age-based
- Grade-based
- Stanine transformation of raw scores
- Uni- or multi-dimensional
- Method of paired comparisons
8Method of Paired Comparisons
- Select the behavior you think would be more
justified - a. cheating on taxes if one has a chance
- b. accepting a bribe in the course of ones
duties
- Which picture do you prefer?
9Test Construction (contd)
- Writing Items
- Consider content, of item formats, number of
items - Item pool group that items will be drawn or
discarded for the final test version
10Test Construction (contd)
- Writing items contd
- Item format
- Selected-response
- Constructed-response
11Constructed Response
- The standard deviation is generally considered
the most useful measure of _________. - Answer variability
12Construction-Writing Items
- Writing Items contd
- Selected response formats
- Dichotomous
- Polytomous
- Likert
- Categorical
- Checklists
- Matching
- Subjective response format
13Dichotomous True/False
- Variables such as the form, plan, structure,
arrangement, and layout of individual test items
are collectively referred to as item format. - True False
14Selected-Response Multiple Choice
- Item A
- A psychological test, and interview, and a case
study are - Psychological assessment tools
- Standardized behavioral samples
- Reliable assessment instruments
- Theory-linked measures
Stem Correct alternative Distractors
15Selected Response MC
- Item B
- A good multiple-choice item in a an achievement
test - Has one correct alternative
- Has grammatically parallel alternatives
- Has alternatives of similar length
- Has alternatives that fit grammatically with the
stem - Includes as much of the item as possible in the
stem to avoid unnecessary repetition - Avoids ridiculous distractors
- Is not excessively long
- All of the above
- None of the above
16Likert Scales
- How effective was the textbook in facilitating
your learning in this course? - 1 2 3 4 5
- Not A little Average
More effective Extremely - at all effective
effectiveness than usual
effective - effective
17Categorical
- What level of education have you completed?
- Range between kindergarten and 5th grade
- Middle school education (6 8th grade)
- Portion of high school (9-11th grade)
- High school diploma
- Associates degree
- Masters degree
- Professional degree (Ph.D., M.D., J.D., D.O.)
18Checklists
- Which symptoms have you experienced in the past
month? - ___ Feeling down ___ Anxiety
- ___ Irritability ___ Restlessness
- ___ Sadness ___ Appetite changes
- ___ Crying ___ Less interest in sex
19Matching
- ___ A. Samuel L. Jackson 1. Mission
Impossible - ___ B. Brad Pitt 2. Dumb Dumber
- ___ C. Jim Carrey 3. Shaft
- ___ D. Tom Cruise 4. Fight Club
20Subjective Response Formats
- Fill-in-the-blank (e.g., regression is
_________________) - Short answer
- Essay
- The longer and more complex the answer, the more
difficult it is to score reliably.
21Summary for writing items
- 1. Use a theory or model to guide your
test/survey when possible - 2. Try not to confuse the participant
- 3. Use simple, clear language
- 4. PROOFREAD
- 5. Anticipate confusion
- 6. Consider boredom and fatigue
- 7. Consider short-term memory limitations
- 8. Remember item writing should proceed with a
plan in mind we should have a clearly defined
notion of the construct we wish to measure!!!
22Test Construction (contd)
- Scoring Items
- Class scoring earn credit towards placement in
class - Category scoring earn credit?category
- Ipsative scoring compares testtakers score on
one scale within the test with another scale on
same test
23Ipsative Scoring
- Edwards Personal Preference Schedule (EPPS)
forced choice of two socially desirable
responses yields info on the strength of the
various needs in relations to the strength of the
other needs of the testtaker (not towards needs
of general population) so can only draw
intra-individual (within) conclusions NOT
inter-individual (between) - e.g.,
- I feel depressed when I fail at something
- I feel nervous when giving a talk before a group.
24Test Tryout
- Use similar people as those test developed for
- 5-10 people per test item
- e.g., if test is for aiding in selection of
corporate execs w/ management potential, then try
it out on corporate employees at the targeted
level - The more the people, the weaker the role of
chance in data analysis
25Item Analysis
- Item difficulty how many people get it right
The more who get it right, the easier the item - Optimum difficulty level 1st, find half of the
difference between 100 success and chance
performance (chance) 2nd, will add this value to
the probability of performing correctly by chance
alone (midway pt) - 100 (1.0) and level of chance (.2 for 5 items)
- 1.0 - .2 0.8 .40
- 2 2
- .40 .20 .60 (optimum difficulty
- (chance) (midway pt)
level)
26Item Analysis
- Item discriminability determines whether people
who have done well on certain items have also
done well on whole test - Extreme group method compares those who do well
with those who havent
27Item Analysis
- Item reliability
- Item-Reliability Index higher index means more
reliable (i.e., measure of internal consistency) - Factor analysis can see if items are loading on
factors you want them to or if several factors
are emerging can eliminate items based on what
you want test to do
28Item Analysis
- Item Validity
- Item-validity index indicates degree which a
test measures what it says it measures - Higher is better
- Uses the item-score SD and the correlation
between the item score and criterion score
29Item Analysis
- Item Characteristic Curve relationship between
performance on the item and performance on the
test
30Item Characteristic Curves
A
B
C
D
High Prob of correct response Low
Low High
Ability
Low High
Ability
31Test Revision
- Mold test into its final form
- Evaluate strengths/weaknesses of items
- Delete weaker items
- e.g.,
- Some items may be too easy or too hard (these
lack reliability and validity because of their
restricted ranges of testtaker performance) - items could have high reliability but poor
criterion validity, or could be unbiased but too
easy - also want to reflect on purpose of test (if
educational placement test, developer will be
very concerned about bias of items) - if want test to identify most skilled individuals
(astronaut program candidates), then want high
item discrimination.
32Test Revision
- Administer test under standardized conditions to
a 2nd appropriate sample of testtakers - Standardization Once test is in its final form,
this process used to introduce objectivity
uniformity into test administration, scoring,
interpretation - Cross-validation revalidating test on another
sample of people - Validity shrinkage decrease in item validities
33Example
- Affirmative Action Knowledge Test
- 5 phases of development
- Item-level analysis
- Scale-level analysis
- Convergent/discriminant validity