Title: Effective Implementation of the International Test Commission Guidelines for Adapting Tests
1Effective Implementation of the International
Test Commission Guidelines for Adapting Tests
- Ronald K. Hambleton, Shuhong Li
- University of Massachusetts, USA
- ICP, Beijing, China, August 10, 2004
2Background
- Interest in Test Translations and Test
Adaptations Has Increased Tremendously in Past 15
Years. - --IQ and Personality Tests in 50 languages
- --Achievement Tests for Large Scale Assessments
(PISA, TIMMS) in 30 Languages - --International Uses of Credentialing Exams Is
Expanding.
3Background
- Medical Health Researchers With Their Quality of
Life Measures (Huge Field) - Marketing Research
4Problems
- All Too Often, Test Translation and Adaptation
Process Is Not Understood (e.g., hiring one
translator) - --Limited Technical Work (e.g. back-translation
only) - --Literal Translations Only
- --Validity Initiatives End With Judgmental
Analyses
5Problems
- 22 ITC Guidelines for Test Adaptation Are
Becoming Well-known, and Are Often Referred to in
the Literature. - But, Not Always Clear to Practitioners How These
Guidelines Might Be Applied. - Paper by van de Vijver and Tanzer (1997) very
useful but not directly linked to the guidelines,
and many new developments since 1997.
6Purposes of the Research
- Provide Specific Ideas for Applying the ITC Test
Adaptation Guidelines. - The paper includes many excellent examples of
applications. - Successful adaptation is a mixture of good
designs, excellent translators, questionnaires,
observations, good judgments, statistical
analyses, validity studies, etc.
7C.1 The amount of overlap in the constructs in
the populations of interest should be assessed.
- Is the meaning of the construct the same over
language groups? - Exploratory factor analysis, and especially
confirmatory factor analysis (SEM),
multidimensional scaling. - --see work by Byrne with self-concept measures
van de Vijver (2004) with CPI Gregoire (2004)
with WAIS-III. cont.
8C.1 The amount of overlap in the constructs in
the populations of interest should be assessed.
- Judgment of the content suitability
- --Routinely done with the international
assessments, via questionnaires, and face-to-face
committee meetings. - Assessing nomological netsbasically,
investigating a pattern of test results that
includes external factors (i.e., construct
validity investigations)
9C.2 Effects of cultural differences which are
not relevant to the purpose of the study should
be minimized.
- Across language and cultural groups Same
motivational level? Same understanding of
directions? Same impact of speed? Common
experience? If not, fix! - --Questionnaires, observations, local experts,
etc. can provide valuable evidence. - cont.
10C.2 Effects of cultural differences which are
not relevant to the purpose of the study should
be minimized.
- Assessment of cultural difference.
- --assess differences in language, family
structures, religion, lifestyle, values, etc.
(see van de Vijver Leung, 1997) - --statistical methods such as ANCOVA may allow
differences to be removed statistically.
11D.1 Insure test adaptation takes account of
linguistic and cultural differences.
- This guideline is really about the translators
and their qualifications they must know
languages, cultures, basic test development
knowledge, and subject matter/construct. - --Evaluate process used in selecting translators
(avoid convenience as a criterion) - --Use of multiple translators cont.
12D.1 Insure test adaptation takes account of
linguistic and cultural differences.
- D.1 has been one of the most widely applied
guidelines Agencies are evaluating translators
more carefully, and frequently using multiple
translators. See Meara (2004), Grisay (2004).
13D.2 Provide evidence that directions, scoring
rubrics, formats, are applicable.
- If possible, begin in source language to choose
concepts, formats, etc. that will adapt easily. - --Qualified translators can be very useful.
- --Develop checklists for translators to watch for
unfamiliar words, lengths of sentences,
culturally specific concepts, etc. Have them
sign off. See Meara (2004). cont.
14D.2 Provide evidence that directions and scoring
rubrics, item formats, and items are widely
applicable.
- Another successful guideline checklists/rating
scales have been developed to focus on content,
conceptual, and linguistic equivalence. See
Jeanrie and Bertrand (1999).
15D.3 Formats, instructions, and test itself
should be developed to maximize utility in
multiple groups.
- The meaning of this guideline seems clear.
- --Compile evidence on target groupvia
questionnaires, observations, discussions with
testing specialists, small tryout. (van de
Vijver Tanzer, 1997) - cont.
16D.3 Formats, instructions, and test itself
should be developed to maximize utility in
multiple groups.
- --Are training materials available, in case the
tests are unusual or new? - --Were training materials evaluated for their
success? - --Consider balancing formats in the test
17D.4 Item content should be familiar to persons
in the target languages.
- Here, we mean judgmental reviews.
- --Develop checklists for reviewers in the target
language. (As is done to detect gender and
ethnic bias or evaluate test items.) - --If changes are made (e.g., dollars to pounds)
be sure the changes are judged as psychologically
equivalent.
18D.5 Linguistic and psychological evidence should
be used to improve the test, and address
equivalence.
- Implement forward and backward translation
designs for effective review. - --Were multiple translators used?
- --Both designs?
- --Probes of target language/culture respondents?
- --Administration of source and back translated
versions? cont.
19D.5 Linguistic and psychological evidence should
be used to improve the test, and address
equivalence.
- Gregoires (2004) work to build equivalent scales
and test dimensionality in French to match
English version of the WAIS-III
20D.6 Choose a data collection design to provide
statistical evidence to establish item
equivalence.
- DIF, SEM, IRT studies are valuable but suitable
designs are needed for effective analyses.
(Bilingual, Mono-Mono) - --Are sample sizes large enough? Representative
of the populations? - See Muniz et al. (2001) for small sample study of
MH and conditional p-values.
21D.7 Use psychometric and statistical techniques
to establish test equivalence and test
shortcomings.
- Primary concern is that statistical procedures
are consistent with data assumptions. - --No common scalecan do SEM
- --Common scale (unconditional)ANOVA, LR, delta
plots - --Common scale (conditional)IRT
22D.8 Provide technical evidence to support the
validity of the test in its adapted form.
- Validity of a translated/adapted test cannot be
assumed. Many cultural factors can be present.
One of most common myths! Level of effort is
tied to importance and cultural difference in
source and target groups. - cont.
23D.8 Provide technical evidence to support the
validity of the test in its adapted form.
- --item analysis, reliability, validity studies
(content, criterion-related, construct) in
relation to stated purposes are needed on
translated/adapted test.
24D.9 Provide evidence of item equivalence in
multiple languages.
- Lots of methodology here. Simple extension of
DIF methodology. - --Delta plots, standardized p-differences,
b-value plots, Mantel-Haenszel, logistic
regression, and much more. - Van de Vijver (2004) excellent example, and
grapples with the issue of effect sizes (not all
statistical differences are consequential)
cont.
25D.9 Provide evidence of item equivalence in
multiple languages.
- Zumbo (2004) shows that SEM procedures do not
necessarily spot item level DIFso these analyses
are very important.
26D.10 Non-equivalent items should be eliminated
from linking process.
- If linking of scales is in the plan, watch for
non-functioning test items and eliminate. But
items can remain in the source language version. - --see the example.
- --eliminated items from the link may still be
valuable in the source and target languages with
unique item statistics.
27Figure 3. Delta Plot with 40 Anchor Items.
Linear Equating Line y 1.09x - .44 Major
Axis of the Ellipse y 1.11x - .66
28A.1 Try to anticipate test administration
problems and eliminate them.
- Next six test administration guidelines are
clear. - --Empirical evidence is needed to support the
claim of equivalence. Observations, interviews
with respondents and administrators, local
experts, small tryout, analysis of response
times, practice effects, well-trained
administrators, etc.
29A.2 Be sensitive to problems with tests,
administration, format, etc. that might lower
validity.
- This guideline is clear.
- --Work from checklists of common problems (e.g.,
hard words, format familiarity, rating scales,
role of speed, separate answer sheets, using
keyboards, availability of practice materials,
etc.)
30A.3 Eliminate aspects of the environment that
may impact on performance.
- Watch for environmental factors that may impact
on test performance and reduce validity. - --Again, observations, interviews, checklists can
be helpful. Will respondents be honest?
Maximize performance, if achievement tests? Does
administrator have skills to follow directions
closely?
31A.4 Minimize problems with test administration
directions.
- Instructions can be problematic across cultural
groups. - --Use a checklist to watch for clarity in
instructions to the respondents (e.g., simple
words, avoidance of passive tense, specific
rather than general directions, use of examples
to explain item formats, etc.)
32A.5 Identify in the manual administration
details that need to be considered.
- Test manuals need to reflect all the details for
test administration - --Does the test manual describe administration
procedures that are understandable and based on
field-test experience? - --Does the manual emphasize the need for
standardization in administration?
33A.6 Administrators need to be unobtrusive, and
examiner-examinee interaction minimized.
- This guideline is about minimizing the role of
administratorgender, ethnic background, age,
etc. - --Were standardized procedures followed?
- --Was training effective?
- --Were local cultural norms respected?
- --Were pilot studies carried out?
34I.1 With adapted tests, document changes that
have been made, and evidence of equivalence.
- It is important to keep records of procedures in
the adaptation process, and changes. - --A record of process can be valuable. Becomes
part of argument for validity. For example, how
were qualified translators identified?
35I.2 Score differences should not be taken at
face value. Compile validity data to
substantiate the differences.
- Easy to interpret differences in terms of
achievement but why are there differences? - --Look at educational policies, resources, etc.
- --Form a committee to interpret findings. (Need
diversity of opinions.) - --Be aware of other research that might help the
interpretations. - Chung (2004) with psychological tests in China.
TIMSS and PISA studies.
36I.3 Comparisons across populations can be made
at the level of invariance that is established
for the test.
- Are scores in different populations linked to a
common scale? If not, comparisons are
problematic. (And statistical equating is a
complicated process.) - --Are scores being interpreted at the level of
invariance that has been established?
37I.4 Specific suggestions for interpreting the
results need to be offered.
- The idea here is that it is not good enough to
produce the test results, a basis for
interpretations should be offered. - --Are possible interpretations offered?
- --Are cautions for misinterpretations offered?
- --Are factors discussed that might impact on the
results? cont.
38I.4 Specific suggestions for interpreting the
results need to be offered.
- --Hierarchical Linear Modeling (HLM) being used
to build causal models to explain results.
39Conclusions
- Test adaptation practices should improve with
methodology linked to the guidelines. - Whats needed now are more comprehensive examples
for how these guidelines are being applied.
e.g., Grisay, 2004 Meara (2004) papers in this
symposium -
40Follow-Up Reading
- See work being done by TIMSS and
OECD/PISAoutstanding quality - Language Testing (2004 special issue)
- Hambleton, et al. (2004) Adaptation of
educational and psychological tests. Erlbaum.
41Paper Request
- Please contact the first author at
- RKH_at_educ.umass.edu for a copy of the paper.