Effective Implementation of the International Test Commission Guidelines for Adapting Tests PowerPoint PPT Presentation

presentation player overlay
1 / 41
About This Presentation
Transcript and Presenter's Notes

Title: Effective Implementation of the International Test Commission Guidelines for Adapting Tests


1
Effective Implementation of the International
Test Commission Guidelines for Adapting Tests
  • Ronald K. Hambleton, Shuhong Li
  • University of Massachusetts, USA
  • ICP, Beijing, China, August 10, 2004

2
Background
  • Interest in Test Translations and Test
    Adaptations Has Increased Tremendously in Past 15
    Years.
  • --IQ and Personality Tests in 50 languages
  • --Achievement Tests for Large Scale Assessments
    (PISA, TIMMS) in 30 Languages
  • --International Uses of Credentialing Exams Is
    Expanding.

3
Background
  • Medical Health Researchers With Their Quality of
    Life Measures (Huge Field)
  • Marketing Research

4
Problems
  • All Too Often, Test Translation and Adaptation
    Process Is Not Understood (e.g., hiring one
    translator)
  • --Limited Technical Work (e.g. back-translation
    only)
  • --Literal Translations Only
  • --Validity Initiatives End With Judgmental
    Analyses

5
Problems
  • 22 ITC Guidelines for Test Adaptation Are
    Becoming Well-known, and Are Often Referred to in
    the Literature.
  • But, Not Always Clear to Practitioners How These
    Guidelines Might Be Applied.
  • Paper by van de Vijver and Tanzer (1997) very
    useful but not directly linked to the guidelines,
    and many new developments since 1997.

6
Purposes of the Research
  • Provide Specific Ideas for Applying the ITC Test
    Adaptation Guidelines.
  • The paper includes many excellent examples of
    applications.
  • Successful adaptation is a mixture of good
    designs, excellent translators, questionnaires,
    observations, good judgments, statistical
    analyses, validity studies, etc.

7
C.1 The amount of overlap in the constructs in
the populations of interest should be assessed.
  • Is the meaning of the construct the same over
    language groups?
  • Exploratory factor analysis, and especially
    confirmatory factor analysis (SEM),
    multidimensional scaling.
  • --see work by Byrne with self-concept measures
    van de Vijver (2004) with CPI Gregoire (2004)
    with WAIS-III. cont.

8
C.1 The amount of overlap in the constructs in
the populations of interest should be assessed.
  • Judgment of the content suitability
  • --Routinely done with the international
    assessments, via questionnaires, and face-to-face
    committee meetings.
  • Assessing nomological netsbasically,
    investigating a pattern of test results that
    includes external factors (i.e., construct
    validity investigations)

9
C.2 Effects of cultural differences which are
not relevant to the purpose of the study should
be minimized.
  • Across language and cultural groups Same
    motivational level? Same understanding of
    directions? Same impact of speed? Common
    experience? If not, fix!
  • --Questionnaires, observations, local experts,
    etc. can provide valuable evidence.
  • cont.

10
C.2 Effects of cultural differences which are
not relevant to the purpose of the study should
be minimized.
  • Assessment of cultural difference.
  • --assess differences in language, family
    structures, religion, lifestyle, values, etc.
    (see van de Vijver Leung, 1997)
  • --statistical methods such as ANCOVA may allow
    differences to be removed statistically.

11
D.1 Insure test adaptation takes account of
linguistic and cultural differences.
  • This guideline is really about the translators
    and their qualifications they must know
    languages, cultures, basic test development
    knowledge, and subject matter/construct.
  • --Evaluate process used in selecting translators
    (avoid convenience as a criterion)
  • --Use of multiple translators cont.

12
D.1 Insure test adaptation takes account of
linguistic and cultural differences.
  • D.1 has been one of the most widely applied
    guidelines Agencies are evaluating translators
    more carefully, and frequently using multiple
    translators. See Meara (2004), Grisay (2004).

13
D.2 Provide evidence that directions, scoring
rubrics, formats, are applicable.
  • If possible, begin in source language to choose
    concepts, formats, etc. that will adapt easily.
  • --Qualified translators can be very useful.
  • --Develop checklists for translators to watch for
    unfamiliar words, lengths of sentences,
    culturally specific concepts, etc. Have them
    sign off. See Meara (2004). cont.

14
D.2 Provide evidence that directions and scoring
rubrics, item formats, and items are widely
applicable.
  • Another successful guideline checklists/rating
    scales have been developed to focus on content,
    conceptual, and linguistic equivalence. See
    Jeanrie and Bertrand (1999).

15
D.3 Formats, instructions, and test itself
should be developed to maximize utility in
multiple groups.
  • The meaning of this guideline seems clear.
  • --Compile evidence on target groupvia
    questionnaires, observations, discussions with
    testing specialists, small tryout. (van de
    Vijver Tanzer, 1997)
  • cont.

16
D.3 Formats, instructions, and test itself
should be developed to maximize utility in
multiple groups.
  • --Are training materials available, in case the
    tests are unusual or new?
  • --Were training materials evaluated for their
    success?
  • --Consider balancing formats in the test

17
D.4 Item content should be familiar to persons
in the target languages.
  • Here, we mean judgmental reviews.
  • --Develop checklists for reviewers in the target
    language. (As is done to detect gender and
    ethnic bias or evaluate test items.)
  • --If changes are made (e.g., dollars to pounds)
    be sure the changes are judged as psychologically
    equivalent.

18
D.5 Linguistic and psychological evidence should
be used to improve the test, and address
equivalence.
  • Implement forward and backward translation
    designs for effective review.
  • --Were multiple translators used?
  • --Both designs?
  • --Probes of target language/culture respondents?
  • --Administration of source and back translated
    versions? cont.

19
D.5 Linguistic and psychological evidence should
be used to improve the test, and address
equivalence.
  • Gregoires (2004) work to build equivalent scales
    and test dimensionality in French to match
    English version of the WAIS-III

20
D.6 Choose a data collection design to provide
statistical evidence to establish item
equivalence.
  • DIF, SEM, IRT studies are valuable but suitable
    designs are needed for effective analyses.
    (Bilingual, Mono-Mono)
  • --Are sample sizes large enough? Representative
    of the populations?
  • See Muniz et al. (2001) for small sample study of
    MH and conditional p-values.

21
D.7 Use psychometric and statistical techniques
to establish test equivalence and test
shortcomings.
  • Primary concern is that statistical procedures
    are consistent with data assumptions.
  • --No common scalecan do SEM
  • --Common scale (unconditional)ANOVA, LR, delta
    plots
  • --Common scale (conditional)IRT

22
D.8 Provide technical evidence to support the
validity of the test in its adapted form.
  • Validity of a translated/adapted test cannot be
    assumed. Many cultural factors can be present.
    One of most common myths! Level of effort is
    tied to importance and cultural difference in
    source and target groups.
  • cont.

23
D.8 Provide technical evidence to support the
validity of the test in its adapted form.
  • --item analysis, reliability, validity studies
    (content, criterion-related, construct) in
    relation to stated purposes are needed on
    translated/adapted test.

24
D.9 Provide evidence of item equivalence in
multiple languages.
  • Lots of methodology here. Simple extension of
    DIF methodology.
  • --Delta plots, standardized p-differences,
    b-value plots, Mantel-Haenszel, logistic
    regression, and much more.
  • Van de Vijver (2004) excellent example, and
    grapples with the issue of effect sizes (not all
    statistical differences are consequential)
    cont.

25
D.9 Provide evidence of item equivalence in
multiple languages.
  • Zumbo (2004) shows that SEM procedures do not
    necessarily spot item level DIFso these analyses
    are very important.

26
D.10 Non-equivalent items should be eliminated
from linking process.
  • If linking of scales is in the plan, watch for
    non-functioning test items and eliminate. But
    items can remain in the source language version.
  • --see the example.
  • --eliminated items from the link may still be
    valuable in the source and target languages with
    unique item statistics.

27
Figure 3. Delta Plot with 40 Anchor Items.
Linear Equating Line y 1.09x - .44 Major
Axis of the Ellipse y 1.11x - .66
28
A.1 Try to anticipate test administration
problems and eliminate them.
  • Next six test administration guidelines are
    clear.
  • --Empirical evidence is needed to support the
    claim of equivalence. Observations, interviews
    with respondents and administrators, local
    experts, small tryout, analysis of response
    times, practice effects, well-trained
    administrators, etc.

29
A.2 Be sensitive to problems with tests,
administration, format, etc. that might lower
validity.
  • This guideline is clear.
  • --Work from checklists of common problems (e.g.,
    hard words, format familiarity, rating scales,
    role of speed, separate answer sheets, using
    keyboards, availability of practice materials,
    etc.)

30
A.3 Eliminate aspects of the environment that
may impact on performance.
  • Watch for environmental factors that may impact
    on test performance and reduce validity.
  • --Again, observations, interviews, checklists can
    be helpful. Will respondents be honest?
    Maximize performance, if achievement tests? Does
    administrator have skills to follow directions
    closely?

31
A.4 Minimize problems with test administration
directions.
  • Instructions can be problematic across cultural
    groups.
  • --Use a checklist to watch for clarity in
    instructions to the respondents (e.g., simple
    words, avoidance of passive tense, specific
    rather than general directions, use of examples
    to explain item formats, etc.)

32
A.5 Identify in the manual administration
details that need to be considered.
  • Test manuals need to reflect all the details for
    test administration
  • --Does the test manual describe administration
    procedures that are understandable and based on
    field-test experience?
  • --Does the manual emphasize the need for
    standardization in administration?

33
A.6 Administrators need to be unobtrusive, and
examiner-examinee interaction minimized.
  • This guideline is about minimizing the role of
    administratorgender, ethnic background, age,
    etc.
  • --Were standardized procedures followed?
  • --Was training effective?
  • --Were local cultural norms respected?
  • --Were pilot studies carried out?

34
I.1 With adapted tests, document changes that
have been made, and evidence of equivalence.
  • It is important to keep records of procedures in
    the adaptation process, and changes.
  • --A record of process can be valuable. Becomes
    part of argument for validity. For example, how
    were qualified translators identified?

35
I.2 Score differences should not be taken at
face value. Compile validity data to
substantiate the differences.
  • Easy to interpret differences in terms of
    achievement but why are there differences?
  • --Look at educational policies, resources, etc.
  • --Form a committee to interpret findings. (Need
    diversity of opinions.)
  • --Be aware of other research that might help the
    interpretations.
  • Chung (2004) with psychological tests in China.
    TIMSS and PISA studies.

36
I.3 Comparisons across populations can be made
at the level of invariance that is established
for the test.
  • Are scores in different populations linked to a
    common scale? If not, comparisons are
    problematic. (And statistical equating is a
    complicated process.)
  • --Are scores being interpreted at the level of
    invariance that has been established?

37
I.4 Specific suggestions for interpreting the
results need to be offered.
  • The idea here is that it is not good enough to
    produce the test results, a basis for
    interpretations should be offered.
  • --Are possible interpretations offered?
  • --Are cautions for misinterpretations offered?
  • --Are factors discussed that might impact on the
    results? cont.

38
I.4 Specific suggestions for interpreting the
results need to be offered.
  • --Hierarchical Linear Modeling (HLM) being used
    to build causal models to explain results.

39
Conclusions
  • Test adaptation practices should improve with
    methodology linked to the guidelines.
  • Whats needed now are more comprehensive examples
    for how these guidelines are being applied.
    e.g., Grisay, 2004 Meara (2004) papers in this
    symposium

40
Follow-Up Reading
  • See work being done by TIMSS and
    OECD/PISAoutstanding quality
  • Language Testing (2004 special issue)
  • Hambleton, et al. (2004) Adaptation of
    educational and psychological tests. Erlbaum.

41
Paper Request
  • Please contact the first author at
  • RKH_at_educ.umass.edu for a copy of the paper.
Write a Comment
User Comments (0)
About PowerShow.com