Change and Stability in Educational Assessment: an Oxymoron - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Change and Stability in Educational Assessment: an Oxymoron

Description:

1. Change and Stability in Educational Assessment: an Oxymoron? Alina A. von Davier ... There are many major educational assessments that have been shaping the ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 48
Provided by: avond
Category:

less

Transcript and Presenter's Notes

Title: Change and Stability in Educational Assessment: an Oxymoron


1
Change and Stability in Educational Assessment
an Oxymoron?
  • Alina A. von Davier
  • Educational Testing Service
  • April 24th, 2008
  • Institute of Educational Assessors
  • National Conference

2
Change in Educational Assessment
  • There are many major educational assessments that
    have been shaping the society for decades
  • Maintaining a test that is in sync with the
    curriculum, shifting demographics, and the
    societys fluid expectations requires adapting
    the testing instruments
  • Examples of long term assessments SAT I, GRE
    , TOEFL

3
Stability in Educational Assessment
  • The questions are
  • Whether the testing instrument can be improved or
    adapted without changing the meaning of the
    reported scores
  • How to relate scores on the old and new versions
    of the test (linking)
  • What claims the score linking might or might not
    support

4
Outline
  • Present an overview of the educational assessment
    transition process
  • Discuss statistical and policy implications of
    the process components
  • Outline several techniques for investigating the
    relationship between the test scores (linking)
    when tests undergo major changes

5
Frequently Asked Questions
  • What are the consequences of change for score
    meaning?
  • Why should we care about whether score meaning is
    changed? How will that affect the decisions we
    want to make?
  • How do we find out if score meaning has been
    affected?
  • What can we do to preserve score meaning across
    changes?

6
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
7
Communication Accountability
  • Accountability is the glue which holds the
    examinations and testing system together David
    Gee, from the Keynote Address at IEA in 2007
  • The problem with communication ... is the
    illusion that it has been accomplished. George
    Bernard Shaw

8
Communication Accountability(cont.)
  • The Assessment Agency (AA) should communicate the
    goals and requirements to the Testing Agency
    (TA)
  • In turn, the TA should provide suggestions,
    solutions, and risk assessment
  • The TA chooses the tools and processes and
    validates them with the AA

9
Running Example
  • The AA plans a change in the administration mode
    for a national end-of-course test
  • The change is to administer the test on computer
    (CB) instead of using paper-and-pencil (PP)
  • The change was recommended by the experts in
    education, the Board of the Assessment, etc.
  • The AA and a chosen TA begin the dialog

10
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
11
Test Purpose
  • The results of the test will be used for a
    myriad of purposes ranging from helping
    students to improve their learning, assessing
    school performance, influencing house prices,
    and, yes, even rating the effectiveness of
    politicians. It is important to ensure that
    the data outputs from assessment are not put to
    too many diverse purposes.David Gee, from the
    Keynote Address at IEA in 2007

12
Test Purpose(cont.)
  • The purposes of the existing test should be
    carefully discussed by the AA and TA
  • An assessment should be constructed to support a
    primary use
  • Additional purposes need to be explicit and
    considered

13
Test PurposeExamples
  • Survey Assessments
  • Group level reporting Low-stakes for the test
    takers, high-stakes for states/countries
  • Summative/Achievement tests
  • Individual reporting Precision at all
    score/ability levels, including at cut-score(s)
    High-stakes for the test takers and for
    schools/districts
  • End of course tests
  • Individual reporting Precision at all
    score/ability levels and at cut-score(s)
    High-stakes for the test takers

14
Test Purpose(cont.)
  • Licensure and certification tests
  • Individual reporting Precision at cut-score(s)
    High-stakes for the test takers
  • Placement or locator tests
  • Individual reporting Age-independent tests
    Multiple cut-scores Potentially high-stakes for
    the test takers
  • Formative assessments
  • Individual reporting Multiple cut-scores
    Low-stakes for the test takers (increasing stakes
    for the test takers)

15
Running Example (cont.)
  • The AA communicates the main purpose of the test
    and describes it in detail
  • End of course test
  • Individual reporting
  • Precision at all ability levels and at
    cut-score(s)
  • High-stakes for the test takers
  • The AA communicates other uses of the test
  • School accountability
  • Teacher evaluation

16
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
17
Change Purpose
  • At some point early in the redesign process, the
    organization must make a conscious decision
    about what is most important in the test
    revisionAll the revisions and data collections
    should be guided by this redesign principle.
    Liu Walker, 2007.

18
Change Purpose(cont.)
  • Administration mode
  • Content
  • Populations
  • Format

19
Running Example (cont.)
  • The change is to administer the test on computer
    instead of using paper-and-pencil
  • The other aspects of the tests should not change

20
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
21
Change Specifications Claims Standard 4.16 of
(AERA, APA, NCME, 1999)
  • If test specifications are changed from one
    version of a test to a subsequent version, such
    changes should be identified in the test manual,
    and an indication should be given that converted
    scores for the two versions may not be strictly
    equivalent. When substantial changes in test
    specifications occur, either scores should be
    reported on a new scale or a clear statement
    should be provided to alert users that the scores
    are not directly comparable with those on earlier
    versions of the test. Standards for Educational
    and Psychological Testing

22
Change Specifications Claims(cont.)
  • Claims about the content comparability (experts
    judgment, factor structure, correlations,
    validity)
  • Claims about the scores comparability (equating,
    linking, concordance, reliability, or standard
    setting)
  • Equating It should be a matter of indifference
    for a test taker which version of the test he or
    she takes

23
Change Specifications Claims(cont.)
  • Claims about the scale scores
  • The meaning of the reporting scale persists over
    time even though norms change gt Rescaling?
  • New/Additional linkings to other existing tests

24
Running Example (cont.)
  • The content of the test should not change
  • The test should measure the same construct
  • The scores should be interchangeable it should
    be a matter of indifference for a test taker
    whether the test is CB or PP (equating)
  • The meaning of the scores should be preserved
    because test users want to make the same
    inferences (scaling)
  • The precision at all ability/score levels should
    be maintained

25
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
26
Constraints and Quality
  • Cost
  • Testing time
  • Technology availability
  • Security issues
  • Item disclosure policy
  • Inability to pretest items

27
Constraints and Quality(cont.)
  • Time for reporting scores
  • Request for Constructed Response (CR) items
  • Ability to mark the CR items
  • Rater reliability, training, availability of
    raters
  • Sample sizes for the special studies for the
    tests
  • Accessibility (for, say, students with
    disabilities or ESL)

28
Running Example (cont.)
  • The test will be taken on computers with no less
    than 12inch monitors, the keyboard should be of a
    standard size
  • The number of items displayed on one screen is
    the same as the number of items on the test taken
    on paper
  • A sufficiently large sample of motivated test
    takers should be available for the special
    studies
  • The computer-based test should be as reliable as
    the paper-based test

29
Running Example (cont.)
  • The test should not be speeded
  • The items should be as difficult as before
  • All schools at the national level should have the
    needed technology available
  • All students should have been trained to take a
    test on a computer
  • Training materials have been made available

30
Running Example (cont.)
  • Ability to write and mark the open-ended
    questions on the computer
  • The raters have received the proper training
  • The ability to manage the data, merge the files,
    clean the data on the computer
  • Special accommodations

31
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
32
Tools Processes Overview
  • Special Studies
  • Data collection design Equity and Fairness
  • Linking versus Rescaling
  • Statistical methods and their assumptions
  • Software

33
Tools Processes Special Studies
  • Data Collection
  • Sample size
  • Sample representativeness
  • Attrition (in a CB vs. PP), one builds a
    randomized assignment to one or another mode, but
    some disappointed students assigned to PP might
    drop out of the assessment Eignor, 2007)
  • Design choices counterbalanced, equivalent,
    common items

34
Tools Processes Special Studies (cont.)
  • Equity and fairness
  • Check the claims at item and test level such as
    implementing DIF indices and equating population
    invariance indices with respect to the background
    variables of interest
  • Choose appropriate methods and check their
    assumptions
  • Interpretation of results and recommendations

35
Tools Processes Linking versus Rescaling
  • Rescaling might be a psychometric issue, but
    decisions about whether to rescale are seldom
    made by psychometricians. R. Brennan, 2007,
    p.174.

36
Tools Processes Linking vs. Rescaling (cont.)
  • When is linking not enough?
  • Adapted example (inspired from Golub-Smith,
    2005)
  • The scale for an achievement test was set many
    years ago on the cohort that took the test at
    that time
  • The mean of the scale was set to 400 for both
    Verbal and Math
  • Meanwhile, the test-takers population and the
    mode changed
  • The mean of the current norm is 320 for Verbal
    and 470 on Math
  • So a test taker who gets score 400 on both Verbal
    and Math today, is actually above the average on
    Verbal and below the average on Math, though the
    tradition may suggest that the person's score is
    average
  • In a case like this psychometricians usually
    suggest that a rescaling is needed

37
Tools Processes Statistical Methods and Their
Assumptions
  • To a man with a hammer any problem is a nail
  • The solution should be chosen through a data
    centered-approach rather than through a
    particular hammer
  • The methods should support the claims

38
Tools Processes Methods and Assumptions (cont.)
  • Usually, the set of assumptions needed for a
    particular method needs to be checked against the
    constraints.
  • The TA needs to assess the risk of not meeting
    one or more assumptions and needs to provide
    alternative suggestions and methods

39
Tools Processes Methods Classical Test
Theory-based
  • Traditional linear and equipercentile methods
  • Kernel equating method
  • Diagnostic and accuracy methods

40
Tools Processes Methods IRT-based
  • The choice of the appropriate IRT models
  • The choice of the estimation methods (related to
    the software below)
  • The choice of linking method (concurrent
    calibration, separate calibration with Stocking
    Lord method)
  • The choice of equating (IRT true-score, IRT
    observed-score, or linear transformation)

41
Tools Processes High-quality software
  • Appropriate for the chosen method
  • Estimation methods
  • License and availability
  • Training and interface

42
Running Example (cont.)
  • Design A sufficiently large and representative
    sample of motivated the test takers that should
    take both versions of the test in different
    orders counterbalanced design
  • Exploratory analyses correlations, distribution
    characteristics, order effects analyses, factor
    analyses, DIF
  • If the tests are similar in difficulty, if the
    factorial structure is the same, if the items
    function similarly, if there are no particular
    order effects, if the tests are similarly
    reliable, if there are no subgroup effects THEN
    equating is possible

43
Running Example (cont.)
  • Which equating method?
  • If possible, one would choose a method based on
    the same principles of the scoring method for the
    PP
  • Say classical test theory-based choose
    appropriate models to fit the data choose an
    equating function that a) preserves the same
    pass/fail proportion and similar distributions at
    the highest scores as in the previous
    administrations, b) has the smallest standard
    errors, c) shows no dependence on the subgroups

44
Running Example (cont.)
  • Investigate the scaling method
  • If everything goes well, recommend the use of the
    newly produced conversion table
  • If the analyses support the claims then the score
    can be used interchangeably
  • In this case, no need to rescale

45
Change Purpose
Communication Accountability
Test Purpose
Change Specification
Tools Processes
Constraints Quality
46
Discussion
  • The dialectic of the stability of the reporting
    scale and continuous changes of various aspects
    of the testing instruments of state or national
    assessments is a potential basis for informative
    analyses not only for the test practitioners, but
    also for the policy makers.

47
Discussion(cont.)
  • Nowadays when more standardized testing is used
    for assessing competencies in different domains
    nationally and internationally, we are also
    discovering more challenges in ensuring that the
    process and the results are fair and accurate.
Write a Comment
User Comments (0)
About PowerShow.com