PowerPointPrsentation - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

PowerPointPrsentation

Description:

for grades 9 and 10 by IQB Berlin. In Foreign Languages, standards ... Unter den Linden 6. 10099 Berlin. GERMANY. Guido Martin. Phone 49 (0)40 48 500 612 ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 30
Provided by: mme87
Learn more at: http://www.ealta.eu.org
Category:

less

Transcript and Presenter's Notes

Title: PowerPointPrsentation


1
Evaluation and Control of Rater Reliability
Holistic vs. Analytic Scoring EALTA,
Athens May 9-11, 2008
Claudia Harsch, IQBGuido Martin, IEA DPC
2
Overview
  • Background
  • - Standards-based assessment in Germanyhere
    Writing in EFL
  • Writing tasks and rating approach
  • Feasibility Studies
  • - Feasibility Study I, May 2007trial scales and
    approach
  • Feasibility Study II, June 2007trial holistic
    vs. analytic approach
  • Pilot Study, July/August 2007
  • - Training
  • - Comparison FS II vs. Pilot Study Training

3
Overview
  • Background
  • - Standards-based assessment in Germanyhere
    Writing in EFL
  • Writing tasks and rating approach
  • Feasibility Studies
  • - Feasibility Study I, May 2007trial scales and
    approach
  • Feasibility Study II, June 2007trial holistic
    vs. analytic approach
  • Pilot Study, July/August 2007
  • - Training
  • Comparison FS II vs. summer training

4
Background Assessing ES in Germany
  • Evaluation of Educational Standards for grades 9
    and 10 by IQB Berlin
  • In Foreign Languages, standards are linked to the
    CEF, targetingA2 for lower track of secondary
    schoolB1 for middle track of secondary school
  • Assessment of 4 skillsreading, listening,
    writing and speaking (under development)
  • Tasks based on CEF-levels A1 to C1uni-level
    approach

5
Sample task Keeper, targeting B1
6
Assessment of Writing Tasks
  • Criteria of assessment, each defined by
    descriptors based on CEF, Manual, Into Europe
  • task fulfilment
  • organisation
  • grammar
  • vocabulary
  • overall impression
  • Rating approach
  • A uni-level approach to grading the tasks in line
    with the specific target level
  • Performance to be graded on a below / pass / pass
    plus basis
  • "Holistic approach" Ratings are the result of a
    weighted assessment of several descriptors per
    criterion

7
Overview
  • Background
  • - Standards-based assessment in Germanyhere
    Writing in EFL
  • Writing tasks and rating approach
  • Feasibility Studies
  • - Feasibility Study I, May 2007trial scales and
    approach
  • Feasibility Study II, June 2007trial holistic
    vs. analytic approach
  • Pilot Study, July/August 2007
  • - Training
  • Comparison FS II vs. summer training

8
Feasibility Study I May 2007
  • Aims
  • Trial training / rating approach with student
    teachers
  • Gain insight into scales and criteria
  • Get feedback on accessibility of handbooks,
    benchmarks, coding software
  • Procedure
  • 2 tasks A2 Lost dog / B1 Keeper for a day
  • 6 raters student teachers of English, proficient
    in writing English
  • First training session (1day) introduction to
    CEF, scales and tasks
  • Practice 1 30 scripts per task (over 1 week)
  • Second training session (1day) evaluation
    discussion of practice results
  • Practice 2 28 scripts per task (over 1 week)
  • Evaluation of results in terms of rating
    reliability

9
Feasibility Study I May 2007
  • Evaluation Assessing Rater Reliability
  • Index used Percent Agreement with Mode
  • Measures the percentage of agreement with the
    value most often awarded on the level of
    individual ratings
  • Can be aggregated on item (variable) and rater
    level
  • Easily interpreted
  • No assumptions about scale level
  • No assumptions about value distributions
  • No estimation errors
  • Can be interpreted as a proxy for validity

10
Outcome Feasibility Study I, May 2007
Reliability per Item
11
Outcome Feasibility Study I, May 2007
Reliability per Rater Item
12
Outcome Feasibility Study I, May 2007
  • Approach appears feasible
  • Scales seem to be usable and applicable
  • BUT We do not know what raters do on the
    sub-criterion-level
  • Need to further explore behaviour at descriptor
    levelgt Feasibility Study II

13
Overview
  • Background
  • - Standards-based assessment in Germanyhere
    Writing in EFL
  • Writing tasks and rating approach
  • Feasibility Studies
  • - Feasibility Study I, May 2007trial scales and
    approach
  • Feasibility Study II, June 2007trial holistic
    vs. analytic approach
  • Pilot Study, July/August 2007
  • - Training
  • Comparison FS II vs. summer training

14
Feasibility Study II, June 2007
  • Comparison
  • Holistic scores for the five criteria (FS I)
  • Scoring each descriptor on its own and in
    addition scoring the criteria holistically (FS
    II)
  • Reasons behind
  • below pass pass plus in a uni-level
    approach targeting a specific population
    tendency towards the pass value
  • Similar outcomes can be achieved by purely random
    value distributions at the descriptor level
  • Data on scoring each descriptor show whether
    raters interpret descriptors uniformly before
    using them to compile the weighted overall
    criterion rating
  • Reliable usage of descriptors is a precondition
    for valid ratings on the criterion-level

15
Outcome Feasibility Study II, June 2007
16
Outcome Feasibility Study II, June 2007
17
Outcome Feasibility Study II, June 2007
  • Fairly high agreement on criterion-level ratings
    is NOT the result of uniform interpretation of
    descriptors
  • BUT rather results from cancellation of
    deviations on the descriptor-level during the
    compilation of the criterion ratings
  • Rating holistic criteria by evaluation of several
    pre-defined descriptors can only be valid if
    descriptors are understood uniformly by all
    raters
  • Descriptors need to be revised
  • Training and assessment of pilot study has to be
    conducted on the descriptor level in order to be
    able to control rating behavior

18
Overview
  • Background
  • - Standards-based assessment in Germanyhere
    Writing in EFL
  • Writing tasks and rating approach
  • Feasibility Studies
  • - Feasibility Study I, May 2007trial scales
    and approach
  • Feasibility Study II, June 2007trial holistic
    vs. analytic approach
  • Pilot Study, July/August 2007
  • - Training
  • Comparison FS II vs. summer training

19
Background Pilot Study
  • Sample Size N 2932
  • Number of Items
  • Listening 349
  • Reading 391
  • Writing 19 Tasks
  • n 300 370 / item (M 330)
  • All Länder
  • All school types
  • 8th, 9th and 10th graders

20
Summer Training
  • 13 Raters, selected on the basis of English
    language proficiency, study background and DPC
    coding test
  • Challenge of piloting tasks, rating approach and
    scales simultaneously
  • First one-week seminar
  • - Introduction of CEF, scales and tasks
  • - Introduction of rating procedures
  • - Introduction of benchmarks

21
Summer Training
  • 6 one-day sessions
  • - Weekly practice
  • - Discussion Evaluation of practice results
  • - Introduction of further tasks / levels
  • - Revision of scale descriptors
  • Five levels, 19 tasks Simultaneous introduction
    of several levels and tasks necessary in order to
    control level and task interdependencies
  • Three rounds of practice per task ideal1. Intro
    practice2. Feedback practice3. Feedback
    practice4. Evaluation of reliabilities

22
Training Progress "Sports Accident", B1
23
Training Progress "Sports Accident", B1
24
Summer Training
  • Second one-week seminar
  • Feedback on last round of practice
  • Addition of benchmarks for borderline cases
  • - Addition of detailed justifications for
    benchmarks
  • - Finalisation of scale descriptors
  • - Revision of rating handbooks

25
Comparison FS II - Training
26
Comparison FS II - Training
27
Conclusion
  • Training concept for the future
  • Materials prepared weekly seminars not
    necessary
  • Training and rating on descriptor level
  • Multiple one-day sessions, one per week to give
    time for practice
  • - Introduction
  • - Practice 3 rounds per task ideal
  • - Feedback

28
Thank you for your attention!
29
Claudia Harsch Phone 49 (0)30 2093 -
5508 Telefax 49 (0)30 2093 -
5336 E-mail Claudia.Harsch_at_IQB.hu-berlin.de
Website www.IQB.hu-berlin.de Mail
Address Humboldt-Universität zu Berlin Unter den
Linden 6 10099 Berlin GERMANY
Guido Martin Phone 49 (0)40 48 500
612 E-mail guido.martin_at_iea-dpc.de Website
www.iea-dpc.de Mail Address IEA
DPC Mexikoring 37 D-22297 HamburgGERMANY
Write a Comment
User Comments (0)
About PowerShow.com