Title: PowerPointPrsentation
1Evaluation and Control of Rater Reliability
Holistic vs. Analytic Scoring EALTA,
Athens May 9-11, 2008
Claudia Harsch, IQBGuido Martin, IEA DPC
2Overview
- Background
- - Standards-based assessment in Germanyhere
Writing in EFL - Writing tasks and rating approach
- Feasibility Studies
- - Feasibility Study I, May 2007trial scales and
approach - Feasibility Study II, June 2007trial holistic
vs. analytic approach - Pilot Study, July/August 2007
- - Training
- - Comparison FS II vs. Pilot Study Training
3Overview
- Background
- - Standards-based assessment in Germanyhere
Writing in EFL - Writing tasks and rating approach
- Feasibility Studies
- - Feasibility Study I, May 2007trial scales and
approach - Feasibility Study II, June 2007trial holistic
vs. analytic approach - Pilot Study, July/August 2007
- - Training
- Comparison FS II vs. summer training
4Background Assessing ES in Germany
- Evaluation of Educational Standards for grades 9
and 10 by IQB Berlin - In Foreign Languages, standards are linked to the
CEF, targetingA2 for lower track of secondary
schoolB1 for middle track of secondary school - Assessment of 4 skillsreading, listening,
writing and speaking (under development) - Tasks based on CEF-levels A1 to C1uni-level
approach
5Sample task Keeper, targeting B1
6Assessment of Writing Tasks
- Criteria of assessment, each defined by
descriptors based on CEF, Manual, Into Europe - task fulfilment
- organisation
- grammar
- vocabulary
- overall impression
- Rating approach
- A uni-level approach to grading the tasks in line
with the specific target level - Performance to be graded on a below / pass / pass
plus basis - "Holistic approach" Ratings are the result of a
weighted assessment of several descriptors per
criterion
7Overview
- Background
- - Standards-based assessment in Germanyhere
Writing in EFL - Writing tasks and rating approach
- Feasibility Studies
- - Feasibility Study I, May 2007trial scales and
approach - Feasibility Study II, June 2007trial holistic
vs. analytic approach - Pilot Study, July/August 2007
- - Training
- Comparison FS II vs. summer training
8Feasibility Study I May 2007
- Aims
- Trial training / rating approach with student
teachers - Gain insight into scales and criteria
- Get feedback on accessibility of handbooks,
benchmarks, coding software - Procedure
- 2 tasks A2 Lost dog / B1 Keeper for a day
- 6 raters student teachers of English, proficient
in writing English - First training session (1day) introduction to
CEF, scales and tasks - Practice 1 30 scripts per task (over 1 week)
- Second training session (1day) evaluation
discussion of practice results - Practice 2 28 scripts per task (over 1 week)
- Evaluation of results in terms of rating
reliability
9Feasibility Study I May 2007
- Evaluation Assessing Rater Reliability
- Index used Percent Agreement with Mode
- Measures the percentage of agreement with the
value most often awarded on the level of
individual ratings - Can be aggregated on item (variable) and rater
level - Easily interpreted
- No assumptions about scale level
- No assumptions about value distributions
- No estimation errors
- Can be interpreted as a proxy for validity
10Outcome Feasibility Study I, May 2007
Reliability per Item
11Outcome Feasibility Study I, May 2007
Reliability per Rater Item
12Outcome Feasibility Study I, May 2007
- Approach appears feasible
- Scales seem to be usable and applicable
- BUT We do not know what raters do on the
sub-criterion-level - Need to further explore behaviour at descriptor
levelgt Feasibility Study II
13Overview
- Background
- - Standards-based assessment in Germanyhere
Writing in EFL - Writing tasks and rating approach
- Feasibility Studies
- - Feasibility Study I, May 2007trial scales and
approach - Feasibility Study II, June 2007trial holistic
vs. analytic approach - Pilot Study, July/August 2007
- - Training
- Comparison FS II vs. summer training
14Feasibility Study II, June 2007
- Comparison
- Holistic scores for the five criteria (FS I)
- Scoring each descriptor on its own and in
addition scoring the criteria holistically (FS
II) - Reasons behind
- below pass pass plus in a uni-level
approach targeting a specific population
tendency towards the pass value - Similar outcomes can be achieved by purely random
value distributions at the descriptor level - Data on scoring each descriptor show whether
raters interpret descriptors uniformly before
using them to compile the weighted overall
criterion rating - Reliable usage of descriptors is a precondition
for valid ratings on the criterion-level
15Outcome Feasibility Study II, June 2007
16Outcome Feasibility Study II, June 2007
17Outcome Feasibility Study II, June 2007
- Fairly high agreement on criterion-level ratings
is NOT the result of uniform interpretation of
descriptors - BUT rather results from cancellation of
deviations on the descriptor-level during the
compilation of the criterion ratings - Rating holistic criteria by evaluation of several
pre-defined descriptors can only be valid if
descriptors are understood uniformly by all
raters - Descriptors need to be revised
- Training and assessment of pilot study has to be
conducted on the descriptor level in order to be
able to control rating behavior
18Overview
- Background
- - Standards-based assessment in Germanyhere
Writing in EFL - Writing tasks and rating approach
- Feasibility Studies
- - Feasibility Study I, May 2007trial scales
and approach - Feasibility Study II, June 2007trial holistic
vs. analytic approach - Pilot Study, July/August 2007
- - Training
- Comparison FS II vs. summer training
19Background Pilot Study
- Sample Size N 2932
- Number of Items
- Listening 349
- Reading 391
- Writing 19 Tasks
- n 300 370 / item (M 330)
- All Länder
- All school types
- 8th, 9th and 10th graders
20Summer Training
- 13 Raters, selected on the basis of English
language proficiency, study background and DPC
coding test - Challenge of piloting tasks, rating approach and
scales simultaneously - First one-week seminar
- - Introduction of CEF, scales and tasks
- - Introduction of rating procedures
- - Introduction of benchmarks
21Summer Training
- 6 one-day sessions
- - Weekly practice
- - Discussion Evaluation of practice results
- - Introduction of further tasks / levels
- - Revision of scale descriptors
- Five levels, 19 tasks Simultaneous introduction
of several levels and tasks necessary in order to
control level and task interdependencies - Three rounds of practice per task ideal1. Intro
practice2. Feedback practice3. Feedback
practice4. Evaluation of reliabilities
22Training Progress "Sports Accident", B1
23Training Progress "Sports Accident", B1
24Summer Training
- Second one-week seminar
- Feedback on last round of practice
- Addition of benchmarks for borderline cases
- - Addition of detailed justifications for
benchmarks - - Finalisation of scale descriptors
- - Revision of rating handbooks
25Comparison FS II - Training
26Comparison FS II - Training
27Conclusion
- Training concept for the future
- Materials prepared weekly seminars not
necessary - Training and rating on descriptor level
- Multiple one-day sessions, one per week to give
time for practice - - Introduction
- - Practice 3 rounds per task ideal
- - Feedback
28Thank you for your attention!
29Claudia Harsch Phone 49 (0)30 2093 -
5508 Telefax 49 (0)30 2093 -
5336 E-mail Claudia.Harsch_at_IQB.hu-berlin.de
Website www.IQB.hu-berlin.de Mail
Address Humboldt-Universität zu Berlin Unter den
Linden 6 10099 Berlin GERMANY
Guido Martin Phone 49 (0)40 48 500
612 E-mail guido.martin_at_iea-dpc.de Website
www.iea-dpc.de Mail Address IEA
DPC Mexikoring 37 D-22297 HamburgGERMANY