PowerPointPrsentation - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

PowerPointPrsentation

Description:

for grades 9 and 10 by IQB Berlin. In Foreign Languages, standards ... Unter den Linden 6. 10099 Berlin. GERMANY. Guido Martin. Phone 49 (0)40 48 500 612 ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 30

Provided by: mme87

Learn more at: http://www.ealta.eu.org

Category:

more less

Transcript and Presenter's Notes

Title: PowerPointPrsentation

1
Evaluation and Control of Rater Reliability
Holistic vs. Analytic Scoring EALTA,
Athens May 9-11, 2008
Claudia Harsch, IQBGuido Martin, IEA DPC
2
Overview

Background
- Standards-based assessment in Germanyhere
Writing in EFL
Writing tasks and rating approach
Feasibility Studies
- Feasibility Study I, May 2007trial scales and
approach
Feasibility Study II, June 2007trial holistic
vs. analytic approach
Pilot Study, July/August 2007
- Training
- Comparison FS II vs. Pilot Study Training

3
Overview

Background
- Standards-based assessment in Germanyhere
Writing in EFL
Writing tasks and rating approach
Feasibility Studies
- Feasibility Study I, May 2007trial scales and
approach
Feasibility Study II, June 2007trial holistic
vs. analytic approach
Pilot Study, July/August 2007
- Training
Comparison FS II vs. summer training

4
Background Assessing ES in Germany

Evaluation of Educational Standards for grades 9
and 10 by IQB Berlin
In Foreign Languages, standards are linked to the
CEF, targetingA2 for lower track of secondary
schoolB1 for middle track of secondary school
Assessment of 4 skillsreading, listening,
writing and speaking (under development)
Tasks based on CEF-levels A1 to C1uni-level
approach

5
Sample task Keeper, targeting B1
6
Assessment of Writing Tasks

Criteria of assessment, each defined by
descriptors based on CEF, Manual, Into Europe
task fulfilment
organisation
grammar
vocabulary
overall impression
Rating approach
A uni-level approach to grading the tasks in line
with the specific target level
Performance to be graded on a below / pass / pass
plus basis
"Holistic approach" Ratings are the result of a
weighted assessment of several descriptors per
criterion

7
Overview

Background
- Standards-based assessment in Germanyhere
Writing in EFL
Writing tasks and rating approach
Feasibility Studies
- Feasibility Study I, May 2007trial scales and
approach
Feasibility Study II, June 2007trial holistic
vs. analytic approach
Pilot Study, July/August 2007
- Training
Comparison FS II vs. summer training

8
Feasibility Study I May 2007

Aims
Trial training / rating approach with student
teachers
Gain insight into scales and criteria
Get feedback on accessibility of handbooks,
benchmarks, coding software
Procedure
2 tasks A2 Lost dog / B1 Keeper for a day
6 raters student teachers of English, proficient
in writing English
First training session (1day) introduction to
CEF, scales and tasks
Practice 1 30 scripts per task (over 1 week)
Second training session (1day) evaluation
discussion of practice results
Practice 2 28 scripts per task (over 1 week)
Evaluation of results in terms of rating
reliability

9
Feasibility Study I May 2007

Evaluation Assessing Rater Reliability
Index used Percent Agreement with Mode
Measures the percentage of agreement with the
value most often awarded on the level of
individual ratings
Can be aggregated on item (variable) and rater
level
Easily interpreted
No assumptions about scale level
No assumptions about value distributions
No estimation errors
Can be interpreted as a proxy for validity

10
Outcome Feasibility Study I, May 2007
Reliability per Item
11
Outcome Feasibility Study I, May 2007
Reliability per Rater Item
12
Outcome Feasibility Study I, May 2007

Approach appears feasible
Scales seem to be usable and applicable
BUT We do not know what raters do on the
sub-criterion-level
Need to further explore behaviour at descriptor
levelgt Feasibility Study II

13
Overview

Background
- Standards-based assessment in Germanyhere
Writing in EFL
Writing tasks and rating approach
Feasibility Studies
- Feasibility Study I, May 2007trial scales and
approach
Feasibility Study II, June 2007trial holistic
vs. analytic approach
Pilot Study, July/August 2007
- Training
Comparison FS II vs. summer training

14
Feasibility Study II, June 2007

Comparison
Holistic scores for the five criteria (FS I)
Scoring each descriptor on its own and in
addition scoring the criteria holistically (FS
II)
Reasons behind
below pass pass plus in a uni-level
approach targeting a specific population
tendency towards the pass value
Similar outcomes can be achieved by purely random
value distributions at the descriptor level
Data on scoring each descriptor show whether
raters interpret descriptors uniformly before
using them to compile the weighted overall
criterion rating
Reliable usage of descriptors is a precondition
for valid ratings on the criterion-level

15
Outcome Feasibility Study II, June 2007
16
Outcome Feasibility Study II, June 2007
17
Outcome Feasibility Study II, June 2007

Fairly high agreement on criterion-level ratings
is NOT the result of uniform interpretation of
descriptors
BUT rather results from cancellation of
deviations on the descriptor-level during the
compilation of the criterion ratings
Rating holistic criteria by evaluation of several
pre-defined descriptors can only be valid if
descriptors are understood uniformly by all
raters
Descriptors need to be revised
Training and assessment of pilot study has to be
conducted on the descriptor level in order to be
able to control rating behavior

18
Overview

Background
- Standards-based assessment in Germanyhere
Writing in EFL
Writing tasks and rating approach
Feasibility Studies
- Feasibility Study I, May 2007trial scales
and approach
Feasibility Study II, June 2007trial holistic
vs. analytic approach
Pilot Study, July/August 2007
- Training
Comparison FS II vs. summer training

19
Background Pilot Study

Sample Size N 2932
Number of Items
Listening 349
Reading 391
Writing 19 Tasks
n 300 370 / item (M 330)
All Länder
All school types
8th, 9th and 10th graders

20
Summer Training

13 Raters, selected on the basis of English
language proficiency, study background and DPC
coding test
Challenge of piloting tasks, rating approach and
scales simultaneously
First one-week seminar
- Introduction of CEF, scales and tasks
- Introduction of rating procedures
- Introduction of benchmarks

21
Summer Training

6 one-day sessions
- Weekly practice
- Discussion Evaluation of practice results
- Introduction of further tasks / levels
- Revision of scale descriptors
Five levels, 19 tasks Simultaneous introduction
of several levels and tasks necessary in order to
control level and task interdependencies
Three rounds of practice per task ideal1. Intro
practice2. Feedback practice3. Feedback
practice4. Evaluation of reliabilities

22
Training Progress "Sports Accident", B1
23
Training Progress "Sports Accident", B1
24
Summer Training

Second one-week seminar
Feedback on last round of practice
Addition of benchmarks for borderline cases
- Addition of detailed justifications for
benchmarks
- Finalisation of scale descriptors
- Revision of rating handbooks

25
Comparison FS II - Training
26
Comparison FS II - Training
27
Conclusion

Training concept for the future
Materials prepared weekly seminars not
necessary
Training and rating on descriptor level
Multiple one-day sessions, one per week to give
time for practice
- Introduction
- Practice 3 rounds per task ideal
- Feedback

28
Thank you for your attention!
29
Claudia Harsch Phone 49 (0)30 2093 -
5508 Telefax 49 (0)30 2093 -
5336 E-mail Claudia.Harsch_at_IQB.hu-berlin.de
Website www.IQB.hu-berlin.de Mail
Address Humboldt-Universität zu Berlin Unter den
Linden 6 10099 Berlin GERMANY
Guido Martin Phone 49 (0)40 48 500
612 E-mail guido.martin_at_iea-dpc.de Website
www.iea-dpc.de Mail Address IEA
DPC Mexikoring 37 D-22297 HamburgGERMANY

Write a Comment

User Comments (0)