Test tasks for speaking - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Test tasks for speaking

Description:

Test tasks for speaking balancing between authenticity and reliability. ... Foreign languages and multicultural perspectives in the European context; ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 29
Provided by: acuk
Category:

less

Transcript and Presenter's Notes

Title: Test tasks for speaking


1
Test tasks for speaking balancing between
authenticity and reliability
  • Raili Hildén, University of Helsinki, Finland
  • Raili.hilden_at_helsinki.fi
  • TBLT 2009Lancaster
  • Tasks context, purpose and use 3rd Biennial
    International Conference on Task-Based Language
    Teaching
  • 13-16 September 2009

2
Background Hy-talk project of speaking
assessment
  • The project is funded by the University of
    Helsinki
  • To validate the illustrative scales of speaking
    included in the national core curricula for
    general education and upper secondary level by
    trialing a prototype test of speaking.
  • Subscales overall task completion, fluency,
    pronunciation, range and accuracy is empirically
    aligned to relevant scales of the CEFR.
  • http//blogs.helsinki.fi/hy-talk/

3
The conceptual framework
  • Validity argumentation scheme for interpretation
    of the HY-Talk project data (adapted from Kane,
    2001, Fulcher Davidson, 2007, 164 174
    Bachman, 2005)
  • The claim to be probed
  • The illustrative scales of descriptors of oral
    proficiency included in the national core
    curricula for language education enable
    sufficiently valid conclusions on students oral
    proficiency in general school education in
    Finland.

4
The purpose of the HY-Talk study
  • The validity claim is supported and challenged by
    warrants and rebuttals regarding
  • relevance
  • utility
  • (Intended consequences)
  • sufficiency

5
Warrants
  • The tasks used to elicit student performance
    correspond to pedagogic tasks and target language
    use tasks of students at the age of general
    education. (utility)
  • Reliability of assessments based on the scale and
    the tasks to elicit performances is found to be
    high enough. (sufficiency)

6
Backing to support the utility claim
  • Rater and test taker feedback confirm the
    perceived authenticity of the tasks and
    appropriateness of administration.
  • The level ratings correspond to the target levels
    in the curricula.

7
Backing data to support the sufficiency claim
  • Statistical reliability evidence confirm
    sufficient level of consistency across raters,
    tasks and languages, and interlocutors.

8
Counterclaims
  • The tasks used to elicit student performance
    correspond inadequately to pedagogic tasks or TLU
    tasks of students. (utility)
  • The link to the scale descriptors may be weak.
    (utility)
  • The level assignments do not match the target
    levels set in the curricula.
  • Reliability of assessments is not stable, but
    varies too much across tasks, raters or
    languages, or is caused by intervening variables
    or inadequate evidence base. (sufficiency)

9
Reubuttal data to support the utility claim
  • Statistical evidence challenge the intended
    utility of the tasks.
  • Verbal data from students and teachers question
    the utility and/or sufficiency of the tasks for
    the purpose.

10
Research questions
  • 1. How is the inter-rater reliability of the
    judgements?
  • 2. How are the tasks and corresponding salient
    task features related to target level judgements,
    assessment criteria and their combination?
    (numeric data, analysed with Facets)
  • 3. How are the tasks perceived by students and
    raters? (verbal data based on feedback sheets and
    audio recorded rating sessions)

11
Speaking Tasks
  • Tasks were designed to reflect the average target
    level specified for good mastery of the syllabus
  • English (grade 7 A1.3, grade 1 A2.2)
  • German etc. (grade 7 A1.2, grade 1 A2.1)
  • They also draw on the thematic content of the
    curricula
  • Discussed, revised and piloted by the project
    group

12
Prototype tasks (with examples)
  • 1. Presentation (A2.2) partly controlled
    monologue
  • 2. Everyday life (A2.1 A2.2) rigidly controlled
    dialogues
  • At the airport, grade 7
  • At home, grade 7
  • Accommodation, grade 1
  • On the way home, grade 1
  • 3. Negotiation partly controlled idalogue
    Planning an outing (A2.1 B1.1)

13
Speaking Tasks
  • Prompts in L1
  • Time on task 10-15 min,
  • Conducted in pairs
  • Rated by 5-10 language experts

14
Data of this study
  • Speech samples in English (56)
  • Speech samples in German (66)

15
Facets examined in this study
  • Raters (5 English, 7 German)
  • Tasks 1-4
  • Task dimensions
  • Overall task performance
  • Fluency
  • Pronunciation
  • Range
  • Accuracy

16
Results RQ1 english samplesoverall inter-rater
agreement
  • Majority of total ratings were placed between
    levels 5-6 (CEFR A2-B1)
  • Across all facets the raters the distance between
    the most severe and the most lenient rater was 1
    logit (levels 5/6)
  • Average of ratings given by R4 6.66
  • Average of ratings given by R1 5.87
  • For more detailed record please contact the
    author.

17
Results RQ1 english samplesoverall task
difficulty
  • The easiest task
  • Presentation was assigned the highest fair
    average of 6.29
  • The trickiest task
  • Everyday life task Accommodation was assigned
    the lowest fair average of 6.21
  • For more detailed record please contact the
    author.

18
Results RQ1 english samplescriteria
  • The easiest criterion
  • Pronunciation (fair average 6.39)
  • The trickiest criterion
  • Range (fair average 6.02)
  • For more detailed record please contact the
    author.

19
Results RQ1 english samplescombined difficulty
taskcriteria
  • The easiest combination
  • Presentation Accuracy
  • Presentation Fluency
  • The trickiest combination
  • Everyday situation Accommodation Range
  • For more detailed record please contact the
    author.

20
Results RQ1 german samplesoverall inter-rater
agreement
  • Majority of total ratings were placed between
    levels 5-6/10 (CEFR A2-B1)
  • Across all facets and raters, the distance
    between the most severe and the most lenient
    rater was 1 logit (levels 5-6)
  • Average of ratings given by R6 (3.96/10)
  • Average of ratings given by R2 (3.57/10)
  • For more detailed record please contact the
    author.

21
Results RQ1 german samplesoverall task
difficulty
  • The easiest task
  • Presentation task was assigned the highest fair
    average of 4.21/10
  • The trickiest task
  • Everyday life task On the way home was
    assigned the lowest fair average of 3.57/10
  • For more detailed record please contact the
    author.

22
Results RQ1 german samplescriteria
  • The easiest criterion Pronunciation 4.24/10
    (fair average )
  • The trickiest criterion Range 3.49/10
  • (fair average )
  • For more detailed record please contact the
    author.

23
Results RQ1 german samplescombined difficulty
taskcriteria
  • The easiest combination
  • Presentation Pronunciation (level 6B1.1)
  • The trickiest combination
  • Negotiation (Planning an outing) Range (level 5
    A2.2 lower band)
  • For more detailed record please contact the
    author.

24
Rq2 english german
  • The tasks were conceived as authentic in regard
    to themes and situations
  • Authenticity (Bachman Palmer, 1996) was
    questioned by raters during the sessions due to
    the high grade of control regulated by the L1
    prompts (to increase reliability)
  • Students regarded the tasks as relevant and
    highly probable in real life.
  • The raters of German discussed the interlocutor
    impact of the pair setting as a biasing factor.
  • The results suggest that the target level
    requirements set in the Finnish curricula are
    attained reasonably well.

25
discussion
  • Utility claim was confirmed as to the high level
    of agreement of raters across facets
    (reliability)
  • Sufficiency and relevance were partly questioned
    due to the claimed unauthenticity of the task
    (rigor of instructions)
  • How to go about the dilemma in the future
    versions of the test?

26
references
  • Bachman. L.F. (2005). Building and supporting a
    case for test use. Language Assessment Quarterly,
    2(1), 134.
  • Fulcher, G. Davidson, F. (2007). Language
    Testing and Assessment. An advanced resource
    book. Abington New York Routledge.
  • Hildén, R. Takala, S. 2007. Relating
    Descriptors of the Finnish School Scale to the
    CEF Overall Scales for Communicative Activities.
    Teoksessa Koskensalo, A., Smeds, J., Kaikkonen,
    P. Kohonen, V. (toim.) Foreign languages and
    multicultural perspectives in the European
    context Fremdsprachen und multikulturelle
    Perspektiven im europäischen Kontext. Dichtung,
    Wahrheit und Sprache (ss. 73 88). LIT-Verlag.

27
bibliography
  • National Core Curriculum for the Comprehensive
    School 2004. Helsinki Finnish National Board of
    Education. In Finnish http//www.oph.fi/info/ops/
  • National Core Curriculum for the Upper Secondary
    Level 2003. Helsinki Finnish National Board of
    Education. In Finnish
  • http//www.oph.fi/pageLast.asp?path1,17627,1830,2
    3059
  • Kane, M. D. (2001). Current concerns in validity
    theory. Journal of Educational Measurement, 38
    (4), 319 342.

28
Thank you!
raili.hilden_at_helsinki.fi
Write a Comment
User Comments (0)
About PowerShow.com