The Montclair Electronic Language Learner Database MELD - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

The Montclair Electronic Language Learner Database MELD

Description:

essays of students of English as a foreign language. Corpus development (academic) ... link essays to student background data. produce an error-free version ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 17
Provided by: ralphgr
Category:

less

Transcript and Presenter's Notes

Title: The Montclair Electronic Language Learner Database MELD


1
The Montclair Electronic Language Learner
Database(MELD)
  • www.chss.montclair.edu/linguistics/MELD/
  • Eileen Fitzpatrick Steve Seegmiller
  • Montclair State University

2
Non-native speaker (NNS) corpora
  • Begun in early 1990s
  • Data
  • written performance only
  • essays of students of English as a foreign
    language
  • Corpus development (academic)
  • in Europe Louvain, Lodz, Uppsala
  • in Asia Tokyo Gakugei University, Hong Kong Univ
    of Science and Technology
  • Annotation
  • Lodz part of speech
  • HKUST, Lodz error tags

3
Gaps in NNS Corpus Creation
  • No NNS Corpus in America, so no corpus of English
    as a Second Language (ESL)
  • No NNS corpus is publicly available
  • No NNS corpus annotates errors without a
    predetermined list of error types

4
MELD Goals
  • Initial Goals
  • Collect ESL student writing
  • Tag writing for error
  • Provide publicly available NNS data
  • Initial Goals support
  • 2nd language pedagogy
  • Language acquisition research
  • tool building (grammar checkers, student editing
    aids, parallel texts from NS and NNS)

5
MELD Overview
  • Data
  • 44477 words of text annotated
  • 53826 more words of raw data
  • language, education data for each student author
  • upper level ESL students
  • Tools written to
  • link essays to student background data
  • produce an error-free version from tagged text
  • allow fast entry of background data

6
Annotation
  • Annotators reconstruct a grammatical form
  • error/reconstruction
  • school systems is/are
  • since children 0/are usually inspired
  • becoming a/0 good citizens
  • Agreement between annotators is an issue

7
Error Classification from a Predetermined List
  • Benefit
  • annotators agree on what an error is only those
    items in the classification scheme
  • Problems
  • annotators have to learn a classification scheme
  • the existence of a classification scheme means
    that the annotators can misclassify
  • errors not in the scheme will be missed

8
Error Identification Reconstruction
  • Benefits
  • speed in annotating since there is no
    classification scheme to learn
  • no chance of misclassifying
  • less common errors will be captured
  • a reconstructed text can be more easily parsed
    and tagged for part of speech
  • Question
  • How well can we agree on what is an error?

9
Agreement Measures
  • Reliability What percentage of the errors do
    both taggers tag? T1 ?
    T2
  • (T1 T2)/2
  • Precision What percentage of the non-experts
    (T2) tags are accurate? T1 ? T2
  • T2
  • Recall What percent of true errors did the
    non-expert (T2) find? T1 ? T2
  • T1

1 -
10
Agreement Measures
Non-expert
Expert
High precision Low Recall Low
Reliability
11
Agreement Measures
  • JL
  • Essay Recall Precision Reliability
  • 1-10 .54 .58 .39
  • 11-22 .57 .78 .49
  • JN
  • Essay Recall Precision Reliability
  • 1-10 .58 .48 .23
  • 11-22 .37 .54 .27
  • LN
  • Essay Recall Precision Reliability
  • 1-10 .65 .70 .37
  • 11-22 .60 .78 .36

12
Conclusions on Tagging Agreement
  • Unsatisfactory level of agreement as to what is
    an error
  • Disagreements resolved through regular meetings
  • There are now 2 types of tags one for
    lexico-syntactic errors and one for stylistic
  • The tags are transparent to the user and can be
    deleted or ignored

13
The Future
  • Immediate
  • Internet access to data and tools
  • an error concordancer
  • automatic part of speech and syntactic markup
  • data from different ESL skill levels
  • Long Range
  • statistical tool to correlate error frequency
    with student background
  • student editing aid
  • grammar checker
  • NNS speech data

14
Some Possible Applications
  • Preparation of instructional materials
  • Studies of progress over a semester
  • Research on error types by L1
  • Research on writing characteristics by L1

15
Writing Characteristics by L1
  • L1 Spanish tense
  • 1 would/will
  • 1 went/go
  • 1 stay/stayed
  • 1 gave/give
  • 1 cannot/could
  • 1 can/could
  • TOTAL 6
  • Word Ct 2305
  • L1 Gujarati tense
  • 5 was/is 1 passes/passed
  • 3 were/are 1 love/loved
  • 2 would/will 1 left/leave
  • 2 is/was 1 kept/keeps
  • 2 have/had 1 involved/involves
  • 2 had/have 1 get/got
  • 1 would start/started 1 do/did
  • 1 will/0 1 can/could
  • 1 will/were to 1 are/were
  • 1 was/were
  • 1 wanted/want 1 spend/spent TOTAL 31
  • Word Ct 2500

16
Acknowledgments
  • Jacqueline Cassidy
  • Jennifer Higgins
  • Norma Pravec
  • Lenore Rosenbluth
  • Donna Samko
  • Jory Samkoff
  • Kae Shigeta
Write a Comment
User Comments (0)
About PowerShow.com