Title: Automatic Identification of Discourse Moves in Scientific Article Introductions
1Automatic Identification of Discourse Moves in
Scientific Article Introductions
NICK PENDAR AND ELENA COTOS IOWA STATE
UNIVERSITY THE 3RD WORKSHOP ON INNOVATIVE USE OF
NLP FOR BUILDING EDUCATIONAL APPLICATIONS JUNE
19, 2008
2Outline
- Background and motivation
- Discourse move identification
- Data and annotation scheme
- Feature selection
- Sentence representation
- Classifier
- Evaluation
- Inter-annotator agreement
- Further work
3Automated evaluation Background
- Automated essay scoring (AES) in
performance-based and high-stakes standardized
tests (e.g., ACT, GMAT, TOEFL, etc.)? - Automated error detection in L2 output (Burstein
and Chodorow, 1999 Chodorow et al., 2007 Han et
al., 2006 Leacock and Chodorow, 2003)? - Assessment of various constructs, e.g., topical
content, grammar, style, mechanics, syntactic
complexity, and deviance or plagiarism (Burstein,
2003 Elliott, 2003 Landauer et al., 2003
Mitchell et al., 2002 Page, 2003 Rudner and
Liang, 2002) - Text organization limited to recognizing the
five-paragraph essay format, thesis, and topic
sentences - AntMover (Anthony and Lashkia, 2003)?
4Automated evaluation CALI Motivation
- Wide range of possibilities for high quality
evaluation and feedback (Criterion Burstein,
Chodorow, Leacock, 2004)? - Potential in formative assessment, but the
effects of intelligent formative feedback are not
fully investigated - Warschauer and Ware (2006) call for the
development of a classroom research agenda that
would help evaluate and guide the application of
AES in the writing pedagogy - the potential of automated essay evaluation for
improving student writing is an empirical
question, and virtually no peer-reviewed research
has yet been published (Hyland and Hyland, 2006,
p. 109)?
5Automated evaluation EAP Motivation
- EAP pedagogical approaches (Cortes, 2006 Levis
Levis-Muller, 2003 Vann Myers, 2001) fail to
provide NNSs with sufficient academic writing
practice and remediational guidance - Problem of disciplinarity
- An NLP-based academic discourse evaluation
software application could account for this
drawback - Such an application has not yet been developed
6Automated evaluation Research Motivation
- Long-term research goals
- design and implementation of IADE (Intelligent
Academic Discourse Evaluator)? - analysis of IADE effectiveness for formative
assessment purposes
7- Evaluates students research article
introductions in terms of moves/steps (Swales
1990, 2004)? - Draws from
- SLA models interactionist views (Caroll, 1999
Gass, 1997 Long, 1996 Long Robinson, 1998
Mackey, Gass, McDonough, 2000 Swain, 1993) and
Systemic Functional Linguistics (Martin, 1992
Halliday, 1985)? - Skill Acquisition Theory of learning (DeKeyser,
2007 )? - Is informed by empirical research on the
provision of feedback - Is informed by Evidence Centered Design
principles (Mislevy et al., 2006)?
8Discourse Move Identification
- Approached as a classification problem (similar
to Burstein et al., 2003)? - given a sentence and a finite set of moves and
steps, what move/step does the sentence signify? - ISUAW corpus 1,623 articles 1,322,089 words
average length of articles 814.09 words - Stratified sampling of 401 introduction sections
representative of 20 academic disciplines - Sub-corpus 267,029 words average length 665.91
words 11,149 sentences - Manual annotation
9Discourse Move Identification
- Annotation scheme (Swales, 1990 Swales, 2004)?
10Discourse Move Identification
- Multiple layers of annotation for cases when the
same sentence signified more than one move or
more than one step
11Feature Selection
- Features that reliably indicate a move/step
- Text-categorization approach (see Sebastiani,
2002)? - Each sentence treated as a data item to be
classified and represented as an n-dimensional
vector in the Euclidean space - The task of the learning algorithm is to find a
function F S ? M that would map the sentences
in the corpus S to classes in M m1,m2,m3 - Identification of moves, not yet steps
12Feature Selection
- Extraction of word unigrams, bigrams, and
trigrams from the annotated corpus - Preprocessing
- All tokens stemmed using the NLTK port of the
Porter Stemmer algorithm (Porter, 1980)? - All numbers in the texts replaced by the string
_number_ - The tokens inside each n-gram alphabetized in
case of bigrams and trigrams - All n-grams with a frequency of less than five
excluded
13Feature Selection
- Odds ratio
- Conditional probabilities are calculated as
maximum likelihood estimates - N-grams with maximum odds ratios selected as
features
14Sentence Representation
- Each sentence represented as a vector
- Presence or absence of terms in sentences
recorded as Boolean values (0 for the absence of
the corresponding term or a 1 for its presence)?
15Classifier
- Support Vector Machines (SVM) (Basu et al., 2003
Burges, 1998 Cortes and Vapnik, 1995 Joachims,
1998 Vapnik, 1995)? - five-fold cross validation
- Machine learning environment RAPIDMINER (Mierswa
et al., 2006)? - RBF kernel
found through a set of different parameter
settings on the feature set with 3,000 unigrams - Parameters not necessarily the best exhaustive
searches will be performed on the other feature
sets
16Evaluation
- Five-fold cross validation on 14 different
feature sets were performed
17Evaluation
Accuracy - the proportion of classifications that
agreed with the manually assigned labels
18Evaluation
- Precision - what proportion of the items assigned
to a given category actually belonged to it - Recall - what proportion of the items actually
belonging to a category were labeled correctly
19Evaluation
- Trigram models result in the best precision
- Unigram models result in the best recall
20Evaluation
- Move 2 is most difficult to identify as revealed
by error analysis Move 2 gets misclassified as
Move 1 - Use the relative position of the sentence in the
text to disambiguate the move involved - see what percentage of Move 2 sentences
identified as Move 1 by the system also have been
labeled Move 1 by the annotator - Extracted features are not discipline-dependent
21This just in
- Built a model with top 3000 unigrams and top 3000
trigrams - Precision 91.14
- Recall 82.98
- Kappa 87.57
22Inter-annotator agreement
- Second annotations on a sample of files across
all 20 disciplines 487 sentences - k - inter-annotator agreement
- P(A) - observed probability of agreement
- P(E) - expected probability of agreement
- Average k 0.945 over the three moves
23Further work on IADE
- Ongoing experiments to improve accuracy
- ? experimenting with different kernel parameters
to find optimal models - More annotation
- Inter-annotator agreement (3 annotators)?
- Identification of steps
- Development of intelligent feedback
- Web interface design
24Further research with IADE
- Evaluation of IADE effectiveness
- Learning potential
- Learner fit
- Meaning focus
- Authenticity
- Impact
- Practicality (Chapelle, 2001)
- Process/product research direction - interaction
between use and outcome (Warschauer Ware, 2006)? - Target for evaluation - what is taught through
technology (Chapelle, 2007, p.30)?
25Questions?Suggestions?
THANK YOU!