Corpus Mark-up - PowerPoint PPT Presentation

About This Presentation
Title:

Corpus Mark-up

Description:

The Use of Annotated Corpora for New Testament Discourse Analysis: A Survey of Current Practice and Future Prospects , in S.E. Porter and J.T. Reed ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 29
Provided by: MatthewBr2
Category:

less

Transcript and Presenter's Notes

Title: Corpus Mark-up


1
Corpus Mark-up
  • UoL Summer Institute in Corpus Linguistics
  • Matthew Brook ODonnell

2
Aims
  • Introduce the concepts of corpus mark-up and
    annotation
  • Consider why we would want to add extra
    non-textual information to corpus texts
  • Use a pos-tagger and tagged text

3
What is Corpus Annotation?
  • the practice of adding interpretative linguistic
    information to a corpus (Leech 2005)
  • interpretative
  • linguistic
  • results in -gt value-added corpus

4
Terminology
  • Corpus Markup
  • processing/formatting information
  • metadata/text classifications
  • structural representation
  • Tagging
  • (usually) inline addition of category to word(s)
  • Parsing
  • higher-level, multiword units (constituents)
  • chunking/shallow vs. full syntactical parsing
  • neednt just be syntactical analysis
  • XML
  • eXtensible Markup Language

5
Why Annotate?
  1. Manual examination of corpus
  2. Automatic analysis of corpus
  3. Reusability of annotations
  4. Multi-functionality
  5. Objective record of analysis
  6. Annotation process is corpus analysis

Leech 2005
McEnery 2003
ODonnell 1999
6
Types of Corpus Annotation
  • Part-of-speech (POS)
  • Lemmatization
  • Syntactical (parsing)
  • Semantic (domain classifications)
  • Coreference (Discourse)
  • Pragmatic (Speech acts dialogue)
  • Stylistic
  • Research specific (ad hoc)

7
POS Tagging Claws C5
  • Corpus_NN1 annotation_NN1 is_VBZ
  • the_AT0 practice_NN1 of_PRF
  • adding_VVG interpretative_AJ0
  • linguistic_AJ0 information_NN1
  • to_PRP a_AT0 corpus_NN1 ._.

NN1 singular noun AJ0 adjective (unmarked) VBZ
-s form of the verb "BE PRF the preposition
OF VVG -ing form of lexical verb AT0 article
8
POS Tagging Claws C7
  • Corpus_NN1 annotation_NN1 is_VBZ
  • the_AT practice_NN1 of_IO
  • adding_VVG interpretative_JJ
  • linguistic_JJ information_NN1
  • to_II a_AT1 corpus_NN1 ._.

http//www.comp.lancs.ac.uk/ucrel/claws/trial.html
9
POS Tagging POSTagger
  • Corpus/NN annotation/NN is/VBZ
  • the/DT practice/NN of/IN
  • adding/VBG interpretative/JJ
  • linguistic/JJ information/NN
  • to/TO a/DT corpus/NN ./.

10
Parsing Chunking
  • NP (NN Corpus) (NN annotation)
  • (VBZ is)
  • NP (DT the) (NN practice)
  • (IN of) (VBG adding)
  • NP (JJ interpretative) (JJ linguistic) (NN
    information)
  • PP (TO to) NP (DT a) (NN corpus)

11
Parsing
  • (S
  • (NP Corpus annotation)
  • (VP is
  • (NP
  • (NP the practice)
  • (PP of
  • (S (VP adding
  • (NP interpretative
    linguistic information)
  • (PP to (NP a corpus))
  • ))
  • )
  • )
  • )
  • .)

12
Semantic Annotation
  • Each word given code from thesaurus-style
    dictionary
  • Also called Word Sense Tagging
  • Examples
  • UCREL Semantic Analysis System
  • http//www.comp.lancs.ac.uk/ucrel/usas/
  • WordNet
  • http//wordnet.princeton.edu/

13
Semantic Annotation
  • The noun move has 5 senses (first 5 from tagged
    texts)
  • 1. (377) move -- (the act of deciding to do
    something "he didn't make a move to help" "his
    first move was to hire a lawyer")
  • 2. (70) move, relocation -- (the act of changing
    your residence or place of business "they say
    that three moves equal one fire")
  • 3. (57) motion, movement, move, motility -- (a
    change of position that does not entail a change
    of location "the reflex motion of his eyebrows
    revealed his surprise" "movement is a sign of
    life" "an impatient move of his hand"
    "gastrointestinal motility")
  • 4. (30) motion, movement, move -- (the act of
    changing location from one place to another
    "police controlled the motion of the crowd" "the
    movement of people from the farms to the cities"
    "his move put him directly in my path")
  • 5. (5) move -- ((game) a player's turn to take
    some action permitted by the rules of the game)

14
Semantic Annotation
  • The verb move has 16 senses (first 13 from tagged
    texts)
  • 1. (130) travel, go, move, locomote -- (change
    location move, travel, or proceed "How fast
    does your new car go?" "We travelled from Rome
    to Naples by bus" "The policemen went from door
    to door looking for the suspect" "The soldiers
    moved towards the city in an attempt to take it
    before night fell")
  • 2. (60) move, displace -- (cause to move, both in
    a concrete and in an abstract sense "Move those
    boxes into the corner, please" "I'm moving my
    money to another bank" "The director moved more
    responsibilities onto his new assistant")
  • 3. (52) move -- (move so as to change position,
    perform a nontranslational motion "He moved his
    hand slightly to the right")
  • 4. (20) move -- (change residence, affiliation,
    or place of employment "We moved from Idaho to
    Nebraska" "The basketball player moved from one
    team to another")

15
Tools
  • XML
  • Annotation Editors
  • GATE
  • WordSmith

16
The Great Annotation Debate
  • Leech et al. annotation value added
  • Sinclair annotation perilous activity
  • Scott beware of the POS prison!

17
Sinclair on the perils of corpus annotation
  • The interspersing of tags in a language text is
    a perilous activity, because the text thereby
    loses integrity

Current Issues in Corpus Linguistics (Sinclair
2004 191)
18
Sinclair on the perils of corpus annotation
  • ..one cosy consequence of using tagged text is
    that the description which produces the tags in
    the first place is not challenged it is
    protected. The corpus data can only be observed
    through the tags that is to say, anything the
    tags are not sensitive to will be missed

Current Issues in Corpus Linguistics (Sinclair
2004 191)
19
Sinclair on the perils of corpus annotation
  • In corpus-driven linguistics you do not use
    pre-tagged text, but you process the raw text
    directly and then patterns of this uncontaminated
    text are able to be observed.

Current Issues in Corpus Linguistics (Sinclair
2004 191)
20
Hunston annotation as double-edged sword
  • the categories used to annotate a corpus are
    typically determined before any corpus analysis
    is carried out, which in turn tends to limit, not
    the kind of question that can be asked, but the
    kind of question that usually is asked.

(Hunston 2002 93)
21
Hunston annotation as double-edged sword
  • Most of the work that is done using annotated
    corpora uses categories that have been developed
    in pre-corpus days, such as nominal clauses,
    anaphoric reference Phenomena such as frames or
    semantic prosody tend to have been identified
    from plain text corpora and word-based studies.

(Hunston 2002 93)
22
Corpus-based approach
annotated corpus
CORPUS METHODS
ANALYSIS categorization
DATA
ANALYSIS generalization
plain corpus
  • Annotate Corpus
  • POS
  • Parsing
  • Semantic
  • Reference

RESULTS
23
Corpus-driven approach
CORPUS METHODS
plain corpus
DATA
ANALYSIS generalization categorization
RESULTS
24
Problem for both CB CD Approach
  • Serial/Sequential process
  • CB analysis before (annotation) and after
    processing
  • CD analysis only after processing (so no need for
    annotation)
  • Empirical process is cyclic
  • analysis feeds back into process and around
    again and again

25
So what if.
  • Hunston - Most of the work that is done using
    annotated corpora uses categories that have been
    developed in pre-corpus days.
  • we annotate categories that have come out of
    corpus analysis instead of/as well as traditional
    categories?

(Hunston 2002 93)
26
New uses for corpus annotation
  • Cyclic investigation process
  • KWIC/Frequency list/Collocates etc.
  • Annotate results
  • Goto 1
  • How sould we annotate
  • collocates
  • lexical items
  • semantic associations/prosodies
  • Local textual functions

27
References
  • Leech, G
  • 2005 Adding Linguistic Annotation, in M.
    Wynne, Developing Linguistic Corpora a Guide to
    Good Practice (Oxford Oxbrow Books), pp. 17-29
  • http//ahds.ac.uk/linguistic-corpora/
  • Hunston, S.
  • 2002 Corpora in Applied Linguistics (Cambridge
    Cambridge University Press)
  • McEnery, A
  • 2003 Corpus Linguistics, in R. Mitov (ed.),
    The Oxford Handbook of Computational Linguistics
    (Oxford Oxford University Press), pp. 448-463

28
References
  • ODonnell, M.B.
  • The Use of Annotated Corpora for New Testament
    Discourse Analysis A Survey of Current Practice
    and Future Prospects, in S.E. Porter and J.T.
    Reed (eds.), Discourse Analysis and the New
    Testament Results and Applications (Sheffield
    Sheffield Academic Press, 1999), pp. 71-117.
  • Sinclair, J.
  • 2004 Trust the Text Language, Corpus and
    Discourse (London Routledge)
Write a Comment
User Comments (0)
About PowerShow.com