Corpus Annotation - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Corpus Annotation

Description:

Certain kinds of linguistic annotation, which involve the attachment of special ... Prosodic annotation (prosodic features) At the morphological level ... – PowerPoint PPT presentation

Number of Views:334
Avg rating:3.0/5.0
Slides: 37
Provided by: don78
Category:

less

Transcript and Presenter's Notes

Title: Corpus Annotation


1
Corpus Annotation
  • ?????????????????
  • ?????????????????
  • ???

2
  • Apart from the pure text, a corpus can also be
    provided with additional linguistic information,
    called 'annotation'.

3
Types of annotation
  • Certain kinds of linguistic annotation, which
    involve the attachment of special codes to words
    in order to indicate particular features, are
    often known as "tagging" rather than annotation,
    and the codes which are assigned to features are
    known as "tags".

4
  • Part of Speech annotation
  • Lemmatisation
  • Parsing
  • Semantics
  • Discoursal and text linguistic annotation
  • Phonetic transcription
  • Prosody
  • Problem-oriented tagging

5
  • At the phonological level
  • Phonetic/phonemic annotation (syllable
    boundaries)
  • Prosodic annotation (prosodic features)
  • At the morphological level
  • Morphological annotation (prefixes, suffixes and
    stems)
  • At the lexical level
  • POS Tagging (Parts-of-speech annotation)
  • Lemmatization (lemmas)
  • Semantic annotation (semantic fields)
  • At the syntactic level
  • Parsing, treebanking or bracketing (syntactic
    analysis)
  • At the discoursal level
  • Coreference annotation (anaphoric relations)
  • Pragmatic annotation (pragmatic information like
    speech acts)
  • Stylistic annotation (stylistic features such as
    sppech or thought presentation)

6
Part-of-speech Annotation.
  • This is the most basic type of linguistic corpus
    annotation - the aim being to assign to each
    lexical unit in the text a code indicating its
    part of speech. Part-of-speech annotation is
    useful because it increases the specificity of
    data retrieval from corpora, and also forms an
    essential foundation for further forms of
    analysis (such as syntactic parsing and semantic
    field annotation). Part-of-speech annotation also
    allows us to distinguish between homographs.

7
  • Part-of-speech annotation was one of the first
    types of annotation to be formed on corpora and
    is the most common today. One reason for this is
    because it is a task that can be carried out to a
    high degree of accuracy by a computer. Greene and
    Rubin (1971) achieved a 71 accuracy rate of
    correctly tagged words with their early
    part-of-speech tagging program (TAGGIT). In the
    early 1980s the UCREL team at Lancaster
    University reported a success rate of 95 using
    their program CLAWS.

8
  • The Brown Corpus, the LOB Corpus and the British
    National Corpus (BNC) are examples of
    grammatically annotated corpora.

9
Lemmatisation
  • Lemmatisation is closely allied to the
    identification of parts-of-speech and involves
    the reduction of the words in a corpus to their
    respective lexemes. Lemmatisation allows the
    researcher to extract and examine all the
    variants of a particular lexeme without having to
    input all the possible variants, and to produce
    frequency and distribution information for the
    lexeme. Although accurate software has been
    developed for this purpose (Beale 1987),
    lemmatisation has not been applied to many of the
    more widely available corpora. However, the
    SUSANNE corpus does contain lemmatised forms of
    the corpus words, along with other information.
    See the example below - the fourth column
    contains the lemmatised words

10
  • N120510g - PPHS1m He he
  • N120510h - VVDv studied study
  • N120510i - AT the the
  • N120510j - NN1c problem problem
  • N120510k - IF for for
  • N120510m - DD221 a a
  • N120510n - DD222 few few
  • N120510p - NNT2 seconds second
  • N120520a - CC and and
  • N120520b - VVDv thought think
  • N120520c - IO of of
  • N120520d - AT1 a a
  • N120520e - NNc means means
  • N120520f - IIb by by
  • N120520g - DDQr which which
  • N120520h - PPH1 it it
  • N120520i - VMd might may
  • N120520j - VB0 be be
  • N120520k - VVNt solved solve

11
Parsing
  • Parsing involves the procedure of bringing basic
    morphosyntactic categories into high-level
    syntactic relationships with one another. This is
    probably the most commonly encountered form of
    corpus annotation after part-of-speech tagging.
    Parsed corpora are sometimes known as treebanks.
    This term alludes to the tree diagrams or "phrase
    markers" used in parsing.

12
  • (Ssentence, NPnoun phrase, VPverb phrase,
    PPprepositional phrase, Nnoun, Vverb,
    ATarticle, Ppreposition.)
  • Such visual diagrams are rarely encountered in
    corpus annotation - more often the identical
    information is represented using sets of labelled
    brackets.

13
  • SNP Claudia_NP1 NPVP sat_VVD PP on_II NP
    a_AT1 stool_NN1 NP PP VP S
  • Morphosyntactic information is attached to the
    words by underscore characters ( _ ) in the form
    of part-of-speech tags, whereas the constituents
    are indicated by opening and closing square
    brackets annotated at the beginning and end with
    the phrase type e.g. S ...... S

14
  • Because automatic parsing (via computer programs)
    has a lower success rate than part-of-speech
    annotation, it is often either post-edited by
    human analysts or carried out by hand (although
    possibly with the help of parsing software).
  • The disadvantage of manual parsing, however, is
    inconsistency, especially where more than one
    person is parsing or editing the corpus, which
    can often be the case on large projects. The
    solution - more detailed guidelines, but even
    then there can occur ambiguities where more than
    one interpretation is possible.

15
Semantics
  • The marking of semantic relationships between
    items in the text, for example the agents or
    patients of particular actions. This has scarcely
    begun to be widely accepted at the time of
    writing, although some forms of parsing capture
    much of its import.

16
  • There is no universal agreement about which
    semantic features ought to be annotated.

17
Discoursal and text linguistic annotation
  • Aspects of language at the levels of text and
    discourse are one of the least frequently
    encountered annotations in corpora. However,
    occasionally such annotations are applied.

18
Discourse tags
  • Stenström (1984) annotated the London-Lund spoken
    corpus with 16 "discourse tags". They included
    categories such as
  • "apologies" e.g. sorry, excuse me
  • "greetings" e.g. hello
  • "hedges" e.g. kind of, sort of thing
  • "politeness" e.g. please
  • "responses" e.g. really, that's right

19
  • Despite their potential role in the analysis of
    discourse these kinds of annotation have never
    become widely used, possibly because the
    linguistic categories are context-dependent and
    their identification in texts is a greater source
    of dispute than other forms of linguistic
    phenomena.

20
Anaphoric annotation
  • Anaphoric annotation is the marking of pronoun
    reference - our pronoun system can only be
    realised and understood by reference to large
    amounts of empirical data, in other words,
    corpora.
  • carried out by human analysts, since one of the
    aims of the annotation is to train computer
    programs with this data to carry out the task.
  • There are only a few instances of corpora which
    have been anaphorically annotated one of these
    is the Lancaster/IBM anaphoric treebank

21
Phonetic transcription
  • Spoken language corpora can also be transcribed
    using a form of phonetic transcription.
  • carried out by humans rather than computers. Such
    humans have to be well skilled in the perception
    and transcription of speech sounds.
  • Phonetic transcription is therefore a very time
    consuming task.
  • "sounds" do not have such clear boundaries,
    therefore what phonetic transcription takes to be
    the same sound, might be different according to
    context.

22
  • Phonetically transcribed corpora is extremely
    useful to the linguist who lacks the
    technological tools and expertise for the
    laboratory analysis of recorded speech.
  • One such example is the MARSEC corpus (which is
    derived from the Lancaster/IBM Spoken English
    Corpus) and has been manipulated by the
    Universities of Lancaster and Leeds. The MARSEC
    corpus will include a phonetic transcription.

23
Prosody
  • Prosody refers to all aspects of the sound system
    above the level of segmental sounds e.g. stress,
    intonation and rhythm. The annotations in
    prosodically annotated corpora typically follow
    widely accepted descriptive frameworks for
    prosody such as that of O'Connor and Arnold
    (1961). Usually, only the most prominent
    intonations are annotated, rather than the
    intonation of every syllable. The example below
    is taken from the London-Lund corpus

24
Problem-oriented tagging
  • Problem-oriented tagging is the phenomenon
    whereby users will take a corpus, either already
    annotated, or unannotated, and add to it their
    own form of annotation, oriented particularly
    towards their own research goal. This differs in
    two ways from the other types of annotation we
    have examined in this session.

25
  • It is not exhaustive. Not every word (or
    sentence) is tagged - only those which are
    directly relevant to the research. This is
    something which problem-oriented tagging has in
    common with anaphoric annotation.

26
  • Annotation schemes are selected, not for broad
    coverage and theory-neutrality, but for the
    relevance of the distinctions which it makes to
    the specific questions that the researcher wishes
    to ask of his/her data.

27
  • Although it is difficult to generalise further
    about this form of corpus annotation, it is an
    important type to keep in mind in the context of
    practical research using corpora.

28
Error Tagging
29
Problem-oriented annotation
30
Automatic or manual
31
  • CLAWS tagger (online trial service)
  • http//www.comp.lancs.ac.uk/computing/research/ucr
    el/claws/tagservice.html
  • http//www.comp.lancs.ac.uk/computing/research/ucr
    el/claws/ (for licensed user)
  • AMALGAM tagger (email)
  • http//www.scs.leeds.ac.uk/ccalas/amalgam/amalgtag
    3.html
  • Brills tagger
  • http//www.cs.jhu.edu/brill/code.html
  • GoTagger (free)
  • http//uluru.lang.osaka-u.ac.jp/k-goto/download/d
    ownload.cgi?nameGoTagger07.zip
  • TreeTagger (free, DOS, Windows GUI available but
    not user-friendly)
  • http//www.ims.uni-stuttgart.de/projekte/corplex/T
    reeTagger/DecisionTreeTagger.html

32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
Anno Tool
Write a Comment
User Comments (0)
About PowerShow.com