Corpus Annotation The Index Thomisticus Treebank - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Corpus Annotation The Index Thomisticus Treebank

Description:

Editorial (relationship corpus original source): additions/omissions, ... Leech (2004): 'Corpus annotation is the practice of adding interpretative ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 21
Provided by: marcopas
Category:

less

Transcript and Presenter's Notes

Title: Corpus Annotation The Index Thomisticus Treebank


1
Corpus AnnotationThe Index Thomisticus Treebank
  • Marco Passarotti
  • Università Cattolica del Sacro Cuore
  • Milan

2
1.Corpus annotation
3
Annotation
  • Annotation is the practice of
  • adding to corpus data different kinds of
    meta-data
  • Editorial (relationship corpus original
    source) additions/omissions, corrections,
    normalizations
  • Descriptive (classificatory info) bibliographic
    info, author, year, language
  • Administrative (documentary info) file and tags
    description, availability of the corpus, revision
    status
  • Analytic (interpretation and analysis of corpus
    components) structural features (e.g. text
    subdivision), quotations, foreign words,
    named-entity recognition (names, addresses,
    dates, measures), linguistic information
    (corpus annotation)
  • Leech (2004) Corpus annotation is the practice
    of adding interpretative linguistic information
    to a corpus

4
Layers of linguistic annotation
  • Digital text tokenization
  • (spoken corpora) Phonetic and prosodic (tone,
    stress)
  • Lemmatization (lmz)
  • morphological
  • morpho-syntactic
  • Syntactic analysis (treebanks)
  • Topic-focus articulation anaphora resolution
  • Pragmatic-rhetorical annotation (RST speech
    acts, illocutionary force)
  • Semantic annotation word-sense disambiguation
  • Multimodal annotation of speech word, gesture,
    facial expression, intonation, perceivable
    context

5
(No Transcript)
6
2.TreebanksCorpora annotated at syntactic level
7
PhSG vs. DG
  • PhSG (Chomsky, Schützenberger)
  • Words, PoS, Phrases, Start Symbols
  • Set inclusion categorization (e.g. word, N, NP,
    S)
  • DG (Tesnière, Melcuk, etc.)
  • Only words (Chomskys terminals)
  • Head-dependent relations between words
  • Lexical nodes are linked through binary relations
    (dependencies) and are annotated with functional
    categories
  • Sentence word order not marked
  • Suitable for free-word-order languages (richly
    inflected) Dutch (Alpino), Czech (PDT), Latin
    (IT-TB LDT)

8
3.Latin treebanks
9
Latin features
  • Richly inflected
  • Much homonymy
  • Free word order non-projective dependencies
  • if we put the words in the linear order the
    edges cannot be drawn above the words without
    crossings (McDonald, Hajic, Pereira, Ribarov,
    2005)

that glory would know my old age (Prop., I.8.46)
10
IT-Treebank and LDT
  • A collaborative project (official partnership)
    between Università Cattolica (Milan) and the
    Perseus Project (Tufts University Boston)
  • Latin Dependency Treebank texts from Classical
    era (around 35,000 tokens annotated)
  • Index Thomisticus Treebank (Busa) Thomas Aquinas
    opera omnia (around 35,000 tokens annotated)
  • Dependency Grammar (via PDT) a single annotation
    manual (guidelines)

11
4.Samples from IT-TBhttp//itreebank.marginalia.
it
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
5.Research perspectives
20
Research perspectives
  • IT-TB
  • Manual syntactic annotation
  • Train data-driven NLP tools (Parsers, PoS
    Taggers) shallow parsing for semi-automatic
    annotation
  • Dynamic valency-frames lexicon
  • Updating the IT with new texts and new critical
    editions
  • Enhancing text and manuscript images (Navarra
    University)
  • IT-TB LDT
  • Joint workshops (TLT 2009?)
  • Annotation back-off (comparison)
  • Consistency checking in related annotation (e.g.
    n-gram variation by Dickinson and Meurers)
Write a Comment
User Comments (0)
About PowerShow.com