Title: Corpus Annotation II
1Corpus Annotation II
- Martin Volk
- Stockholm University
2Overview
- Clean-Up and Text Structure Recognition
- Sentence Boundary Recognition
- Proper Name Recognition and Classification
- Part-of-Speech Tagging
- Tagging Correction and Sentence Boundary Corr.
- Lemmatisation and Lemma Filtering
- NP/PP Chunk Recognition
- Recognition of Local and Temporal PPs
- Clause Boundary Recognition
3Input Docs
Tokenizer and Sentence Boundary Recognizer
Abbreviations
Proper Name Recognizer Persons, Locations,
First Name list Location list
Training Corpus SUC
Part-of-Speech Tagger and Lemmatiser
Morph. Rules Lexicon
Swetwol Morph. Analyser for Lemmas, Tags,
Compounds
4Part-of-Speech Tagging for German
- Was done with the Tree-Tagger
- (from Helmut Schmid, IMS Stuttgart).
- The Tree-Tagger
- is a statistical tagger.
- uses the STTS tag set (50 PoS tags and 3 tags for
punctuation). - assigns 1 tag to each word form.
- preserves pre-set tags.
5A statistical Part-of-Speech tagger
- learns tagging rules from a manually
Part-of-Speech annotated corpus ( training
corpus). - Vid/PR kliniken/NN i/PR Huddinge/PM övervakas/VB
nu/AB Mijailovic/PM ständigt/AB av/PR två/RG
vårdare/NN. - applies the learned rules to new sentences.
- Problems
- words that were not in the training corpus.
- words with many possible tags.
6Two Swedish example word forms with multiple PoS
tags in SUC
- av
- adverb (AB) 48 times
- particle (PL) 407 times
- proper name (PM) 4 times
- preposition (PR) 14580 times
- foreign word (UO) 2 times
- lagar (EN laws or to make/repair)
- noun (NN) 43 times
- verb (VB) 5 times
7Part-of-Speech Tagging for Swedish
- is done with the TreeTagger
- which is trained on SUC (Stockholm-Umeå-Corpus 1
million words) - with the SUC tag set (slightly enlarged)
- originally 22 tags plus VBFIN, VBINF, VBSUP,
VBIMP - has an estimated error rate of 4 (ie. every 25th
word is incorrectly tagged!)
8Part-of-Speech Tagging with Lemmatisation
- The TreeTagger also assigns lemmas that it has
learned from the training corpus. - Rule If word form W in the corpus has
- lemma L1 with tag T1 and
- lemma L2 with tag T2,
- then the TreeTagger will assign the lemma
corresponding to the chosen tag. - Example Swedish låg
- ligger (EN to lay) and VVFIN (finite full verb)
- låg (EN low) and JJ (adjective)
- nice example of PoS Tagging as word sense
disambiguation
9PoS Tagging with Lemmatisation
- But, it is possible that word form W has more
than one lemma with tag T1 in the training
corpus. - Example Swedish kön
- kö (EN queue) noun
- kön (EN gender, sex) noun
- The TreeTagger will simply assign all lemmas to W
that go with T1 (no lemma disambiguation).
10Tagging Correction in German
- Correction of observed tagger problems
- Sentence-initial adjectives
- are often tagged as noun (NN)
- '...lichenr' or '...ischenr' ? ADJA
- Verb group patterns
- the verb in front of 'worden' must be perfect
participle - VVXXX 'worden' ? VVPP
- if verb modal verb then the verb must be
infinitive - VVXXX VMYYY ? VVINF
- Unknown prepositions (a, via, innert, ennet)
11Correction of sentence boundaries
- E.g. suspected ordinal number followed by a
capitalized - determiner or
- pronoun or
- preposition or
- adverb
- ? insert sentence boundary.
- Open question Could all sentence boundary
detection be done after PoS tagging?
12Lemmatisation for Swedish
- is (partly) done by the TreeTagger by re-using
the lemmas from SUC (Stockholm-Umeå-Corpus) - Limits
- word forms that are not in SUC. In particular
- names ? proper name recognition
- compounds ? Swetwol
- neologisms, foreign expressions ? ??
- SUC lemmas have no compound boundaries
- (byskolan ? byskola), (konstindustriskolan ?
konstindustriskola) - elliptical compounds (e.g. kostnads- och
tidseffektivt) ? ?? - TreeTagger ignores the hyphen.
- upper case / lower case (e.g. Bo vs. bo) ? ??
- TreeTagger treats them separately.
13Morphological information
- such as case, number, gender etc.
- is important for correct linguistic analysis.
- could be taken from SUC based on the triple
- word form PoS tag lemma
- Examples
- kön NN kön ? NEUtrum SINgular INDefinite
NOMinative - kön NN kö ? UTRum SINgular DEFinite
NOMinative - Limits
- word forms that are not in SUC, and
- triples that have more than 1 set of
morphological features.
14Lemmatisation for Swedish
- can be done with Swetwol (Lingsoft Oy, Helsinki)
for - adjectives (inflection lyckligt - lyckliga,
gradation söt - sötare - sötaste), - nouns (inflection hus husen huset ),
- verbs (inflection arbeta arbetar - arbetat ).
- Swetwol
- is a two-level morphology analyzer for Swedish
- is lexicon-based
- returns all possible interpretations for each
word form - kön ? kön N NEU INDEF SG/PL NOM
- kön ? kö N UTR DEF SG NOM
- segments compound words dynamically if all parts
are known - cirkusskolan ? cirkusskola
- analyzes hyphenated compounds only if all parts
are known - FN-uppdraget ? FN-uppdrag
- tPA-plantan ? ?? although plantan ? planta
- ? feed last element to Swetwol
15Lemmatisation for German
- can be done with Gertwol (Lingsoft Oy, Helsinki)
for - adjectives (inflection schöne - schönes,
gradation schöner - schönste), - nouns (inflection Haus Hauses Häuser
Häusern), - prepositions (contraction zum zur zu), and
- verbs (inflection zeige zeigst zeigt zeigte
zeigten ). - Gertwol
- is a two-level morphology analyzer for German
- is lexicon-based
- returns all possible interpretations for each
word form - segments compound words dynamically
- analyzes hyphenated compounds only if all parts
are known - e.g. Software-Aktien but not Informix-Aktien
- ? feed last element to Gertwol
16Lemma Filtering (a project by Julian Käser)
- After lemmatisation Merging of Ger/Swetwol and
tagger information - Case 1 The lemma was prespecified during proper
name recognition (IBMs ? IBM) - Case 2 Ger/Swetwol does not find a lemma ?
insert the word form as lemma (mark it with '?')
17Lemma Filtering
- Case 3 Ger/Swetwol finds exactly one lemma for
the given PoS ? insert the lemma - Case 4 Ger/Swetwol finds multiple lemmas for the
given PoS ? disambiguate and insert the best
lemma - Disambiguation weights the segmentation symbols
- Strong compound segment boundary 4 points
- Weak compound segment boundary 2 points
- Derivational segment boundary 1 point
- the lemma with the lowest score wins!
- Examples
- Abteilungen ? Abteilunge (5 points) vs.
Abteilung (3 points) - rådhusklockan ? rådhusklocka (6 p.) vs.
rådhusklocka (8 p.)
18Lemma Filtering
- Case 5 Ger/Swetwol finds a lemma but not for the
given PoS - ? this indicates a tagger error (Ger/Swetwol is
more reliable than the tagger.) - Case 5.1 Ger/Swetwol finds a lemma for exactly
one PoS ? insert the lemma and exchange the PoS
tag - Case 5.2 Ger/Swetwol finds lemmas for more than
one PoS ? find closest PoS tag, or guess - Option Check if the PoS tag in the corpus was
licensed by SUC. If yes, ask the user for a
decision.
19Lemma Filtering for German
- 0.74 of all PoS tags were exchanged (2 of
Adjective tags, Noun tags, Verb tags). - In other words 14'000 tags / annual volume of
the ComputerZeitung were exchanged. - 85 are cases with exactly one Gertwol tag, 15
are guesses.
20Limitations of Gertwol
- Compounds are lemmatized only if all parts are
known. - Idea Use a corpus for lemmatizing remaining
compounds - Examples kaputtreden, Waferfabriken
- Solution
- If first part occurs standing alone AND
- second part occurs standing alone with lemma,
- then segment and lemmatize!
- and store first part as lemma (of itself)! !!
21NP/PP Chunk Recognition (a project by Dominik A.
Merz)
- adapted to Swedish by Jerker Hagman, 2004
- Pattern matcher with patterns over PoS-tags
- Example patterns
- ADV ADJ --gt AP
- ART AP NN --gt NP
- PR NP --gt PP
- Example tree
22Jerker Hagmans results
- 135 chunking rules
- Categories
- AdjP, AdvP,
- MPN, Coordinated_MPN, MPN_genitive
- NP, Coordinated_NP, NP_genitive
- PP
- VerbCluster (hade gått), InfinitiveGroup (att
minska) - Evaluation against a small treebank
- 75 precision
- 68 recall
23Recognition of temporal PPs in German (a project
by Stefan Höfler)
- A second step towards semantic annotation.
- Starting point
- Prepositions (3) that always introduce a temporal
PP binnen, während, zeit - Prepositions (30) that may introduce a temporal
PP ab, an, auf, bis, ... additional evidence - Additional evidence
- Temporal adverb in PP heute, niemals, wann, ...
- Temporal noun in PP Minute, Stunde, Jahr,
Anfang, ...
24Recognition of temporal PPs
- Evaluation corpus 990 sentences with manually
checked 263 temporal PPs - Result
- Precision 81
- Recall 76
25Recognition of local PPs
- Starting point
- Prepositions that always introduce a local PP
fern, oberhalb, südlich von - Prepositions that may introduce a local PP ab,
auf, bei, ... additional evidence - Additional evidence
- Local adverb in PP dort, hier, oben, rechts,
... - Local noun in PP Strasse, Quartier, Land,
Norden, ltLOCgt, ...
26Recognition of temporal and local PPs
27A Word on Recall and Precision
- The focus varies with the application!
- Often Precision is more important than Recall!
- Idea If I annotate something, then I want to be
'sure' that it is correct.
28Clause Boundary Recognition (a project by
Gaudenz Lügstenmann)
- Definition A clause is a unit consisting of a
full verb together with its (non-clausal)
complements and adjuncts. - A sentence consists of one or more clauses, and a
clause consists of one or more phrases. - Clauses are important for determining the
cooccurrence of verbs and PPs (among other
things).
29Dagens Nyheter, 20. Sept. 2004
- ltSgt Mijailovic vårdas på sjukhus
- ltSgt Anna Lindhs mördare Mijailo Mijailovic är så
sjuk ltCBgt att han förts till sjukhus. - ltSgt Sedan i lördags vårdas han vid
rättspsykiatriska kliniken på Karolinska
universitetssjukhuset i Huddinge. ltSgt Dit fördes
han ltCBgt sedan en läkare vid Kronobergshäktet i
Stockholm konstaterat ltCBgt att han det fanns risk
ltCBgt att han skulle försöka ltCBgt ta livet av sig
i häktet. ltSgt Det skriver Aftonbladet och
Expressen. - ltSgt Mijailovic, ltCBgt som väntar på rättegången i
Högsta domstolen ltCBgt efter att ha dömts till
sluten psykiatrisk vård och inte till fängelse,
ltCBgt ska enligt tidningarna ha slutat ta sina
tabletter ltCBgt och blivit starkt förvirrad. ltSgt
Enligt Kriminalvårdsstyrelsens bestämmelser ska i
sådana fall en fånge föras till sjukhus.
30Clause Boundary Recognition
- Exceptions from the definition Clauses with more
than one verb - Coordinated verbs
- Daten können überführt und verarbeitet werden.
- Perception verb infinitive verb (AcI)
- die den Markt wachsen sehen.
- 'lassen' infinitive verb
- lässt die Handbücher übertragen
31Clause Boundary Recognition
- Exceptions from the definition Clauses without a
verb - Elliptical clauses (e.g. in coordinated
structures) - Examples
- Er beobachtet den Markt und seine Mitarbeiter die
Konkurrenz. - Heute kann die Welt nur mehr knapp 30 dieser
früher äusserst populären Riesenbilder bewundern,
drei davon in der Schweiz.
32Clause Boundary Recognition
- The German CB recognizer is realized as a pattern
matcher over PoS tags. (34 patterns) - Example
- Comma Relative Pronoun
- Finite verb ... Conjunction ... Finite Verb
- Most difficult CB without overt punctuation
symbol or trigger word - Example Simple Budgetreduzierungen in der IT in
den Vordergrund zu stellen ltCBgt ist der falsche
Ansatz. - This happens often in Swedish.?
33Clause Boundary Recognition for German
- Evaluation corpus 1150 sentences with 754
intra-sentential CBs. - Results (counting all CBs)
- Precision 95.8
- Recall 84.9
- Results (counting only intra-sentential CBs)
- Precision 90.5
- Recall 61.1
34Using a PoS Tagger for Clause Boundary
Recognition in German
- A CB recognizer can be seen as a disambiguator
over commas and CB-trigger-tokens (if we
disregard the CBs without trigger). - A tagger may serve the same purpose.
- Example
- ... schrieb der Präsident,ltCogt Michael
Eisner,ltCogt im Jahresbericht. - ... schrieb der Präsident,ltCBgt der Michael Eisner
kannte,ltCBgt im Jahresbericht.
35Using a PoS Tagger for Clause Boundary
Recognition in German
- Evaluation corpus 1150 sentences with 754
intra-sentential CBs. - Training the Brill-Tagger on 75 and applying it
on the remaining 25 - Results
- 93 Precision
- 91 Recall
- Caution very small evaluation corpus!!
36Clause Boundary Recognition vs. Clause Recognition
- CB recognition marks only the boundaries. It does
not identify discontinuous parts of clauses. It
does not identify nesting. - Example
- ltSgt Mijailovic, ltCBgt som väntar på rättegången i
Högsta domstolen ltCBgt efter att ha dömts till
sluten psykiatrisk vård och inte till fängelse,
ltCBgt ska enligt tidningarna ha slutat ta sina
tabletter ltCBgt och blivit starkt förvirrad. - ltCgt Mijailovic, ltCgt som väntar på rättegången i
Högsta domstolen ltCgt efter att ha dömts till
sluten psykiatrisk vård och inte till fängelse,
lt/Cgtlt/Cgt ska enligt tidningarna ha slutat ta sina
tabletter lt/CgtltCgt och blivit starkt
förvirrad.lt/Cgt - Clause Recognition should be done with a
recursive parsing approach because of clause
nesting.
37Summary
- Part-of-Speech tagging based on statistical
methods is robust and reliable. - The TreeTagger assigns PoS tags and lemmas.
- Swetwol is a morphological analyser that given a
word form outputs the PoS tag, the lemma and the
morphological features for all its readings. - Multiple knowledge sources (e.g. PoS-tagger and
Swetwol) may lead to conflicting tags. - Chunking (partial parsing) builds partial trees.
- Clause boundary detection can be realized as
pattern matching over PoS tags.