Computational%20Tools%20for%20Linguists - PowerPoint PPT Presentation

About This Presentation
Title:

Computational%20Tools%20for%20Linguists

Description:

Computational Tools for Linguists – PowerPoint PPT presentation

Number of Views:675
Avg rating:3.0/5.0
Slides: 155
Provided by: inde61
Category:

less

Transcript and Presenter's Notes

Title: Computational%20Tools%20for%20Linguists


1
Computational Tools for Linguists
  • Inderjeet Mani
  • Georgetown University
  • im5_at_georgetown.edu

2
Topics
  • Computational tools for
  • manual and automatic annotation of linguistic
    data
  • exploration of linguistic hypotheses
  • Case studies
  • Demonstrations and training
  • Inter-annotator reliability
  • Effectiveness of annotation scheme
  • Costs and tradeoffs in corpus preparation

3
Outline
  • Topics
  • Concordances
  • Data sparseness
  • Chomskys Critique
  • Ngrams
  • Mutual Information
  • Part-of-speech tagging
  • Annotation Issues
  • Inter-Annotator Reliability
  • Named Entity Tagging
  • Relationship Tagging
  • Case Studies
  • metonymy
  • adjective ordering
  • Discourse markers then
  • TimeML

4
Corpus Linguistics
  • Use of linguistic data from corpora to test
    linguistic hypotheses gt emphasizes language use
  • Uses computers to do the searching and counting
    from on-line material
  • Faster than doing it by hand! Check?
  • Most typical tool is a concordancer, but there
    are many others!
  • Tools can analyze a certain amount, rest is left
    to human!
  • Corpus Linguistics is also a particular approach
    to linguistics, namely an empiricist approach
  • Sometimes (extreme view) opposed to the
    rationalist approach, at other times (more
    moderate view) viewed as complementary to it
  • Cf. Theoretical vs. Applied Linguistics

5
Empirical Approaches in Computational Linguistics
  • Empiricism the doctrine that knowledge is
    derived from experience
  • Rationalism the doctrine that knowledge is
    derived from reason
  • Computational Linguistics is, by necessity,
    focused on performance, in that naturally
    occurring linguistic data has to be processed
  • Naturally occurring data is messy! This means we
    have to process data characterized by false
    starts, hesitations, elliptical sentences, long
    and complex sentences, input that is in a complex
    format, etc.
  • The methodology used is corpus-based
  • linguistic analysis (phonological, morphological,
    syntactic, semantic, etc.) carried out on a
    fairly large scale
  • rules are derived by humans or machines from
    looking at phenomena in situ (with statistics
    playing an important role)

6
Example metonymy
  • Metonymy substituting the name of one referent
    for another
  • George W. Bush invaded Iraq
  • A Mercedes rear-ended me
  • Is metonymy involving institutions as agents more
    common in print news than in fiction?
  • The X Vreporting
  • Lets start with The X said
  • This pattern will provide a handle to identify
    the data

7
Exploring Corpora
  • Datasets
  • http//complingtwo.georgetown.edu/cgi-bin/gwilson/
    bin/DataSets.cgi
  • Metonymy Test using Corpora
  • http//complingtwo.georgetown.edu/gwilson/Tools/M
    etonymy/TheXSaid_MST.html

8
The X said from Concordance data
Words Freq Freq/ M Words
Fiction 1870 1.7M 60 35
Fiction 2000 1.5M 219 146
Print News 1.9M 915 481
The preference for metonymy in print news arises
because of the need to communicate Information
from companies and governments.
9
Chomskys Critique of Corpus-Based Methods
  • 1. Corpora model performance, while linguistics
    is aimed at the explanation of competence
  • If you define linguistics that way, linguistic
    theories will never be able to deal with actual,
    messy data
  • Many linguists dont find the competence-performan
    ce distinction to be clear-cut. Sociolinguists
    have argued that the variability of linguistic
    performance is systematic, predictable, and
    meaningful to speakers of a language.
  • Grammatical theories vary in where they draw the
    line between competence and performance, with
    some grammars (such as Hallidays Systemic
    Grammar) organized as systems of
    functionally-oriented choices.

10
Chomskys Critique (concluded)
  • 2. Natural language is in principle infinite,
    whereas corpora are finite, so many examples will
    be missed
  • Excellent point, which needs to be understood by
    anyone working with a corpus.
  • But does that mean corpora are useless?
  • Introspection is unreliable (prone to performance
    factors, cf. only short sentences), and pretty
    useless with child data.
  • Also, insights from a corpus might lead to
    generalization/induction beyond the corpus if
    the corpus is a good sample of the text
    population
  • 3. Ungrammatical examples wont be available in a
    corpus
  • Depends on the corpus, e.g., spontaneous speech,
    language learners, etc.
  • The notion of grammaticality is not that clear
  • Who did you see pictures/?a picture/??his
    picture/Johns picture of?
  • ARG/ADJUNCT example

11
Which Words are the Most Frequent?
Common Words in Tom Sawyer (71,730 words), from
Manning Schutze p.21
Will these counts hold in a different corpus (and
genre, cf. Tom)? What happens if you have 8-9M
words? (check usage demo!)
12
Data Sparseness
  • Many low-frequency words
  • Fewer high-frequency words.
  • Only a few words will have lots of examples.
  • About 50 of word types occur only once
  • Over 90 occur 10 times or less.
  • So, there is merit to Chomskys 2nd objection

Word Frequency Number of words of that frequency
1 3993
2 1292
3 664
4 410
5 243
6 199
7 172
8 131
9 82
10 91
11-50 540
51-100 99
gt100 102
Frequency of word types in Tom Sawyer, from MS
22.
13
Zipfs Law Frequency is inversely proportional
to rank
turned 51 200 10200
youll 30 300 9000
name 21 400 8400
comes 16 500 8000
group 13 600 7800
lead 11 700 7700
friends 10 800 8000
begin 9 900 8100
family 8 1000 8000
brushed 4 2000 8000
sins 2 3000 6000
could 2 4000 8000
applausive 1 8000 8000
Empirical evaluation of Zipfs Law on Tom Sawyer,
from MS 23.
14
Illustration of Zipfs Law (Brown Corpus, from
MS p. 30)
logarithmic scale
  • See also http//www.georgetown.edu/faculty/wilsong
    /IR/WordDist.html

15
Tokenizing words for corpus analysis
  • 1. Break on
  • Spaces? ????????????????
  • inuo butta otokonokowa
    otooto da
  • Periods? (U.K. Products)
  • Hyphens? data-base database data base
  • Apostrophes? wont, couldnt, ORiley, cars
  • 2. should different word forms be counted as
    distinct?
  • Lemma a set of lexical forms having the same
    stem, the same pos, and the same word-sense. So,
    cat and cats are the same lemma.
  • Sometimes, words are lemmatized by stemming,
    other times by morphological analysis, using a
    dictionary and/or morphological rules
  • 3. fold case or not (usually folded)?
  • The the THE Mark versus mark
  • One may need, however, to regenerate the original
    case when presenting it to the user

16
Counting Word Tokens vs Word Types
  • Word tokens in Tom Sawyer 71,370
  • Word types (i.e., how many different words)
    8,018
  • In newswire text of that number of tokens, you
    would have 11,000 word types. Perhaps because Tom
    Sawyer is written in a simple style.

17
Inspecting word frequencies in a corpus
  • http//complingtwo.georgetown.edu/cgi-bin/gwilson/
    bin/DataSets.cgi
  • Usage demo
  • http//complingtwo.georgetown.edu/cgi-bin/gwilson/
    bin/Usage.cgi

18
Ngrams
  • Sequences of linguistic items of length n
  • See count.pl

19
A test for association strength Mutual
Information
Data from (Church et al. 1991)
1988 AP corpus N44.3M
20
Interpreting Mutual Information
  • High scores, e.g., strong supporter (8.85)
    indicates strongly associated in the corpus
  • MI is a logarithmic score. To convert it, recall
    that X2 log2X
  • so, 28.85 ? 461.44. So this is 461 X chance.
  • Low scores powerful support (1.74) this is 3X
    chance, since 21.74 ? 3
  • I fxy fx fy x
    y
  • 1.74 2 1984 13,428 powerful support
  • I log2 (2N/198413428) 1.74
  • So, doesnt necessarily mean weakly associated
    could be due to data sparseness

21
Mutual Information over Grammatical Relations
  • Parse a corpus
  • Determine subject-verb-object triples
  • Identify head nouns of subject and object NPs
  • Score subj-verb and verb-obj associations using MI

22
Demo of Verb-Subj, Verb-Obj Parses
  • Who devours or what gets devoured?
  • Demo http//www.cs.ualberta.ca/lindek/demos/depi
    ndex.htm

23
MI over verb-obj relations
  • Data from (Church et al. 1991)

24
A Subj-Verb MI Example Who does what in news?
  • executive police politician
  • reprimand 16.36 shoot 17.37 clamor
    16.94
  • conceal 17.46 raid 17.65 jockey 17.53
  • bank 18.27 arrest 17.96 wrangle 17.59
  • foresee 18.85 detain 18.04 woo 18.92
  • conspire 18.91 disperse 18.14 exploit 19.57
  • convene 19.69 interrogate 18.36 brand 19.65
  • plead 19.83 swoop 18.44 behave 19.72
  • sue 19.85 evict 18.46 dare 19.73
  • answer 20.02 bundle 18.50 sway 19.77
  • commit 20.04 manhandle 18.59 criticize 19.78
  • worry 20.04 search 18.60 flank 19.87
  • accompany 20.11 confiscate 18.63 proclaim 19.91
  • own 20.22 apprehend 18.71 annul 19.91
  • witness 20.28 round 18.78 favor 19.92

Data from (Schiffman et al. 2001)
25
Famous Corpora
  • Must see http//www.ldc.upenn.edu/Catalog/
  • Brown Corpus
  • British National Corpus
  • International Corpus of English
  • Penn Treebank
  • Lancaster-Oslo-Bergen Corpus
  • Canadian Hansard Corpus
  • U.N. Parallel Corpus
  • TREC Corpora
  • MUC Corpora
  • English, Arabic, Chinese Gigawords
  • Chinese, ArabicTreebanks
  • North American News Text Corpus
  • Multext East Corpus 1984 in multiple
    Eastern/Central European langauges

26
Links to Corpora
  • Corpora
  • Linguistic Data Consortium (LDC)
    http//www.ldc.upenn.edu/
  • Oxford Text Archive http//sable.ox.ac.uk/ota/
  • Project Gutenberg http//www.promo.net/pg/
  • CORPORA list http//www.hd.uib.no/corpora/archive.
    html
  • Other
  • Chris Mannings Corpora Page
  • http//www-nlp.stanford.edu/links/statnlp.htmlCor
    pora
  • Michael Barlows Corpus Linguistics page
    http//www.ruf.rice.edu/barlow/corpus.html
  • Cathy Balls Corpora tutorial http//www.georgetow
    n.edu/faculty/ballc/corpora/tutorial.html

27
Summary Introduction
  • Concordances and corpora are widely used and
    available, to help one to develop
    empirically-based linguistic theories and
    computer implementations
  • The linguistic items that can be counted are
    many, but words (defined appropriately) are
    basic items
  • The frequency distribution of words in any
    natural language is Zipfian
  • Data sparseness is a basic problem when using
    observations in a corpus sample of language
  • Sequences of linguistic items (e.g., word
    sequences n-grams) can also be counted, but the
    counts will be very rare for longer items
  • Associations between items can be easily computed
  • e.g., associations between verbs and
    parser-discovered subjs or objs

28
Outline
  • Topics
  • Concordances
  • Data sparseness
  • Chomskys Critique
  • Ngrams
  • Mutual Information
  • Part-of-speech tagging
  • Annotation Issues
  • Inter-Annotator Reliability
  • Named Entity Tagging
  • Relationship Tagging
  • Case Studies
  • metonymy
  • adjective ordering
  • Discourse markers then
  • TimeML

29
Using POS in Concordances
Words Freq Freq/ Words
Fiction 2000 N \bdeal_NN 1.5M 115 7.66
Fiction 2000 VB 1.5M 14 9.33
Gigaword N 10.5M 2857 2.72
Gigaword VB 10.5M 139 1.32
deal is more often a verb In Fiction 2000
deal is more often a noun in English
Gigaword deal is more prevalent in Fiction 2000
than Gigaword
30
POS Tagging What is it?
  • Given a sentence and a tagset of lexical
    categories, find the most likely tag for each
    word in the sentence
  • Tagset e.g., Penn Treebank (45 tags, derived
    from the 87-tag Brown corpus tagset)
  • Note that many of the words may have unambiguous
    tags
  • Example
  • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
    tomorrow/NN
  • People/NNS continue/VBP to/TO inquire/VB the/DT
    reason/NN for/IN the/DT race/NN for/IN outer/JJ
    space/NN

31
More details of POS problem
  • How ambiguous?
  • Most words in English have only one Brown Corpus
    tag
  • Unambiguous (1 tag) 35,340 word types
  • Ambiguous (2- 7 tags) 4,100 word types 11.5
  • 7 tags 1 word type still
  • But many of the most common words are ambiguous
  • Over 40 of Brown corpus tokens are ambiguous
  • Obvious strategies may be suggested based on
    intuition
  • to/TO race/VB
  • the/DT race/NN
  • will/MD race/NN
  • Sentences can also contain unknown words for
    which tags have to be guessed Secretariat/NNP
    is/VBZ

32
Different English Part-of-Speech Tagsets
  • Brown corpus - 87 tags
  • Allows compound tags
  • I'm tagged as PPSSBEM
  • PPSS for "non-3rd person nominative personal
    pronoun" and BEM for "am, 'm
  • Others have derived their work from Brown Corpus
  • LOB Corpus 135 tags
  • Lancaster UCREL Group 165 tags
  • London-Lund Corpus 197 tags.
  • BNC 61 tags (C5)
  • PTB 45 tags
  • To see comparisons ad mappings of tagsets, go to
    www.comp.leeds.ac.uk/amalgam/tagsets/tagmenu.html

33
PTB Tagset (36 main tags 9 punctuation tags)
34
PTB Tagset Development
  • Several changes were made to Brown Corpus tagset
  • Recoverability
  • Lexical Same treatment of Be, do, have, whereas
    BC gave each its own symbol
  • Do/VB does/VBZ did/VBD doing/VBG done/VBN
  • Syntactic Since parse trees were used as part of
    Treebank, conflated certain categories under the
    assumption that they would be recoverable from
    syntax
  • subject vs. object pronouns (both PP)
  • subordinating conjunctions vs. prepositions on
    being informed vs. on the table (both IN)
  • Preposition to vs. infinitive marker (both TO)
  • Syntactic Function
  • BC the/DT one/CD vs. PTB the/DT one/NN
  • BC both/ABX vs.
  • PTB both/PDT the boys, the boys both/RB,
    both/NNS of the boys, both/CC boys and girls

35
PTB Tagging Process
  • Tagset developed
  • Automatic tagging by rule-based and statistical
    pos taggers
  • Human correction using an editor embedded in Gnu
    Emacs
  • Takes under a month for humans to learn this (at
    15 hours a week), and annotation speeds after a
    month exceed 3,000 words/hour
  • Inter-annotator disagreement (4 annotators, eight
    2000-word docs) was 7.2 for the tagging task and
    4.1 for the correcting task
  • Manual tagging took about 2X as long as
    correcting, with about 2X the inter-annotator
    disagreement rate and an error rate that was
    about 50 higher.
  • So, for certain problems, having a linguist
    correct automatically tagged output is far more
    efficient and leads to better reliability among
    linguists compared to having them annotate the
    text from scratch!

36
Automatic POS tagging
  • http//complingone.georgetown.edu/linguist/

37
A Baseline Strategy
  • Choose the most likely tag for each ambiguous
    word, independent of previous words
  • i.e., assign each token to the pos-category it
    occurred in most often in the training set
  • E.g., race which pos is more likely in a
    corpus?
  • This strategy gives you 90 accuracy in
    controlled tests
  • So, this unigram baseline must always be
    compared against

38
Beyond the Baseline
  • Hand-coded rules
  • Sub-symbolic machine learning
  • Symbolic machine learning

39
Machine Learning
  • Machines can learn from examples
  • Learning can be supervised or unsupervised
  • Given training data, machines analyze the data,
    and learn rules which generalize to new examples
  • Can be sub-symbolic (rule may be a mathematical
    function) e.g. neural nets
  • Or it can be symbolic (rules are in a
    representation that is similar to representation
    used for hand-coded rules)
  • In general, machine learning approaches allow for
    more tuning to the needs of a corpus, and can be
    reused across corpora

40
A Probabilistic Approach to POS tagging
  • What you want to do is find the best sequence
    of pos-tags CC1..Cn for a sentence WW1..Wn.
  • (Here C1 is pos_tag(W1)).
  • In other words, find a sequence of pos tags
    Cthat maximizes P(C W)
  • Using Bayes Rule, we can say
  • P(C W) P(W C) P(C) / P(W )
  • Since we are interested in finding the value of C
    which maximizes the RHS, the denominator can be
    discarded, since it will be the same for every C
  • So, the problem is Find C which maximizes
  • P(W C) P(C)
  • Example He will race
  • Possible sequences
  • He/PP will/MD race/NN
  • He/PP will/NN race/NN
  • He/PP will/MD race/VB
  • He/PP will/NN race/VB
  • W W1 W2 W3
  • He will race
  • C C1 C2 C3
  • Choices
  • C PP MD NN
  • C PP NN NN
  • C PP MD VB
  • C PP NN VB

41
Independence Assumptions
  • P(C1.Cn) ? ?i1, n P(Ci Ci-1)
  • assumes that the event of a pos-tag occurring is
    independent of the event of any other pos-tag
    occurring, except for the immediately previous
    pos tag
  • From a linguistic standpoint, this seems an
    unreasonable assumption, due to long-distance
    dependencies
  • P(W1.Wn C1.Cn) ? ?i1, n P(Wi Ci)
  • assumes that the event of a word appearing in a
    category is independent of the event of any other
    word appearing in a category
  • Ditto
  • However, the proof of the pudding is in the
    eating!
  • N-gram models work well for part-of-speech
    tagging

42
A Statistical Method for POS Tagging
MD NN VB PRP he 0 0 0
.3 will .8 .2 0 0 race 0
.4 .6 0
  • Find the value of C1..Cn which maximizes
  • ?i1, n P(Wi Ci) P(Ci Ci-1)

Pos bigram probs
lexical generation probabilities
lexical generation probs
CR MD NN VB PRP MD .4 .6 NN
.3 .7 PP .8 .2 ?
1
pos bigram probs
43
Finding the best path through an HMM
C
E
willMD .8
raceNN .4
.4
Viterbi algorithm
A
.8
hePP 1
1
ltsgt?
lex(B)
.3
F
.6
B
.2
willNN .2
raceVB .6
.7
D
  • Score(I) Max J pred I Score(J)
    transition(IJ) lex(I)
  • Score(B) P(PP?) P(hePP) 1.3.3
  • Score(C)Score(B) P(MDPP) P(willMD)
    .3.8.8 .19
  • Score(D)Score(B) P(NNPP) P(willNN)
    .3.2.2 .012
  • Score(E) Max Score(C)P(NNMD),
    Score(D)P(NNNN) P(raceNN)
  • Score(F) Max Score(C)P(VBMD),
    Score(D)P(VBNN)P(raceVB)

44
But Data Sparseness Bites Again!
  • Lexical generation probabilities will lack
    observations for low-frequency and unknown words
  • Most systems do one of the following
  • Smooth the counts
  • E.g., add a small number to unseen data (to zero
    counts). For example, assume a bigram not seen in
    the data has a very small probability, e.g.,
    .0001.
  • Backoff bigrams with unigrams, etc.
  • Use lots more data (youll still lose, thanks to
    Zipf!)
  • Group items into classes, thus increasing class
    frequency
  • e.g., group words into ambiguity classes, based
    on their set of tags. For counting, alll words in
    an ambiguity class are treated as variants of the
    same word

45
A Symbolic Learning Method
  • HMMs are subsymbolic they dont give you rules
    that you can inspect
  • A method called Transformational Rule Sequence
    learning (Brill algorithm) can be used for
    symbolic learning (among other approaches)
  • The rules (actually, a sequence of rules) are
    learnt from an annotated corpus
  • Performs at least as accurately as other
    statistical approaches
  • Has better treatment of context compared to HMMs
  • rules which use the next (or previous) pos
  • HMMs just use P(Ci Ci-1) or P(Ci Ci-2Ci-1)
  • rules which use the previous (next) word
  • HMMs just use P(WiCi)

46
Brill Algorithm (Overview)
  • Assume you are given a training corpus G (for
    gold standard)
  • First, create a tag-free version V of it
  • Notes
  • As the algorithm proceeds, each successive rule
    becomes narrower (covering fewer examples, i.e.,
    changing fewer tags), but also potentially more
    accurate
  • Some later rules may change tags changed by
    earlier rules
  • 1. First label every word token in V with most
    likely tag for that word type from G. If this
    initial state annotator is perfect, youre
    done!
  • 2. Then consider every possible transformational
    rule, selecting the one that leads to the most
    improvement in V using G to measure the error
  • 3. Retag V based on this rule
  • 4. Go back to 2, until there is no significant
    improvement in accuracy over previous iteration

47
Brill Algorithm (Detailed)
  • 1. Label every word token with its most likely
    tag (based on lexical generation probabilities).
  • 2. List the positions of tagging errors and their
    counts, by comparing with ground-truth (GT)
  • 3. For each error position, consider each
    instantiation I of X, Y, and Z in Rule template.
    If YGT, increment improvementsI, else
    increment errorsI.
  • 4. Pick the I which results in the greatest
    error reduction, and add to output
  • e.g., VB NN PREV1OR2TAG DT improves 98 errors,
    but produces 18 new errors, so net decrease of 80
    errors
  • 5. Apply that I to corpus
  • 6. Go to 2, unless stopping criterion is reached
  • Most likely tag
  • P(NNrace) .98
  • P(VBrace) .02
  • Is/VBZ expected/VBN to/TO race/NN tomorrow/NN
  • Rule template Change a word from tag X to
    tag Y when previous tag is Z
  • Rule Instantiation to above example NN VB
    PREV1OR2TAG TO
  • Applying this rule yields
  • Is/VBZ expected/VBN to/TO race/VB tomorrow/NN

48
Example of Error Reduction
From Eric Brill (1995) Computational
Linguistics, 21, 4, p. 7
49
Example of Learnt Rule Sequence
  • 1. NN VB PREVTAG TO
  • to/TO race/NN-gtVB
  • 2. VBP VB PREV1OR20R3TAG MD
  • might/MD vanish/VBP-gt VB
  • 3. NN VB PREV1OR2TAG MD
  • might/MD not/MD reply/NN -gt VB
  • 4. VB NN PREV1OR2TAG DT
  • the/DT great/JJ feast/VB-gtNN
  • 5. VBD VBN PREV1OR20R3TAG VBZ
  • He/PP was/VBZ killed/VBD-gtVBN by/IN Chapman/NNP

50
Handling Unknown Words
  • Can also use the Brill method
  • Guess NNP if capitalized, NN otherwise.
  • Or use the tag most common for words ending in
    the last 3 letters.
  • etc.

Example Learnt Rule Sequence for Unknown Words
51
POS Tagging using Unsupervised Methods
  • Reason Annotated data isnt always available!
  • Example the can
  • Lets take unambiguous words from dictionary, and
    count their occurrences after the
  • the .. elephant
  • the .. guardian
  • Conclusion immediately after the, nouns are more
    common than verbs or modals
  • Initial state annotator for each word, list all
    tags in dictionary
  • Transformation template
  • Change tag ? of word to tag Y if the previous
    (next) tag (word) is Z, where ? is a set of 2 or
    more tags
  • Dont change any other tags

52
Error Reduction in Unsupervised Method
  • Let a rule to change ? to Y in context C be
    represented as Rule(?, Y, C).
  • Rule1 VB, MD, NN NN PREVWORD the
  • Rule2 VB, MD, NN VB PREVWORD the
  • Idea
  • since annotated data isnt available, score rules
    so as to prefer those where Y appears much more
    frequently in the context C than all others in ?
  • frequency is measured by counting unambiguously
    tagged words
  • so, prefer VB, MD, NN NN PREVWORD the
  • to VB, MD, NN VB PREVWORD the
  • since dict-unambiguous nouns are more common in
    a corpus after the than dict-unambiguous verbs

53
Summary POS tagging
  • A variety of POS tagging schemes exist, even for
    a single language
  • Preparing a POS-tagged corpus requires, for
    efficiency, a combination of automatic tagging
    and human correction
  • Automatic part-of-speech tagging can use
  • Hand-crafted rules based on inspecting a corpus
  • Machine Learning-based approaches based on corpus
    statistics
  • e.g., HMM lexical generation probability table,
    pos transition probability table
  • Machine Learning-based approaches using rules
    derived automatically from a corpus
  • Combinations of different methods often improve
    performance

54
Outline
  • Topics
  • Concordances
  • Data sparseness
  • Chomskys Critique
  • Ngrams
  • Mutual Information
  • Part-of-speech tagging
  • Annotation Issues
  • Inter-Annotator Reliability
  • Named Entity Tagging
  • Relationship Tagging
  • Case Studies
  • metonymy
  • adjective ordering
  • Discourse markers then
  • TimeML

55
Adjective Ordering
  • A political serious problem
  • A social extravagant life
  • red lovely hair
  • old little lady
  • green little men
  • Adjectives have been grouped into various classes
    to explain ordering phenomena

56
Collins COBUILD L2 Grammar
  • qualitative lt color lt classifying
  • Qualitative expresses a quality that someone or
    something has, e.g., sad, pretty, small, etc.
  • Qualitative adjectives are gradable, i.e., the
    person or thing can have more or less of the
    quality
  • Classifying used to identify the class
    something belongs to, i.e.., distinguishing
  • financial help, American citizens.
  • Classifying adjectives arent gradable.
  • So, the ordering reduces to
  • Gradable lt color lt non-gradable
  • A serious political problem
  • Lovely red hair
  • Big rectangular green Chinese carpet

57
Vendler 68
  • A9 lt A8 lt A2 lt A1x ltA1m lt ltA1a
  • A9 probably, likely, certain
  • A8 useful, profitable, necessary
  • A7 possible, impossible
  • A6 clever, stupid, reasonable, nice, kind,
    thoughtful, considerate
  • A5 ready, willing, anxious
  • A4 easy
  • A3 slow, fast, good, bad, weak, careful,
    beautiful
  • A2 contrastive/polar adjectives long-short,
    thick-thin, big-little, wide-narrow
  • A1j verb-derivatives washed
  • A1i verb-derivatives washing
  • A1h luminous
  • A1g rectangular
  • A1f color adjectives
  • A1a iron, steel, metal
  • big rectangular green Chinese carpet

58
Other Adjective Ordering Theories
Goyvaerts 68 quality lt size/length/shape lt age lt color lt naturally lt style lt general lt denominal
Quirck Greenbaum 73 Intensifying perfect lt general-measurable careful wealthy lt age young old lt color lt denominal material woollen scarf lt denominal style Parisian dress
Dixon 82 value lt dimension lt physical property lt speed lt human propensity lt age lt color
Frawley 92 value lt size lt color (English, German, Hungarian, Polish, Turkish, Hindi, Persian, Indonesian, Basque)
Collins COBUILD gradable lt
color lt non-gradable Goyvaerts, QG, Dixon
size lt age lt color Goyvaerts, QG
color lt denominal Goyvaerts,
Dixon shape lt color
59
Testing the Theories on Large Corpora
  • Selective coverage of a particular language or
    (small) set of languages
  • Based on categories that arent defined precisely
    that are computable
  • Based on small large numbers of examples
  • Test gradable lt color lt non-gradable

60
Computable Tests for Gradable Adjectives
  • Submodifiers expressing gradation
  • veryrathersomewhatextremely A
  • But what about very British?
  • http//complingtwo.georgetown.edu/gwilson/Tools/A
    dj/GW_Grad.txt
  • Periphrastic comparatives
  • more A than "the most A
  • Inflectional comparatives
  • -er-est
  • http//complingtwo.georgetown.edu/gwilson/Tools/A
    dj/BothLists.txt

61
Challenges Data Sparseness
  • Data sparseness
  • Only some pairs will be present in a given corpus
  • few adjectives on the gradable list may be
    present
  • Even fewer longer sequences will be present in a
    corpus
  • Use transitivity?
  • small lt red, red lt wooden --gt small lt red lt
    wooden?

62
Challenges Tool Incompleteness
  • Search pattern will return many non-examples
  • Collocations
  • common or marked ones
  • American green card
  • national Blue Cross
  • Adjective Modification
  • bright blue
  • POS-tagging errors
  • May also miss many examples

63
Results from Corpus Analysis
  • G lt C lt not G generally holds
  • However, there are exceptions
  • Classifying/Non-Gradable lt Color
  • After all, the maple leaf replaced the British
    red ensign as Canada's flag almost 30 years ago.
  • http//complingtwo.georgetown.edu/gwilson/Tools/A
    dj/Color2.html
  • where he stood on a stage festooned with balloons
    displaying the Palestinian green, white and red
    flag
  • http//complingtwo.georgetown.edu/gwilson/Tools/A
    dj/Color4.html
  • Color lt Shape
  • paintings in which pink, roundish shapes,
    enriched with flocking, gel, lentils and thread,
    suggest the insides of the female body.
  • http//complingtwo.georgetown.edu/gwilson/Tools/A
    dj/Color4.html

64
Summary Adjective Ordering
  • It is possible to test concrete predictions of a
    linguistic theory in a corpus-based setting
  • The testing means that the machine searches for
    examples satisfying patterns that the human
    specifies
  • The patterns can pre-suppose a certain/high
    degree of automatic tagging, with attendant loss
    of accuracy
  • The patterns should be chosen so that they
    provide handles to identify the phenomena of
    interest
  • The patterns should be restricted enough that the
    number of examples the human has to judge is not
    infeasible
  • This is usually an iterative process

65
Outline
  • Topics
  • Concordances
  • Data sparseness
  • Chomskys Critique
  • Ngrams
  • Mutual Information
  • Part-of-speech tagging
  • Annotation Issues
  • Named Entity Tagging
  • Inter-Annotator Reliability
  • Relationship Tagging
  • Case Studies
  • metonymy
  • adjective ordering
  • Discourse markers then
  • TimeML

66
The Art of Annotation 101
  • Define Goal
  • Eyeball Data (with the help of Computers)
  • Design Annotation Scheme
  • Develop Example-based Guidelines
  • Unless satisfied/exhausted, goto 1
  • WriteTraining Manuals
  • Initiate HumanTraining Sessions
  • Annotate Data / Train Computers
  • Computers can also help with the annotation
  • Evaluate Humans and Computers
  • Unless satisfied/exhausted, goto 1

67
Annottation Methodology Picture
Raw Corpus
Initial Tagger
Annotation Editor
Annotation Guidelines
Machine Learning Program
Rule Apply
Learned Rules
Raw Corpus
Annotated Corpus
Annotated Corpus
Knowledge Base?
68
Goals of an Annotation Scheme
  • Simplicity simple enough for a human to carry
    out
  • Precision precise enough to be useful in CLI
    applications
  • Text-based annotation of an item should be
    based on information conveyed by the text, rather
    than information conveyed by background
    information
  • Human-centered should be based on what a human
    can infer from the text, rather than what a
    machine can currently do or not do
  • Reproducible your annotation should be
    reproducible by other humans (i.e.,
    inter-annotator agreement should be high)
  • obviously, these other humans may have to have
    particular expertise and training

69
What Should An Annotation Contain
  • Additional Information about the text being
    annotated e.g., EAGLES external and internal
    criteria
  • Information about the annotator who, when, what
    version of tool, etc. (usually in meta-tags
    associated with the text)
  • The tagged text itself
  • Example
  • http//www.emille.lancs.ac.uk/spoken.htm

70
External and Internal Criteria (EAGLES)
  • External participants, occasion, social
    setting, communicative function
  • origin Aspects of the origin of the text that
    are thought to affect its structure or content.
  • state the appearance of the text, its layout and
    relation to non-textual matter, at the point when
    it is selected for the corpus.
  • aims the reason for making the text and the
    intended effect it is expected to have.
  • Internal patterns of language use
  • Topic (economics, sports, etc.)
  • Style (formal/informal, etc.)

71
External Criteria state (EAGLES)
  • Mode
  • spoken
  • participant awareness surreptitious/warned/aware
  • venue studio/on location/telephone
  • written
  • Relation to the medium
  • written how it is laid out, the paper, print,
    etc.
  • spoken the acoustic conditions, etc.
  • Relation to non-linguistic communicative matter
  • diagrams, illustrations, other media that are
    coupled with the language in a communicative
    event.
  • Appearance
  • e.g., advertising leaflets, aspects of
    presentation that are unique in design and are
    important enough to have an effect on the
    language.

72
Examples of annotation schemes (changing the way
we do business!)
  • POS tagging annotation Penn Treebank Scheme
  • Named entity annotation ACE Scheme
  • Phrase Structure annotation Penn Treebank
    scheme
  • Time Expression annotation TIMEX2 Scheme
  • Protein Name Annotation GU Scheme
  • Event Annotation TimeML Scheme
  • Rhetorical Structure Annotation - RST Scheme
  • Coreference Annotation, Subjectivity Annotation,
    Gesture Annotation, Intonation Annotation,
    Metonymy Annotation, etc., etc.
  • Etc.
  • Several hundred schemes exist, for different
    problems in different languages

73
POS Tag Formats Non-SGML to SGML
  • CLAWS tagger non-SGML
  • What_DTQ can_VM0 CLAWS_NN2 do_VDI to_PRP
    Inderjeet_NP0 's_POS noonsense_NN1 text_NN1 ?_?
  • Brill tagger non-SGML
  • What/WP can/MD CLAWS/NNP do/VB to/TO
    Inderjeet/NNP 's/POS noonsense/NN text/NN ?/.
  • Alembic POS tagger
  • ltsgtltlex posWPgtWhatlt/lexgt ltlex posMDgtcanlt/lexgt
    ltlex posNNPgtCLAWSlt/lexgt ltlex posVBgtdolt/lexgt
    ltlex posTOgttolt/lexgt ltlex posNNPgtInderjeetlt/lexgt
    ltlex posPOSgt'lt/lexgtltlex posPRPgtslt/lexgt ltlex
    posVBPgtnoonsenselt/lexgt ltlex posNNgttextlt/lexgt
    ltlex pos"."gt?lt/lexgtlt/sgt
  • Conversion to SGML is pretty trivial in such
    cases

74
SGML (Standard Generalized Markup Language)
  • A general markup language for text
  • HTML is an instance of an SGML encoding
  • Text Encoding Initiative (TEI) defines SGML
    schemes for marking up humanities text resources
    as well as dictionaries
  • Examples
  • ltpgtltsgtIm really hungry right now.lt/sgtltsgtOh,
    yeah?lt/sgt
  • ltutt speakFred date10-Feb-1998gtThat is an
    ugly couch.lt/uttgt
  • Note some elements (e.g., ltpgt) can consist just
    of a single tag
  • Character references ways of referring to the
    non-ASCII characters using a numeric code
  • 229 (this is in decimal) xE5 (this is in
    hexadecimal)
  • å
  • Entity references are used to encode a special
    character or sequence of characters via a
    symbolic name
  • reacutesumeacute.
  • docdate

75
DTDs
  • A document type definition, or DTD, is used to
    define a grammar of legal SGML structures for a
    document
  • e.g., para should consist of one or more
    sentences and nothing else
  • SGML parser verifies that document is compliant
    with DTD
  • DTDs can therefore be used for XML as well
  • DTDs can specify what attributes are required, in
    what order, what their legit values are, etc.
  • The DTDs are often ignored in practice!
  • DTD
  • lt!ENTITY writer SYSTEM "http//www.mysite.com/all-
    entities.dtd"gt
  • lt!ATTLIST payment type (checkcash) "cash"gt
  • XML
  • ltauthorgtwriterlt/authorgt
  • ltpayment type"check"gt

76
XML
  • Extensible Markup Language (XML) is a simple,
    very flexible text format derived from SGML.
  • Originally designed to meet the challenges of
    large-scale electronic publishing, XML is also
    playing an increasingly important role in the
    exchange of a wide variety of data on the Web and
    elsewhere. www.w3.org/XML/
  • Defines a simplified subset of SGML, designed
    especially for Web applications
  • Unlike HTML, separates out display (e.g., XSL)
    from content (XML)
  • Example
  • ltp/gtltsgtltlex posWPgtWhatlt/lexgt ltlex
    posMDgtcanlt/lexgtlt/sgt
  • Makes use of DTDs, but also RDF Schemas

77
RDF Schemas
  • Example of Real RDF Schema
  • http//www.cs.brandeis.edu/jamesp/arda/time/docum
    entation/TimeML.xsd (see EVENT tag and
    attributes)

78
Inline versus Standoff Annotation
  • Usually, when tags are added, an annotation tool
    is used, to avoid spurious insertions or
    deletions
  • The annotation tool may use inline or standoff
    annotation
  • Inline tags are stored internally in (a copy
    of) the source text.
  • Tagged text can be substantially larger than
    original text
  • Web pages are a good example i.e., HTML tags
  • Standoff tags are stored internally in separate
    files, with information as to what positions in
    the source text the tags occupy
  • e.g., PERSON 335 337
  • However, the annotation tool displays the text as
    if the tags were in-line

79
Summary Annotation Issues
  • A best-practices methodology is widely used
    for annotating corpora
  • The annotation process involves computational
    tools at all stages
  • Standard guidelines are available for use
  • To share annotated corpora (and to ensure their
    survivability), it is crucial that the data be
    represented in a standard rather than ad hoc
    format
  • XML provides a well-established, Web-compliant
    standard for markup languages
  • DTDs and RDF provide mechanisms for checking
    well-formedness of annotation

80
Outline
  • Topics
  • Concordances
  • Data sparseness
  • Chomskys Critique
  • Ngrams
  • Mutual Information
  • Part-of-speech tagging
  • Annotation Issues
  • Inter-Annotator Reliability
  • Named Entity Tagging
  • Relationship Tagging
  • Case Studies
  • metonymy
  • adjective ordering
  • Discourse markers then
  • TimeML

81
Background
  • Deborah Schiffrin. Anaphoric then aspectual,
    textual, and epistemic meaning. Linguistics 30
    (1992), 753-792
  • Schiffrin xamines uses of then in data elicited
    via 20 sociolinguistic interviews, each an hour
    long
  • Distinguishes two anaphoric temporal senses,
    showing that they are differentiated by clause
    position
  • Shows that they have systematic effects on
    aspectual interpretation
  • A parallel argument is made for two epistemic
    temporal senses

82
Schiffrin Temporal and Non-Temporal Senses
  • Anaphoric Senses
  • Narrative temporal sense (shifts reference
    time)
  • And then I uh lived there until I was sixteen
  • Continuing Temporal sense (continues a previous
    reference time)
  • I was only a little boy then.
  • Epistemic senses
  • Conditional sentences (rare, but often have
    temporal antecedents in her data)
  • But if I think about it for a few days -- well,
    then I seem to remember a great deal
  • if Im still in the situation where I am
    now.Im, not gonna have no more then
  • Initiation-response-evaluation sequences (in
    that case?)
  • Freda Do y still need the light?
  • Debby Um.
  • Freda Wll have t go in then. Because the bugs
    are out.

83
Schiffrins Argument (Simplified) and Its Test
  • Shifting RT thens (call these Narrative) then
    in if-then conditionals
  • similar semantic function
  • mainly clause-initial
  • Continuing RT thens (call these Temporal) IRE
    thens
  • similar semantic function
  • mainly clause final
  • stative verb more likely (since RT overlaps,
    verbs conveying duration are expected)
  • Call the rest Other
  • isnt differentiated into if-then versus IRE
  • So, only part of her claims tested

84
So, What do we do Then?
  • Define environments of interest, each one defined
    by a pattern
  • For each environment
  • Find examples matching the pattern
  • If classifying the examples is manageable, carry
    it out and stop
  • Otherwise restrict the environment by adding new
    elements to the pattern, and go back to 1
  • So, for each final environment, we claim that X
    of the examples in that environment are of a
    particular class
  • Initial then Pattern (_CC_RB)\sthen\w\s\w
  • Final then Pattern \,\sthen\.\?\'\\!\

85
Exceptions
  • Non-Narrative Initial then
  • then there be
  • then come
  • then again
  • then and now
  • only then
  • even then
  • so then
  • Non-Temporal Final then
  • What then?
  • All right/OK , then
  • And then?

86
Results
Written Fiction 2000 Written Fiction 2000 Written Fiction 2000 Spoken Broadcast News Spoken Broadcast News Spoken Broadcast News Written Gigaword News Written Gigaword News Written Gigaword News
T N O T N O T N O
Clause Initial 1.73 (23/1322) 96.67 (1276/1322) 1.58 (21/1322) .73 (6/818) 93.88(768/818) 5.3 (44/818) 3.64 (27/740) 75.94 (562/740) 20.40 (151/740)
Clause Final 71.81 (79/110) 2.72 (3/110) 25.45 (28/110) 72.61 (61/84) 5.95 (5/84) 21.42 (18/84) 93.23 (179/192) 0 6.77 (13/192)
Other is a presence in final position in fiction
and broadcast news, and in initial position in
print news. Is this real or artifact of catch-all
class? Conclusion only part of her claims
tested. But those claims are borne out across
three different genres and much more data!
87
Outline
  • Topics
  • Concordances
  • Data sparseness
  • Chomskys Critique
  • Ngrams
  • Mutual Information
  • Part-of-speech tagging
  • Annotation Issues
  • Inter-Annotator Reliability
  • Named Entity Tagging
  • Relationship Tagging
  • Case Studies
  • metonymy
  • adjective ordering
  • Discourse markers then
  • TimeML

88
Considerations in Inter-Annotator Agreement
  • Size of tagset
  • Structure of tagset
  • Clarity of Guidelines
  • Number of raters
  • Experience of raters
  • Training of raters
  • Independent ratings (preferred)
  • Consensus (not preferred)
  • Exact, partial, and equivalent matches
  • Metrics
  • Lessons Learned Disagreement patterns suggest
    guideline revisions

89
Protein Names
  • Considerable variability in the forms of the
    names
  • Multiple naming conventions
  • Researchers may name a newly discovered protein
    based on
  • function
  • sequence features
  • gene name
  • cellular location
  • molecular weight
  • discoverer
  • or other properties
  • Prolific use of abbreviations and acronyms
  • fushi tarazu 1 factor homolog
  • Fushi tarazu factor (Drosophila) homolog 1
  • FTZ-F1 homolog ELP
  • steroid/thyroid/retinoic nuclear hormone
    receptor homolog nhr-35
  • V-INT 2 murine mammary tumor virus integration
    site oncogene homolog
  • fibroblast growth factor 1 (acidic) isoform 1
    precursor
  • nuclear hormone receptor subfamily 5, Group A,
    member 1

90
Guidelines v1 TOC
91
Agreement Metrics
Reference Candidate Candidate
Yes No
Yes TP ? FN
No FP TN ?
Measure Definition
Percentage Agreement 100(TPTN)/ (TPFPTNFN)
Precision TP/(TPFP)
Recall TP/(TPFN)
(Balanced) F-Measure 2PrecisionRecall/(PrecisionRecall)
92
Example for F-measure Scorer Output (Protein
Name Tagging)
  • REFERENCE
    CANDIDATE
  • CORR        FTZ-F1 homolog ELP  
               FTZ-F1 homolog ELP INCO             
    M2-LHX3                           
        M2SPUR            
                                  
    -SPUR                                   
                                LHX3
  • Precision ¼ 0.25
  • Recall ½ 0.5
  • F-measure 2 ¼ ½ / ( ¼ ½ ) 0.33

93
The importance of disagreement
  • Measuring inter-annotator agreement is very
    useful in debugging the annotation scheme
  • Disagreement can lead to improvements in the
    annotation scheme
  • Extreme disagreement can lead to abandonment of
    the scheme

94
V2 Assessment (ABS2)
  • Old Guidelines
  • protein 0.71 F
  • acronym 0.85 F
  • array-protein 0.15 F
  • New Guidelines
  • protein 0.86 F
  • long-form 0.71 F
  • these are only 4 of tags

Coders Correct Precision Precision Recall Recall F-mea- sure
ltproteingt ltproteingt ltproteingt ltproteingt ltproteingt ltproteingt
A1-A3 4497 0.874 0.852 0.852 0.863 0.863
A1-A4 4769 0.884 0.904 0.904 0.894 0.894
A3-A4 4476 0.830 0.870 0.870 0.849 0.849
Average 0.862 0.875 0.875 0.868 0.868
ltlong-formgt ltlong-formgt ltlong-formgt ltlong-formgt ltlong-formgt ltlong-formgt
A1-A3 172 0.720 0.599 0.599 0.654 0.654
A1-A4 241 0.837 0.840 0.840 0.838 0.838
A3-A4 175 0.608 0.732 0.732 0.664 0.664
Average 0.721 0.723 0.723 0.718 0.718
95
TIMEX2 Annotation Scheme
  • Time Points ltTIMEX2 VAL"2000-W42"gtthe third week
    of Octoberlt/TIMEX2gt
  • Durations ltTIMEX2 VALPT30Mgthalf an hour
    longlt/TIMEX2gt
  • Indexicality ltTIMEX2 VAL2000-10-04gttomorrowlt/T
    IMEX2gt
  • Sets ltTIMEX2 VALXXXX-WXX-2" SET"YES
    PERIODICITY"F1W" GRANULARITYG1Dgtevery
    Tuesdaylt/TIMEX2gt
  • Fuzziness ltTIMEX2 VAL1990-SUgtSummer of 1990
    lt/TIMEX2gt
  • ltTIMEX2 VAL1999-07-15TMOgtThis
    morninglt/TIMEX2gt
  • Non-specificity ltTIMEX2 VAL"XXXX-04"
    NON_SPECIFICYESgtAprillt/TIMEX2gt is usually wet.
  • For guidelines, tools, and corpora, please see
    timex2.mitre.org

96
TIMEX2 Inter-Annotator Agreement
193 NYT news docs 5 annotators 10 pairs of
annotators
  • Human annotation quality is acceptable on
    EXTENT and VAL
  • Poor performance on Granularity and Non-Specific
  • But only a small number of instances of these
    (200 6000)
  • Annotators deviate from guidelines, and produce
    systematic errors (fatigue?)
  • several years ago PXY instead of PAST_REF
  • all day P1D instead of YYYY-MM-DD

97
TempEx in Qanda
98
Summary Inter-Annotator Reliability
  • Theres no point going on with an annotation
    scheme if it cant be reproduced
  • There are standard methods for measuring
    inter-annotator reliability
  • An analysis of inter-annotator disagreements is
    critical for debugging an annotation scheme

99
Outline
  • Topics
  • Concordances
  • Data sparseness
  • Chomskys Critique
  • Ngrams
  • Mutual Information
  • Part-of-speech tagging
  • Annotation Issues
  • Inter-Annotator Reliability
  • Named Entity Tagging
  • Relationship Tagging
  • Case Studies
  • metonymy
  • adjective ordering
  • Discourse markers then
  • TimeML

100
Information Extraction
  • Types
  • Flag names of people, organizations, places,
  • Flag and normalize time expressions, phrases such
    as time expressions, measure phrases, currency
    expressions, etc.
  • Group coreferring expressions together
  • Find relations between named entities (works for,
    located at, etc.)
  • Find events mentioned in the text
  • Find relations between events and entities
  • A hot commercial technology!
  • Example patterns
  • Mr. ---,
  • , Ill.

101
Message Understanding Conferences (MUCs)
  • Idea precise tasks to measure success, rather
    than test suite of input and logical forms.
  • MUC-1 1987 and MUC-2 1989 - messages about navy
    operations
  • MUC-3 1991 and MIC-4 1992 - news articles and
    transcripts of radio broadcasts about terrorist
    activity
  • MUC-5 1993 - news articles about joint ventures
    and microelectronics
  • MUC-6 1995 - news articles about management
    changes, additional tasks of named entity
    recognition, coreference, and template element
  • MUC-7 1998 mostly multilingual information
    extraction
  • Has also been applied to hundreds of other
    domains - scientific articles, etc., etc.

102
Historical Perspective
  • Until MUC-3 (1993), many IE systems used a
    Knowledge Engineering approach
  • They did something like full chart parsing with a
    unification-based grammar with full logical
    forms, a rich lexicon and KB
  • E.g., SRIs Tacitus
  • Then, they discovered that things could work much
    faster using finite-state methods and partial
    parsing
  • And that using domain-specific rather than
    general purpose lexicons simplified parsing (less
    ambiguity due to fewer irrelevant senses)
  • And that these methods worked even better for the
    IE tasks
  • E.g., SRIs Fastus, SRAs Nametag
  • Meanwhile, people also started using statistical
    learning methods from annotated corpora
  • Including CFG parsing

103
An instantiated scenario template
Source
Wall Street Journal, 06/15/88 MAXICARE HEALTH
PLANS INC and UNIVERSAL HEALTH SERVICES INC have
dissolved a joint venture which provided health
services.
104
Templates Can get Complex! (MUC-5)
105
2002 Automatic Content Extraction (ACE) Program
Entity Types
  • Person
  • Organization
  • (Place)
  • Location e.g., geographical areas, landmasses,
    bodies of water, geological formations
  • Geo-Political Entity e.g., nations, states,
    cities
  • Created due to metonymies involving this class of
    places
  • The riots in Miami
  • Miami imposed a curfew
  • Miami railed against a curfew
  • Facility buildings, streets, airports, etc.

106
ACE Entity Attributes and Relations
  • Attributes
  • Name An entity mentioned by name
  • Pronoun
  • Nominal
  • Relations
  • AT based-in, located, residence
  • NEAR relative-location
  • PART part-of, subsidiary, other
  • ROLE affiliate-partner, citizen-of, client,
    founder, general-staff, manager, member, owner,
    other
  • SOCIAL associate, grandparent, parent, sibling,
    spouse, other-relative, other-personal,
    other-professional

107
Designing an Information Extraction Task
  • Define the overall task
  • Collect a corpus
  • Design an Annotation Scheme
  • linguistic theories help
  • Use Annotation Tools
  • - authoring tools
  • -automatic extraction tools
  • Apply to annotation to corpus, assessing
    reliability
  • Use training portion of corpus to train
    information extraction (IE) systems
  • Use test portion to test IE systems, using a
    scoring program

108
Annotation Tools
  • Specialized authoring tools used for marking up
    text without damaging it
  • Some tools are tied to particular annotation
    schemes

109
Annotation Tool Example Alembic Workbench
110
Callisto (Java successor to Alembic Workbench)
111
Relationship Annotation Callisto
112
Steps in Information Extraction
  • Tokenization
  • Language Identification
  • D
Write a Comment
User Comments (0)
About PowerShow.com