Title: Computational%20Tools%20for%20Linguists
1Computational Tools for Linguists
- Inderjeet Mani
- Georgetown University
- im5_at_georgetown.edu
2Topics
- Computational tools for
- manual and automatic annotation of linguistic
data - exploration of linguistic hypotheses
- Case studies
- Demonstrations and training
- Inter-annotator reliability
- Effectiveness of annotation scheme
- Costs and tradeoffs in corpus preparation
3Outline
- Topics
- Concordances
- Data sparseness
- Chomskys Critique
- Ngrams
- Mutual Information
- Part-of-speech tagging
- Annotation Issues
- Inter-Annotator Reliability
- Named Entity Tagging
- Relationship Tagging
- Case Studies
- metonymy
- adjective ordering
- Discourse markers then
- TimeML
4Corpus Linguistics
- Use of linguistic data from corpora to test
linguistic hypotheses gt emphasizes language use - Uses computers to do the searching and counting
from on-line material - Faster than doing it by hand! Check?
- Most typical tool is a concordancer, but there
are many others! - Tools can analyze a certain amount, rest is left
to human! - Corpus Linguistics is also a particular approach
to linguistics, namely an empiricist approach - Sometimes (extreme view) opposed to the
rationalist approach, at other times (more
moderate view) viewed as complementary to it - Cf. Theoretical vs. Applied Linguistics
5Empirical Approaches in Computational Linguistics
- Empiricism the doctrine that knowledge is
derived from experience - Rationalism the doctrine that knowledge is
derived from reason - Computational Linguistics is, by necessity,
focused on performance, in that naturally
occurring linguistic data has to be processed - Naturally occurring data is messy! This means we
have to process data characterized by false
starts, hesitations, elliptical sentences, long
and complex sentences, input that is in a complex
format, etc. - The methodology used is corpus-based
- linguistic analysis (phonological, morphological,
syntactic, semantic, etc.) carried out on a
fairly large scale - rules are derived by humans or machines from
looking at phenomena in situ (with statistics
playing an important role)
6Example metonymy
- Metonymy substituting the name of one referent
for another - George W. Bush invaded Iraq
- A Mercedes rear-ended me
- Is metonymy involving institutions as agents more
common in print news than in fiction? - The X Vreporting
- Lets start with The X said
- This pattern will provide a handle to identify
the data
7Exploring Corpora
- Datasets
- http//complingtwo.georgetown.edu/cgi-bin/gwilson/
bin/DataSets.cgi - Metonymy Test using Corpora
- http//complingtwo.georgetown.edu/gwilson/Tools/M
etonymy/TheXSaid_MST.html
8The X said from Concordance data
Words Freq Freq/ M Words
Fiction 1870 1.7M 60 35
Fiction 2000 1.5M 219 146
Print News 1.9M 915 481
The preference for metonymy in print news arises
because of the need to communicate Information
from companies and governments.
9Chomskys Critique of Corpus-Based Methods
- 1. Corpora model performance, while linguistics
is aimed at the explanation of competence - If you define linguistics that way, linguistic
theories will never be able to deal with actual,
messy data - Many linguists dont find the competence-performan
ce distinction to be clear-cut. Sociolinguists
have argued that the variability of linguistic
performance is systematic, predictable, and
meaningful to speakers of a language. - Grammatical theories vary in where they draw the
line between competence and performance, with
some grammars (such as Hallidays Systemic
Grammar) organized as systems of
functionally-oriented choices.
10Chomskys Critique (concluded)
- 2. Natural language is in principle infinite,
whereas corpora are finite, so many examples will
be missed - Excellent point, which needs to be understood by
anyone working with a corpus. - But does that mean corpora are useless?
- Introspection is unreliable (prone to performance
factors, cf. only short sentences), and pretty
useless with child data. - Also, insights from a corpus might lead to
generalization/induction beyond the corpus if
the corpus is a good sample of the text
population - 3. Ungrammatical examples wont be available in a
corpus - Depends on the corpus, e.g., spontaneous speech,
language learners, etc. - The notion of grammaticality is not that clear
- Who did you see pictures/?a picture/??his
picture/Johns picture of? - ARG/ADJUNCT example
11Which Words are the Most Frequent?
Common Words in Tom Sawyer (71,730 words), from
Manning Schutze p.21
Will these counts hold in a different corpus (and
genre, cf. Tom)? What happens if you have 8-9M
words? (check usage demo!)
12Data Sparseness
- Many low-frequency words
- Fewer high-frequency words.
- Only a few words will have lots of examples.
- About 50 of word types occur only once
- Over 90 occur 10 times or less.
- So, there is merit to Chomskys 2nd objection
Word Frequency Number of words of that frequency
1 3993
2 1292
3 664
4 410
5 243
6 199
7 172
8 131
9 82
10 91
11-50 540
51-100 99
gt100 102
Frequency of word types in Tom Sawyer, from MS
22.
13Zipfs Law Frequency is inversely proportional
to rank
turned 51 200 10200
youll 30 300 9000
name 21 400 8400
comes 16 500 8000
group 13 600 7800
lead 11 700 7700
friends 10 800 8000
begin 9 900 8100
family 8 1000 8000
brushed 4 2000 8000
sins 2 3000 6000
could 2 4000 8000
applausive 1 8000 8000
Empirical evaluation of Zipfs Law on Tom Sawyer,
from MS 23.
14Illustration of Zipfs Law (Brown Corpus, from
MS p. 30)
logarithmic scale
- See also http//www.georgetown.edu/faculty/wilsong
/IR/WordDist.html
15Tokenizing words for corpus analysis
- 1. Break on
- Spaces? ????????????????
- inuo butta otokonokowa
otooto da - Periods? (U.K. Products)
- Hyphens? data-base database data base
- Apostrophes? wont, couldnt, ORiley, cars
- 2. should different word forms be counted as
distinct? - Lemma a set of lexical forms having the same
stem, the same pos, and the same word-sense. So,
cat and cats are the same lemma. - Sometimes, words are lemmatized by stemming,
other times by morphological analysis, using a
dictionary and/or morphological rules - 3. fold case or not (usually folded)?
- The the THE Mark versus mark
- One may need, however, to regenerate the original
case when presenting it to the user
16Counting Word Tokens vs Word Types
- Word tokens in Tom Sawyer 71,370
- Word types (i.e., how many different words)
8,018 - In newswire text of that number of tokens, you
would have 11,000 word types. Perhaps because Tom
Sawyer is written in a simple style.
17Inspecting word frequencies in a corpus
- http//complingtwo.georgetown.edu/cgi-bin/gwilson/
bin/DataSets.cgi - Usage demo
- http//complingtwo.georgetown.edu/cgi-bin/gwilson/
bin/Usage.cgi
18Ngrams
- Sequences of linguistic items of length n
- See count.pl
19A test for association strength Mutual
Information
Data from (Church et al. 1991)
1988 AP corpus N44.3M
20Interpreting Mutual Information
- High scores, e.g., strong supporter (8.85)
indicates strongly associated in the corpus - MI is a logarithmic score. To convert it, recall
that X2 log2X - so, 28.85 ? 461.44. So this is 461 X chance.
- Low scores powerful support (1.74) this is 3X
chance, since 21.74 ? 3 - I fxy fx fy x
y - 1.74 2 1984 13,428 powerful support
- I log2 (2N/198413428) 1.74
- So, doesnt necessarily mean weakly associated
could be due to data sparseness
21Mutual Information over Grammatical Relations
- Parse a corpus
- Determine subject-verb-object triples
- Identify head nouns of subject and object NPs
- Score subj-verb and verb-obj associations using MI
22Demo of Verb-Subj, Verb-Obj Parses
- Who devours or what gets devoured?
- Demo http//www.cs.ualberta.ca/lindek/demos/depi
ndex.htm
23MI over verb-obj relations
- Data from (Church et al. 1991)
24A Subj-Verb MI Example Who does what in news?
- executive police politician
- reprimand 16.36 shoot 17.37 clamor
16.94 - conceal 17.46 raid 17.65 jockey 17.53
- bank 18.27 arrest 17.96 wrangle 17.59
- foresee 18.85 detain 18.04 woo 18.92
- conspire 18.91 disperse 18.14 exploit 19.57
- convene 19.69 interrogate 18.36 brand 19.65
- plead 19.83 swoop 18.44 behave 19.72
- sue 19.85 evict 18.46 dare 19.73
- answer 20.02 bundle 18.50 sway 19.77
- commit 20.04 manhandle 18.59 criticize 19.78
- worry 20.04 search 18.60 flank 19.87
- accompany 20.11 confiscate 18.63 proclaim 19.91
- own 20.22 apprehend 18.71 annul 19.91
- witness 20.28 round 18.78 favor 19.92
Data from (Schiffman et al. 2001)
25Famous Corpora
- Must see http//www.ldc.upenn.edu/Catalog/
- Brown Corpus
- British National Corpus
- International Corpus of English
- Penn Treebank
- Lancaster-Oslo-Bergen Corpus
- Canadian Hansard Corpus
- U.N. Parallel Corpus
- TREC Corpora
- MUC Corpora
- English, Arabic, Chinese Gigawords
- Chinese, ArabicTreebanks
- North American News Text Corpus
- Multext East Corpus 1984 in multiple
Eastern/Central European langauges
26Links to Corpora
- Corpora
- Linguistic Data Consortium (LDC)
http//www.ldc.upenn.edu/ - Oxford Text Archive http//sable.ox.ac.uk/ota/
- Project Gutenberg http//www.promo.net/pg/
- CORPORA list http//www.hd.uib.no/corpora/archive.
html - Other
- Chris Mannings Corpora Page
- http//www-nlp.stanford.edu/links/statnlp.htmlCor
pora - Michael Barlows Corpus Linguistics page
http//www.ruf.rice.edu/barlow/corpus.html - Cathy Balls Corpora tutorial http//www.georgetow
n.edu/faculty/ballc/corpora/tutorial.html
27Summary Introduction
- Concordances and corpora are widely used and
available, to help one to develop
empirically-based linguistic theories and
computer implementations - The linguistic items that can be counted are
many, but words (defined appropriately) are
basic items - The frequency distribution of words in any
natural language is Zipfian - Data sparseness is a basic problem when using
observations in a corpus sample of language - Sequences of linguistic items (e.g., word
sequences n-grams) can also be counted, but the
counts will be very rare for longer items - Associations between items can be easily computed
- e.g., associations between verbs and
parser-discovered subjs or objs
28Outline
- Topics
- Concordances
- Data sparseness
- Chomskys Critique
- Ngrams
- Mutual Information
- Part-of-speech tagging
- Annotation Issues
- Inter-Annotator Reliability
- Named Entity Tagging
- Relationship Tagging
- Case Studies
- metonymy
- adjective ordering
- Discourse markers then
- TimeML
29Using POS in Concordances
Words Freq Freq/ Words
Fiction 2000 N \bdeal_NN 1.5M 115 7.66
Fiction 2000 VB 1.5M 14 9.33
Gigaword N 10.5M 2857 2.72
Gigaword VB 10.5M 139 1.32
deal is more often a verb In Fiction 2000
deal is more often a noun in English
Gigaword deal is more prevalent in Fiction 2000
than Gigaword
30POS Tagging What is it?
- Given a sentence and a tagset of lexical
categories, find the most likely tag for each
word in the sentence - Tagset e.g., Penn Treebank (45 tags, derived
from the 87-tag Brown corpus tagset) - Note that many of the words may have unambiguous
tags - Example
- Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NN - People/NNS continue/VBP to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN
31More details of POS problem
- How ambiguous?
- Most words in English have only one Brown Corpus
tag - Unambiguous (1 tag) 35,340 word types
- Ambiguous (2- 7 tags) 4,100 word types 11.5
- 7 tags 1 word type still
- But many of the most common words are ambiguous
- Over 40 of Brown corpus tokens are ambiguous
- Obvious strategies may be suggested based on
intuition - to/TO race/VB
- the/DT race/NN
- will/MD race/NN
- Sentences can also contain unknown words for
which tags have to be guessed Secretariat/NNP
is/VBZ
32Different English Part-of-Speech Tagsets
- Brown corpus - 87 tags
- Allows compound tags
- I'm tagged as PPSSBEM
- PPSS for "non-3rd person nominative personal
pronoun" and BEM for "am, 'm - Others have derived their work from Brown Corpus
- LOB Corpus 135 tags
- Lancaster UCREL Group 165 tags
- London-Lund Corpus 197 tags.
- BNC 61 tags (C5)
- PTB 45 tags
- To see comparisons ad mappings of tagsets, go to
www.comp.leeds.ac.uk/amalgam/tagsets/tagmenu.html
33PTB Tagset (36 main tags 9 punctuation tags)
34PTB Tagset Development
- Several changes were made to Brown Corpus tagset
- Recoverability
- Lexical Same treatment of Be, do, have, whereas
BC gave each its own symbol - Do/VB does/VBZ did/VBD doing/VBG done/VBN
- Syntactic Since parse trees were used as part of
Treebank, conflated certain categories under the
assumption that they would be recoverable from
syntax - subject vs. object pronouns (both PP)
- subordinating conjunctions vs. prepositions on
being informed vs. on the table (both IN) - Preposition to vs. infinitive marker (both TO)
- Syntactic Function
- BC the/DT one/CD vs. PTB the/DT one/NN
- BC both/ABX vs.
- PTB both/PDT the boys, the boys both/RB,
both/NNS of the boys, both/CC boys and girls
35PTB Tagging Process
- Tagset developed
- Automatic tagging by rule-based and statistical
pos taggers - Human correction using an editor embedded in Gnu
Emacs - Takes under a month for humans to learn this (at
15 hours a week), and annotation speeds after a
month exceed 3,000 words/hour - Inter-annotator disagreement (4 annotators, eight
2000-word docs) was 7.2 for the tagging task and
4.1 for the correcting task - Manual tagging took about 2X as long as
correcting, with about 2X the inter-annotator
disagreement rate and an error rate that was
about 50 higher. - So, for certain problems, having a linguist
correct automatically tagged output is far more
efficient and leads to better reliability among
linguists compared to having them annotate the
text from scratch!
36Automatic POS tagging
- http//complingone.georgetown.edu/linguist/
37A Baseline Strategy
- Choose the most likely tag for each ambiguous
word, independent of previous words - i.e., assign each token to the pos-category it
occurred in most often in the training set - E.g., race which pos is more likely in a
corpus? - This strategy gives you 90 accuracy in
controlled tests - So, this unigram baseline must always be
compared against
38Beyond the Baseline
- Hand-coded rules
- Sub-symbolic machine learning
- Symbolic machine learning
39Machine Learning
- Machines can learn from examples
- Learning can be supervised or unsupervised
- Given training data, machines analyze the data,
and learn rules which generalize to new examples - Can be sub-symbolic (rule may be a mathematical
function) e.g. neural nets - Or it can be symbolic (rules are in a
representation that is similar to representation
used for hand-coded rules) - In general, machine learning approaches allow for
more tuning to the needs of a corpus, and can be
reused across corpora
40A Probabilistic Approach to POS tagging
- What you want to do is find the best sequence
of pos-tags CC1..Cn for a sentence WW1..Wn. - (Here C1 is pos_tag(W1)).
- In other words, find a sequence of pos tags
Cthat maximizes P(C W) - Using Bayes Rule, we can say
- P(C W) P(W C) P(C) / P(W )
- Since we are interested in finding the value of C
which maximizes the RHS, the denominator can be
discarded, since it will be the same for every C - So, the problem is Find C which maximizes
- P(W C) P(C)
- Example He will race
- Possible sequences
- He/PP will/MD race/NN
- He/PP will/NN race/NN
- He/PP will/MD race/VB
- He/PP will/NN race/VB
- W W1 W2 W3
- He will race
- C C1 C2 C3
- Choices
- C PP MD NN
- C PP NN NN
- C PP MD VB
- C PP NN VB
41Independence Assumptions
- P(C1.Cn) ? ?i1, n P(Ci Ci-1)
- assumes that the event of a pos-tag occurring is
independent of the event of any other pos-tag
occurring, except for the immediately previous
pos tag - From a linguistic standpoint, this seems an
unreasonable assumption, due to long-distance
dependencies - P(W1.Wn C1.Cn) ? ?i1, n P(Wi Ci)
- assumes that the event of a word appearing in a
category is independent of the event of any other
word appearing in a category - Ditto
- However, the proof of the pudding is in the
eating! - N-gram models work well for part-of-speech
tagging
42A Statistical Method for POS Tagging
MD NN VB PRP he 0 0 0
.3 will .8 .2 0 0 race 0
.4 .6 0
- Find the value of C1..Cn which maximizes
- ?i1, n P(Wi Ci) P(Ci Ci-1)
Pos bigram probs
lexical generation probabilities
lexical generation probs
CR MD NN VB PRP MD .4 .6 NN
.3 .7 PP .8 .2 ?
1
pos bigram probs
43Finding the best path through an HMM
C
E
willMD .8
raceNN .4
.4
Viterbi algorithm
A
.8
hePP 1
1
ltsgt?
lex(B)
.3
F
.6
B
.2
willNN .2
raceVB .6
.7
D
- Score(I) Max J pred I Score(J)
transition(IJ) lex(I) - Score(B) P(PP?) P(hePP) 1.3.3
- Score(C)Score(B) P(MDPP) P(willMD)
.3.8.8 .19 - Score(D)Score(B) P(NNPP) P(willNN)
.3.2.2 .012 - Score(E) Max Score(C)P(NNMD),
Score(D)P(NNNN) P(raceNN) - Score(F) Max Score(C)P(VBMD),
Score(D)P(VBNN)P(raceVB)
44But Data Sparseness Bites Again!
- Lexical generation probabilities will lack
observations for low-frequency and unknown words - Most systems do one of the following
- Smooth the counts
- E.g., add a small number to unseen data (to zero
counts). For example, assume a bigram not seen in
the data has a very small probability, e.g.,
.0001. - Backoff bigrams with unigrams, etc.
- Use lots more data (youll still lose, thanks to
Zipf!) - Group items into classes, thus increasing class
frequency - e.g., group words into ambiguity classes, based
on their set of tags. For counting, alll words in
an ambiguity class are treated as variants of the
same word
45A Symbolic Learning Method
- HMMs are subsymbolic they dont give you rules
that you can inspect - A method called Transformational Rule Sequence
learning (Brill algorithm) can be used for
symbolic learning (among other approaches) - The rules (actually, a sequence of rules) are
learnt from an annotated corpus - Performs at least as accurately as other
statistical approaches - Has better treatment of context compared to HMMs
- rules which use the next (or previous) pos
- HMMs just use P(Ci Ci-1) or P(Ci Ci-2Ci-1)
- rules which use the previous (next) word
- HMMs just use P(WiCi)
46Brill Algorithm (Overview)
- Assume you are given a training corpus G (for
gold standard) - First, create a tag-free version V of it
- Notes
- As the algorithm proceeds, each successive rule
becomes narrower (covering fewer examples, i.e.,
changing fewer tags), but also potentially more
accurate - Some later rules may change tags changed by
earlier rules
- 1. First label every word token in V with most
likely tag for that word type from G. If this
initial state annotator is perfect, youre
done! - 2. Then consider every possible transformational
rule, selecting the one that leads to the most
improvement in V using G to measure the error - 3. Retag V based on this rule
- 4. Go back to 2, until there is no significant
improvement in accuracy over previous iteration
47Brill Algorithm (Detailed)
- 1. Label every word token with its most likely
tag (based on lexical generation probabilities). - 2. List the positions of tagging errors and their
counts, by comparing with ground-truth (GT) - 3. For each error position, consider each
instantiation I of X, Y, and Z in Rule template.
If YGT, increment improvementsI, else
increment errorsI. - 4. Pick the I which results in the greatest
error reduction, and add to output - e.g., VB NN PREV1OR2TAG DT improves 98 errors,
but produces 18 new errors, so net decrease of 80
errors - 5. Apply that I to corpus
- 6. Go to 2, unless stopping criterion is reached
- Most likely tag
- P(NNrace) .98
- P(VBrace) .02
- Is/VBZ expected/VBN to/TO race/NN tomorrow/NN
- Rule template Change a word from tag X to
tag Y when previous tag is Z - Rule Instantiation to above example NN VB
PREV1OR2TAG TO - Applying this rule yields
- Is/VBZ expected/VBN to/TO race/VB tomorrow/NN
48Example of Error Reduction
From Eric Brill (1995) Computational
Linguistics, 21, 4, p. 7
49Example of Learnt Rule Sequence
- 1. NN VB PREVTAG TO
- to/TO race/NN-gtVB
- 2. VBP VB PREV1OR20R3TAG MD
- might/MD vanish/VBP-gt VB
- 3. NN VB PREV1OR2TAG MD
- might/MD not/MD reply/NN -gt VB
- 4. VB NN PREV1OR2TAG DT
- the/DT great/JJ feast/VB-gtNN
- 5. VBD VBN PREV1OR20R3TAG VBZ
- He/PP was/VBZ killed/VBD-gtVBN by/IN Chapman/NNP
50Handling Unknown Words
- Can also use the Brill method
- Guess NNP if capitalized, NN otherwise.
- Or use the tag most common for words ending in
the last 3 letters. - etc.
Example Learnt Rule Sequence for Unknown Words
51POS Tagging using Unsupervised Methods
- Reason Annotated data isnt always available!
- Example the can
- Lets take unambiguous words from dictionary, and
count their occurrences after the - the .. elephant
- the .. guardian
- Conclusion immediately after the, nouns are more
common than verbs or modals
- Initial state annotator for each word, list all
tags in dictionary - Transformation template
- Change tag ? of word to tag Y if the previous
(next) tag (word) is Z, where ? is a set of 2 or
more tags - Dont change any other tags
52Error Reduction in Unsupervised Method
- Let a rule to change ? to Y in context C be
represented as Rule(?, Y, C). - Rule1 VB, MD, NN NN PREVWORD the
- Rule2 VB, MD, NN VB PREVWORD the
- Idea
- since annotated data isnt available, score rules
so as to prefer those where Y appears much more
frequently in the context C than all others in ? - frequency is measured by counting unambiguously
tagged words - so, prefer VB, MD, NN NN PREVWORD the
- to VB, MD, NN VB PREVWORD the
- since dict-unambiguous nouns are more common in
a corpus after the than dict-unambiguous verbs
53Summary POS tagging
- A variety of POS tagging schemes exist, even for
a single language - Preparing a POS-tagged corpus requires, for
efficiency, a combination of automatic tagging
and human correction - Automatic part-of-speech tagging can use
- Hand-crafted rules based on inspecting a corpus
- Machine Learning-based approaches based on corpus
statistics - e.g., HMM lexical generation probability table,
pos transition probability table - Machine Learning-based approaches using rules
derived automatically from a corpus - Combinations of different methods often improve
performance
54Outline
- Topics
- Concordances
- Data sparseness
- Chomskys Critique
- Ngrams
- Mutual Information
- Part-of-speech tagging
- Annotation Issues
- Inter-Annotator Reliability
- Named Entity Tagging
- Relationship Tagging
- Case Studies
- metonymy
- adjective ordering
- Discourse markers then
- TimeML
55Adjective Ordering
- A political serious problem
- A social extravagant life
- red lovely hair
- old little lady
- green little men
- Adjectives have been grouped into various classes
to explain ordering phenomena
56Collins COBUILD L2 Grammar
- qualitative lt color lt classifying
- Qualitative expresses a quality that someone or
something has, e.g., sad, pretty, small, etc. - Qualitative adjectives are gradable, i.e., the
person or thing can have more or less of the
quality - Classifying used to identify the class
something belongs to, i.e.., distinguishing - financial help, American citizens.
- Classifying adjectives arent gradable.
- So, the ordering reduces to
- Gradable lt color lt non-gradable
- A serious political problem
- Lovely red hair
- Big rectangular green Chinese carpet
57Vendler 68
- A9 lt A8 lt A2 lt A1x ltA1m lt ltA1a
- A9 probably, likely, certain
- A8 useful, profitable, necessary
- A7 possible, impossible
- A6 clever, stupid, reasonable, nice, kind,
thoughtful, considerate - A5 ready, willing, anxious
- A4 easy
- A3 slow, fast, good, bad, weak, careful,
beautiful - A2 contrastive/polar adjectives long-short,
thick-thin, big-little, wide-narrow - A1j verb-derivatives washed
- A1i verb-derivatives washing
- A1h luminous
- A1g rectangular
- A1f color adjectives
- A1a iron, steel, metal
- big rectangular green Chinese carpet
58Other Adjective Ordering Theories
Goyvaerts 68 quality lt size/length/shape lt age lt color lt naturally lt style lt general lt denominal
Quirck Greenbaum 73 Intensifying perfect lt general-measurable careful wealthy lt age young old lt color lt denominal material woollen scarf lt denominal style Parisian dress
Dixon 82 value lt dimension lt physical property lt speed lt human propensity lt age lt color
Frawley 92 value lt size lt color (English, German, Hungarian, Polish, Turkish, Hindi, Persian, Indonesian, Basque)
Collins COBUILD gradable lt
color lt non-gradable Goyvaerts, QG, Dixon
size lt age lt color Goyvaerts, QG
color lt denominal Goyvaerts,
Dixon shape lt color
59Testing the Theories on Large Corpora
- Selective coverage of a particular language or
(small) set of languages - Based on categories that arent defined precisely
that are computable - Based on small large numbers of examples
- Test gradable lt color lt non-gradable
60Computable Tests for Gradable Adjectives
- Submodifiers expressing gradation
- veryrathersomewhatextremely A
- But what about very British?
- http//complingtwo.georgetown.edu/gwilson/Tools/A
dj/GW_Grad.txt - Periphrastic comparatives
- more A than "the most A
- Inflectional comparatives
- -er-est
- http//complingtwo.georgetown.edu/gwilson/Tools/A
dj/BothLists.txt
61Challenges Data Sparseness
- Data sparseness
- Only some pairs will be present in a given corpus
- few adjectives on the gradable list may be
present - Even fewer longer sequences will be present in a
corpus - Use transitivity?
- small lt red, red lt wooden --gt small lt red lt
wooden?
62Challenges Tool Incompleteness
- Search pattern will return many non-examples
- Collocations
- common or marked ones
- American green card
- national Blue Cross
- Adjective Modification
- bright blue
- POS-tagging errors
- May also miss many examples
63Results from Corpus Analysis
- G lt C lt not G generally holds
- However, there are exceptions
- Classifying/Non-Gradable lt Color
- After all, the maple leaf replaced the British
red ensign as Canada's flag almost 30 years ago. - http//complingtwo.georgetown.edu/gwilson/Tools/A
dj/Color2.html - where he stood on a stage festooned with balloons
displaying the Palestinian green, white and red
flag - http//complingtwo.georgetown.edu/gwilson/Tools/A
dj/Color4.html - Color lt Shape
- paintings in which pink, roundish shapes,
enriched with flocking, gel, lentils and thread,
suggest the insides of the female body. - http//complingtwo.georgetown.edu/gwilson/Tools/A
dj/Color4.html
64Summary Adjective Ordering
- It is possible to test concrete predictions of a
linguistic theory in a corpus-based setting - The testing means that the machine searches for
examples satisfying patterns that the human
specifies - The patterns can pre-suppose a certain/high
degree of automatic tagging, with attendant loss
of accuracy - The patterns should be chosen so that they
provide handles to identify the phenomena of
interest - The patterns should be restricted enough that the
number of examples the human has to judge is not
infeasible - This is usually an iterative process
65Outline
- Topics
- Concordances
- Data sparseness
- Chomskys Critique
- Ngrams
- Mutual Information
- Part-of-speech tagging
- Annotation Issues
- Named Entity Tagging
- Inter-Annotator Reliability
- Relationship Tagging
- Case Studies
- metonymy
- adjective ordering
- Discourse markers then
- TimeML
66The Art of Annotation 101
- Define Goal
- Eyeball Data (with the help of Computers)
- Design Annotation Scheme
- Develop Example-based Guidelines
- Unless satisfied/exhausted, goto 1
- WriteTraining Manuals
- Initiate HumanTraining Sessions
- Annotate Data / Train Computers
- Computers can also help with the annotation
- Evaluate Humans and Computers
- Unless satisfied/exhausted, goto 1
67Annottation Methodology Picture
Raw Corpus
Initial Tagger
Annotation Editor
Annotation Guidelines
Machine Learning Program
Rule Apply
Learned Rules
Raw Corpus
Annotated Corpus
Annotated Corpus
Knowledge Base?
68Goals of an Annotation Scheme
- Simplicity simple enough for a human to carry
out - Precision precise enough to be useful in CLI
applications - Text-based annotation of an item should be
based on information conveyed by the text, rather
than information conveyed by background
information - Human-centered should be based on what a human
can infer from the text, rather than what a
machine can currently do or not do - Reproducible your annotation should be
reproducible by other humans (i.e.,
inter-annotator agreement should be high) - obviously, these other humans may have to have
particular expertise and training
69What Should An Annotation Contain
- Additional Information about the text being
annotated e.g., EAGLES external and internal
criteria - Information about the annotator who, when, what
version of tool, etc. (usually in meta-tags
associated with the text) - The tagged text itself
- Example
- http//www.emille.lancs.ac.uk/spoken.htm
70External and Internal Criteria (EAGLES)
- External participants, occasion, social
setting, communicative function - origin Aspects of the origin of the text that
are thought to affect its structure or content. - state the appearance of the text, its layout and
relation to non-textual matter, at the point when
it is selected for the corpus. - aims the reason for making the text and the
intended effect it is expected to have. - Internal patterns of language use
- Topic (economics, sports, etc.)
- Style (formal/informal, etc.)
71External Criteria state (EAGLES)
- Mode
- spoken
- participant awareness surreptitious/warned/aware
- venue studio/on location/telephone
- written
- Relation to the medium
- written how it is laid out, the paper, print,
etc. - spoken the acoustic conditions, etc.
- Relation to non-linguistic communicative matter
- diagrams, illustrations, other media that are
coupled with the language in a communicative
event. - Appearance
- e.g., advertising leaflets, aspects of
presentation that are unique in design and are
important enough to have an effect on the
language.
72Examples of annotation schemes (changing the way
we do business!)
- POS tagging annotation Penn Treebank Scheme
- Named entity annotation ACE Scheme
- Phrase Structure annotation Penn Treebank
scheme - Time Expression annotation TIMEX2 Scheme
- Protein Name Annotation GU Scheme
- Event Annotation TimeML Scheme
- Rhetorical Structure Annotation - RST Scheme
- Coreference Annotation, Subjectivity Annotation,
Gesture Annotation, Intonation Annotation,
Metonymy Annotation, etc., etc. - Etc.
- Several hundred schemes exist, for different
problems in different languages
73POS Tag Formats Non-SGML to SGML
- CLAWS tagger non-SGML
- What_DTQ can_VM0 CLAWS_NN2 do_VDI to_PRP
Inderjeet_NP0 's_POS noonsense_NN1 text_NN1 ?_? - Brill tagger non-SGML
- What/WP can/MD CLAWS/NNP do/VB to/TO
Inderjeet/NNP 's/POS noonsense/NN text/NN ?/. - Alembic POS tagger
- ltsgtltlex posWPgtWhatlt/lexgt ltlex posMDgtcanlt/lexgt
ltlex posNNPgtCLAWSlt/lexgt ltlex posVBgtdolt/lexgt
ltlex posTOgttolt/lexgt ltlex posNNPgtInderjeetlt/lexgt
ltlex posPOSgt'lt/lexgtltlex posPRPgtslt/lexgt ltlex
posVBPgtnoonsenselt/lexgt ltlex posNNgttextlt/lexgt
ltlex pos"."gt?lt/lexgtlt/sgt - Conversion to SGML is pretty trivial in such
cases
74SGML (Standard Generalized Markup Language)
- A general markup language for text
- HTML is an instance of an SGML encoding
- Text Encoding Initiative (TEI) defines SGML
schemes for marking up humanities text resources
as well as dictionaries - Examples
- ltpgtltsgtIm really hungry right now.lt/sgtltsgtOh,
yeah?lt/sgt - ltutt speakFred date10-Feb-1998gtThat is an
ugly couch.lt/uttgt - Note some elements (e.g., ltpgt) can consist just
of a single tag
- Character references ways of referring to the
non-ASCII characters using a numeric code - 229 (this is in decimal) xE5 (this is in
hexadecimal) - å
- Entity references are used to encode a special
character or sequence of characters via a
symbolic name - reacutesumeacute.
- docdate
75DTDs
- A document type definition, or DTD, is used to
define a grammar of legal SGML structures for a
document - e.g., para should consist of one or more
sentences and nothing else - SGML parser verifies that document is compliant
with DTD - DTDs can therefore be used for XML as well
- DTDs can specify what attributes are required, in
what order, what their legit values are, etc. - The DTDs are often ignored in practice!
- DTD
- lt!ENTITY writer SYSTEM "http//www.mysite.com/all-
entities.dtd"gt - lt!ATTLIST payment type (checkcash) "cash"gt
- XML
- ltauthorgtwriterlt/authorgt
- ltpayment type"check"gt
76XML
- Extensible Markup Language (XML) is a simple,
very flexible text format derived from SGML. - Originally designed to meet the challenges of
large-scale electronic publishing, XML is also
playing an increasingly important role in the
exchange of a wide variety of data on the Web and
elsewhere. www.w3.org/XML/ - Defines a simplified subset of SGML, designed
especially for Web applications - Unlike HTML, separates out display (e.g., XSL)
from content (XML) - Example
- ltp/gtltsgtltlex posWPgtWhatlt/lexgt ltlex
posMDgtcanlt/lexgtlt/sgt - Makes use of DTDs, but also RDF Schemas
77RDF Schemas
- Example of Real RDF Schema
- http//www.cs.brandeis.edu/jamesp/arda/time/docum
entation/TimeML.xsd (see EVENT tag and
attributes)
78Inline versus Standoff Annotation
- Usually, when tags are added, an annotation tool
is used, to avoid spurious insertions or
deletions - The annotation tool may use inline or standoff
annotation - Inline tags are stored internally in (a copy
of) the source text. - Tagged text can be substantially larger than
original text - Web pages are a good example i.e., HTML tags
- Standoff tags are stored internally in separate
files, with information as to what positions in
the source text the tags occupy - e.g., PERSON 335 337
- However, the annotation tool displays the text as
if the tags were in-line
79Summary Annotation Issues
- A best-practices methodology is widely used
for annotating corpora - The annotation process involves computational
tools at all stages - Standard guidelines are available for use
- To share annotated corpora (and to ensure their
survivability), it is crucial that the data be
represented in a standard rather than ad hoc
format - XML provides a well-established, Web-compliant
standard for markup languages - DTDs and RDF provide mechanisms for checking
well-formedness of annotation
80Outline
- Topics
- Concordances
- Data sparseness
- Chomskys Critique
- Ngrams
- Mutual Information
- Part-of-speech tagging
- Annotation Issues
- Inter-Annotator Reliability
- Named Entity Tagging
- Relationship Tagging
- Case Studies
- metonymy
- adjective ordering
- Discourse markers then
- TimeML
81Background
- Deborah Schiffrin. Anaphoric then aspectual,
textual, and epistemic meaning. Linguistics 30
(1992), 753-792 - Schiffrin xamines uses of then in data elicited
via 20 sociolinguistic interviews, each an hour
long - Distinguishes two anaphoric temporal senses,
showing that they are differentiated by clause
position - Shows that they have systematic effects on
aspectual interpretation - A parallel argument is made for two epistemic
temporal senses
82Schiffrin Temporal and Non-Temporal Senses
- Anaphoric Senses
- Narrative temporal sense (shifts reference
time) - And then I uh lived there until I was sixteen
- Continuing Temporal sense (continues a previous
reference time) - I was only a little boy then.
- Epistemic senses
- Conditional sentences (rare, but often have
temporal antecedents in her data) - But if I think about it for a few days -- well,
then I seem to remember a great deal - if Im still in the situation where I am
now.Im, not gonna have no more then - Initiation-response-evaluation sequences (in
that case?) - Freda Do y still need the light?
- Debby Um.
- Freda Wll have t go in then. Because the bugs
are out.
83Schiffrins Argument (Simplified) and Its Test
- Shifting RT thens (call these Narrative) then
in if-then conditionals - similar semantic function
- mainly clause-initial
- Continuing RT thens (call these Temporal) IRE
thens - similar semantic function
- mainly clause final
- stative verb more likely (since RT overlaps,
verbs conveying duration are expected) - Call the rest Other
- isnt differentiated into if-then versus IRE
- So, only part of her claims tested
84So, What do we do Then?
- Define environments of interest, each one defined
by a pattern - For each environment
- Find examples matching the pattern
- If classifying the examples is manageable, carry
it out and stop - Otherwise restrict the environment by adding new
elements to the pattern, and go back to 1 - So, for each final environment, we claim that X
of the examples in that environment are of a
particular class - Initial then Pattern (_CC_RB)\sthen\w\s\w
- Final then Pattern \,\sthen\.\?\'\\!\
85Exceptions
- Non-Narrative Initial then
- then there be
- then come
- then again
- then and now
- only then
- even then
- so then
- Non-Temporal Final then
- What then?
- All right/OK , then
- And then?
86Results
Written Fiction 2000 Written Fiction 2000 Written Fiction 2000 Spoken Broadcast News Spoken Broadcast News Spoken Broadcast News Written Gigaword News Written Gigaword News Written Gigaword News
T N O T N O T N O
Clause Initial 1.73 (23/1322) 96.67 (1276/1322) 1.58 (21/1322) .73 (6/818) 93.88(768/818) 5.3 (44/818) 3.64 (27/740) 75.94 (562/740) 20.40 (151/740)
Clause Final 71.81 (79/110) 2.72 (3/110) 25.45 (28/110) 72.61 (61/84) 5.95 (5/84) 21.42 (18/84) 93.23 (179/192) 0 6.77 (13/192)
Other is a presence in final position in fiction
and broadcast news, and in initial position in
print news. Is this real or artifact of catch-all
class? Conclusion only part of her claims
tested. But those claims are borne out across
three different genres and much more data!
87Outline
- Topics
- Concordances
- Data sparseness
- Chomskys Critique
- Ngrams
- Mutual Information
- Part-of-speech tagging
- Annotation Issues
- Inter-Annotator Reliability
- Named Entity Tagging
- Relationship Tagging
- Case Studies
- metonymy
- adjective ordering
- Discourse markers then
- TimeML
88Considerations in Inter-Annotator Agreement
- Size of tagset
- Structure of tagset
- Clarity of Guidelines
- Number of raters
- Experience of raters
- Training of raters
- Independent ratings (preferred)
- Consensus (not preferred)
- Exact, partial, and equivalent matches
- Metrics
- Lessons Learned Disagreement patterns suggest
guideline revisions
89Protein Names
- Considerable variability in the forms of the
names - Multiple naming conventions
- Researchers may name a newly discovered protein
based on - function
- sequence features
- gene name
- cellular location
- molecular weight
- discoverer
- or other properties
- Prolific use of abbreviations and acronyms
- fushi tarazu 1 factor homolog
- Fushi tarazu factor (Drosophila) homolog 1
- FTZ-F1 homolog ELP
- steroid/thyroid/retinoic nuclear hormone
receptor homolog nhr-35 - V-INT 2 murine mammary tumor virus integration
site oncogene homolog - fibroblast growth factor 1 (acidic) isoform 1
precursor - nuclear hormone receptor subfamily 5, Group A,
member 1
90Guidelines v1 TOC
91Agreement Metrics
Reference Candidate Candidate
Yes No
Yes TP ? FN
No FP TN ?
Measure Definition
Percentage Agreement 100(TPTN)/ (TPFPTNFN)
Precision TP/(TPFP)
Recall TP/(TPFN)
(Balanced) F-Measure 2PrecisionRecall/(PrecisionRecall)
92Example for F-measure Scorer Output (Protein
Name Tagging)
-
- REFERENCE
CANDIDATE - CORR FTZ-F1 homolog ELP
FTZ-F1 homolog ELP INCO
M2-LHX3
M2SPUR
-SPUR
LHX3 - Precision ¼ 0.25
- Recall ½ 0.5
- F-measure 2 ¼ ½ / ( ¼ ½ ) 0.33
93The importance of disagreement
- Measuring inter-annotator agreement is very
useful in debugging the annotation scheme - Disagreement can lead to improvements in the
annotation scheme - Extreme disagreement can lead to abandonment of
the scheme
94V2 Assessment (ABS2)
- Old Guidelines
- protein 0.71 F
- acronym 0.85 F
- array-protein 0.15 F
- New Guidelines
- protein 0.86 F
- long-form 0.71 F
- these are only 4 of tags
Coders Correct Precision Precision Recall Recall F-mea- sure
ltproteingt ltproteingt ltproteingt ltproteingt ltproteingt ltproteingt
A1-A3 4497 0.874 0.852 0.852 0.863 0.863
A1-A4 4769 0.884 0.904 0.904 0.894 0.894
A3-A4 4476 0.830 0.870 0.870 0.849 0.849
Average 0.862 0.875 0.875 0.868 0.868
ltlong-formgt ltlong-formgt ltlong-formgt ltlong-formgt ltlong-formgt ltlong-formgt
A1-A3 172 0.720 0.599 0.599 0.654 0.654
A1-A4 241 0.837 0.840 0.840 0.838 0.838
A3-A4 175 0.608 0.732 0.732 0.664 0.664
Average 0.721 0.723 0.723 0.718 0.718
95TIMEX2 Annotation Scheme
- Time Points ltTIMEX2 VAL"2000-W42"gtthe third week
of Octoberlt/TIMEX2gt - Durations ltTIMEX2 VALPT30Mgthalf an hour
longlt/TIMEX2gt - Indexicality ltTIMEX2 VAL2000-10-04gttomorrowlt/T
IMEX2gt - Sets ltTIMEX2 VALXXXX-WXX-2" SET"YES
PERIODICITY"F1W" GRANULARITYG1Dgtevery
Tuesdaylt/TIMEX2gt - Fuzziness ltTIMEX2 VAL1990-SUgtSummer of 1990
lt/TIMEX2gt - ltTIMEX2 VAL1999-07-15TMOgtThis
morninglt/TIMEX2gt - Non-specificity ltTIMEX2 VAL"XXXX-04"
NON_SPECIFICYESgtAprillt/TIMEX2gt is usually wet. - For guidelines, tools, and corpora, please see
timex2.mitre.org
96TIMEX2 Inter-Annotator Agreement
193 NYT news docs 5 annotators 10 pairs of
annotators
- Human annotation quality is acceptable on
EXTENT and VAL - Poor performance on Granularity and Non-Specific
- But only a small number of instances of these
(200 6000) - Annotators deviate from guidelines, and produce
systematic errors (fatigue?) - several years ago PXY instead of PAST_REF
- all day P1D instead of YYYY-MM-DD
97TempEx in Qanda
98Summary Inter-Annotator Reliability
- Theres no point going on with an annotation
scheme if it cant be reproduced - There are standard methods for measuring
inter-annotator reliability - An analysis of inter-annotator disagreements is
critical for debugging an annotation scheme
99Outline
- Topics
- Concordances
- Data sparseness
- Chomskys Critique
- Ngrams
- Mutual Information
- Part-of-speech tagging
- Annotation Issues
- Inter-Annotator Reliability
- Named Entity Tagging
- Relationship Tagging
- Case Studies
- metonymy
- adjective ordering
- Discourse markers then
- TimeML
100Information Extraction
- Types
- Flag names of people, organizations, places,
- Flag and normalize time expressions, phrases such
as time expressions, measure phrases, currency
expressions, etc. - Group coreferring expressions together
- Find relations between named entities (works for,
located at, etc.) - Find events mentioned in the text
- Find relations between events and entities
- A hot commercial technology!
- Example patterns
- Mr. ---,
- , Ill.
101Message Understanding Conferences (MUCs)
- Idea precise tasks to measure success, rather
than test suite of input and logical forms. - MUC-1 1987 and MUC-2 1989 - messages about navy
operations - MUC-3 1991 and MIC-4 1992 - news articles and
transcripts of radio broadcasts about terrorist
activity - MUC-5 1993 - news articles about joint ventures
and microelectronics - MUC-6 1995 - news articles about management
changes, additional tasks of named entity
recognition, coreference, and template element - MUC-7 1998 mostly multilingual information
extraction - Has also been applied to hundreds of other
domains - scientific articles, etc., etc.
102Historical Perspective
- Until MUC-3 (1993), many IE systems used a
Knowledge Engineering approach - They did something like full chart parsing with a
unification-based grammar with full logical
forms, a rich lexicon and KB - E.g., SRIs Tacitus
- Then, they discovered that things could work much
faster using finite-state methods and partial
parsing - And that using domain-specific rather than
general purpose lexicons simplified parsing (less
ambiguity due to fewer irrelevant senses) - And that these methods worked even better for the
IE tasks - E.g., SRIs Fastus, SRAs Nametag
- Meanwhile, people also started using statistical
learning methods from annotated corpora - Including CFG parsing
103An instantiated scenario template
Source
Wall Street Journal, 06/15/88 MAXICARE HEALTH
PLANS INC and UNIVERSAL HEALTH SERVICES INC have
dissolved a joint venture which provided health
services.
104Templates Can get Complex! (MUC-5)
1052002 Automatic Content Extraction (ACE) Program
Entity Types
- Person
- Organization
- (Place)
- Location e.g., geographical areas, landmasses,
bodies of water, geological formations - Geo-Political Entity e.g., nations, states,
cities - Created due to metonymies involving this class of
places - The riots in Miami
- Miami imposed a curfew
- Miami railed against a curfew
- Facility buildings, streets, airports, etc.
106ACE Entity Attributes and Relations
- Attributes
- Name An entity mentioned by name
- Pronoun
- Nominal
- Relations
- AT based-in, located, residence
- NEAR relative-location
- PART part-of, subsidiary, other
- ROLE affiliate-partner, citizen-of, client,
founder, general-staff, manager, member, owner,
other - SOCIAL associate, grandparent, parent, sibling,
spouse, other-relative, other-personal,
other-professional
107Designing an Information Extraction Task
- Define the overall task
- Collect a corpus
- Design an Annotation Scheme
- linguistic theories help
- Use Annotation Tools
- - authoring tools
- -automatic extraction tools
- Apply to annotation to corpus, assessing
reliability - Use training portion of corpus to train
information extraction (IE) systems - Use test portion to test IE systems, using a
scoring program
108Annotation Tools
- Specialized authoring tools used for marking up
text without damaging it - Some tools are tied to particular annotation
schemes
109Annotation Tool Example Alembic Workbench
110Callisto (Java successor to Alembic Workbench)
111Relationship Annotation Callisto
112Steps in Information Extraction
- Tokenization
- Language Identification
- D