Computational%20Tools%20for%20Linguists

About This Presentation

Title:

Computational%20Tools%20for%20Linguists

Description:

Computational Tools for Linguists – PowerPoint PPT presentation

Number of Views:683

Avg rating:3.0/5.0

Slides: 155

Provided by: inde61

Category:

more less

Transcript and Presenter's Notes

Title: Computational%20Tools%20for%20Linguists

1
Computational Tools for Linguists

Inderjeet Mani
Georgetown University
im5_at_georgetown.edu

2
Topics

Computational tools for
manual and automatic annotation of linguistic
data
exploration of linguistic hypotheses
Case studies
Demonstrations and training
Inter-annotator reliability
Effectiveness of annotation scheme
Costs and tradeoffs in corpus preparation

3
Outline

Topics
Concordances
Data sparseness
Chomskys Critique
Ngrams
Mutual Information
Part-of-speech tagging
Annotation Issues
Inter-Annotator Reliability
Named Entity Tagging
Relationship Tagging

Case Studies
metonymy
adjective ordering
Discourse markers then
TimeML

4
Corpus Linguistics

Use of linguistic data from corpora to test
linguistic hypotheses gt emphasizes language use
Uses computers to do the searching and counting
from on-line material
Faster than doing it by hand! Check?
Most typical tool is a concordancer, but there
are many others!
Tools can analyze a certain amount, rest is left
to human!
Corpus Linguistics is also a particular approach
to linguistics, namely an empiricist approach
Sometimes (extreme view) opposed to the
rationalist approach, at other times (more
moderate view) viewed as complementary to it
Cf. Theoretical vs. Applied Linguistics

5
Empirical Approaches in Computational Linguistics

Empiricism the doctrine that knowledge is
derived from experience
Rationalism the doctrine that knowledge is
derived from reason
Computational Linguistics is, by necessity,
focused on performance, in that naturally
occurring linguistic data has to be processed
Naturally occurring data is messy! This means we
have to process data characterized by false
starts, hesitations, elliptical sentences, long
and complex sentences, input that is in a complex
format, etc.
The methodology used is corpus-based
linguistic analysis (phonological, morphological,
syntactic, semantic, etc.) carried out on a
fairly large scale
rules are derived by humans or machines from
looking at phenomena in situ (with statistics
playing an important role)

6
Example metonymy

Metonymy substituting the name of one referent
for another
George W. Bush invaded Iraq
A Mercedes rear-ended me
Is metonymy involving institutions as agents more
common in print news than in fiction?
The X Vreporting
Lets start with The X said
This pattern will provide a handle to identify
the data

7
Exploring Corpora

Datasets
http//complingtwo.georgetown.edu/cgi-bin/gwilson/
bin/DataSets.cgi
Metonymy Test using Corpora
http//complingtwo.georgetown.edu/gwilson/Tools/M
etonymy/TheXSaid_MST.html

8
The X said from Concordance data
Words Freq Freq/ M Words
Fiction 1870 1.7M 60 35
Fiction 2000 1.5M 219 146
Print News 1.9M 915 481
The preference for metonymy in print news arises
because of the need to communicate Information
from companies and governments.
9
Chomskys Critique of Corpus-Based Methods

1. Corpora model performance, while linguistics
is aimed at the explanation of competence
If you define linguistics that way, linguistic
theories will never be able to deal with actual,
messy data
Many linguists dont find the competence-performan
ce distinction to be clear-cut. Sociolinguists
have argued that the variability of linguistic
performance is systematic, predictable, and
meaningful to speakers of a language.
Grammatical theories vary in where they draw the
line between competence and performance, with
some grammars (such as Hallidays Systemic
Grammar) organized as systems of
functionally-oriented choices.

10
Chomskys Critique (concluded)

2. Natural language is in principle infinite,
whereas corpora are finite, so many examples will
be missed
Excellent point, which needs to be understood by
anyone working with a corpus.
But does that mean corpora are useless?
Introspection is unreliable (prone to performance
factors, cf. only short sentences), and pretty
useless with child data.
Also, insights from a corpus might lead to
generalization/induction beyond the corpus if
the corpus is a good sample of the text
population
3. Ungrammatical examples wont be available in a
corpus
Depends on the corpus, e.g., spontaneous speech,
language learners, etc.
The notion of grammaticality is not that clear
Who did you see pictures/?a picture/??his
picture/Johns picture of?
ARG/ADJUNCT example

11
Which Words are the Most Frequent?
Common Words in Tom Sawyer (71,730 words), from
Manning Schutze p.21
Will these counts hold in a different corpus (and
genre, cf. Tom)? What happens if you have 8-9M
words? (check usage demo!)
12
Data Sparseness

Many low-frequency words
Fewer high-frequency words.
Only a few words will have lots of examples.
About 50 of word types occur only once
Over 90 occur 10 times or less.
So, there is merit to Chomskys 2nd objection

Word Frequency Number of words of that frequency
1 3993
2 1292
3 664
4 410
5 243
6 199
7 172
8 131
9 82
10 91
11-50 540
51-100 99
gt100 102
Frequency of word types in Tom Sawyer, from MS
22.
13
Zipfs Law Frequency is inversely proportional
to rank
turned 51 200 10200
youll 30 300 9000
name 21 400 8400
comes 16 500 8000
group 13 600 7800
lead 11 700 7700
friends 10 800 8000
begin 9 900 8100
family 8 1000 8000
brushed 4 2000 8000
sins 2 3000 6000
could 2 4000 8000
applausive 1 8000 8000
Empirical evaluation of Zipfs Law on Tom Sawyer,
from MS 23.
14
Illustration of Zipfs Law (Brown Corpus, from
MS p. 30)
logarithmic scale

See also http//www.georgetown.edu/faculty/wilsong
/IR/WordDist.html

15
Tokenizing words for corpus analysis

1. Break on
Spaces? ????????????????
inuo butta otokonokowa
otooto da
Periods? (U.K. Products)
Hyphens? data-base database data base
Apostrophes? wont, couldnt, ORiley, cars
2. should different word forms be counted as
distinct?
Lemma a set of lexical forms having the same
stem, the same pos, and the same word-sense. So,
cat and cats are the same lemma.
Sometimes, words are lemmatized by stemming,
other times by morphological analysis, using a
dictionary and/or morphological rules
3. fold case or not (usually folded)?
The the THE Mark versus mark
One may need, however, to regenerate the original
case when presenting it to the user

16
Counting Word Tokens vs Word Types

Word tokens in Tom Sawyer 71,370
Word types (i.e., how many different words)
8,018
In newswire text of that number of tokens, you
would have 11,000 word types. Perhaps because Tom
Sawyer is written in a simple style.

17
Inspecting word frequencies in a corpus

http//complingtwo.georgetown.edu/cgi-bin/gwilson/
bin/DataSets.cgi
Usage demo
http//complingtwo.georgetown.edu/cgi-bin/gwilson/
bin/Usage.cgi

18
Ngrams

Sequences of linguistic items of length n
See count.pl

19
A test for association strength Mutual
Information
Data from (Church et al. 1991)
1988 AP corpus N44.3M
20
Interpreting Mutual Information

High scores, e.g., strong supporter (8.85)
indicates strongly associated in the corpus
MI is a logarithmic score. To convert it, recall
that X2 log2X
so, 28.85 ? 461.44. So this is 461 X chance.
Low scores powerful support (1.74) this is 3X
chance, since 21.74 ? 3
I fxy fx fy x
y
1.74 2 1984 13,428 powerful support
I log2 (2N/198413428) 1.74
So, doesnt necessarily mean weakly associated
could be due to data sparseness

21
Mutual Information over Grammatical Relations

Parse a corpus
Determine subject-verb-object triples
Identify head nouns of subject and object NPs
Score subj-verb and verb-obj associations using MI

22
Demo of Verb-Subj, Verb-Obj Parses

Who devours or what gets devoured?
Demo http//www.cs.ualberta.ca/lindek/demos/depi
ndex.htm

23
MI over verb-obj relations

Data from (Church et al. 1991)

24
A Subj-Verb MI Example Who does what in news?

executive police politician
reprimand 16.36 shoot 17.37 clamor
16.94
conceal 17.46 raid 17.65 jockey 17.53
bank 18.27 arrest 17.96 wrangle 17.59
foresee 18.85 detain 18.04 woo 18.92
conspire 18.91 disperse 18.14 exploit 19.57
convene 19.69 interrogate 18.36 brand 19.65
plead 19.83 swoop 18.44 behave 19.72
sue 19.85 evict 18.46 dare 19.73
answer 20.02 bundle 18.50 sway 19.77
commit 20.04 manhandle 18.59 criticize 19.78
worry 20.04 search 18.60 flank 19.87
accompany 20.11 confiscate 18.63 proclaim 19.91
own 20.22 apprehend 18.71 annul 19.91
witness 20.28 round 18.78 favor 19.92

Data from (Schiffman et al. 2001)
25
Famous Corpora

Must see http//www.ldc.upenn.edu/Catalog/
Brown Corpus
British National Corpus
International Corpus of English
Penn Treebank
Lancaster-Oslo-Bergen Corpus
Canadian Hansard Corpus
U.N. Parallel Corpus
TREC Corpora
MUC Corpora
English, Arabic, Chinese Gigawords
Chinese, ArabicTreebanks
North American News Text Corpus
Multext East Corpus 1984 in multiple
Eastern/Central European langauges

26
Links to Corpora

Corpora
Linguistic Data Consortium (LDC)
http//www.ldc.upenn.edu/
Oxford Text Archive http//sable.ox.ac.uk/ota/
Project Gutenberg http//www.promo.net/pg/
CORPORA list http//www.hd.uib.no/corpora/archive.
html
Other
Chris Mannings Corpora Page
http//www-nlp.stanford.edu/links/statnlp.htmlCor
pora
Michael Barlows Corpus Linguistics page
http//www.ruf.rice.edu/barlow/corpus.html
Cathy Balls Corpora tutorial http//www.georgetow
n.edu/faculty/ballc/corpora/tutorial.html

27
Summary Introduction

Concordances and corpora are widely used and
available, to help one to develop
empirically-based linguistic theories and
computer implementations
The linguistic items that can be counted are
many, but words (defined appropriately) are
basic items
The frequency distribution of words in any
natural language is Zipfian
Data sparseness is a basic problem when using
observations in a corpus sample of language
Sequences of linguistic items (e.g., word
sequences n-grams) can also be counted, but the
counts will be very rare for longer items
Associations between items can be easily computed
e.g., associations between verbs and
parser-discovered subjs or objs

28
Outline

Topics
Concordances
Data sparseness
Chomskys Critique
Ngrams
Mutual Information
Part-of-speech tagging
Annotation Issues
Inter-Annotator Reliability
Named Entity Tagging
Relationship Tagging

Case Studies
metonymy
adjective ordering
Discourse markers then
TimeML

29
Using POS in Concordances
Words Freq Freq/ Words
Fiction 2000 N \bdeal_NN 1.5M 115 7.66
Fiction 2000 VB 1.5M 14 9.33
Gigaword N 10.5M 2857 2.72
Gigaword VB 10.5M 139 1.32
deal is more often a verb In Fiction 2000
deal is more often a noun in English
Gigaword deal is more prevalent in Fiction 2000
than Gigaword
30
POS Tagging What is it?

Given a sentence and a tagset of lexical
categories, find the most likely tag for each
word in the sentence
Tagset e.g., Penn Treebank (45 tags, derived
from the 87-tag Brown corpus tagset)
Note that many of the words may have unambiguous
tags
Example
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN

31
More details of POS problem

How ambiguous?
Most words in English have only one Brown Corpus
tag
Unambiguous (1 tag) 35,340 word types
Ambiguous (2- 7 tags) 4,100 word types 11.5
7 tags 1 word type still
But many of the most common words are ambiguous
Over 40 of Brown corpus tokens are ambiguous
Obvious strategies may be suggested based on
intuition
to/TO race/VB
the/DT race/NN
will/MD race/NN
Sentences can also contain unknown words for
which tags have to be guessed Secretariat/NNP
is/VBZ

32
Different English Part-of-Speech Tagsets

Brown corpus - 87 tags
Allows compound tags
I'm tagged as PPSSBEM
PPSS for "non-3rd person nominative personal
pronoun" and BEM for "am, 'm
Others have derived their work from Brown Corpus
LOB Corpus 135 tags
Lancaster UCREL Group 165 tags
London-Lund Corpus 197 tags.
BNC 61 tags (C5)
PTB 45 tags
To see comparisons ad mappings of tagsets, go to
www.comp.leeds.ac.uk/amalgam/tagsets/tagmenu.html

33
PTB Tagset (36 main tags 9 punctuation tags)
34
PTB Tagset Development

Several changes were made to Brown Corpus tagset
Recoverability
Lexical Same treatment of Be, do, have, whereas
BC gave each its own symbol
Do/VB does/VBZ did/VBD doing/VBG done/VBN
Syntactic Since parse trees were used as part of
Treebank, conflated certain categories under the
assumption that they would be recoverable from
syntax
subject vs. object pronouns (both PP)
subordinating conjunctions vs. prepositions on
being informed vs. on the table (both IN)
Preposition to vs. infinitive marker (both TO)
Syntactic Function
BC the/DT one/CD vs. PTB the/DT one/NN
BC both/ABX vs.
PTB both/PDT the boys, the boys both/RB,
both/NNS of the boys, both/CC boys and girls

35
PTB Tagging Process

Tagset developed
Automatic tagging by rule-based and statistical
pos taggers
Human correction using an editor embedded in Gnu
Emacs
Takes under a month for humans to learn this (at
15 hours a week), and annotation speeds after a
month exceed 3,000 words/hour
Inter-annotator disagreement (4 annotators, eight
2000-word docs) was 7.2 for the tagging task and
4.1 for the correcting task
Manual tagging took about 2X as long as
correcting, with about 2X the inter-annotator
disagreement rate and an error rate that was
about 50 higher.
So, for certain problems, having a linguist
correct automatically tagged output is far more
efficient and leads to better reliability among
linguists compared to having them annotate the
text from scratch!

36
Automatic POS tagging

http//complingone.georgetown.edu/linguist/

37
A Baseline Strategy

Choose the most likely tag for each ambiguous
word, independent of previous words
i.e., assign each token to the pos-category it
occurred in most often in the training set
E.g., race which pos is more likely in a
corpus?
This strategy gives you 90 accuracy in
controlled tests
So, this unigram baseline must always be
compared against

38
Beyond the Baseline

Hand-coded rules
Sub-symbolic machine learning
Symbolic machine learning

39
Machine Learning

Machines can learn from examples
Learning can be supervised or unsupervised
Given training data, machines analyze the data,
and learn rules which generalize to new examples
Can be sub-symbolic (rule may be a mathematical
function) e.g. neural nets
Or it can be symbolic (rules are in a
representation that is similar to representation
used for hand-coded rules)
In general, machine learning approaches allow for
more tuning to the needs of a corpus, and can be
reused across corpora

40
A Probabilistic Approach to POS tagging

What you want to do is find the best sequence
of pos-tags CC1..Cn for a sentence WW1..Wn.
(Here C1 is pos_tag(W1)).
In other words, find a sequence of pos tags
Cthat maximizes P(C W)
Using Bayes Rule, we can say
P(C W) P(W C) P(C) / P(W )
Since we are interested in finding the value of C
which maximizes the RHS, the denominator can be
discarded, since it will be the same for every C
So, the problem is Find C which maximizes
P(W C) P(C)

Example He will race
Possible sequences
He/PP will/MD race/NN
He/PP will/NN race/NN
He/PP will/MD race/VB
He/PP will/NN race/VB
W W1 W2 W3
He will race
C C1 C2 C3
Choices
C PP MD NN
C PP NN NN
C PP MD VB
C PP NN VB

41
Independence Assumptions

P(C1.Cn) ? ?i1, n P(Ci Ci-1)
assumes that the event of a pos-tag occurring is
independent of the event of any other pos-tag
occurring, except for the immediately previous
pos tag
From a linguistic standpoint, this seems an
unreasonable assumption, due to long-distance
dependencies
P(W1.Wn C1.Cn) ? ?i1, n P(Wi Ci)
assumes that the event of a word appearing in a
category is independent of the event of any other
word appearing in a category
Ditto
However, the proof of the pudding is in the
eating!
N-gram models work well for part-of-speech
tagging

42
A Statistical Method for POS Tagging
MD NN VB PRP he 0 0 0
.3 will .8 .2 0 0 race 0
.4 .6 0

Find the value of C1..Cn which maximizes
?i1, n P(Wi Ci) P(Ci Ci-1)

Pos bigram probs
lexical generation probabilities
lexical generation probs
CR MD NN VB PRP MD .4 .6 NN
.3 .7 PP .8 .2 ?
1
pos bigram probs
43
Finding the best path through an HMM
C
E
willMD .8
raceNN .4
.4
Viterbi algorithm
A
.8
hePP 1
1
ltsgt?
lex(B)
.3
F
.6
B
.2
willNN .2
raceVB .6
.7
D

Score(I) Max J pred I Score(J)
transition(IJ) lex(I)
Score(B) P(PP?) P(hePP) 1.3.3
Score(C)Score(B) P(MDPP) P(willMD)
.3.8.8 .19
Score(D)Score(B) P(NNPP) P(willNN)
.3.2.2 .012
Score(E) Max Score(C)P(NNMD),
Score(D)P(NNNN) P(raceNN)
Score(F) Max Score(C)P(VBMD),
Score(D)P(VBNN)P(raceVB)

44
But Data Sparseness Bites Again!

Lexical generation probabilities will lack
observations for low-frequency and unknown words
Most systems do one of the following
Smooth the counts
E.g., add a small number to unseen data (to zero
counts). For example, assume a bigram not seen in
the data has a very small probability, e.g.,
.0001.
Backoff bigrams with unigrams, etc.
Use lots more data (youll still lose, thanks to
Zipf!)
Group items into classes, thus increasing class
frequency
e.g., group words into ambiguity classes, based
on their set of tags. For counting, alll words in
an ambiguity class are treated as variants of the
same word

45
A Symbolic Learning Method

HMMs are subsymbolic they dont give you rules
that you can inspect
A method called Transformational Rule Sequence
learning (Brill algorithm) can be used for
symbolic learning (among other approaches)
The rules (actually, a sequence of rules) are
learnt from an annotated corpus
Performs at least as accurately as other
statistical approaches
Has better treatment of context compared to HMMs
rules which use the next (or previous) pos
HMMs just use P(Ci Ci-1) or P(Ci Ci-2Ci-1)
rules which use the previous (next) word
HMMs just use P(WiCi)

46
Brill Algorithm (Overview)

Assume you are given a training corpus G (for
gold standard)
First, create a tag-free version V of it
Notes
As the algorithm proceeds, each successive rule
becomes narrower (covering fewer examples, i.e.,
changing fewer tags), but also potentially more
accurate
Some later rules may change tags changed by
earlier rules

1. First label every word token in V with most
likely tag for that word type from G. If this
initial state annotator is perfect, youre
done!
2. Then consider every possible transformational
rule, selecting the one that leads to the most
improvement in V using G to measure the error
3. Retag V based on this rule
4. Go back to 2, until there is no significant
improvement in accuracy over previous iteration

47
Brill Algorithm (Detailed)

1. Label every word token with its most likely
tag (based on lexical generation probabilities).
2. List the positions of tagging errors and their
counts, by comparing with ground-truth (GT)
3. For each error position, consider each
instantiation I of X, Y, and Z in Rule template.
If YGT, increment improvementsI, else
increment errorsI.
4. Pick the I which results in the greatest
error reduction, and add to output
e.g., VB NN PREV1OR2TAG DT improves 98 errors,
but produces 18 new errors, so net decrease of 80
errors
5. Apply that I to corpus
6. Go to 2, unless stopping criterion is reached

Most likely tag
P(NNrace) .98
P(VBrace) .02
Is/VBZ expected/VBN to/TO race/NN tomorrow/NN
Rule template Change a word from tag X to
tag Y when previous tag is Z
Rule Instantiation to above example NN VB
PREV1OR2TAG TO
Applying this rule yields
Is/VBZ expected/VBN to/TO race/VB tomorrow/NN

48
Example of Error Reduction
From Eric Brill (1995) Computational
Linguistics, 21, 4, p. 7
49
Example of Learnt Rule Sequence

1. NN VB PREVTAG TO
to/TO race/NN-gtVB
2. VBP VB PREV1OR20R3TAG MD
might/MD vanish/VBP-gt VB
3. NN VB PREV1OR2TAG MD
might/MD not/MD reply/NN -gt VB
4. VB NN PREV1OR2TAG DT
the/DT great/JJ feast/VB-gtNN
5. VBD VBN PREV1OR20R3TAG VBZ
He/PP was/VBZ killed/VBD-gtVBN by/IN Chapman/NNP

50
Handling Unknown Words

Can also use the Brill method
Guess NNP if capitalized, NN otherwise.
Or use the tag most common for words ending in
the last 3 letters.
etc.

Example Learnt Rule Sequence for Unknown Words
51
POS Tagging using Unsupervised Methods

Reason Annotated data isnt always available!
Example the can
Lets take unambiguous words from dictionary, and
count their occurrences after the
the .. elephant
the .. guardian
Conclusion immediately after the, nouns are more
common than verbs or modals

Initial state annotator for each word, list all
tags in dictionary
Transformation template
Change tag ? of word to tag Y if the previous
(next) tag (word) is Z, where ? is a set of 2 or
more tags
Dont change any other tags

52
Error Reduction in Unsupervised Method

Let a rule to change ? to Y in context C be
represented as Rule(?, Y, C).
Rule1 VB, MD, NN NN PREVWORD the
Rule2 VB, MD, NN VB PREVWORD the
Idea
since annotated data isnt available, score rules
so as to prefer those where Y appears much more
frequently in the context C than all others in ?
frequency is measured by counting unambiguously
tagged words
so, prefer VB, MD, NN NN PREVWORD the
to VB, MD, NN VB PREVWORD the
since dict-unambiguous nouns are more common in
a corpus after the than dict-unambiguous verbs

53
Summary POS tagging

A variety of POS tagging schemes exist, even for
a single language
Preparing a POS-tagged corpus requires, for
efficiency, a combination of automatic tagging
and human correction
Automatic part-of-speech tagging can use
Hand-crafted rules based on inspecting a corpus
Machine Learning-based approaches based on corpus
statistics
e.g., HMM lexical generation probability table,
pos transition probability table
Machine Learning-based approaches using rules
derived automatically from a corpus
Combinations of different methods often improve
performance

54
Outline

Topics
Concordances
Data sparseness
Chomskys Critique
Ngrams
Mutual Information
Part-of-speech tagging
Annotation Issues
Inter-Annotator Reliability
Named Entity Tagging
Relationship Tagging

Case Studies
metonymy
adjective ordering
Discourse markers then
TimeML

55
Adjective Ordering

A political serious problem
A social extravagant life
red lovely hair
old little lady
green little men
Adjectives have been grouped into various classes
to explain ordering phenomena

56
Collins COBUILD L2 Grammar

qualitative lt color lt classifying
Qualitative expresses a quality that someone or
something has, e.g., sad, pretty, small, etc.
Qualitative adjectives are gradable, i.e., the
person or thing can have more or less of the
quality
Classifying used to identify the class
something belongs to, i.e.., distinguishing
financial help, American citizens.
Classifying adjectives arent gradable.
So, the ordering reduces to
Gradable lt color lt non-gradable
A serious political problem
Lovely red hair
Big rectangular green Chinese carpet

57
Vendler 68

A9 lt A8 lt A2 lt A1x ltA1m lt ltA1a
A9 probably, likely, certain
A8 useful, profitable, necessary
A7 possible, impossible
A6 clever, stupid, reasonable, nice, kind,
thoughtful, considerate
A5 ready, willing, anxious
A4 easy
A3 slow, fast, good, bad, weak, careful,
beautiful
A2 contrastive/polar adjectives long-short,
thick-thin, big-little, wide-narrow
A1j verb-derivatives washed
A1i verb-derivatives washing
A1h luminous
A1g rectangular
A1f color adjectives
A1a iron, steel, metal
big rectangular green Chinese carpet

58
Other Adjective Ordering Theories
Goyvaerts 68 quality lt size/length/shape lt age lt color lt naturally lt style lt general lt denominal
Quirck Greenbaum 73 Intensifying perfect lt general-measurable careful wealthy lt age young old lt color lt denominal material woollen scarf lt denominal style Parisian dress
Dixon 82 value lt dimension lt physical property lt speed lt human propensity lt age lt color
Frawley 92 value lt size lt color (English, German, Hungarian, Polish, Turkish, Hindi, Persian, Indonesian, Basque)
Collins COBUILD gradable lt
color lt non-gradable Goyvaerts, QG, Dixon
size lt age lt color Goyvaerts, QG
color lt denominal Goyvaerts,
Dixon shape lt color
59
Testing the Theories on Large Corpora

Selective coverage of a particular language or
(small) set of languages
Based on categories that arent defined precisely
that are computable
Based on small large numbers of examples
Test gradable lt color lt non-gradable

60
Computable Tests for Gradable Adjectives

Submodifiers expressing gradation
veryrathersomewhatextremely A
But what about very British?
http//complingtwo.georgetown.edu/gwilson/Tools/A
dj/GW_Grad.txt
Periphrastic comparatives
more A than "the most A
Inflectional comparatives
-er-est
http//complingtwo.georgetown.edu/gwilson/Tools/A
dj/BothLists.txt

61
Challenges Data Sparseness

Data sparseness
Only some pairs will be present in a given corpus
few adjectives on the gradable list may be
present
Even fewer longer sequences will be present in a
corpus
Use transitivity?
small lt red, red lt wooden --gt small lt red lt
wooden?

62
Challenges Tool Incompleteness

Search pattern will return many non-examples
Collocations
common or marked ones
American green card
national Blue Cross
Adjective Modification
bright blue
POS-tagging errors
May also miss many examples

63
Results from Corpus Analysis

G lt C lt not G generally holds
However, there are exceptions
Classifying/Non-Gradable lt Color
After all, the maple leaf replaced the British
red ensign as Canada's flag almost 30 years ago.
http//complingtwo.georgetown.edu/gwilson/Tools/A
dj/Color2.html
where he stood on a stage festooned with balloons
displaying the Palestinian green, white and red
flag
http//complingtwo.georgetown.edu/gwilson/Tools/A
dj/Color4.html
Color lt Shape
paintings in which pink, roundish shapes,
enriched with flocking, gel, lentils and thread,
suggest the insides of the female body.
http//complingtwo.georgetown.edu/gwilson/Tools/A
dj/Color4.html

64
Summary Adjective Ordering

It is possible to test concrete predictions of a
linguistic theory in a corpus-based setting
The testing means that the machine searches for
examples satisfying patterns that the human
specifies
The patterns can pre-suppose a certain/high
degree of automatic tagging, with attendant loss
of accuracy
The patterns should be chosen so that they
provide handles to identify the phenomena of
interest
The patterns should be restricted enough that the
number of examples the human has to judge is not
infeasible
This is usually an iterative process

65
Outline

Topics
Concordances
Data sparseness
Chomskys Critique
Ngrams
Mutual Information
Part-of-speech tagging
Annotation Issues
Named Entity Tagging
Inter-Annotator Reliability
Relationship Tagging

Case Studies
metonymy
adjective ordering
Discourse markers then
TimeML

66
The Art of Annotation 101

Define Goal
Eyeball Data (with the help of Computers)
Design Annotation Scheme
Develop Example-based Guidelines
Unless satisfied/exhausted, goto 1
WriteTraining Manuals
Initiate HumanTraining Sessions
Annotate Data / Train Computers
Computers can also help with the annotation
Evaluate Humans and Computers
Unless satisfied/exhausted, goto 1

67
Annottation Methodology Picture
Raw Corpus
Initial Tagger
Annotation Editor
Annotation Guidelines
Machine Learning Program
Rule Apply
Learned Rules
Raw Corpus
Annotated Corpus
Annotated Corpus
Knowledge Base?
68
Goals of an Annotation Scheme

Simplicity simple enough for a human to carry
out
Precision precise enough to be useful in CLI
applications
Text-based annotation of an item should be
based on information conveyed by the text, rather
than information conveyed by background
information
Human-centered should be based on what a human
can infer from the text, rather than what a
machine can currently do or not do
Reproducible your annotation should be
reproducible by other humans (i.e.,
inter-annotator agreement should be high)
obviously, these other humans may have to have
particular expertise and training

69
What Should An Annotation Contain

Additional Information about the text being
annotated e.g., EAGLES external and internal
criteria
Information about the annotator who, when, what
version of tool, etc. (usually in meta-tags
associated with the text)
The tagged text itself
Example
http//www.emille.lancs.ac.uk/spoken.htm

70
External and Internal Criteria (EAGLES)

External participants, occasion, social
setting, communicative function
origin Aspects of the origin of the text that
are thought to affect its structure or content.
state the appearance of the text, its layout and
relation to non-textual matter, at the point when
it is selected for the corpus.
aims the reason for making the text and the
intended effect it is expected to have.
Internal patterns of language use
Topic (economics, sports, etc.)
Style (formal/informal, etc.)

71
External Criteria state (EAGLES)

Mode
spoken
participant awareness surreptitious/warned/aware
venue studio/on location/telephone
written
Relation to the medium
written how it is laid out, the paper, print,
etc.
spoken the acoustic conditions, etc.
Relation to non-linguistic communicative matter
diagrams, illustrations, other media that are
coupled with the language in a communicative
event.
Appearance
e.g., advertising leaflets, aspects of
presentation that are unique in design and are
important enough to have an effect on the
language.

72
Examples of annotation schemes (changing the way
we do business!)

POS tagging annotation Penn Treebank Scheme
Named entity annotation ACE Scheme
Phrase Structure annotation Penn Treebank
scheme
Time Expression annotation TIMEX2 Scheme
Protein Name Annotation GU Scheme
Event Annotation TimeML Scheme
Rhetorical Structure Annotation - RST Scheme
Coreference Annotation, Subjectivity Annotation,
Gesture Annotation, Intonation Annotation,
Metonymy Annotation, etc., etc.
Etc.
Several hundred schemes exist, for different
problems in different languages

73
POS Tag Formats Non-SGML to SGML

CLAWS tagger non-SGML
What_DTQ can_VM0 CLAWS_NN2 do_VDI to_PRP
Inderjeet_NP0 's_POS noonsense_NN1 text_NN1 ?_?
Brill tagger non-SGML
What/WP can/MD CLAWS/NNP do/VB to/TO
Inderjeet/NNP 's/POS noonsense/NN text/NN ?/.
Alembic POS tagger
ltsgtltlex posWPgtWhatlt/lexgt ltlex posMDgtcanlt/lexgt
ltlex posNNPgtCLAWSlt/lexgt ltlex posVBgtdolt/lexgt
ltlex posTOgttolt/lexgt ltlex posNNPgtInderjeetlt/lexgt
ltlex posPOSgt'lt/lexgtltlex posPRPgtslt/lexgt ltlex
posVBPgtnoonsenselt/lexgt ltlex posNNgttextlt/lexgt
ltlex pos"."gt?lt/lexgtlt/sgt
Conversion to SGML is pretty trivial in such
cases

74
SGML (Standard Generalized Markup Language)

A general markup language for text
HTML is an instance of an SGML encoding
Text Encoding Initiative (TEI) defines SGML
schemes for marking up humanities text resources
as well as dictionaries
Examples
ltpgtltsgtIm really hungry right now.lt/sgtltsgtOh,
yeah?lt/sgt
ltutt speakFred date10-Feb-1998gtThat is an
ugly couch.lt/uttgt
Note some elements (e.g., ltpgt) can consist just
of a single tag

Character references ways of referring to the
non-ASCII characters using a numeric code
229 (this is in decimal) xE5 (this is in
hexadecimal)
å
Entity references are used to encode a special
character or sequence of characters via a
symbolic name
reacutesumeacute.
docdate

75
DTDs

A document type definition, or DTD, is used to
define a grammar of legal SGML structures for a
document
e.g., para should consist of one or more
sentences and nothing else
SGML parser verifies that document is compliant
with DTD
DTDs can therefore be used for XML as well
DTDs can specify what attributes are required, in
what order, what their legit values are, etc.
The DTDs are often ignored in practice!

DTD
lt!ENTITY writer SYSTEM "http//www.mysite.com/all-
entities.dtd"gt
lt!ATTLIST payment type (checkcash) "cash"gt
XML
ltauthorgtwriterlt/authorgt
ltpayment type"check"gt

76
XML

Extensible Markup Language (XML) is a simple,
very flexible text format derived from SGML.
Originally designed to meet the challenges of
large-scale electronic publishing, XML is also
playing an increasingly important role in the
exchange of a wide variety of data on the Web and
elsewhere. www.w3.org/XML/
Defines a simplified subset of SGML, designed
especially for Web applications
Unlike HTML, separates out display (e.g., XSL)
from content (XML)
Example
ltp/gtltsgtltlex posWPgtWhatlt/lexgt ltlex
posMDgtcanlt/lexgtlt/sgt
Makes use of DTDs, but also RDF Schemas

77
RDF Schemas

Example of Real RDF Schema
http//www.cs.brandeis.edu/jamesp/arda/time/docum
entation/TimeML.xsd (see EVENT tag and
attributes)

78
Inline versus Standoff Annotation

Usually, when tags are added, an annotation tool
is used, to avoid spurious insertions or
deletions
The annotation tool may use inline or standoff
annotation
Inline tags are stored internally in (a copy
of) the source text.
Tagged text can be substantially larger than
original text
Web pages are a good example i.e., HTML tags
Standoff tags are stored internally in separate
files, with information as to what positions in
the source text the tags occupy
e.g., PERSON 335 337
However, the annotation tool displays the text as
if the tags were in-line

79
Summary Annotation Issues

A best-practices methodology is widely used
for annotating corpora
The annotation process involves computational
tools at all stages
Standard guidelines are available for use
To share annotated corpora (and to ensure their
survivability), it is crucial that the data be
represented in a standard rather than ad hoc
format
XML provides a well-established, Web-compliant
standard for markup languages
DTDs and RDF provide mechanisms for checking
well-formedness of annotation

80
Outline

Topics
Concordances
Data sparseness
Chomskys Critique
Ngrams
Mutual Information
Part-of-speech tagging
Annotation Issues
Inter-Annotator Reliability
Named Entity Tagging
Relationship Tagging

Case Studies
metonymy
adjective ordering
Discourse markers then
TimeML

81
Background

Deborah Schiffrin. Anaphoric then aspectual,
textual, and epistemic meaning. Linguistics 30
(1992), 753-792
Schiffrin xamines uses of then in data elicited
via 20 sociolinguistic interviews, each an hour
long
Distinguishes two anaphoric temporal senses,
showing that they are differentiated by clause
position
Shows that they have systematic effects on
aspectual interpretation
A parallel argument is made for two epistemic
temporal senses

82
Schiffrin Temporal and Non-Temporal Senses

Anaphoric Senses
Narrative temporal sense (shifts reference
time)
And then I uh lived there until I was sixteen
Continuing Temporal sense (continues a previous
reference time)
I was only a little boy then.
Epistemic senses
Conditional sentences (rare, but often have
temporal antecedents in her data)
But if I think about it for a few days -- well,
then I seem to remember a great deal
if Im still in the situation where I am
now.Im, not gonna have no more then
Initiation-response-evaluation sequences (in
that case?)
Freda Do y still need the light?
Debby Um.
Freda Wll have t go in then. Because the bugs
are out.

83
Schiffrins Argument (Simplified) and Its Test

Shifting RT thens (call these Narrative) then
in if-then conditionals
similar semantic function
mainly clause-initial
Continuing RT thens (call these Temporal) IRE
thens
similar semantic function
mainly clause final
stative verb more likely (since RT overlaps,
verbs conveying duration are expected)
Call the rest Other
isnt differentiated into if-then versus IRE
So, only part of her claims tested

84
So, What do we do Then?

Define environments of interest, each one defined
by a pattern
For each environment
Find examples matching the pattern
If classifying the examples is manageable, carry
it out and stop
Otherwise restrict the environment by adding new
elements to the pattern, and go back to 1
So, for each final environment, we claim that X
of the examples in that environment are of a
particular class
Initial then Pattern (_CC_RB)\sthen\w\s\w
Final then Pattern \,\sthen\.\?\'\\!\

85
Exceptions

Non-Narrative Initial then
then there be
then come
then again
then and now
only then
even then
so then

Non-Temporal Final then
What then?
All right/OK , then
And then?

86
Results
Written Fiction 2000 Written Fiction 2000 Written Fiction 2000 Spoken Broadcast News Spoken Broadcast News Spoken Broadcast News Written Gigaword News Written Gigaword News Written Gigaword News
T N O T N O T N O
Clause Initial 1.73 (23/1322) 96.67 (1276/1322) 1.58 (21/1322) .73 (6/818) 93.88(768/818) 5.3 (44/818) 3.64 (27/740) 75.94 (562/740) 20.40 (151/740)
Clause Final 71.81 (79/110) 2.72 (3/110) 25.45 (28/110) 72.61 (61/84) 5.95 (5/84) 21.42 (18/84) 93.23 (179/192) 0 6.77 (13/192)
Other is a presence in final position in fiction
and broadcast news, and in initial position in
print news. Is this real or artifact of catch-all
class? Conclusion only part of her claims
tested. But those claims are borne out across
three different genres and much more data!
87
Outline

Topics
Concordances
Data sparseness
Chomskys Critique
Ngrams
Mutual Information
Part-of-speech tagging
Annotation Issues
Inter-Annotator Reliability
Named Entity Tagging
Relationship Tagging

Case Studies
metonymy
adjective ordering
Discourse markers then
TimeML

88
Considerations in Inter-Annotator Agreement

Size of tagset
Structure of tagset
Clarity of Guidelines
Number of raters
Experience of raters
Training of raters
Independent ratings (preferred)
Consensus (not preferred)
Exact, partial, and equivalent matches
Metrics
Lessons Learned Disagreement patterns suggest
guideline revisions

89
Protein Names

Considerable variability in the forms of the
names
Multiple naming conventions
Researchers may name a newly discovered protein
based on
function
sequence features
gene name
cellular location
molecular weight
discoverer
or other properties
Prolific use of abbreviations and acronyms

fushi tarazu 1 factor homolog
Fushi tarazu factor (Drosophila) homolog 1
FTZ-F1 homolog ELP
steroid/thyroid/retinoic nuclear hormone
receptor homolog nhr-35
V-INT 2 murine mammary tumor virus integration
site oncogene homolog
fibroblast growth factor 1 (acidic) isoform 1
precursor
nuclear hormone receptor subfamily 5, Group A,
member 1

90
Guidelines v1 TOC
91
Agreement Metrics
Reference Candidate Candidate
Yes No
Yes TP ? FN
No FP TN ?
Measure Definition
Percentage Agreement 100(TPTN)/ (TPFPTNFN)
Precision TP/(TPFP)
Recall TP/(TPFN)
(Balanced) F-Measure 2PrecisionRecall/(PrecisionRecall)
92
Example for F-measure Scorer Output (Protein
Name Tagging)

REFERENCE
CANDIDATE
CORR        FTZ-F1 homolog ELP
           FTZ-F1 homolog ELP INCO
M2-LHX3
    M2SPUR

-SPUR
                            LHX3
Precision ¼ 0.25
Recall ½ 0.5
F-measure 2 ¼ ½ / ( ¼ ½ ) 0.33

93
The importance of disagreement

Measuring inter-annotator agreement is very
useful in debugging the annotation scheme
Disagreement can lead to improvements in the
annotation scheme
Extreme disagreement can lead to abandonment of
the scheme

94
V2 Assessment (ABS2)

Old Guidelines
protein 0.71 F
acronym 0.85 F
array-protein 0.15 F
New Guidelines
protein 0.86 F
long-form 0.71 F
these are only 4 of tags

Coders Correct Precision Precision Recall Recall F-mea- sure
ltproteingt ltproteingt ltproteingt ltproteingt ltproteingt ltproteingt
A1-A3 4497 0.874 0.852 0.852 0.863 0.863
A1-A4 4769 0.884 0.904 0.904 0.894 0.894
A3-A4 4476 0.830 0.870 0.870 0.849 0.849
Average 0.862 0.875 0.875 0.868 0.868
ltlong-formgt ltlong-formgt ltlong-formgt ltlong-formgt ltlong-formgt ltlong-formgt
A1-A3 172 0.720 0.599 0.599 0.654 0.654
A1-A4 241 0.837 0.840 0.840 0.838 0.838
A3-A4 175 0.608 0.732 0.732 0.664 0.664
Average 0.721 0.723 0.723 0.718 0.718
95
TIMEX2 Annotation Scheme

Time Points ltTIMEX2 VAL"2000-W42"gtthe third week
of Octoberlt/TIMEX2gt
Durations ltTIMEX2 VALPT30Mgthalf an hour
longlt/TIMEX2gt
Indexicality ltTIMEX2 VAL2000-10-04gttomorrowlt/T
IMEX2gt
Sets ltTIMEX2 VALXXXX-WXX-2" SET"YES
PERIODICITY"F1W" GRANULARITYG1Dgtevery
Tuesdaylt/TIMEX2gt
Fuzziness ltTIMEX2 VAL1990-SUgtSummer of 1990
lt/TIMEX2gt
ltTIMEX2 VAL1999-07-15TMOgtThis
morninglt/TIMEX2gt
Non-specificity ltTIMEX2 VAL"XXXX-04"
NON_SPECIFICYESgtAprillt/TIMEX2gt is usually wet.
For guidelines, tools, and corpora, please see
timex2.mitre.org

96
TIMEX2 Inter-Annotator Agreement
193 NYT news docs 5 annotators 10 pairs of
annotators

Human annotation quality is acceptable on
EXTENT and VAL
Poor performance on Granularity and Non-Specific
But only a small number of instances of these
(200 6000)
Annotators deviate from guidelines, and produce
systematic errors (fatigue?)
several years ago PXY instead of PAST_REF
all day P1D instead of YYYY-MM-DD

97
TempEx in Qanda
98
Summary Inter-Annotator Reliability

Theres no point going on with an annotation
scheme if it cant be reproduced
There are standard methods for measuring
inter-annotator reliability
An analysis of inter-annotator disagreements is
critical for debugging an annotation scheme

99
Outline

Topics
Concordances
Data sparseness
Chomskys Critique
Ngrams
Mutual Information
Part-of-speech tagging
Annotation Issues
Inter-Annotator Reliability
Named Entity Tagging
Relationship Tagging

Case Studies
metonymy
adjective ordering
Discourse markers then
TimeML

100
Information Extraction

Types
Flag names of people, organizations, places,
Flag and normalize time expressions, phrases such
as time expressions, measure phrases, currency
expressions, etc.
Group coreferring expressions together
Find relations between named entities (works for,
located at, etc.)
Find events mentioned in the text
Find relations between events and entities
A hot commercial technology!
Example patterns
Mr. ---,
, Ill.

101
Message Understanding Conferences (MUCs)

Idea precise tasks to measure success, rather
than test suite of input and logical forms.
MUC-1 1987 and MUC-2 1989 - messages about navy
operations
MUC-3 1991 and MIC-4 1992 - news articles and
transcripts of radio broadcasts about terrorist
activity
MUC-5 1993 - news articles about joint ventures
and microelectronics
MUC-6 1995 - news articles about management
changes, additional tasks of named entity
recognition, coreference, and template element
MUC-7 1998 mostly multilingual information
extraction
Has also been applied to hundreds of other
domains - scientific articles, etc., etc.

102
Historical Perspective

Until MUC-3 (1993), many IE systems used a
Knowledge Engineering approach
They did something like full chart parsing with a
unification-based grammar with full logical
forms, a rich lexicon and KB
E.g., SRIs Tacitus
Then, they discovered that things could work much
faster using finite-state methods and partial
parsing
And that using domain-specific rather than
general purpose lexicons simplified parsing (less
ambiguity due to fewer irrelevant senses)
And that these methods worked even better for the
IE tasks
E.g., SRIs Fastus, SRAs Nametag
Meanwhile, people also started using statistical
learning methods from annotated corpora
Including CFG parsing

103
An instantiated scenario template
Source
Wall Street Journal, 06/15/88 MAXICARE HEALTH
PLANS INC and UNIVERSAL HEALTH SERVICES INC have
dissolved a joint venture which provided health
services.
104
Templates Can get Complex! (MUC-5)
105
2002 Automatic Content Extraction (ACE) Program
Entity Types

Person
Organization
(Place)
Location e.g., geographical areas, landmasses,
bodies of water, geological formations
Geo-Political Entity e.g., nations, states,
cities
Created due to metonymies involving this class of
places
The riots in Miami
Miami imposed a curfew
Miami railed against a curfew
Facility buildings, streets, airports, etc.

106
ACE Entity Attributes and Relations

Attributes
Name An entity mentioned by name
Pronoun
Nominal
Relations
AT based-in, located, residence
NEAR relative-location
PART part-of, subsidiary, other
ROLE affiliate-partner, citizen-of, client,
founder, general-staff, manager, member, owner,
other
SOCIAL associate, grandparent, parent, sibling,
spouse, other-relative, other-personal,
other-professional

107
Designing an Information Extraction Task

Define the overall task
Collect a corpus
Design an Annotation Scheme
linguistic theories help
Use Annotation Tools
- authoring tools
-automatic extraction tools
Apply to annotation to corpus, assessing
reliability
Use training portion of corpus to train
information extraction (IE) systems
Use test portion to test IE systems, using a
scoring program

108
Annotation Tools

Specialized authoring tools used for marking up
text without damaging it
Some tools are tied to particular annotation
schemes

109
Annotation Tool Example Alembic Workbench
110
Callisto (Java successor to Alembic Workbench)
111
Relationship Annotation Callisto
112
Steps in Information Extraction