Title: SIMS 290-2: Applied Natural Language Processing
1SIMS 290-2 Applied Natural Language Processing
Marti Hearst Sept 13, 2004
2Today
- Purpose of Part-of-Speech Tagging
- Training and Testing Collections
- Intro to N-grams and Language Modeling
- Using NLTK for POS Tagging
3Class Exercise
- I will read off a few words from the beginning of
a sentence - You should write down the very first 2 words that
come to mind that should follow these words. - Example
- I say One fish
- You write two fish
- Dont second-guess or try to be clever.
- Note there are no correct answers
4Terminology
- Tagging
- The process of associating labels with each token
in a text - Tags
- The labels
- Tag Set
- The collection of tags used for a particular task
5Example
- Typically a tagged text is a sequence of
white-space separated base/tag tokens - The/at Pantheons/np interior/nn ,/,still/rb
in/in its/pp original/jj form/nn ,/, is/bez
truly/ql majestic/jj and/cc an/at
architectural/jj triumph/nn ./. Its/pp rotunda/nn
forms/vbz a/at perfect/jj circle/nn whose/wp
diameter/nn is/bez equal/jj to/in the/at
height/nn from/in the/at floor/nn to/in the/at
ceiling/nn ./.
6What does Tagging do?
- Collapses Distinctions
- Lexical identity may be discarded
- e.g. all personal pronouns tagged with PRP
- Introduces Distinctions
- Ambiguities may be removed
- e.g. deal tagged with NN or VB
- e.g. deal tagged with DEAL1 or DEAL2
- Helps classification and prediction
7Significance of Parts of Speech
- A words POS tells us a lot about the word and
its neighbors - Limits the range of meanings (deal),
pronunciation (object vs object) or both (wind) - Helps in stemming
- Limits the range of following words for Speech
Recognition - Can help select nouns from a document for IR
- Basis for partial parsing (chunked parsing)
- Parsers can build trees directly on the POS tags
instead of maintaining a lexicon
8Choosing a tagset
- The choice of tagset greatly affects the
difficulty of the problem - Need to strike a balance between
- Getting better information about context (best
introduce more distinctions) - Make it possible for classifiers to do their job
(need to minimize distinctions)
9Some of the best-known Tagsets
- Brown corpus 87 tags
- Penn Treebank 45 tags
- Lancaster UCREL C5 (used to tag the BNC) 61 tags
- Lancaster C7 145 tags
10The Brown Corpus
- The first digital corpus (1961)
- Francis and Kucera, Brown University
- Contents 500 texts, each 2000 words long
- From American books, newspapers, magazines
- Representing genres
- Science fiction, romance fiction, press reportage
scientific writing, popular lore
11Penn Treebank
- First syntactically annotated corpus
- 1 million words from Wall Street Journal
- Part of speech tags and syntax trees
12How hard is POS tagging?
In the Brown corpus,- 11.5 of word types
ambiguous- 40 of word TOKENS
Number of tags 1 2 3 4 5 6 7
Number of words types 35340 3760 264 61 12 2 1
13Important Penn Treebank tags
14Verb inflection tags
15The entire Penn Treebank tagset
16Quick test
DoCoMo and Sony are to develop a chip that would
let people pay for goods through their mobiles.
17Tagging methods
- Hand-coded
- Statistical taggers
- Brill (transformation-based) tagger
18Reading Tagged Corpora
- gtgt corpus brown.read(ca01)
- gtgt corpusWORDS010
- ltThe/atgt, ltFulton/np-tlgt, ltCounty/nn-tlgt,
ltGrand/jj-tlgt, - ltJury/nn-tlgt, ltsaid/vbdgt, ltFriday/nrgt, ltan/atgt,
ltinvestigation/nngt, ltof/ingt - gtgt corpusWORDS2TAG
- nn-tl
- gtgt corpusWORDS2TEXT
- County
19Default Tagger
- We need something to use for unseen words
- E.g., guess NNP for a word with an initial
capital - How to do this?
- Apply a sequence of regular expression tests
- Assign the word to a suitable tag
- If there are no matches
- Assign to the most frequent unknown tag, NN
- Other common ones are verb, proper noun,
adjective - Note the role of closed-class words in English
- Prepositions, auxiliaries, etc.
- New ones do not tend to appear.
20A Default Tagger
- gt from nltk.tokenizer import
- gt from nltk.tagger import
- gt text_token Token(TEXT"John saw 3 polar bears
.") - gt WhitespaceTokenizer().tokenize(text_token)
- gt NN_CD_tagger
- RegexpTagger((r'0-9(.0-9)?', 'cd'),
(r'.', 'nn')) - gt NN_CD_tagger.tag(text_token)
- ltltJohn/nngt, ltsaw/nngt, lt3/cdgt, ltpolar/nngt,
ltbears/nngt, lt./nngtgt - NN_CD_Tagger assigns CD to numbers, otherwise
NN. - Poor performance (20-30) in isolation, but when
used with other taggers can significantly improve
performance
21Finding the most frequent tag
- gtgtgtfrom nltk.probability import FreqDist
- gtgtgtfrom nltk.corpus import brown
- gtgtgt fd FreqDist()
- gtgtgt corpus brown.read('ca01')
- gtgtgt for token in corpus'WORDS'
fd.inc(token'TAG')... - gtgtgt fd.max()
- gtgtgt fd.count(fd.max())
22Evaluating the Tagger
This gets 2 wrong out of 16, or 18.5 error Can
also say an accuracy of 81.5.
23Training vs. Testing
- A fundamental idea in computational linguistics
- Start with a collection labeled with the right
answers - Supervised learning
- Usually the labels are done by hand
- Train or teach the algorithm on a subset of
the labeled text. - Test the algorithm on a different set of data.
- Why?
- If memorization worked, wed be done.
- Need to generalize so the algorithm works on
examples that you havent seen yet. - Thus testing only makes sense on examples you
didnt train on. - NLTK has an excellent interface for doing this
easily.
24Training the Unigram Tagger
25Creating Separate Training and Testing Sets
26Evaluating a Tagger
- Tagged tokens the original data
- Untag (exclude) the data
- Tag the data with your own tagger
- Compare the original and new tags
- Iterate over the two lists checking for identity
and counting - Accuracy fraction correct
27Assessing the Errors
Why the tuple method? Dictionaries cannot be
indexed by lists, so convert lists to tuples.
exclude returns a new token containing only the
properties that are not named in the given list.
28Assessing the Errors
29Language Modeling
- Another fundamental concept in NLP
- Main idea
- For a given language, some words are more likely
than others to follow each other, or - You can predict (with some degree of accuracy)
the probability that a given word will follow
another word. - Illustration
- Distributions of words in class-participation
exercise.
30N-Grams
- The N stands for how many terms are used
- Unigram 1 term
- Bigram 2 terms
- Trigrams 3 terms
- Usually dont go beyond this
- You can use different kinds of terms, e.g.
- Character based n-grams
- Word-based n-grams
- POS-based n-grams
- Ordering
- Often adjacent, but not required
- We use n-grams to help determine the context in
which some linguistic phenomenon happens. - E.g., look at the words before and after the
period to see if it is the end of a sentence or
not.
31Features and Contexts
wn-2 wn-1 wn wn1
CONTEXT FEATURE CONTEXT
tn-1
tn
tn1
tn-2
32Unigram Tagger
- Trained using a tagged corpus to determine which
tags are most common for each word. - E.g. in tagged WSJ sample, deal is tagged with
NN 11 times, with VB 1 time, and with VBP 1 time - Performance is highly dependent on the quality of
its training set. - Cant be too small
- Cant be too different from texts we actually
want to tag
33Nth Order Tagging
- Order refers to how much context
- Its one less than the N in N-gram here because
we use the target word itself as part of the
context. - Oth order unigram tagger
- 1st order bigrams
- 2nd order trigrams
- Bigram tagger
- For tagging, in addition to considering the
tokens type, the context also considers the tags
of the n preceding tokens - What is the most likely tag for w_n, given w_n-1
and t_n-1? - The tagger picks the tag which is most likely for
that context.
34Reading the Bigram table
35Tagging with lexical frequencies
- Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NN - People/NNS continue/VBP to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN - Problem assign a tag to race given its lexical
frequency - Solution we choose the tag that has the greater
- P(raceVB)
- P(raceNN)
- Actual estimate from the Switchboard corpus
- P(raceNN) .00041
- P(raceVB) .00003
36Combining Taggers
- Use more accurate algorithms when we can, backoff
to wider coverage when needed. - Try tagging the token with the 1st order tagger.
- If the 1st order tagger is unable to find a tag
for the token, try finding a tag with the 0th
order tagger. - If the 0th order tagger is also unable to find a
tag, use the NN_CD_Tagger to find a tag.
37BackoffTagger class
- gtgtgt train_toks TaggedTokenizer().tokenize(tagged
_text_str) - Construct the taggers
- gtgtgt tagger1 NthOrderTagger(1,
SUBTOKENSWORDS) - gtgtgt tagger2 UnigramTagger() 0th order
- gtgtgt tagger3 NN_CD_Tagger()
- Train the taggers
- gtgtgt for tok in train_toks
- tagger1.train(tok)
- tagger2.train(tok)
38Backoff (continued)
- Combine the taggers (in order, by specificity)
- gt tagger BackoffTagger(tagger1, tagger2,
tagger3) - Use the combined tagger
- gt accuracy tagger_accuracy(tagger,
unseen_tokens)
39Rule-Based Tagger
- The Linguistic Complaint
- Where is the linguistic knowledge of a tagger?
- Just a massive table of numbers
- Arent there any linguistic insights that could
emerge from the data? - Could thus use handcrafted sets of rules to tag
input sentences, for example, if input follows a
determiner tag it as a noun.
40The Brill tagger
- An example of TRANSFORMATION-BASED LEARNING
- Very popular (freely available, works fairly
well) - A SUPERVISED method requires a tagged corpus
- Basic idea do a quick job first (using
frequency), then revise it using contextual rules
41Brill Tagging In more detail
- Start with simple (less accurate) ruleslearn
better ones from tagged corpus - Tag each word initially with most likely POS
- Examine set of transformations to see which
improves tagging decisions compared to tagged
corpus - Re-tag corpus using best transformation
- Repeat until, e.g., performance doesnt improve
- Result tagging procedure (ordered list of
transformations) which can be applied to new,
untagged text
42An example
- Examples
- It is expected to race tomorrow.
- The race for outer space.
- Tagging algorithm
- Tag all uses of race as NN (most likely tag in
the Brown corpus) - It is expected to race/NN tomorrow
- the race/NN for outer space
- Use a transformation rule to replace the tag NN
with VB for all uses of race preceded by the
tag TO - It is expected to race/VB tomorrow
- the race/NN for outer space
43Transformation-based learning in the Brill tagger
- Tag the corpus with the most likely tag for each
word - Choose a TRANSFORMATION that deterministically
replaces an existing tag with a new one such that
the resulting tagged corpus has the lowest error
rate - Apply that transformation to the training corpus
- Repeat
- Return a tagger that
- first tags using unigrams
- then applies the learned transformations in order
44Examples of learned transformations
45Templates
46Additional issues
Most of the difference in performance between POS
algorithms depends on their treatment of UNKNOWN
WORDS
Multiple token words (Penn Treebank)
Class-based N-grams
47Upcoming
- I will email the procedures for turning in the
first assignment on Wed Sept 15 - Will be over the web
- On Wed Ill discuss shallow parsing
- Start reading the Chunking (Shallow Parsing)
tutorial - I will assign homework from this on Wed, due in
one week on Sept 22. - Next Monday Ill briefly discuss syntactic
parsting - There is a tutorial on this feel free to read it
- In the interests of reducing workload, Im not
assigning it however