Title: NLTK Tagging
 1NLTK Tagging 
- CS1573 AI Application Development, Spring 2003 
- (modified from Steven Birds notes) 
2Todays Outline
- Administration 
- Final Words on Regular Expressions 
- Regular Expressions in NLTK 
- New Topic Tagging 
- Motivation and Linguistic Background 
- NLTK Tutorial Tagging 
- Part-of-Speech Tagging 
- The nltk.tagger Module 
- A Few Tagging Algorithms 
- Some Gory Details 
3Regular Expressions, again
- Python 
- Regular expression syntax 
- NLTK uses 
- The regular expression tokenizer 
- A simple regular expression tagging algorithm 
4Regular Expression Tokenizers
- Mimicing the WSTokenizer 
- gtgtgt tokenizerRETokenizer(r'\s') 
- gtgtgt tokenizer.tokenize(example_text) 
-  'Hello.'_at_0w, "Isn't"_at_1w, 'this'_at_2w, 
 'fun?'_at_3w
5RE Tokenization, continued
-  gt regexpr'\w\w\s 
-  '\w\w\s' 
-  gt tokenizer  RETokenizer(regexp) 
-  gt tokenizer.tokenize(example_text) 
 'Hello'_at_0w, '.'_at_1w, 'Isn'_at_2w, "'"_at_3w,
 't'_at_4w, 'this'_at_5w, 'fun'_at_6w, '?'_at_7w
- Why is this version better?
6RE Tokenization, continued
-  gt regexpr'\w\w\s' 
- Why is this version better? 
- -includes punctuation as separate tokens 
- -matches either a sequence of alphanumeric 
 characters (letters and numbers) or a sequence
 of punctuation characters.
- But, still has problems, for example  ?
7Improved Example
- gt example_text  'That poster costs 22.40.' 
- gt regexp  r'(\w)(\\d\.\d)(\w\s)' 
 '(\w)(\\d\.\d)(\w\s)'
- gt tokenizer  RETokenizer(regexp) 
- gt tokenizer.tokenize(example_text) 
 'That'_at_0w, 'poster'_at_1w, 'costs'_at_2w,
 '22.40'_at_3w, '.'_at_4w
8Regular Expression Limitations
- While Regular Languages can model many things, 
 there are still limitations (no advice when
 rejection, all or one solution when accept
 condition is ambiguous).
9New Topic
- Now were going to start looking at tagging, and 
 especially approaches that depend on looking at
 words in context.
- Well start with what looks like an artificial 
 task predicting the next word in a sequence.
- Well then move to tagging, the process of 
 associating auxiliary information with each
 token, often for use in later stages of text
 processing
10Word Prediction Example
- From NY Times 
- Stocks plunged this 
11Word Prediction Example
- From NY Times 
- Stocks plunged this morning, despite a cut in 
 interest
12Word Prediction Example
- From NY Times 
- Stocks plunged this morning, despite a cut in 
 interest rates by the Federal Reserve, as Wall
13Word Prediction Example
- From NY Times 
- Stocks plunged this morning, despite a cut in 
 interest rates by the Federal Reserve, as Wall
 Street began
14Word Prediction Example
- From NY Times 
- Stocks plunged this morning, despite a cut in 
 interest rates by the Federal Reserve, as Wall
 Street began trading for the first time since
 last
15Word Prediction Example
- From NY Times 
- Stocks plunged this morning, despite a cut in 
 interest rates by the Federal Reserve, as Wall
 Street began trading for the first time since
 last Tuesdays terrorist attacks.
16Format Change
- Move to pdf slides (highlights of Jurafsky and 
 Martin Chapters 6 and 8)
17Tagging Overview /Review
- Motivation 
- What is tagging? What does tagging do? Kinds of 
 tagging?
- Significance of part of speech 
- Basics 
- Features and context 
- Brown and Penn Treebank, tagsets 
- Tagging in NLTK (nltk.tagger module) 
- Tagging 
- Algorithms, statistical and rule-based tagging 
- Evaluation 
18Terminology
- Tagging 
- The process of associating labels with each token 
 in a text
- Tags 
- The labels 
- Tag Set 
- The collection of tags used for a particular task
19Example
- Typically a tagged text is a sequence of 
 white-space separated base/tag tokens
-  The/at Pantheons/np interior/nn ,/,still/rb 
 in/in its/pp original/jj form/nn ,/, is/bez
 truly/ql majestic/jj and/cc an/at
 architectural/jj triumph/nn ./. Its/pp rotunda/nn
 forms/vbz a/at perfect/jj circle/nn whose/wp
 diameter/nn is/bez equal/jj to/in the/at
 height/nn from/in the/at floor/nn to/in the/at
 ceiling/nn ./.
- .
20What does Tagging do?
- Collapses Distinctions 
- Lexical identity may be discarded 
- e.g. all personal pronouns tagged with PRP 
- Introduces Distinctions 
- Ambiguities may be removed 
- e.g. deal tagged with NN or VB 
- e.g. deal tagged with DEAL1 or DEAL2 
- Helps classification and prediction
21Kinds of Tagging
- Part-of-Speech tagging 
- Grammatical tagging 
- Divides words into categories based on how they 
 can be combined to form sentences (e.g., articles
 can combine with nouns but not verbs)
- Semantic Sense tagging 
- Sense disambiguation 
- Homonym disambiguation 
- Discourse tagging 
- Speech acts (request, inform, greet, etc.) 
22Significance of Parts of Speech
- A words POS tells us a lot about the word and 
 its neighbors
- Limits the range of meanings (deal), 
 pronunciation (object vs object) or both (wind)
- Helps in stemming 
- Limits the range of following words for ASR 
- Helps select nouns from a document for IR 
- Basis for partial parsing 
- Basis for searching for linguistic constructions 
- Parsers can build trees directly on the POS tags 
 instead of maintaining a lexicon
23Features and Contexts
wn-2 wn-1 wn wn1 
 CONTEXT FEATURE
tn-1
tn
tn1
tn-2 
 24Why there are many tag sets
- Definition of POS tag 
- Semantic, syntactic, morphological 
- Tagsets differ in both how they define the tags, 
 and at what level of granularity
- Balancing classification and prediction 
- Introducing more distinctions 
- Better information about context 
- Harder to classify current token 
- Introducing few distinctions 
- Less information about context 
- Less work to do for classifying current token
25The Brown Corpus 
- The first digital corpus (1961) 
- Francis and Kucera, Brown University 
- Contents 500 texts, each 2000 words long 
- From American books, newspapers, magazines 
- Representing genres 
- Science fiction, romance fiction, press reportage 
 scientific writing, popular lore
26Penn Treebank
- First syntactically annotated corpus 
- 1 million words from Wall Street Journal 
- Part of speech tags and syntax trees
27Representing Tags in NLTK
- TaggedType class 
- gtgtgt ttype1  TaggedType('dog', 'NN') 
- 'dog'/'NN 
- gtgtgt ttype1.base() 
- dog' 
- gtgtgt ttype1.tag() 
- NN' 
- Tagged tokens 
- gtgtgt ttoken  Token(ttype, Location(5)) 
- 'dog'/'NN'_at_5
28Reading Tagged Corpora
- gtgt tagged_text_str  open('corpus.txt').read() 
- 'John/NN saw/VB the/AT book/NN on/IN the/AT 
 table/NN ./END He/NN sighed/VB ./END'
- gtgt tokensTaggedTokenizer().tokenize(tagged_text_s
 tr)
- 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w, 
 'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
 'on'/'IN'_at_4w, 'the'/'AT'_at_5w,
 'table'/'NN'_at_6w, '.'/'END'_at_7w,
 'He'/'NN'_at_8w, 'sighed'/'VB'_at_9w,
 '.'/'END'_at_10w
- If TaggedTokenizer encouters a word without a 
 tag, it will assign it the default tag None.
29The TaggerI Interface
- gt tokens  WSTokenizer().tokenize(untagged_text_st
 r) 'John'_at_0w, 'saw'_at_1w, 'the'_at_2w,
 'book'_at_3w, 'on'_at_4w, 'the'_at_5w, 'table'_at_6w,
 '.'_at_7w, 'He'_at_8w, 'sighed'_at_9w, '.'_at_10w
- gt my_tagger.tag(tokens) 
-  'John'/'NN'_at_0w, 'saw'/'VB'_at_1w, 
 'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
 'on'/'IN'_at_4w, 'the'/'AT'_at_5w,
 'table'/'NN'_at_6w, '.'/'END'_at_7w,
 'He'/'NN'_at_8w, 'sighed'/'VB'_at_9w,
 '.'/'END'_at_10w
- The interface defines a single method, tag, which 
 assigns a tag to each token in a list, and
 returns the resulting list of tagged tokens.
30Tagging Algorithms
- Default tagger 
- Inspect the word and guess a tag 
- Unigram tagger 
- Assign the tag which is the most probable for the 
 word in question, based on raw frequency
- Uses training data 
- Bigram tagger, n-gram tagger 
- Rule-based taggers, HMM taggers (outside scope of 
 this class)
31Default Tagger
- We need something to use for unseen words 
- E.g., guess NNP for a word with an initial 
 capital
- Do regular-expression processing of the words 
- Sequence of regular expression tests 
- Assigment of the wor to a suitable tag 
- If there are no matches 
- Assign to the most frequent tag, NN
32Finding the most frequent tag
- nltk.probability module 
- for ttoken in ttext 
 
 freq_dist.inc(ttoken.tag())
 def_tag
 freq_dist.max()
33A Default Tagger
- gt tokensWSTokenizer().tokenize(untag_text_str) 
 'John'_at_0w, 'saw'_at_1w, '3'_at_2w, 'polar'_at_3w,
 'bears'_at_4w, '.'_at_5w
-  gt my_tagger.tag(tokens) 
- 'John'/'NN'_at_0w, 'saw'/'NN'_at_1w, 
 '3'/'CD'_at_2w, 'polar'/'NN'_at_3w,
 'bears'/'NN'_at_4w, '.'/'NN'_at_5w
- NN_CD_Tagger assigns CD to numbers, otherwise NN 
- Poor performance (20-30) in isolation, but when 
 used with other taggers can significantly improve
 performance
34Unigram Tagger
- Unigram  table of frequencies 
- E.g. in tagged WSJ sample, deal is tagged with 
 NN 11 times, with VB 1 time, and with VBP 1 time
- 90 accuracy 
- Counting events 
- freq_dist  CFFreqDist() 
 for tttoken
 in ttext
 context
 ttoken.type().base()
-  feature  ttoken.type().tag() 
 freq_dist.inc(CFSample(context,feature))
- context_event  ContextEvent(token.type()) 
 samplefreq_dist.cond_max(context_event)
 tagsample.feature()
35Unigram Tagger (continued)
- Before being used, UnigramTaggers are trained 
 using the train method, which uses a tagged
 corpus to determine which tags are most common
 for each word
-  'train.txt' is a tagged training corpus 
- gtgtgt tagged_text_str  open('train.txt').read() 
- gtgtgt train_toks  TaggedTokenizer().tokenize(tagged
 _text_str)
- gtgtgt tagger  UnigramTagger() 
- gtgtgt tagger.train(train_toks) 
36Unigram Tagger (continued)
- Once a UnigramTagger has been trained, the tag 
 can be used to tag untagged corpora
- gt tokens  WSTokenizer().tokenize(untagged_text_st
 r)
- gt tagger.tag(tokens) 
- 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w, 
 'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
 'on'/'IN'_at_4w, 'the'/'AT'_at_5w, ...
37Unigram Tagger (continued)
- Performance is highly dependent on the quality of 
 its training set
- Cant be too small 
- Cant be too different from texts we actually 
 want to tag
- How is this related to the homework that we just 
 did?
38Nth Order Tagging
- Bigram table frequencies of pairs 
- Not necessarily adjacent or of same category 
- What is the most likely tag for w_n, given w_n-1 
 and t_n-1?
- What is the context for NLTK? 
- N-gram tagger 
- Consider n-1 previous tags 
- Sparse data problem 
- Accuracy versus coverage tradeoff 
- Backoff 
- Throwing away order 
- Put context into a set 
39Nth-Order Tagging (continued)
- In addition to considering the tokens type, the 
 context also considers the tags of the n
 preceding tokens
- The tagger then picks the tag which is most 
 likely for that context
- Different values of n are possible 
- Oth order  unigram tagger 
- 1st order  bigrams 
- 2nd order  trigrams 
40Nth-Order Tagging (continued)
- Tagged training corpus determines most likely tag 
 for each context
- gt train_toks  TaggedTokenizer().tokenize(tagged_t
 ext_str)
- gt tagger  NthOrderTagger(3)  3rd order 
 tagger
- gttagger.train(train_toks)
41Nth-Order Tagging (continued)
- Once trained, it can tag untagged corpora 
- gt tokensWSTokenizer().tokenize(untag_text_str) 
- gt tagger.tag(tokens) 
- 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w, 
 'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
 'on'/'IN'_at_4w, 'the'/'AT'_at_5w, ...
42Combining Taggers
- Use more accurate algorithms when we can, backoff 
 to wider coverage when needed.
- Try tagging the token with the 1st order tagger. 
- If the 1st order tagger is unable to find a tag 
 for the token, try finding a tag with the 0th
 order tagger.
- If the 0th order tagger is also unable to find a 
 tag, use the NN_CD_Tagger to find a tag.
43BackoffTagger class
- gtgtgt train_toks  TaggedTokenizer().tokenize(tagged
 _text_str)
-  Construct the taggers 
- gtgtgt tagger1  NthOrderTagger(1)  1st order 
- gtgtgt tagger2  UnigramTagger()  0th order 
- gtgtgt tagger3  NN_CD_Tagger() 
-  Train the taggers 
- gtgtgt tagger1.train(train_toks) 
- gtgtgt tagger2.train(train_toks)
44Backoff (continued)
-  Combine the taggers (in order, by specificity) 
- gtgt tagger  BackoffTagger(tagger1, tagger2, 
 tagger3)
-  Use the combined tagger 
- gtgttokensTaggedTokenizer().tokenize(untagged_text_
 str)
- gtgt tagger.tag(tokens) 
- 'John'/'NN'_at_0w, 'saw'/'VB'_at_1w, 
 'the'/'AT'_at_2w, 'book'/'NN'_at_3w,
 'on'/'IN'_at_4w, 'the'/'AT'_at_5w, ...
45Rule-Based Tagger
- The Linguistic Complaint 
- Where is the linguistic knowledge of a tagger? 
- Just a massive table of numbers 
- Arent there any linguistic insights that could 
 emerge from the data?
- Could thus use handcrafted sets of rules to tag 
 input sentences, for example, if input follows a
 determiner tag it as a noun.
46Evaluating a Tagger
- Tagged tokens  the original data 
- Untag the data 
- Tag the data with your own tagger 
- Compare the original and new tags 
- Iterate over the two lists checking for identity 
 and counting
- Accuracy  fraction correct
47A Look at Tagging Implementations
- It demonstrates how to write classes implementing 
 the interfaces defined by NLTK.
- It provides you with a better understanding of 
 the algorithms and data structures underlying
 each approach to tagging.
- It gives you a chance to see some of the code 
 used to implement NLTK. The developers have tried
 hard to ensure that the implementation of every
 class in NLTK is easy to understand.
48A Sequential Tagger
- The taggers in this tutorial are implemented as 
 sequential taggers
- Assigns tags to one token at a time, starting 
 with the first token of the text, and proceeding
 in sequential order.
- Decides which tag to assign a token on the basis 
 of that token, the tokens that preceed it, and
 the predicted tags for the tokens that preceed
 it.
- To capture this commonality, we define a common 
 base class, SequentialTagger (class
 SequentialTagger(TaggerI))
- The next.tag method (note typo in tutorial) 
 returns the appropriate tag for the next token
 each tagger subclass provides its own
 implementation
49SequentialTagger.next_tag
- -decides which tag to assign a token, given the 
 list of tagged tokens that preceeds it.
- two arguments a list of tagged tokens preceeding 
 the token to be tagged, and the token to be
 tagged and it returns the appropriate tag for
 that token.
- def next_tag(self, tagged_tokens, next_token) 
 assert 0, "next_tag not defined by
 SequentialTagger subclass"
50SequentialTagger.tag
- def tag(self, text) 
-  tagged_text   
-   Tag each token, in sequential order. 
-  for token in text 
-   Get the tag for the next token. 
-  tag  self.next_tag(tagged_text, token) 
-   Use tag to build tagged token, add to 
 tagged_text. tagged_token  Token(TaggedType(toke
 n.type(), tag), token.loc())
-  tagged_text.append(tagged_token) 
-  return tagged_text
51Example Subclass NN_CD_Tagger
- class NN_CD_Tagger(SequentialTagger) 
- def __init__(self) pass empty constructor 
- def next_tag(self, tagged_tokens, next_token) 
-   Assign 'CD' for numbers, 'NN' for anything 
 else.
-  if re.match(r'0-9(.0-9)?', 
 next_token.type())
-  return 'CD' 
-  else 
-  return 'NN 
-  just define this method when the tag method is 
 called, the definition given by SequentialTagger
 will be used.
52Another Example UnigramTagger
- class UnigramTagger(TaggerI) 
- class UnigramTagger(SequentialTagger)
53Unigram Tagger Training
- def train(self, tagged_tokens) 
-  for token in tagged_tokens 
-  outcome  token.type().tag() 
-  context  token.type().base() 
 self._freqdistcontext.inc(outcome
54Unigram Tagger Tagging
- def next_tag(self, tagged_tokens, next_token) 
 context  next_token.type() return
 self._freqdistcontext.max()
-  
eg access context and find most likely 
outcome gtgtgt freqdist'bank'.max() 'NN'  
 55Unigram Tagger Initialization
- The constructor for UnigramTagger simply 
 initializes self._freqdist with a new conditional
 frequency distribution.
- def __init__(self) 
 self._freqdist  probability.Conditional
 FreqDist()
56For Self-Study
- NthOrder Tagger Implementation 
- BackoffTagger Implementation 
57For Next Time