PartofSpeech Tagging - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

PartofSpeech Tagging

Description:

The representative put chairs on the table. AT NN VBD NNS IN AT NN. Using Brown/Penn tag sets ... bj.l : probability that word (or word class) l is emitted by ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 28
Provided by: klplReP
Category:

less

Transcript and Presenter's Notes

Title: PartofSpeech Tagging


1
Part-of-Speech Tagging
  • ???? ???
  • ? ? ?

2
The beginning
  • The task of labeling (or tagging) each word in a
    sentence with its appropriate part of speech.
  • The representative put chairs on the table
  • AT NN VBD NNS
    IN AT NN
  • Using Brown/Penn tag sets
  • A problem of limited scope
  • Instead of constructing a complete parse
  • Fix the syntactic categories of the word in a
    sentence
  • Tagging is a limited but useful application.
  • Information extraction
  • Question and answering
  • Shallow parsing

3
The Information Sources in Tagging
  • Syntagmatic look at the tags assigned to nearby
    words some combinations are highly likely while
    others are highly unlikely or impossible
  • ex) a new play
  • AT JJ NN
  • AT JJ VBP
  • Lexical look at the word itself. (90 accuracy
    just by picking the most likely tag for each
    word)
  • Verb is more likely to be a noun than a verb

4
Notation
  • wi the word at position i in the corpus
  • ti the tag of wi
  • wi,im the words occurring at positions i
    through im
  • ti,im the tags ti tim for wi wim
  • wl the lth word in the lexicon
  • tj the jth tag in the tag set
  • C(wl) the number of occurrences of wl in the
    training set
  • C(tj) the number of occurrences of tj in the
    training set
  • C(tj,tk) the number of occurrences of tj followed
    by tk
  • C(wl,tj) the number of occurrences of wl that are
    tagged as tj
  • T number of tags in tag set
  • W number of words in the lexicon
  • n sentence length

5
The Probabilistic Model (I)
  • The sequence of tags in a text as Markov chain.
  • A words tag only depends on the previous tag
    (Limited horizon)
  • Dependency does not change over time (Time
    invariance)
  • compact notation Limited Horizon Property

6
The Probabilistic Model (II)
  • Maximum likelihood estimate tag following

7
The Probabilistic Model (III)
  • (We define P(t1t0)1.0 to simplify
    our notation)
  • The final equation

8
The Probabilistic Model (III)
  • Algorithm for training a Visible Markov Model
    Tagger
  • Syntagmatic Probabilities
  • for all tags tj do
  • for all tags tk do
  • P(tk tj)C(tj, tk)/C(tj)
  • end
  • end
  • Lexical Probabilities
  • for all tags tj do
  • for all words wl do
  • P(wl tj)C(wl, tj)/C(tj)
  • end
  • end

9
The Probabilistic Model (IV)
ltIdealized counts of some tag transitions in the
Brown Corpusgt
10
The Probabilistic Model (V)
ltIdealized counts for the tags that some words
occur with in the Brown Corpusgt
11
The Viterbi algorithm
  • comment Given a sentence of length n
  • comment Initialization
  • d1(PERIOD) 1.0
  • d1(t) 0.0 for t ? PERIOD
  • comment Induction
  • for i 1 to n step 1 do
  • for all tags tj do
  • di1(tj) max1ltkltTdi(tk)P(wi1tj)P(tj
    tk)
  • ?i1(tj) argmax1ltkltTdi(tk)P(wi1tj)P
    (tjtk)
  • end
  • end
  • comment Termination and path-readout
  • Xn1 argmax1ltjltT dn1(j)
  • for j n to 1 step 1 do
  • Xj ?j1(Xj1)
  • end
  • P(X1 , , Xn) max1ltjltT dn1(tj)

12
Variations (I)
  • Unknown words
  • Unknown words are a major problem for taggers
  • The simplest model for unknown words
  • Assume that they can be of any part of speech
  • Use morphological information
  • Past tense form words ending in ed
  • Capitalized

13
Variations (II)
  • Trigram taggers
  • The basic Markov Model tagger bigram tagger
  • two tag memory
  • disambiguate more cases
  • Interpolation and variable memory
  • trigram tagger may make worse pridictions than a
    bigram tagger
  • linear interpolation
  • Variable Memory Markov Model

14
Variations (III)
  • Smoothing
  • Reversibility
  • Markov model decodes from left to right
    decodes from right to left

Kl is the number of possible parts of speech of wl
15
Variations (IV)
  • Maximum Likelihood Sequence vs. tag by tag
  • Viterbi Alogorithm maximize P(t1,nw1,n)
  • Consider maximize P(tiw1,n)
  • for all i which amounts to summing over different
    tag sequance
  • ex) Time flies like a arrow.
  • a. NN VBZ RB AT NN. P(.) 0.01
  • b. NN NNS VB AT NN. P(.) 0.01
  • c. NN NNS RB AT NN. P(.) 0.001
  • d. NN VBZ VB AT NN. P(.) 0
  • one error does not affect the tagging of other
    words

16
Applying HMMs to POS tagging(I)
  • If we have no training data, we can use a HMM to
    learn the regularities of tag sequences.
  • HMM consists of the following elements
  • a set of states ( tags )
  • an output alphabet ( words or classes of words )
  • initial state probabilities
  • state transition probabilities
  • symbol emission probabilities

17
Applying HMMs to POS tagging(II)
  • Jelineks method
  • bj.l probability that word (or word class) l is
    emitted by tag j

18
Applying HMMs to POS tagging(III)
  • Kupiecs method

L is the number of indices in L
19
Transformation-Based Learning of Tags
  • Markov assumption are too crude?
    transformation-based tagging
  • Exploit a wider range
  • An order of magnitude fewer decisions
  • Two key components
  • a specification of which error-correcting
    transformations are admissible
  • The learning algorithm

20
Transformation(I)
  • A triggering environment
  • A rewrite rule
  • Form t1?t2 replace t1 by t2

21
Transformation(II)
  • environments can be conditioned
  • combination of words and tags
  • Morphology-triggered transformation
  • ex) Replace NN by NNS if the unknown words
    suffix is -s

22
The learning algorithm
  • C0 corpus with each word tagged with its most
    frequent tag
  • for k0 step 1 do
  • ?the transformation ui that minimizes
    E(ui(Ck))
  • if (E(Ck)-E(?(Ck))) lt ? then break fi
  • Ck1 ?(Ck)
  • tk1 ?
  • end
  • Output sequence t1, , tk

23
Relation to other models
  • Decision trees
  • similarity with Transformation-based learning
  • a series of relableing
  • difference with Transformation-based learning
  • split at each node in a decision tree
  • different sequence of transformation for each
    node
  • Probabilistic models in general

24
Automata
  • Transformation-based tagging has a rule
    component, it also has a quantitative component.
  • Once learning is complete, transformation-based
    tagging is purely symbolic
  • Transformation-based tagger can be converted into
    another symbolic object
  • Roche and Schobes(1995) finite state transducer
  • Advantage speed

25
Other Method, Other Languages
  • Other approaches to tagging
  • In chapter 16
  • Languages other than English
  • In many other languages, word order is much freer
  • The rich inflections of a word contribute more
    information about part of speech

26
Tagging accuracy
  • 9597 when calculated over all words
  • Considerable factors
  • The amount of training data available
  • The tag set
  • The difference between training set and test set
  • Unknown words
  • a dump tagger
  • Always chooses a words most frequent tag
  • Accuracy of about 90
  • EngCG

27
Applications of tagging
  • Benefit from syntactically disambiguated text
  • Partial Parsing
  • Finding none phrases of sentence
  • Information Extraction
  • Finding value for the predefined slots of a
    template
  • Finding good indexing term in information
    retrieval
  • Question Answering
  • Returning an appropriate noun such as a location,
    a person, or a date
Write a Comment
User Comments (0)
About PowerShow.com