6122009 - PowerPoint PPT Presentation

About This Presentation
Title:

6122009

Description:

What's the probability of a random word (from a random dictionary page) being a verb? ... All words = just count all the words in the dictionary ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 31
Provided by: neals
Learn more at: https://web.stanford.edu
Category:
Tags: dictionary

less

Transcript and Presenter's Notes

Title: 6122009


1
LING 138/238 SYMBSYS 138Intro to Computer Speech
and Language Processing
  • Lecture 6 Part of Speech Tagging (II)
  • October 14, 2004
  • Neal Snider

Thanks to Dan Jurafsky, Jim Martin, Dekang Lin,
and Bonnie Dorr for some of the examples and
details in these slides!
2
Week 3 Part of Speech tagging
  • Part of speech tagging
  • Parts of speech
  • Whats POS tagging good for anyhow?
  • Tag sets
  • Rule-based tagging
  • Statistical tagging
  • TBL tagging

3
Rule-based tagging
  • Start with a dictionary
  • Assign all possible tags to words from the
    dictionary
  • Write rules by hand to selectively remove tags
  • Leaving the correct tag for each word.

4
3 methods for POS tagging
  • Rule-based tagging
  • (ENGTWOL)
  • Stochastic (Probabilistic) tagging
  • HMM (Hidden Markov Model) tagging
  • Transformation-based tagging
  • Brill tagger

5
Statistical Tagging
  • Based on probability theory
  • Today well go over a few basic ideas of
    probability theory
  • Then well do HMM and TBL tagging.

6
Probability and part of speech tags
  • Whats the probability of drawing a 2 from a deck
    of 52 cards with four 2s?
  • Whats the probability of a random word (from a
    random dictionary page) being a verb?

7
Probability and part of speech tags
  • Whats the probability of a random word (from a
    random dictionary page) being a verb?
  • How to compute each of these
  • All words just count all the words in the
    dictionary
  • of ways to get a verb number of words which
    are verbs!
  • If a dictionary has 50,000 entries, and 10,000
    are verbs. P(V) is 10000/50000 1/5 .20

8
Probability and Independent Events
  • Whats the probability of picking two verbs
    randomly from the dictionary
  • Events are independent, so multiply probs
  • P(w1V,w2V) P(V) P(V)
  • 1/5 1/5
  • 0.04
  • What if events are not independent?

9
Conditional Probability
  • Written P(AB).
  • Lets say A is its raining.
  • Lets say P(A) in drought-stricken California is
    .01
  • Lets say B is it was sunny ten minutes ago
  • P(AB) means what is the probability of it
    raining now if it was sunny 10 minutes ago
  • P(AB) is probably way less than P(A)
  • Lets say P(AB) is .0001

10
Conditional Probability and Tags
  • P(Verb) is the probability of a randomly selected
    word being a verb.
  • P(Verbrace) is whats the probability of a word
    being a verb given that its the word race?
  • Race can be a noun or a verb.
  • Its more likely to be a noun.
  • P(Verbrace) can be estimated by looking at some
    corpus and saying out of all the times we saw
    race, how many were verbs?
  • In Brown corpus, P(Verbrace) 96/98 .98
  • How to calculate for a tag sequence, say
    P(NNDT)?

11
Stochastic Tagging
  • Based on probability of certain tag occurring
    given various possibilities
  • Necessitates a training corpus
  • No probabilities for words not in corpus.
  • Training corpus may be too different from test
    corpus.

12
Stochastic Tagging (cont.)
  • Simple Method Choose most frequent tag in
    training text for each word!
  • Result 90 accuracy
  • Why?
  • Baseline Others will do better
  • HMM is an example

13
HMM Tagger
  • Intuition Pick the most likely tag for this
    word.
  • HMM Taggers choose tag sequence that maximizes
    this formula
  • P(wordtag) P(tagprevious n tags)
  • Let T t1,t2,,tnLet W w1,w2,,wnFind POS
    tags that generate a sequence of words, i.e.,
    look for most probable sequence of tags T
    underlying the observed words W.

14
HMM Tagger
  • argmaxT P(TW)
  • argmaxTP(WT)P(T) Bayes Rule
  • argmaxTP(w1wnt1tn)P(t1tn)
  • Remember, we are trying to find the sequence T
    that will maximize P(TW) so this equation is
    calculated over the whole sentence.

15
HMM Tagger
  • Assume word is dependent only on its own POS tag
    it is independent of the others around it
  • argmaxTP(w1t1)P(w2t2)P(wntn)P(t1)P(t2t1)P
    (tntn-1)

16
Bigram HMM Tagger
  • Also assume that probability is dependent only on
    previous tag
  • For each word and possible tag, need to
    calculate
  • P(ti) P(witi)P(titi-1)
  • then multiply this over each possible tag and
    each word over the sequence of tags

17
Bigram HMM Tagger
  • How do we compute P(titi-1)?
  • c(ti-1ti)/c(ti-1)
  • How do we compute P(witi)?
  • c(wi,ti)/c(ti)
  • How do we compute the most probable tag sequence?
  • Viterbi algorithm

18
An Example
  • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
    tomorrow/NN
  • People/NNS continue/VBP to/TO inquire/VB the DT
    reason/NN for/IN the/DT race/NN for/IN outer/JJ
    space/NN
  • to/TO race/???the/DT race/???
  • ti argmaxj P(tjti-1)P(witj)
  • i num of word in sequence, j num among
    possible tags
  • maxP(VBTO)P(raceVB) , P(NNTO)P(raceNN)
  • Brown
  • P(NNTO) .021 P(raceNN) .00041 .000007
  • P(VBTO) .34 P(raceVB) .00003 .00001

19
Viterbi Algorithm
S1
S2
S4
S3
S5
20
Transformation-Based Tagging (Brill Tagging)
  • Combination of Rule-based and stochastic tagging
    methodologies
  • Like rule-based because rules are used to specify
    tags in a certain environment
  • Like stochastic approach because machine learning
    is usedwith tagged corpus as input
  • Input
  • tagged corpus
  • dictionary (with most frequent tags)

21
Transformation-Based Tagging (cont.)
  • Basic Idea
  • Set the most probable tag for each word as a
    start value
  • Change tags according to rules of type if word-1
    is a determiner and word is a verb then change
    the tag to noun in a specific order
  • Training is done on tagged corpus
  • Write a set of rule templates
  • Among the set of rules, find one with highest
    score
  • Continue from 2 until lowest score threshold is
    passed
  • Keep the ordered set of rules
  • Rules make errors that are corrected by later
    rules

22
TBL Rule Application
  • Tagger labels every word with its most-likely tag
  • For example race has the following probabilities
    in the Brown corpus
  • P(NNrace) .98
  • P(VBrace) .02
  • Transformation rules make changes to tags
  • Change NN to VB when previous tag is TO
    is/VBZ expected/VBN to/TO race/NN
    tomorrow/NNbecomes is/VBZ expected/VBN to/TO
    race/VB tomorrow/NN

23
TBL Rule Learning
  • 2 parts to a rule
  • Triggering environment
  • Rewrite rule
  • The range of triggering environments of templates
    (from Manning Schutze 1999363)

Schema ti-3 ti-2 ti-1 ti ti1 ti2 ti3 1 2
3 4 5 6 7 8 9

24
TBL The Tagging Algorithm
  • Step 1 Label every word with most likely tag
    (from dictionary)
  • Step 2 Check every possible transformation
    select one which most improves tagging
  • Step 3 Re-tag corpus applying the rules
  • Repeat 2-3 until some criterion is reached, e.g.,
    X correct with respect to training corpus
  • RESULT Sequence of transformation rules

25
TBL Rule Learning (cont.)
  • Problem Could apply transformations ad
    infinitum!
  • Constrain the set of transformations with
    templates
  • Replace tag X with tag Y, provided tag Z or word
    Z appears in some position
  • Rules are learned in ordered sequence
  • Rules may interact.
  • Rules are compact and can be inspected by humans

26
Templates for TBL
27
TBL Problems
  • First 100 rules achieve 96.8 accuracyFirst 200
    rules achieve 97.0 accuracy
  • Execution Speed TBL tagger is slower than HMM
    approach
  • Learning Speed Brills implementation over a day
    (600k tokens)
  • BUT
  • (1) Learns small number of simple,
    non-stochastic rules
  • (2) Can be made to work faster with FST
  • (3) Best performing algorithm on unknown words

28
Tagging Unknown Words
  • New words added to (newspaper) language 20 per
    month
  • Plus many proper names
  • Increases error rates by 1-2
  • Method 1 assume they are nouns
  • Method 2 assume the unknown words have a
    probability distribution similar to words only
    occurring once in the training set.
  • Method 3 Use morphological information, e.g.,
    words ending with ed tend to be tagged VBN.

29
Using Morphological Information
30
Evaluation
  • The result is compared with a manually coded
    Gold Standard
  • Typically accuracy reaches 96-97
  • This may be compared with result for a baseline
    tagger (one that uses no context).
  • Important 100 is impossible even for human
    annotators.
Write a Comment
User Comments (0)
About PowerShow.com