Generative and Discriminative Models in NLP: A Survey - PowerPoint PPT Presentation

About This Presentation
Title:

Generative and Discriminative Models in NLP: A Survey

Description:

Natural Language Processing. N L P. S. Motivation. Many problems in natural language processing are disambiguation problems. word senses ... – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 40
Provided by: KristinaT3
Category:

less

Transcript and Presenter's Notes

Title: Generative and Discriminative Models in NLP: A Survey


1
Generative and Discriminative Models in NLP A
Survey
  • Kristina Toutanova
  • Computer Science Department
  • Stanford University

2
Motivation
  • Many problems in natural language processing are
    disambiguation problems
  • word senses
  • jaguar a big cat, a car, name of a Java
    package
  • line - phone, queue, in mathematics, air
    line, etc.
  • part-of-speech tags (noun, verb, proper noun,
    etc.)
  • ? ? ?
  • Joy makes progress every day .

NN VB
NN NNP
VBZ NNS
DT
NN
3
Motivation
  • Parsing choosing preferred phrase structure
    trees for sentences, corresponding to likely
    semantics
  • Possible approaches to disambiguation
  • Encode knowledge about the problem, define rules,
    hand-engineer grammars and patterns (requires
    much effort, not always possible to have
    categorical answers)
  • Treat the problem as a classification task and
    learn classifiers from labeled training data

VP
?
NP
VBD
NNP
PP
NNP
IN
I
saw
Mary
with
the
telescope
4
Overview
  • General ML perspective
  • Examples
  • The case of Part-of-Speech Tagging
  • The case of Syntactic Parsing
  • Conclusions

5
The Classification Problem
  • Given a training set of iid samples T(X1,Y1)
    (Xn,Yn) of input and class variables from an
    unknown distribution D(X,Y), estimate a function
    that predicts the class from the input variables
  • The goal is to come up with a hypothesis
    with minimum expected loss (usually 0-1 loss)
  • Under 0-1 loss the hypothesis with minimum
    expected loss is the Bayes optimal classifier

6
Approaches to Solving Classification Problems - I
  • Generative. Try to estimate the probability
    distribution of the data D(X,Y)
  • specify a parametric model family
  • choose parameters by maximum likelihood on
    training data
  • estimate conditional probabilities by Bayes rule
  • classify new instances to the most probable class
    Y according to

7
Approaches to Solving Classification Problems - I
  • 2. Discriminative. Try to estimate the
    conditional distribution D(YX) from data.
  • specify a parametric model family
  • estimate parameters by maximum conditional
    likelihood of training data
  • classify new instances to the most probable class
    Y according to
  • 3. Discriminative. Distribution-free. Try to
    estimate directly from data so that
    its expected loss will be minimized

8
Axes for comparison of different approaches
  • Asymptotic accuracy
  • Accuracy for limited training data
  • Speed of convergence to the best hypothesis
  • Complexity of training
  • Modeling ease

9
Generative-Discriminative Pairs
  • Definition If a generative and discriminative
    parametric model family can represent the same
    set of conditional probability distributions
    they are a generative-discriminative pair
  • Example Naïve Bayes and Logistic Regression

Y
X2
X1
10
Comparison of Naïve Bayes and Logistic Regression
  • The NB assumption that features are independent
    given the class is not made by logistic
    regression
  • The logistic regression model is more general
    because it allows a larger class of probability
    distributions for the features given classes

11
Example Traffic Lights
Reality
Lights Working
Lights Broken
P(g,r,w) 3/7
P(r,g,w) 3/7
P(r,r,b) 1/7
  • Model assumptions false!
  • JL and CL estimates differ
  • JL P(w) 6/7 CL (w) ?
  • P(rw) ½ (rw) ½
  • P(rb) 1 (rb) 1

NB Model
Working?
NS
EW
12
Joint Traffic Lights
Lights Working
3/14
3/14
3/14
3/14
2/14
0
0
0
Lights Broken
13
Conditional Traffic Lights
Lights Working
?/4
?/4
?/4
?/4
0
0
0
1-?
Lights Broken
14
Comparison of Naïve Bayes and Logistic Regression
Naïve Bayes Logistic Regression
Accuracy
Convergence
Training Speed
Model assumptions independence of features given class Linear log-odds
Advantages Faster convergence, uses information in P(X), faster training More robust and accurate because fewer assumptions
Disadvantages Large bias if the independence assumptions are very wrong Harder parameter estimation problem, ignores information in P(X)
15
Some Experimental Comparisons
error
error
training data size
training data size
Ng Jordan 2002 (15 datasets from UCI ML) Klein Manning 2002 (WSD line and hard data)
16
Part-of-Speech Tagging
  • POS tagging is determining the part of speech
    of every word in a sentence.
  • ? ? ?
  • Joy makes progress every
    day .
  • Sequence classification problem with 45 classes
    (Penn Treebank). Accuracies are high 97! Some
    argue it cant go much higher
  • Existing approaches
  • rule-based (hand-crafted, TBL)
  • generative (HMM)
  • discriminative (maxent, memory-based, decision
    tree, neural network, linear models(boosting,perce
    ptron) )

NN VB
NN NNP
VBZ NNS
NN
DT
17
Part-of-Speech TaggingUseful Features
  • The complete solution of the problem requires
    full syntactic and semantic understanding of
    sentences
  • In most cases information about surrounding
    words/tags is strong disambiguator
  • The long fenestration was tiring .
  • Useful features
  • tags of previous/following words
    P(NNJJ).45P(VBPJJ)0.0005
  • identity of word being tagged/surrounding words
  • suffix/prefix for unknown words, hyphenation,
    capitalization
  • longer distance features
  • others we havent figured out yet

18
HMM Tagging Models - I
  • Independence Assumptions
  • ti is independent of t1ti-2 and w1wi-1 given
    ti-1
  • words are independent given their tags

t1
t2
t3
w1
w2
w3
states can be single tags or pairs of successive
tags or variable length sequences of last tags
t
unknown words (Weischedel et al. 93)
Cap?
suffix
hyph
uw
19
HMM Tagging Models - Brants 2000
  • Highly competitive with other state-of-the art
    models
  • Trigram HMM with smoothed transition
    probabilities
  • Capitalization feature becomes part of the state
    each tag state is split into two e.g.
  • NN ltNN,capgt,ltNN,not capgt
  • Suffix features for unknown words

t
suffixn-1
suffix2
suffix1
suffixn
20
CMM Tagging Models
  • Independence Assumptions
  • ti is independent of t1ti-2 and w1wi-1 given
    ti-1
  • ti is independent of all following observations
  • no independence assumptions on the observation
    sequence

t1
t2
t3
w1
w2
w3
  • Dependence of current tag on previous and future
    observations can be added overlapping features
    of the observation can be taken as predictors

21
MEMM Tagging Models -II
  • Ratnaparkhi (1996)
  • local distributions are estimated using maximum
    entropy models
  • used previous two tags, current word, previous
    two words, next two words
  • suffix, prefix, hyphenation, and capitalization
    features for unknown words

Model Overall Accuracy Unknown Words
HMM (Brants 2000) 96.7 85.5
MEMM(Ratn 1996) 96.63 85.56
MEMM(TM 2000) 96.86 86.91
22
HMM vs CMM I
Johnson (2001)
Model Accuracy
95.5
94.4
95.3
tj
tj1
wj1
wj
tj
tj1
wj1
wj
tj
tj1
wj1
wj
23
HMM vs CMM - II
  • The per-state conditioning of the CMM has been
    observed to exhibit label bias (Bottou, Lafferty)
    and observation bias (Klein Manning )
  • Klein Manning (2002)

HMM CMM CMM
91.23 89.22 90.44
Unobserving words with unambiguous tags improved
performance significantly
24
Conditional Random Fields (Lafferty et al 2001)
  • Models that are globally conditioned on the
    observation sequence define distribution P(YX)
    of tag sequence given word sequence
  • No independence assumptions about the
    observations no need to model their distribution
  • The labels can depend on past and future
    observations
  • Avoids the independence assumption of CMMs that
    labels are independent of future observations and
    thus the label and observation bias problems
  • The parameter estimation problem is much harder

25
CRF - II
t1
t2
t3
  • HMM and this chain CRF form a generative-discrimin
    ative pair
  • Independence assumptions a tag is independent
    of all other tags in the sequence given its
    neighbors and the word sequence

w1
w2
w3
26
CRF-Experimental Results
Model Accuracy Unknown Word Accuracy
HMM 94.31 54.01
CMM (MEMM) 93.63 45.39
CRF 94.45 51.95
CMM (MEMM) 95.19 73.01
CRF 95.73 76.24
27
Discriminative Tagging Model Voted Perceptron
  • Collins 2002 Best reported tagging results on
    WSJ
  • Uses all features used by Ratnaparkhi (96)
  • Learns a linear function
  • Classifies according to
  • Error MEMM(Ratn 96) 96.72 V Perceptron 97.11

28
Summary of Tagging Review
  • For tagging, the change from generative to
    discriminative model does not by itself result
    in great improvement (e.g. HMM and CRF)
  • One profits from discriminative models for
    specifying dependence on overlapping features of
    the observation such as spelling, suffix
    analysis,etc
  • The CMM model allows integration of rich features
    of the observations, but suffers strongly from
    assuming independence from following
    observations this effect can be relieved by
    adding dependence on following words
  • This additional power (of the CMM ,CRF,
    Perceptron models) has been shown to result in
    improvements in accuracy though not dramatic (up
    to 11 error reduction)
  • The higher accuracy of discriminative models
    comes at the price of much slower training
  • More research is needed on specifying useful
    features (or tagging WSJ Penn Treebank is a noisy
    task and the limit is reached)

29
Parsing Models
  • Syntactic parsing is the task of assigning a
    parse tree to a sentence corresponding to its
    most likely interpretation
  • Existing approaches
  • hand-crafted rule-based heuristic methods
  • probabilistic generative models
  • conditional probabilistic discriminative models
  • discriminative ranking models

VP
NP
VBD
NNP
PP
NNP
IN
I
saw
Mary
with
the
telescope
30
Generative Parsing Models
  • Generative models based on PCFG grammars learned
    from corpora are still among the best performing
    (Collins 97,Charniak 97,00) 88 -89 labeled
    precision/recall
  • The generative models learn a distribution P(X,Y)
    on ltsentence, parse treegt pairs
  • and select a single most likely parse for a
    sentence X based on
  • Easy to train using RFE for maximum likelihood
  • These models have the advantage of being usable
    as language models (ChelbaJelinek 00, Charniak
    00)

31
Generative History-Based Model Collins 97
TOP
Accuracy lt 100 words 88.1 LP 87.5 LR
S(bought)
NP(week)
NP-C(Marks)
VP(bought)
NP-C(Brooks)
VBD(bought)
NNP(Brooks)
NNP(Marks)
JJ(Last)
NN(week)
bought
Brooks
Marks
week
Last
32
Discriminative models
  • Shift-reduce parser Ratnaparkhi (98)
  • Learns a distribution P(TS) of parse trees given
    sentences using the sequence of actions of a
    shift-reduce parser
  • Uses a maximum entropy model to learn conditional
    distribution of parse action given history
  • Suffers from independence assumptions that
    actions are independent of future observations as
    CMM
  • Higher parameter estimation cost to learn local
    maximum entropy models
  • Lower but still good accuracy 86 - 87 labeled
    precision/recall

33
Discriminative Models Distribution Free
Re-ranking
  • Represent sentence-parse tree pairs by a feature
    vector F(X,Y)
  • Learn a linear ranking model with parameters
    using the boosting loss

Model LP LR
Collins 99 (Generative) 88.3 88.1
Collins 00 (BoostLoss) 89.9 89.6
13 error reduction
Still very close in accuracy to generative model
(Charniak 00)
34
Comparison of Generative-Discriminative Pairs
  • Johnson (2001) have compared simple PCFG trained
    to maximize L(T,S) and L(TS)
  • A Simple PCFG has parameters
  • Models
  • Results

Model LPrecision LRecall
MLE 0.815 0.789
MCLE 0.817 0.794
35
Weighted CFGs for Unification-Based Grammars - I
  • Unification-based grammars (UBG) are often
    defined using a context-free base and a set of
    path equations
  • Snumber X -gt NPnumber X VPnumber X
  • NPnumber X -gt N number X
  • VPnumber X -gtVnumber X
  • Nnumber sg-gt dog Nnumber pl -gtdogs
  • Vnumber sg -gtbarks Vnumber pl -gtbark
  • A PCFG grammar can be defined using the
    context-free backbone CFGUBG(S-gt NP, VP)
  • The UBG generates dogs bark and dog barks.
    The CFGUBG generates dogs bark ,dog barks,
    dog bark, and dogs barks .

36
Weighted CFGs for Unification-Based Grammars - II
  • A Simple PCFG for CFGUBG has parameters from the
    set
  • It defines a joint distribution P(T,S) and a
    conditional distributions of trees given
    sentences
  • A conditional weighted CFG defines only a
    conditional probability the conditional
    probability of any tree T outside the UBG is 0

37
Weighted CFGs for Unification-based grammars - III
Accuracy
The conditional weighted CFGs perform
consistently better than their generative
counterparts Negative information is extremely
helpful here knowing that the conditional
probability of trees outside the UBG is zero plus
conditional training amounts to 38 error
reduction for the simple PCFG model
38
Summary of Parsing Results
  • The single small study comparing a parsing
    generative-discriminative pair for PCFG parsing
    showed a small (insignificant) advantage for the
    discriminative model the added computational
    cost is probably not worth it
  • The best performing statistical parsers are
    still generative(Charniak 00, Collins 99) or use
    a generative model as a preprocessing
    stage(Collins 00, Collins 2002) (part of which
    has to do with computational complexity)
  • Discriminative models allow more complex
    representations such as the all subtrees
    representation (Collins 2002) or other
    overlapping features (Collins 00) and this has
    led to up to 13 improvement over a generative
    model
  • Discriminative training seems promising for parse
    selection tasks for UBG, where the number of
    possible analyses is not enormous

39
Conclusions
  • For the current sizes of training data available
    for NLP tasks such as tagging and parsing,
    discriminative training has not by itself yielded
    large gains in accuracy
  • The flexibility of including non-independent
    features of the observations in discriminative
    models has resulted in improved part-of-speech
    tagging models (for some tasks it might not
    justify the added computational complexity)
  • For parsing, discriminative training has shown
    improvements when used for re-ranking or when
    using negative information (UBG)
  • if you come up with a feature that is very hard
    to incorporate in a generative models and seems
    extremely useful, see if a discriminative
    approach will be computationally feasible !
Write a Comment
User Comments (0)
About PowerShow.com