Generative and Discriminative Models in NLP: A Survey - PowerPoint PPT Presentation

About This Presentation

Title:

Generative and Discriminative Models in NLP: A Survey

Description:

Natural Language Processing. N L P. S. Motivation. Many problems in natural language processing are disambiguation problems. word senses ... – PowerPoint PPT presentation

Number of Views:188

Avg rating:3.0/5.0

Slides: 40

Provided by: KristinaT3

Category:

more less

Transcript and Presenter's Notes

Title: Generative and Discriminative Models in NLP: A Survey

1
Generative and Discriminative Models in NLP A
Survey

Kristina Toutanova
Computer Science Department
Stanford University

2
Motivation

Many problems in natural language processing are
disambiguation problems
word senses
jaguar a big cat, a car, name of a Java
package
line - phone, queue, in mathematics, air
line, etc.
part-of-speech tags (noun, verb, proper noun,
etc.)
? ? ?
Joy makes progress every day .

NN VB
NN NNP
VBZ NNS
DT
NN
3
Motivation

Parsing choosing preferred phrase structure
trees for sentences, corresponding to likely
semantics
Possible approaches to disambiguation
Encode knowledge about the problem, define rules,
hand-engineer grammars and patterns (requires
much effort, not always possible to have
categorical answers)
Treat the problem as a classification task and
learn classifiers from labeled training data

VP
?
NP
VBD
NNP
PP
NNP
IN
I
saw
Mary
with
the
telescope
4
Overview

General ML perspective
Examples
The case of Part-of-Speech Tagging
The case of Syntactic Parsing
Conclusions

5
The Classification Problem

Given a training set of iid samples T(X1,Y1)
(Xn,Yn) of input and class variables from an
unknown distribution D(X,Y), estimate a function
that predicts the class from the input variables
The goal is to come up with a hypothesis
with minimum expected loss (usually 0-1 loss)
Under 0-1 loss the hypothesis with minimum
expected loss is the Bayes optimal classifier

6
Approaches to Solving Classification Problems - I

Generative. Try to estimate the probability
distribution of the data D(X,Y)
specify a parametric model family
choose parameters by maximum likelihood on
training data
estimate conditional probabilities by Bayes rule
classify new instances to the most probable class
Y according to

7
Approaches to Solving Classification Problems - I

2. Discriminative. Try to estimate the
conditional distribution D(YX) from data.
specify a parametric model family
estimate parameters by maximum conditional
likelihood of training data
classify new instances to the most probable class
Y according to
3. Discriminative. Distribution-free. Try to
estimate directly from data so that
its expected loss will be minimized

8
Axes for comparison of different approaches

Asymptotic accuracy
Accuracy for limited training data
Speed of convergence to the best hypothesis
Complexity of training
Modeling ease

9
Generative-Discriminative Pairs

Definition If a generative and discriminative
parametric model family can represent the same
set of conditional probability distributions
they are a generative-discriminative pair
Example Naïve Bayes and Logistic Regression

Y
X2
X1
10
Comparison of Naïve Bayes and Logistic Regression

The NB assumption that features are independent
given the class is not made by logistic
regression
The logistic regression model is more general
because it allows a larger class of probability
distributions for the features given classes

11
Example Traffic Lights
Reality
Lights Working
Lights Broken
P(g,r,w) 3/7
P(r,g,w) 3/7
P(r,r,b) 1/7

Model assumptions false!
JL and CL estimates differ
JL P(w) 6/7 CL (w) ?
P(rw) ½ (rw) ½
P(rb) 1 (rb) 1

NB Model
Working?
NS
EW
12
Joint Traffic Lights
Lights Working
3/14
3/14
3/14
3/14
2/14
0
0
0
Lights Broken
13
Conditional Traffic Lights
Lights Working
?/4
?/4
?/4
?/4
0
0
0
1-?
Lights Broken
14
Comparison of Naïve Bayes and Logistic Regression
Naïve Bayes Logistic Regression
Accuracy
Convergence
Training Speed
Model assumptions independence of features given class Linear log-odds
Advantages Faster convergence, uses information in P(X), faster training More robust and accurate because fewer assumptions
Disadvantages Large bias if the independence assumptions are very wrong Harder parameter estimation problem, ignores information in P(X)
15
Some Experimental Comparisons
error
error
training data size
training data size
Ng Jordan 2002 (15 datasets from UCI ML) Klein Manning 2002 (WSD line and hard data)
16
Part-of-Speech Tagging

POS tagging is determining the part of speech
of every word in a sentence.
? ? ?
Joy makes progress every
day .
Sequence classification problem with 45 classes
(Penn Treebank). Accuracies are high 97! Some
argue it cant go much higher
Existing approaches
rule-based (hand-crafted, TBL)
generative (HMM)
discriminative (maxent, memory-based, decision
tree, neural network, linear models(boosting,perce
ptron) )

NN VB
NN NNP
VBZ NNS
NN
DT
17
Part-of-Speech TaggingUseful Features

The complete solution of the problem requires
full syntactic and semantic understanding of
sentences
In most cases information about surrounding
words/tags is strong disambiguator
The long fenestration was tiring .
Useful features
tags of previous/following words
P(NNJJ).45P(VBPJJ)0.0005
identity of word being tagged/surrounding words
suffix/prefix for unknown words, hyphenation,
capitalization
longer distance features
others we havent figured out yet

18
HMM Tagging Models - I

Independence Assumptions
ti is independent of t1ti-2 and w1wi-1 given
ti-1
words are independent given their tags

t1
t2
t3
w1
w2
w3
states can be single tags or pairs of successive
tags or variable length sequences of last tags
t
unknown words (Weischedel et al. 93)
Cap?
suffix
hyph
uw
19
HMM Tagging Models - Brants 2000

Highly competitive with other state-of-the art
models
Trigram HMM with smoothed transition
probabilities
Capitalization feature becomes part of the state
each tag state is split into two e.g.
NN ltNN,capgt,ltNN,not capgt
Suffix features for unknown words

t
suffixn-1
suffix2
suffix1
suffixn
20
CMM Tagging Models

Independence Assumptions
ti is independent of t1ti-2 and w1wi-1 given
ti-1
ti is independent of all following observations
no independence assumptions on the observation
sequence

t1
t2
t3
w1
w2
w3

Dependence of current tag on previous and future
observations can be added overlapping features
of the observation can be taken as predictors

21
MEMM Tagging Models -II

Ratnaparkhi (1996)
local distributions are estimated using maximum
entropy models
used previous two tags, current word, previous
two words, next two words
suffix, prefix, hyphenation, and capitalization
features for unknown words

Model Overall Accuracy Unknown Words
HMM (Brants 2000) 96.7 85.5
MEMM(Ratn 1996) 96.63 85.56
MEMM(TM 2000) 96.86 86.91
22
HMM vs CMM I
Johnson (2001)
Model Accuracy
95.5
94.4
95.3
tj
tj1
wj1
wj
tj
tj1
wj1
wj
tj
tj1
wj1
wj
23
HMM vs CMM - II

The per-state conditioning of the CMM has been
observed to exhibit label bias (Bottou, Lafferty)
and observation bias (Klein Manning )
Klein Manning (2002)

HMM CMM CMM
91.23 89.22 90.44
Unobserving words with unambiguous tags improved
performance significantly
24
Conditional Random Fields (Lafferty et al 2001)

Models that are globally conditioned on the
observation sequence define distribution P(YX)
of tag sequence given word sequence
No independence assumptions about the
observations no need to model their distribution
The labels can depend on past and future
observations
Avoids the independence assumption of CMMs that
labels are independent of future observations and
thus the label and observation bias problems
The parameter estimation problem is much harder

25
CRF - II
t1
t2
t3

HMM and this chain CRF form a generative-discrimin
ative pair
Independence assumptions a tag is independent
of all other tags in the sequence given its
neighbors and the word sequence

w1
w2
w3
26
CRF-Experimental Results
Model Accuracy Unknown Word Accuracy
HMM 94.31 54.01
CMM (MEMM) 93.63 45.39
CRF 94.45 51.95
CMM (MEMM) 95.19 73.01
CRF 95.73 76.24
27
Discriminative Tagging Model Voted Perceptron

Collins 2002 Best reported tagging results on
WSJ
Uses all features used by Ratnaparkhi (96)
Learns a linear function
Classifies according to
Error MEMM(Ratn 96) 96.72 V Perceptron 97.11

28
Summary of Tagging Review

For tagging, the change from generative to
discriminative model does not by itself result
in great improvement (e.g. HMM and CRF)
One profits from discriminative models for
specifying dependence on overlapping features of
the observation such as spelling, suffix
analysis,etc
The CMM model allows integration of rich features
of the observations, but suffers strongly from
assuming independence from following
observations this effect can be relieved by
adding dependence on following words
This additional power (of the CMM ,CRF,
Perceptron models) has been shown to result in
improvements in accuracy though not dramatic (up
to 11 error reduction)
The higher accuracy of discriminative models
comes at the price of much slower training
More research is needed on specifying useful
features (or tagging WSJ Penn Treebank is a noisy
task and the limit is reached)

29
Parsing Models

Syntactic parsing is the task of assigning a
parse tree to a sentence corresponding to its
most likely interpretation
Existing approaches
hand-crafted rule-based heuristic methods
probabilistic generative models
conditional probabilistic discriminative models
discriminative ranking models

VP
NP
VBD
NNP
PP
NNP
IN
I
saw
Mary
with
the
telescope
30
Generative Parsing Models

Generative models based on PCFG grammars learned
from corpora are still among the best performing
(Collins 97,Charniak 97,00) 88 -89 labeled
precision/recall
The generative models learn a distribution P(X,Y)
on ltsentence, parse treegt pairs
and select a single most likely parse for a
sentence X based on
Easy to train using RFE for maximum likelihood
These models have the advantage of being usable
as language models (ChelbaJelinek 00, Charniak
00)

31
Generative History-Based Model Collins 97
TOP
Accuracy lt 100 words 88.1 LP 87.5 LR
S(bought)
NP(week)
NP-C(Marks)
VP(bought)
NP-C(Brooks)
VBD(bought)
NNP(Brooks)
NNP(Marks)
JJ(Last)
NN(week)
bought
Brooks
Marks
week
Last
32
Discriminative models

Shift-reduce parser Ratnaparkhi (98)
Learns a distribution P(TS) of parse trees given
sentences using the sequence of actions of a
shift-reduce parser
Uses a maximum entropy model to learn conditional
distribution of parse action given history
Suffers from independence assumptions that
actions are independent of future observations as
CMM
Higher parameter estimation cost to learn local
maximum entropy models
Lower but still good accuracy 86 - 87 labeled
precision/recall

33
Discriminative Models Distribution Free
Re-ranking

Represent sentence-parse tree pairs by a feature
vector F(X,Y)
Learn a linear ranking model with parameters
using the boosting loss

Model LP LR
Collins 99 (Generative) 88.3 88.1
Collins 00 (BoostLoss) 89.9 89.6
13 error reduction
Still very close in accuracy to generative model
(Charniak 00)
34
Comparison of Generative-Discriminative Pairs

Johnson (2001) have compared simple PCFG trained
to maximize L(T,S) and L(TS)
A Simple PCFG has parameters
Models
Results

Model LPrecision LRecall
MLE 0.815 0.789
MCLE 0.817 0.794
35
Weighted CFGs for Unification-Based Grammars - I

Unification-based grammars (UBG) are often
defined using a context-free base and a set of
path equations
Snumber X -gt NPnumber X VPnumber X
NPnumber X -gt N number X
VPnumber X -gtVnumber X
Nnumber sg-gt dog Nnumber pl -gtdogs
Vnumber sg -gtbarks Vnumber pl -gtbark
A PCFG grammar can be defined using the
context-free backbone CFGUBG(S-gt NP, VP)
The UBG generates dogs bark and dog barks.
The CFGUBG generates dogs bark ,dog barks,
dog bark, and dogs barks .

36
Weighted CFGs for Unification-Based Grammars - II

A Simple PCFG for CFGUBG has parameters from the
set
It defines a joint distribution P(T,S) and a
conditional distributions of trees given
sentences
A conditional weighted CFG defines only a
conditional probability the conditional
probability of any tree T outside the UBG is 0

37
Weighted CFGs for Unification-based grammars - III
Accuracy
The conditional weighted CFGs perform
consistently better than their generative
counterparts Negative information is extremely
helpful here knowing that the conditional
probability of trees outside the UBG is zero plus
conditional training amounts to 38 error
reduction for the simple PCFG model
38
Summary of Parsing Results

The single small study comparing a parsing
generative-discriminative pair for PCFG parsing
showed a small (insignificant) advantage for the
discriminative model the added computational
cost is probably not worth it
The best performing statistical parsers are
still generative(Charniak 00, Collins 99) or use
a generative model as a preprocessing
stage(Collins 00, Collins 2002) (part of which
has to do with computational complexity)
Discriminative models allow more complex
representations such as the all subtrees
representation (Collins 2002) or other
overlapping features (Collins 00) and this has
led to up to 13 improvement over a generative
model
Discriminative training seems promising for parse
selection tasks for UBG, where the number of
possible analyses is not enormous

39
Conclusions

For the current sizes of training data available
for NLP tasks such as tagging and parsing,
discriminative training has not by itself yielded
large gains in accuracy
The flexibility of including non-independent
features of the observations in discriminative
models has resulted in improved part-of-speech
tagging models (for some tasks it might not
justify the added computational complexity)
For parsing, discriminative training has shown
improvements when used for re-ranking or when
using negative information (UBG)
if you come up with a feature that is very hard
to incorporate in a generative models and seems
extremely useful, see if a discriminative
approach will be computationally feasible !