Conditional Structure versus Conditional Estimation in NLP Models and a little about Conditional Ran

About This Presentation

Title:

Conditional Structure versus Conditional Estimation in NLP Models and a little about Conditional Ran

Description:

(and a little about Conditional Random Fields) John Lafferty et al. ICML 2001 ... Why conditional model performs better? ... Conditional Likelihood. More training time ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 25

Provided by: oks5

Category:

more less

Transcript and Presenter's Notes

Title: Conditional Structure versus Conditional Estimation in NLP Models and a little about Conditional Ran

1
Conditional Structure versus Conditional
Estimation in NLP Models(and a little about
Conditional Random Fields)

John Lafferty et al
ICML 2001
Chris Manning, Dan Klein,
Empirical Methods in Natural Language Processing,
2002

2
Abstract

Learning parameter estimation models
Goal separate conditional parameter estimation
from conditional model structures
1. Naïve Bayes vs. Logistic Regression
2. Markov Model for maximum entropy tagging
Empirically determine what works and what does
not in word sense disambiguation prediction task
and part-of-speech tagging task

3
Outline

Probabilistic graphical models
Objective functions
JLL, CLL
Model structures for parts of speech tagging
HMM and CMM
Discussion and conclusion

4
Classification problem

Learn f X-gtY from Dx,yi1n
f can be a probability function!
What if Y is multivariate?
Probabilistic model
How do we model p(x,y)?
To a new instance Xnew can assign the most likely
output
yarg max p(YXnew)
arg max p(Y,Xnew)/p(Xnew)
arg max p(Y,Xnew)
MAP (maximum a posteriori principle)
What if we maximize p(yx) directly?

5
Probabilistic Graphical Models (for example HMMs)
Y1
Y2
Y3
Yn

Decompose p(x,y) into a product of smaller terms
p(x,y)?i1np(xiyi)p(yiyi-1)
Tractable need to estimate each of p(xiyi) and
p(yiyi-1)
Inference (computing arg max) is done via Viterbi
Learning maximization of p(x,y) when the data
is fully observable (which it is if we have
everything labeled)

X1
X2
X3
Xn
6
Maximizing the conditional likelihood

From HMM we have p(x,y)
p(yx) p(x,y)/Syp(x,y)
Sum can be computed efficiently using
backward-forward procedure
Since parameters are probabilities need
Syp(xiy)1 and Syp(yiy)1
Learning constraint optimization problem
Can drop the constraints if consider an
undirected model
We have seen these when we talked about Maximum
Entropy models
Undirected version of HMM which maximizes p(yx)
is the equivalent of CRF (Conditional Random
Field)
Lafferty et al, 2001

7
Conditional Random Fields
Y1
Y2
Y3
Yn

Now probabilities are weights
p(xiyi)exp(w)
p(yiyi-1)exp(?)
Decompose p(x,y) into a product of smaller terms
Log p(yx)wf(em) ?f(tr)-Z
Z is normalization term sum over all the outputs
(use backwards-forwards as in HMM)
f are feature functions (like in MEM)
Drop the constraints of HMMs
Use gradient methods to solve for weights

X1
X2
X3
Xn
8
Why conditional model performs better?

Smoothing (adding an instance with each word from
the vocabulary for each class)
Helps with overfitting
Words that appear only once for a class can have
large weights
If other indicator words are present, doesnt
make much difference
If no strong indicator words are present, w will
get sufficiently large weight until s is
predicted correctly

9
Example
10
Trade-offs

Implementation difficulty?
Running time?
Accuracy?

11
Joint vs. conditional likelihood maximization as
learning

Joint Likelihood
Less training time
Closed-form solutions if the data is fully
observable
Ned less data
Lower accuracy

Conditional Likelihood
More training time
No closed-form solutions, need to use
(constrained) gradient
Need more data
Higher accuracy

12
Model Structures

Hidden Markov Model
Assumes independence of future observations from
past ones given intermediate state
Joint likelihood
Conditional Likelihood
CRFs

13
Model Structures

Upward conditional Markov Model
Limitations
Assumption that states are independent of future
observations

14
When things go wrong with CMM

In theory Label bias
States whose following state has low entropy is
preferred
Previous state explains away the next state so
well that the observation at the current state is
ignored

15
Observation bias example

All the indexes dove
PDT DT NNS VBT
All-DT is more common, so CMM labels All with DT
HMM picks up that DT-DT is less common than
PDT-DT, so it labels All correctly

16
Unobserving variables

Change status of observed variables at the
current state to unobserved
During inference we retain that the state above
the is DT, but forget that observation is the,
and sum over all values at that state given DT
and previous state

17
When things go wrong with CMM

In practice Observation bias
State-state distributions are not as sharp as
state-observation distribution
Observations explain the states above so well
that previous states are ignored

18
Conclusion and discussion

Optimizing the objective being evaluated has
positive but often small effect
Model structure is important
In CMM observation bias is more evident than
label bias
CMM is not desirable unless this structure
enables to incorporate better features

Other slides

20
Objective functions

Constraints
constraint to get non-deficient model (annotate
by SCL and CL)
Unconstraint CL is equivalent to maximum entropy
model and logistic regression known to be
concave
CL and SCL is not convex/concave
In practice no local maximum was not global over
feasible region

21
Objective functions
Conditional likelihood
Sum of conditional likelihoods
Naïve Bayes
Accuracy, but optimization is NP-hard!
22
(No Transcript)
23
(Acc(CL)-Acc(JL))/Acc(JL)
24
(No Transcript)

Write a Comment

User Comments (0)