Title: Conditional Structure versus Conditional Estimation in NLP Models and a little about Conditional Ran
1Conditional Structure versus Conditional
Estimation in NLP Models(and a little about
Conditional Random Fields)
- John Lafferty et al
- ICML 2001
- Chris Manning, Dan Klein,
- Empirical Methods in Natural Language Processing,
2002
2Abstract
- Learning parameter estimation models
- Goal separate conditional parameter estimation
from conditional model structures - 1. Naïve Bayes vs. Logistic Regression
- 2. Markov Model for maximum entropy tagging
- Empirically determine what works and what does
not in word sense disambiguation prediction task
and part-of-speech tagging task
3Outline
- Probabilistic graphical models
- Objective functions
- JLL, CLL
- Model structures for parts of speech tagging
- HMM and CMM
- Discussion and conclusion
4Classification problem
- Learn f X-gtY from Dx,yi1n
- f can be a probability function!
- What if Y is multivariate?
- Probabilistic model
- How do we model p(x,y)?
- To a new instance Xnew can assign the most likely
output - yarg max p(YXnew)
- arg max p(Y,Xnew)/p(Xnew)
- arg max p(Y,Xnew)
- MAP (maximum a posteriori principle)
- What if we maximize p(yx) directly?
5Probabilistic Graphical Models (for example HMMs)
Y1
Y2
Y3
Yn
- Decompose p(x,y) into a product of smaller terms
- p(x,y)?i1np(xiyi)p(yiyi-1)
- Tractable need to estimate each of p(xiyi) and
p(yiyi-1) - Inference (computing arg max) is done via Viterbi
- Learning maximization of p(x,y) when the data
is fully observable (which it is if we have
everything labeled)
X1
X2
X3
Xn
6Maximizing the conditional likelihood
- From HMM we have p(x,y)
- p(yx) p(x,y)/Syp(x,y)
- Sum can be computed efficiently using
backward-forward procedure - Since parameters are probabilities need
- Syp(xiy)1 and Syp(yiy)1
- Learning constraint optimization problem
- Can drop the constraints if consider an
undirected model - We have seen these when we talked about Maximum
Entropy models - Undirected version of HMM which maximizes p(yx)
is the equivalent of CRF (Conditional Random
Field) - Lafferty et al, 2001
7Conditional Random Fields
Y1
Y2
Y3
Yn
- Now probabilities are weights
- p(xiyi)exp(w)
- p(yiyi-1)exp(?)
- Decompose p(x,y) into a product of smaller terms
- Log p(yx)wf(em) ?f(tr)-Z
- Z is normalization term sum over all the outputs
(use backwards-forwards as in HMM) - f are feature functions (like in MEM)
- Drop the constraints of HMMs
- Use gradient methods to solve for weights
X1
X2
X3
Xn
8Why conditional model performs better?
- Smoothing (adding an instance with each word from
the vocabulary for each class) - Helps with overfitting
- Words that appear only once for a class can have
large weights - If other indicator words are present, doesnt
make much difference - If no strong indicator words are present, w will
get sufficiently large weight until s is
predicted correctly
9Example
10Trade-offs
- Implementation difficulty?
- Running time?
- Accuracy?
11Joint vs. conditional likelihood maximization as
learning
- Joint Likelihood
- Less training time
- Closed-form solutions if the data is fully
observable - Ned less data
- Lower accuracy
- Conditional Likelihood
- More training time
- No closed-form solutions, need to use
(constrained) gradient - Need more data
- Higher accuracy
12Model Structures
- Hidden Markov Model
- Assumes independence of future observations from
past ones given intermediate state - Joint likelihood
- Conditional Likelihood
- CRFs
13Model Structures
- Upward conditional Markov Model
- Limitations
- Assumption that states are independent of future
observations
14When things go wrong with CMM
- In theory Label bias
- States whose following state has low entropy is
preferred - Previous state explains away the next state so
well that the observation at the current state is
ignored
15Observation bias example
- All the indexes dove
- PDT DT NNS VBT
- All-DT is more common, so CMM labels All with DT
- HMM picks up that DT-DT is less common than
PDT-DT, so it labels All correctly
16Unobserving variables
- Change status of observed variables at the
current state to unobserved - During inference we retain that the state above
the is DT, but forget that observation is the,
and sum over all values at that state given DT
and previous state
17When things go wrong with CMM
- In practice Observation bias
- State-state distributions are not as sharp as
state-observation distribution - Observations explain the states above so well
that previous states are ignored
18Conclusion and discussion
- Optimizing the objective being evaluated has
positive but often small effect - Model structure is important
- In CMM observation bias is more evident than
label bias - CMM is not desirable unless this structure
enables to incorporate better features
19 20Objective functions
- Constraints
- constraint to get non-deficient model (annotate
by SCL and CL) - Unconstraint CL is equivalent to maximum entropy
model and logistic regression known to be
concave - CL and SCL is not convex/concave
- In practice no local maximum was not global over
feasible region
21Objective functions
Conditional likelihood
Sum of conditional likelihoods
Naïve Bayes
Accuracy, but optimization is NP-hard!
22(No Transcript)
23(Acc(CL)-Acc(JL))/Acc(JL)
24(No Transcript)