Conditional Structure versus Conditional Estimation in NLP Models and a little about Conditional Ran - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Conditional Structure versus Conditional Estimation in NLP Models and a little about Conditional Ran

Description:

(and a little about Conditional Random Fields) John Lafferty et al. ICML 2001 ... Why conditional model performs better? ... Conditional Likelihood. More training time ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 25
Provided by: oks5
Category:

less

Transcript and Presenter's Notes

Title: Conditional Structure versus Conditional Estimation in NLP Models and a little about Conditional Ran


1
Conditional Structure versus Conditional
Estimation in NLP Models(and a little about
Conditional Random Fields)
  • John Lafferty et al
  • ICML 2001
  • Chris Manning, Dan Klein,
  • Empirical Methods in Natural Language Processing,
    2002

2
Abstract
  • Learning parameter estimation models
  • Goal separate conditional parameter estimation
    from conditional model structures
  • 1. Naïve Bayes vs. Logistic Regression
  • 2. Markov Model for maximum entropy tagging
  • Empirically determine what works and what does
    not in word sense disambiguation prediction task
    and part-of-speech tagging task

3
Outline
  • Probabilistic graphical models
  • Objective functions
  • JLL, CLL
  • Model structures for parts of speech tagging
  • HMM and CMM
  • Discussion and conclusion

4
Classification problem
  • Learn f X-gtY from Dx,yi1n
  • f can be a probability function!
  • What if Y is multivariate?
  • Probabilistic model
  • How do we model p(x,y)?
  • To a new instance Xnew can assign the most likely
    output
  • yarg max p(YXnew)
  • arg max p(Y,Xnew)/p(Xnew)
  • arg max p(Y,Xnew)
  • MAP (maximum a posteriori principle)
  • What if we maximize p(yx) directly?

5
Probabilistic Graphical Models (for example HMMs)
Y1
Y2
Y3
Yn
  • Decompose p(x,y) into a product of smaller terms
  • p(x,y)?i1np(xiyi)p(yiyi-1)
  • Tractable need to estimate each of p(xiyi) and
    p(yiyi-1)
  • Inference (computing arg max) is done via Viterbi
  • Learning maximization of p(x,y) when the data
    is fully observable (which it is if we have
    everything labeled)


X1
X2
X3
Xn
6
Maximizing the conditional likelihood
  • From HMM we have p(x,y)
  • p(yx) p(x,y)/Syp(x,y)
  • Sum can be computed efficiently using
    backward-forward procedure
  • Since parameters are probabilities need
  • Syp(xiy)1 and Syp(yiy)1
  • Learning constraint optimization problem
  • Can drop the constraints if consider an
    undirected model
  • We have seen these when we talked about Maximum
    Entropy models
  • Undirected version of HMM which maximizes p(yx)
    is the equivalent of CRF (Conditional Random
    Field)
  • Lafferty et al, 2001

7
Conditional Random Fields
Y1
Y2
Y3
Yn
  • Now probabilities are weights
  • p(xiyi)exp(w)
  • p(yiyi-1)exp(?)
  • Decompose p(x,y) into a product of smaller terms
  • Log p(yx)wf(em) ?f(tr)-Z
  • Z is normalization term sum over all the outputs
    (use backwards-forwards as in HMM)
  • f are feature functions (like in MEM)
  • Drop the constraints of HMMs
  • Use gradient methods to solve for weights

X1
X2
X3
Xn
8
Why conditional model performs better?
  • Smoothing (adding an instance with each word from
    the vocabulary for each class)
  • Helps with overfitting
  • Words that appear only once for a class can have
    large weights
  • If other indicator words are present, doesnt
    make much difference
  • If no strong indicator words are present, w will
    get sufficiently large weight until s is
    predicted correctly

9
Example
10
Trade-offs
  • Implementation difficulty?
  • Running time?
  • Accuracy?

11
Joint vs. conditional likelihood maximization as
learning
  • Joint Likelihood
  • Less training time
  • Closed-form solutions if the data is fully
    observable
  • Ned less data
  • Lower accuracy
  • Conditional Likelihood
  • More training time
  • No closed-form solutions, need to use
    (constrained) gradient
  • Need more data
  • Higher accuracy

12
Model Structures
  • Hidden Markov Model
  • Assumes independence of future observations from
    past ones given intermediate state
  • Joint likelihood
  • Conditional Likelihood
  • CRFs

13
Model Structures
  • Upward conditional Markov Model
  • Limitations
  • Assumption that states are independent of future
    observations

14
When things go wrong with CMM
  • In theory Label bias
  • States whose following state has low entropy is
    preferred
  • Previous state explains away the next state so
    well that the observation at the current state is
    ignored

15
Observation bias example
  • All the indexes dove
  • PDT DT NNS VBT
  • All-DT is more common, so CMM labels All with DT
  • HMM picks up that DT-DT is less common than
    PDT-DT, so it labels All correctly

16
Unobserving variables
  • Change status of observed variables at the
    current state to unobserved
  • During inference we retain that the state above
    the is DT, but forget that observation is the,
    and sum over all values at that state given DT
    and previous state

17
When things go wrong with CMM
  • In practice Observation bias
  • State-state distributions are not as sharp as
    state-observation distribution
  • Observations explain the states above so well
    that previous states are ignored

18
Conclusion and discussion
  • Optimizing the objective being evaluated has
    positive but often small effect
  • Model structure is important
  • In CMM observation bias is more evident than
    label bias
  • CMM is not desirable unless this structure
    enables to incorporate better features

19
  • Other slides

20
Objective functions
  • Constraints
  • constraint to get non-deficient model (annotate
    by SCL and CL)
  • Unconstraint CL is equivalent to maximum entropy
    model and logistic regression known to be
    concave
  • CL and SCL is not convex/concave
  • In practice no local maximum was not global over
    feasible region

21
Objective functions
Conditional likelihood
Sum of conditional likelihoods
Naïve Bayes
Accuracy, but optimization is NP-hard!
22
(No Transcript)
23
(Acc(CL)-Acc(JL))/Acc(JL)
24
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com