Title: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems
1CS546 Machine Learning and Natural
LanguageMulti-Class and Structured Prediction
Problems
- Slides from Taskar and Klein are used in this
lecture
TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AA
2Outline
- Multi-Class classification
- Structured Prediction
- Models for Structured Prediction and
Classification - Example of POS tagging
3Mutliclass problems
- Most of the machinery we talked before was
focused on binary classification problems - e.g., SVMs we discussed so far
- However most problems we encounter in NLP are
either - MultiClass e.g., text categorization
- Structured Prediction e.g., predict syntactic
structure of a sentence - How to deal with them?
4Binary linear classification
5Multiclass classification
6Perceptron
7Structured Perceptron
- Joint feature representation
- Algoritm
8Perceptron
9Binary Classification Margin
10Generalize to MultiClass
11Converting to MultiClass SVM
12Max margin Min Norm
- As before, these are equivalent formulations
13Problems
- Requires separability
- What if we have noise in data?
- What if we have little simple feature space?
14Non-separable case
15Non-separable case
16Compare with MaxEnt
17Loss Comparison
18Multiclass -gt Structured
- So far, we considered multiclass classification
- 0-1 losses l(y,y)
- What if what we want to do is to predict
- sequences of POS
- syntactic trees
- translation
19Predicting word alignments
20Predicting Syntactic Trees
21Structured Models
22Parsing
23Max Margin Markov Networks (M3Ns)
Taskar et al, 2003 similar Tsochantaridis et al,
2004
24Max Margin Markov Networks (M3Ns)
25Solving MultiClass with binary learning
- MultiClass classifier
- Function f Rd ? 1,2,3,...,k
- Decompose into binary problems
- Not always possible to learn
- Different scale
- No theoretical justification
Real Problem
26Learning via One-Versus-All (OvA) Assumption
- Find vr,vb,vg,vy ? Rn such that
- vr.x gt 0 iff y red ?
- vb.x gt 0 iff y blue ?
- vg.x gt 0 iff y green ?
- vy.x gt 0 iff y yellow ?
- Classifier f(x) argmax vi.x
H Rkn
Individual Classifiers
Decision Regions
27Learning via All-Verses-All (AvA) Assumption
- Find vrb,vrg,vry,vbg,vby,vgy ? Rd such that
- vrb.x gt 0 if y red
- lt 0 if y blue
- vrg.x gt 0 if y red
- lt 0 if y green
- ... (for all pairs)
Individual Classifiers
Decision Regions
28Classifying with AvA
All are post-learning and might cause weird stuff
29POS Tagging
30POS Tagging, examples from WSJ
From McCallum
31POS Tagging
- Ambiguity not a trivial task
- Useful tasks
- important features for other steps are based on
POS - E.g., use POS as input to a parser
32But still why so popular
- Historically the first statistical NLP problem
- Easy to apply arbitrary classifiers
- both for sequence models and just independent
classifiers - Can be regarded as Finite-State Problem
- Easy to evaluate
- Annotation is cheaper to obtain than TreeBanks
(other languages)
33HMM (reminder)
34HMM (reminder) - transitions
35Transition Estimates
36Emission Estimates
37MaxEnt (reminder)
38Decoding HMM vs MaxEnt
39Accuracies overview
40Accuracies overview
41SVMs for tagging
- We can use SVMs in a similar way as MaxEnt (or
other classifiers) - We can use a window around the word
- 97.16 on WSJ
42SVMs for tagging
from Jimenez Marquez
43No sequence modeling
44CRFs and other global models
45CRFs and other global models
46Compare
W
HMMs
T
MEMMs - Note after each step t the remaining
probability mass cannot be reduced it can only
be distributed across among possible state
transitions
CRFs - no local normalization
47Label Bias
based on a slide from Joe Drish
48Label Bias
- Recall Transition based parsing -- Nivres
algorithm (with beam search) - At each step we can observe only local features
(limited look-ahead) - If later we see that the following word is
impossible we can only distribute probability
uniformly across all (im-)possible decisions - If a small number of such decisions we cannot
decrease probability dramatically - So, label bias is likely to be a serious problem
if - Non local dependencies
- States have small number of possible outgoing
transitions
49Pos Tagging Experiments
- is an extended feature set (hard to
integrate in a generative model) - oov out-of-vocabulary
50Supervision
- We considered before the supervised case
- Training set is labeled
- However, we can try to induce word classes
without supervision - Unsupervised tagging
- We will later discuss the EM algorithm
- Can do it in a partly supervised
- Seed tags
- Small labeled dataset
- Parallel corpus
- ....
51Why not to predict POS parse trees
simultaneously?
- It is possible and often done this way
- Doing tagging internally often benefits parsing
accuracy - Unfortunately, parsing models are less robust
than taggers - e.g., non-grammatical sentences, different
domains - It is more expensive and does not help...
52Questions
- Why there is no label-bias problem for a
generative model (e.g., HMM) ? - How would you integrate word features in a
generative model (e.g., HMMs for POS tagging)? - e.g., if word has
- -ing, -s, -ed, -d, -ment, ...
- post-, de-,...
53CRFs for more complex structured output problems
- We considered sequence labeled problems
- Here, the structure of dependencies is fixed
- What if we do not know the structure but would
like to have interactions respecting the
structure ?
54CRFs for more complex structured output problems
- Recall, we had the MST algorithm (McDonald and
Pereira, 05)
55CRFs for more complex structured output problems
- Complex inference
- E.g., arbitrary 2nd order dependency parsing
models are not tractable (non-projective) - NP-complete (McDonald Pereira, EACL 06)
- Recently conditional models for constituent
parsing - (Finkel et al, ACL 08)
- (Carreras et al, CoNLL 08)
- ...
56Back to MultiClass
- Let us review how to decompose multiclass
problem to binary classification problems
57Summary
- Margin-based method for multiclass classification
and structured prediction - CRFs vs HMMs vs MEMMs for POS tagging
58Conclusions
- All approaches use linear representation
- The differences are
- Features
- How to learn weights
- Training Paradigms
- Global Training (CRF, Global Perceptron)
- Modular Training (PMM, MEMM, ...)
- These approaches are easier to train, but may
requires additional mechanisms to enforce global
constraints.