CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

1
CS546 Machine Learning and Natural
LanguageMulti-Class and Structured Prediction
Problems

Slides from Taskar and Klein are used in this
lecture

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AA
2
Outline

Multi-Class classification
Structured Prediction
Models for Structured Prediction and
Classification
Example of POS tagging

3
Mutliclass problems

Most of the machinery we talked before was
focused on binary classification problems
e.g., SVMs we discussed so far
However most problems we encounter in NLP are
either
MultiClass e.g., text categorization
Structured Prediction e.g., predict syntactic
structure of a sentence
How to deal with them?

4
Binary linear classification
5
Multiclass classification
6
Perceptron
7
Structured Perceptron

Joint feature representation
Algoritm

8
Perceptron
9
Binary Classification Margin
10
Generalize to MultiClass
11
Converting to MultiClass SVM
12
Max margin Min Norm

As before, these are equivalent formulations

13
Problems

Requires separability
What if we have noise in data?
What if we have little simple feature space?

14
Non-separable case
15
Non-separable case
16
Compare with MaxEnt
17
Loss Comparison
18
Multiclass -gt Structured

So far, we considered multiclass classification
0-1 losses l(y,y)
What if what we want to do is to predict
sequences of POS
syntactic trees
translation

19
Predicting word alignments

20
Predicting Syntactic Trees

21
Structured Models

22
Parsing

23
Max Margin Markov Networks (M3Ns)

Taskar et al, 2003 similar Tsochantaridis et al,
2004
24
Max Margin Markov Networks (M3Ns)

25
Solving MultiClass with binary learning

MultiClass classifier
Function f Rd ? 1,2,3,...,k
Decompose into binary problems
Not always possible to learn
Different scale
No theoretical justification

Real Problem
26
Learning via One-Versus-All (OvA) Assumption

Find vr,vb,vg,vy ? Rn such that
vr.x gt 0 iff y red ?
vb.x gt 0 iff y blue ?
vg.x gt 0 iff y green ?
vy.x gt 0 iff y yellow ?
Classifier f(x) argmax vi.x

H Rkn
Individual Classifiers
Decision Regions
27
Learning via All-Verses-All (AvA) Assumption

Find vrb,vrg,vry,vbg,vby,vgy ? Rd such that
vrb.x gt 0 if y red
lt 0 if y blue
vrg.x gt 0 if y red
lt 0 if y green
... (for all pairs)

Individual Classifiers
Decision Regions
28
Classifying with AvA
All are post-learning and might cause weird stuff
29
POS Tagging

English tags

30
POS Tagging, examples from WSJ
From McCallum
31
POS Tagging

Ambiguity not a trivial task
Useful tasks
important features for other steps are based on
POS
E.g., use POS as input to a parser

32
But still why so popular

Historically the first statistical NLP problem
Easy to apply arbitrary classifiers
both for sequence models and just independent
classifiers
Can be regarded as Finite-State Problem
Easy to evaluate
Annotation is cheaper to obtain than TreeBanks
(other languages)

33
HMM (reminder)

34
HMM (reminder) - transitions

35
Transition Estimates

36
Emission Estimates

37
MaxEnt (reminder)

38
Decoding HMM vs MaxEnt

39
Accuracies overview

40
Accuracies overview

41
SVMs for tagging

We can use SVMs in a similar way as MaxEnt (or
other classifiers)
We can use a window around the word
97.16 on WSJ

42
SVMs for tagging
from Jimenez Marquez
43
No sequence modeling
44
CRFs and other global models
45
CRFs and other global models
46
Compare
W
HMMs
T
MEMMs - Note after each step t the remaining
probability mass cannot be reduced it can only
be distributed across among possible state
transitions
CRFs - no local normalization
47
Label Bias
based on a slide from Joe Drish
48
Label Bias

Recall Transition based parsing -- Nivres
algorithm (with beam search)
At each step we can observe only local features
(limited look-ahead)
If later we see that the following word is
impossible we can only distribute probability
uniformly across all (im-)possible decisions
If a small number of such decisions we cannot
decrease probability dramatically
So, label bias is likely to be a serious problem
if
Non local dependencies
States have small number of possible outgoing
transitions

49
Pos Tagging Experiments

is an extended feature set (hard to
integrate in a generative model)
oov out-of-vocabulary

50
Supervision

We considered before the supervised case
Training set is labeled
However, we can try to induce word classes
without supervision
Unsupervised tagging
We will later discuss the EM algorithm
Can do it in a partly supervised
Seed tags
Small labeled dataset
Parallel corpus
....

51
Why not to predict POS parse trees
simultaneously?

It is possible and often done this way
Doing tagging internally often benefits parsing
accuracy
Unfortunately, parsing models are less robust
than taggers
e.g., non-grammatical sentences, different
domains
It is more expensive and does not help...

52
Questions

Why there is no label-bias problem for a
generative model (e.g., HMM) ?
How would you integrate word features in a
generative model (e.g., HMMs for POS tagging)?
e.g., if word has
-ing, -s, -ed, -d, -ment, ...
post-, de-,...

53
CRFs for more complex structured output problems

We considered sequence labeled problems
Here, the structure of dependencies is fixed
What if we do not know the structure but would
like to have interactions respecting the
structure ?

54
CRFs for more complex structured output problems

Recall, we had the MST algorithm (McDonald and
Pereira, 05)

55
CRFs for more complex structured output problems

Complex inference
E.g., arbitrary 2nd order dependency parsing
models are not tractable (non-projective)
NP-complete (McDonald Pereira, EACL 06)
Recently conditional models for constituent
parsing
(Finkel et al, ACL 08)
(Carreras et al, CoNLL 08)
...

56
Back to MultiClass

Let us review how to decompose multiclass
problem to binary classification problems

57
Summary

Margin-based method for multiclass classification
and structured prediction
CRFs vs HMMs vs MEMMs for POS tagging

58
Conclusions

All approaches use linear representation
The differences are
Features
How to learn weights
Training Paradigms
Global Training (CRF, Global Perceptron)
Modular Training (PMM, MEMM, ...)
These approaches are easier to train, but may
requires additional mechanisms to enforce global
constraints.

Write a Comment

User Comments (0)

About PowerShow.com

CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems PowerPoint PPT Presentation