CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems PowerPoint PPT Presentation

presentation player overlay
1 / 58
About This Presentation
Transcript and Presenter's Notes

Title: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems


1
CS546 Machine Learning and Natural
LanguageMulti-Class and Structured Prediction
Problems
  • Slides from Taskar and Klein are used in this
    lecture

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AA
2
Outline
  • Multi-Class classification
  • Structured Prediction
  • Models for Structured Prediction and
    Classification
  • Example of POS tagging

3
Mutliclass problems
  • Most of the machinery we talked before was
    focused on binary classification problems
  • e.g., SVMs we discussed so far
  • However most problems we encounter in NLP are
    either
  • MultiClass e.g., text categorization
  • Structured Prediction e.g., predict syntactic
    structure of a sentence
  • How to deal with them?

4
Binary linear classification
5
Multiclass classification
6
Perceptron
7
Structured Perceptron
  • Joint feature representation
  • Algoritm

8
Perceptron
9
Binary Classification Margin
10
Generalize to MultiClass
11
Converting to MultiClass SVM
12
Max margin Min Norm
  • As before, these are equivalent formulations

13
Problems
  • Requires separability
  • What if we have noise in data?
  • What if we have little simple feature space?

14
Non-separable case
15
Non-separable case
16
Compare with MaxEnt
17
Loss Comparison
18
Multiclass -gt Structured
  • So far, we considered multiclass classification
  • 0-1 losses l(y,y)
  • What if what we want to do is to predict
  • sequences of POS
  • syntactic trees
  • translation

19
Predicting word alignments

20
Predicting Syntactic Trees

21
Structured Models

22
Parsing

23
Max Margin Markov Networks (M3Ns)

Taskar et al, 2003 similar Tsochantaridis et al,
2004
24
Max Margin Markov Networks (M3Ns)

25
Solving MultiClass with binary learning
  • MultiClass classifier
  • Function f Rd ? 1,2,3,...,k
  • Decompose into binary problems
  • Not always possible to learn
  • Different scale
  • No theoretical justification

Real Problem
26
Learning via One-Versus-All (OvA) Assumption
  • Find vr,vb,vg,vy ? Rn such that
  • vr.x gt 0 iff y red ?
  • vb.x gt 0 iff y blue ?
  • vg.x gt 0 iff y green ?
  • vy.x gt 0 iff y yellow ?
  • Classifier f(x) argmax vi.x

H Rkn
Individual Classifiers
Decision Regions
27
Learning via All-Verses-All (AvA) Assumption
  • Find vrb,vrg,vry,vbg,vby,vgy ? Rd such that
  • vrb.x gt 0 if y red
  • lt 0 if y blue
  • vrg.x gt 0 if y red
  • lt 0 if y green
  • ... (for all pairs)

Individual Classifiers
Decision Regions
28
Classifying with AvA
All are post-learning and might cause weird stuff
29
POS Tagging
  • English tags

30
POS Tagging, examples from WSJ
From McCallum
31
POS Tagging
  • Ambiguity not a trivial task
  • Useful tasks
  • important features for other steps are based on
    POS
  • E.g., use POS as input to a parser

32
But still why so popular
  • Historically the first statistical NLP problem
  • Easy to apply arbitrary classifiers
  • both for sequence models and just independent
    classifiers
  • Can be regarded as Finite-State Problem
  • Easy to evaluate
  • Annotation is cheaper to obtain than TreeBanks
    (other languages)

33
HMM (reminder)

34
HMM (reminder) - transitions

35
Transition Estimates

36
Emission Estimates

37
MaxEnt (reminder)

38
Decoding HMM vs MaxEnt

39
Accuracies overview

40
Accuracies overview

41
SVMs for tagging
  • We can use SVMs in a similar way as MaxEnt (or
    other classifiers)
  • We can use a window around the word
  • 97.16 on WSJ

42
SVMs for tagging
from Jimenez Marquez
43
No sequence modeling
44
CRFs and other global models
45
CRFs and other global models
46
Compare
W
HMMs
T
MEMMs - Note after each step t the remaining
probability mass cannot be reduced it can only
be distributed across among possible state
transitions
CRFs - no local normalization
47
Label Bias
based on a slide from Joe Drish
48
Label Bias
  • Recall Transition based parsing -- Nivres
    algorithm (with beam search)
  • At each step we can observe only local features
    (limited look-ahead)
  • If later we see that the following word is
    impossible we can only distribute probability
    uniformly across all (im-)possible decisions
  • If a small number of such decisions we cannot
    decrease probability dramatically
  • So, label bias is likely to be a serious problem
    if
  • Non local dependencies
  • States have small number of possible outgoing
    transitions

49
Pos Tagging Experiments
  • is an extended feature set (hard to
    integrate in a generative model)
  • oov out-of-vocabulary

50
Supervision
  • We considered before the supervised case
  • Training set is labeled
  • However, we can try to induce word classes
    without supervision
  • Unsupervised tagging
  • We will later discuss the EM algorithm
  • Can do it in a partly supervised
  • Seed tags
  • Small labeled dataset
  • Parallel corpus
  • ....

51
Why not to predict POS parse trees
simultaneously?
  • It is possible and often done this way
  • Doing tagging internally often benefits parsing
    accuracy
  • Unfortunately, parsing models are less robust
    than taggers
  • e.g., non-grammatical sentences, different
    domains
  • It is more expensive and does not help...

52
Questions
  • Why there is no label-bias problem for a
    generative model (e.g., HMM) ?
  • How would you integrate word features in a
    generative model (e.g., HMMs for POS tagging)?
  • e.g., if word has
  • -ing, -s, -ed, -d, -ment, ...
  • post-, de-,...

53
CRFs for more complex structured output problems
  • We considered sequence labeled problems
  • Here, the structure of dependencies is fixed
  • What if we do not know the structure but would
    like to have interactions respecting the
    structure ?

54
CRFs for more complex structured output problems
  • Recall, we had the MST algorithm (McDonald and
    Pereira, 05)

55
CRFs for more complex structured output problems
  • Complex inference
  • E.g., arbitrary 2nd order dependency parsing
    models are not tractable (non-projective)
  • NP-complete (McDonald Pereira, EACL 06)
  • Recently conditional models for constituent
    parsing
  • (Finkel et al, ACL 08)
  • (Carreras et al, CoNLL 08)
  • ...

56
Back to MultiClass
  • Let us review how to decompose multiclass
    problem to binary classification problems

57
Summary
  • Margin-based method for multiclass classification
    and structured prediction
  • CRFs vs HMMs vs MEMMs for POS tagging

58
Conclusions
  • All approaches use linear representation
  • The differences are
  • Features
  • How to learn weights
  • Training Paradigms
  • Global Training (CRF, Global Perceptron)
  • Modular Training (PMM, MEMM, ...)
  • These approaches are easier to train, but may
    requires additional mechanisms to enforce global
    constraints.
Write a Comment
User Comments (0)
About PowerShow.com