Multiclass Classification in NLP - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Multiclass Classification in NLP

Description:

Name/Entity Recognition Label people, locations, and organizations in a sentence [PER Sam Houston],[born in] [LOC Virginia], [was a member of the] [ORG US Congress]. – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 47
Provided by: DanR97
Category:

less

Transcript and Presenter's Notes

Title: Multiclass Classification in NLP


1
Multiclass Classification in NLP
  • Name/Entity Recognition
  • Label people, locations, and organizations in a
    sentence
  • PER Sam Houston,born in LOC Virginia, was
    a member of the ORG US Congress.
  • Decompose into sub-problems
  • Sam Houston, born in Virginia... ?
    (PER,LOC,ORG,?) ? PER (1)
  • Sam Houston, born in Virginia... ?
    (PER,LOC,ORG,?) ? None (0)
  • Sam Houston, born in Virginia... ?
    (PER,LOC,ORG,?) ? LOC (2)
  • Many problems in NLP are decomposed this way
  • Disambiguation tasks
  • POS Tagging
  • Word-sense disambiguation
  • Verb Classification
  • Semantic-Role Labeling

2
Outline
  • Multi-Categorical Classification Tasks
  • example Semantic Role Labeling (SRL)
  • Decomposition Approaches
  • Constraint Classification
  • Unifies learning of multi-categorical classifiers
  • Structured-Output Learning
  • revisit SRL
  • Decomposition versus Constraint Classification
  • Goal
  • Discuss multi-class and structured output from
    the same perspective.
  • Discuss similarities and differences

3
Multi-Categorical Output Tasks
  • Multi-class Classification (y ? 1,...,K)
  • character recognition (6)
  • document classification (homepage)
  • Multi-label Classification (y ? 1,...,K)
  • document classification ((homepage,facultypage))
  • Category Ranking (y ? ?K)
  • user preference ((love gt like gt hate))
  • document classification (hompage gt facultypage gt
    sports)
  • Hierarchical Classification (y ? 1,..,K)
  • cohere with class hierarchy
  • place document into index where soccer is-a
    sport

4
(more) Multi-Categorical Output Tasks
  • Sequential Prediction (y ? 1,...,K)
  • e.g. POS tagging ((NVNNA))
  • This is a sentence. ? D V N D
  • e.g. phrase identification
  • Many labels KL for length L sentence
  • Structured Output Prediction (y ? C(1,...,K))
  • e.g. parse tree, multi-level phrase
    identification
  • e.g. sequential prediction
  • Constrained by
  • domain, problem, data, background knowledge,
    etc...

5
Semantic Role LabelingA Structured-Output Problem
  • For each verb in a sentence
  • Identify all constituents that fill a semantic
    role
  • Determine their roles
  • Core Arguments, e.g., Agent, Patient or
    Instrument
  • Their adjuncts, e.g., Locative, Temporal or Manner

6
Semantic Role Labeling
I left my pearls to my daughter-in-law in my will.
  • Many possible valid output
  • Many possible invalid output

7
Structured Output Problems
  • Multi-Class
  • View y4 as (y1,...,yk) ( 0 0 0 1 0 0 0 )
  • The output is restricted by Exactly one of yi1
  • Learn f1(x),..,fk(x)
  • Sequence Prediction
  • e.g. POS tagging x (My name is Dav) y
    (Pr,N,V,N)
  • e.g. restriction Every sentence must have a
    verb
  • Structured Output
  • Arbitrary global constraints
  • Local functions do not have access to global
    constraints!
  • Goal
  • Discuss multi-class and structured output from
    the same perspective.
  • Discuss similarities and differences

8
Transform the sub-problems
  • Sam Houston, born in Virginia... ?
    (PER,LOC,ORG,?) ? PER (1)
  • Transform each problem to feature vector
  • Sam Houston, born in Virginia
  • ? (Bob-, JOHN-, SAM HOUSTON, HAPPY, -BORN,
    --BORN,... )
  • ? ( 0 , 0 , 1
    , 0 , 1 , 1 ,... )
  • Transform each label to a class label
  • PER ? 1
  • LOC ? 2
  • ORG ? 3
  • ? ? 0
  • Input 0,1d or Rd
  • Output 0,1,2,3,...,k

9
Solving multiclass with binary learning
  • Multiclass classifier
  • Function f Rd ? 1,2,3,...,k
  • Decompose into binary problems
  • Not always possible to learn
  • No theoretical justification (unless the problem
    is easy)

10
The Real MultiClass Problem
  • General framework
  • Extend binary algorithms
  • Theoretically justified
  • Provably correct
  • Generalizes well
  • Verified Experimentally
  • Naturally extends binary classification
    algorithms to mulitclass setting
  • e.g. Linear binary separation induces linear
    boundaries in multiclass setting

11
Multi Class over Linear Functions
  • One versus all (OvA)
  • All versus all (AvA)
  • Direct winner-take-all (D-WTA)

12
WTA over linear functions
  • Assume examples generated from winner-take-all
  • y argmax wi . x ti
  • wi, x ? Rn , ti ? R
  • Note Voronoi diagrams are WTA functions
  • argminc ci x argmaxc ci . x ci2
    / 2

13
Learning via One-Versus-All (OvA) Assumption
  • Find vr,vb,vg,vy ? Rn such that
  • vr.x gt 0 iff y red ?
  • vb.x gt 0 iff y blue ?
  • vg.x gt 0 iff y green ?
  • vy.x gt 0 iff y yellow ?
  • Classifier f(x) argmax vi.x

H Rkn
Individual Classifiers
Decision Regions
14
Learning via All-Verses-All (AvA) Assumption
  • Find vrb,vrg,vry,vbg,vby,vgy ? Rd such that
  • vrb.x gt 0 if y red
  • lt 0 if y blue
  • vrg.x gt 0 if y red
  • lt 0 if y green
  • ... (for all pairs)

Individual Classifiers
Decision Regions
15
Classifying with AvA
All are post-learning and might cause weird stuff
16
Summary (1) Learning Binary Classifiers
  • On-Line Perceptron, Winnow
  • Mistake bounded
  • Generalizes well (VC-Dim)
  • Works well in practice
  • SVM
  • Well motivated to maximize margin
  • Generalizes well
  • Works well in practice
  • Boosting, Neural Networks, etc...

17
From Binary to Multi-categorical
  • Decompose multi-categorical problems
  • into multiple (independent) binary problems
  • Multi-class OvA, AvA, ECOC, DT, etc...
  • Multi-label reduce to multi-class
  • Categorical Ranking reduce or regression
  • Sequence Prediction
  • Reduce to Multi-class
  • part/alphabet based decompositions
  • Structured Output
  • learn parts of output based on local
    information!!!

18
Problems with Decompositions
  • Learning optimizes over local metrics
  • Poor global performance
  • What is the metric?
  • We dont care about the performance of the local
    classifiers
  • Poor decomposition ? poor performance
  • Difficult local problems
  • Irrelevant local problems
  • Not clear how to decompose all Multi-category
    problems

19
Multi-class OvA Decomposition a Linear
Representation
  • Hypothesis h(x) argmaxi vix
  • Decomposition
  • Each class represented by a linear function vix
  • Learning One-versus-all (OvA)
  • For each class i vix gt 0 iff iy
  • General Case
  • Each class represented by a function fi(x) gt 0

20
Learning via One-Versus-All (OvA) Assumption
  • Classifier f(x) argmax vi.x

Individual Classifiers
  • OvA Learning Find vi.x gt 0 iff yi
  • OvA is fine only if data is OvA separable!
  • Linear classifier can represent this function!
  • (voronoi) argmin d(ci,x) ? (wta) argmax cix di

21
Other Issues we Mentioned
  • Error Correcting Output Codes
  • Another (class of) decomposition
  • Difficulty how to make sure that the resulting
    problems are separable.
  • Commented on the advantage of All vs. All when
    working with the dual space (e.g., kernels)

22
Example SNoW Multi-class Classifier
How do we train?
How do we evaluate?
Targets (each an LTU)
Weighted edges (weight vectors)
Features
  • SNoW only represents the targets and weighted
    edges

23
Winnow Extensions
  • Winnow learns monotone boolean functions
  • To learn non-monotone boolean functions
  • For each variable x, introduce x x
  • Learn monotone functions over 2n variables
  • To learn functions with real valued inputs
  • Balanced Winnow
  • 2 weights per variable effective weight is the
    difference
  • Update rule

24
An Intuition Balanced Winnow
  • In most multi-class classifiers you have a target
    node that represents positive examples and target
    node that represents negative examples.
  • Typically, we train each node separately (my/not
    my example).
  • Rather, given an example we could say this is
    more a example than a example.
  • We compared the activation of the different
    target nodes (classifiers) on a given example.
    (This example is more class than class -)

25
Constraint Classification
  • Can be viewed as a generalization of the balanced
    Winnow to the multi-class case
  • Unifies multi-class, multi-label,
    category-ranking
  • Reduces learning to a single binary learning task
  • Captures theoretical properties of binary
    algorithm
  • Experimentally verified
  • Naturally extends Perceptron, SVM, etc...
  • Do all of this by representing labels as a set of
    constraints or preferences among output labels.

26
Multi-category to Constraint Classification
  • Multiclass
  • (x, A) ? (x, (AgtB, AgtC, AgtD) )
  • Multilabel
  • (x, (A, B)) ? (x, ( (AgtC, AgtD, BgtC, BgtD) )
  • Label Ranking
  • (x, (5gt4gt3gt2gt1)) ? (x, ( (5gt4, 4gt3, 3gt2, 2gt1) )
  • Examples (x,y) y ? Sk
  • Sk partial order over class labels 1,...,k
  • defines preference relation ( gt ) for class
    labeling
  • Constraint Classifier h X ? Sk

27
Learning Constraint ClassificationKesler
Construction
  • Transform Examples

igtj fi(x) - fj(x) gt 0 wi ? x - wj ? x gt 0 W
? Xi - W ? Xj gt 0 W ? (Xi - Xj) gt 0 W ? Xij
gt 0
Xi (0,x,0,0) ? Rkd Xj (0,0,0,x) ? Rkd Xij
Xi - Xj (0,x,0,-x) W (w1,w2,w3,w4) ? Rkd
28
Keslers Construction (1)
  • y argmaxi(r,b,g,y) vi.x
  • vi , x ? Rn
  • Find vr,vb,vg,vy ? Rn such that
  • vr.x gt vb.x
  • vr.x gt vg.x
  • vr.x gt vy.x

H Rkn
29
Keslers Construction (2)
  • Let v (vr,vb,vg,vy ) ? Rkn
  • Let 0n, be the n-dim zero vector
  • vr.x gt vb.x ? v.(x,-x,0n,0n) gt 0 ? v.(-x,x,0n,0n)
    lt 0
  • vr.x gt vg.x ? v.(x,0n,-x,0n) gt 0 ? v.(x,0n,-x,0n)
    lt 0
  • vr.x gt vy.x ? v.(x,0n,0n,-x) gt 0 ? v.(-x,0n,0n
    ,x) lt 0

30
Keslers Construction (3)
  • Let
  • v (v1, ..., vk) ? Rn x ... x Rn Rkn
  • xij (0(i-1)n, x, 0(k-i)n) (0(j-1)n, x,
    0(k-j)n) ? Rkn
  • Given (x, y) ? Rn x 1,...,k
  • For all j ? y
  • Add to P(x,y), (xyj, 1)
  • Add to P-(x,y), (xyj, -1)
  • P(x,y) has k-1 positive examples (? Rkn)
  • P-(x,y) has k-1 negative examples (? Rkn)

31
Learning via Keslers Construction
  • Given (x1, y1), ..., (xN, yN) ? Rn x 1,...,k
  • Create
  • P ? P(xi,yi)
  • P ? P(xi,yi)
  • Find v (v1, ..., vk) ? Rkn, such that
  • v.x separates P from P
  • Output
  • f(x) argmax vi.x

32
Constraint Classification
  • Examples (x,y)
  • y ? Sk
  • Sk partial order over class labels 1,...,k
  • defines preference relation (lt) for class
    labels
  • e.g. Multiclass 2lt1, 2lt3, 2lt4, 2lt5
  • e.g. Multilabel 1lt3, 1lt4, 1lt5, 2lt3, 2lt4, 4lt5
  • Constraint Classifier
  • f X ? Sk
  • f(x) is a partial order
  • f(x) is consistent with y if (iltj) ? y ? (iltj)
    ?f(x)

33
Implementation
  • Examples (x,y)
  • y ? Sk
  • Sk partial order over class labels 1,...,k
  • defines preference relation (gt) for class
    labels
  • e.g. Multiclass 2gt1, 2gt3, 2gt4, 2gt5
  • Given an example that is labeled 2, the
    activation of target 2 on it, should be larger
    than the activation of the other targets.
  • SNoW implementation Conservative.
  • Only the target node that corresponds to the
    correct label and the highest activation are
    compared.
  • If both are the same target node no change.
  • Otherwise, promote one and demote the other.

34
Properties of Construction
  • Can learn any argmax vi.x function
  • Can use any algorithm to find linear separation
  • Perceptron Algorithm
  • ultraconservative online algorithm Crammer,
    Singer 2001
  • Winnow Algorithm
  • multiclass winnow Masterharm 2000
  • Defines a multiclass margin
  • by binary margin in Rkd
  • multiclass SVM Crammer, Singer 2001

35
Margin Generalization Bounds
  • Linear Hypothesis space
  • h(x) argsort vi.x
  • vi, x ?Rd
  • argsort returns permutation of 1,...,k
  • CC margin-based bound
  • ? min(x,y)?S min (i lt j)?y vi.x vj.x
  • m - number of examples
  • R - maxx x
  • ? - confidence
  • C - average constraints

36
VC-style Generalization Bounds
  • Linear Hypothesis space
  • h(x) argsort vi.x
  • vi, x ?Rd
  • argsort returns permutation of 1,...,k
  • CC VC-based bound
  • m - number of examples
  • d - dimension of input space
  • delta - confidence
  • k - number of classes

37
Beyond Multiclass Classification
  • Ranking
  • category ranking (over classes)
  • ordinal regression (over examples)
  • Multilabel
  • x is both red and blue
  • Complex relationships
  • x is more red than blue, but not green
  • Millions of classes
  • sequence labeling (e.g. POS tagging) LATER
  • SNoW has an implementation of Constraint
    Classification for the Multi-Class case. Try to
    compare with 1-vs-all.
  • Experimental Issues when is this version of
    multi-class better?
  • Several easy improvements are possible via
    modifying the loss function.

38
Multi-class Experiments
Picture isnt so clear for very high dimensional
problems. Why?
39
Summary
OvA Constraint Classification
Learning independent fi(x) gt 0 iff yi Evaluation global h(x) argmax fi(x) Learning global find fi(x) s.t. y argmax fi(x) Evaluation global h(x) argmax fi(x)
Learn Inference Inference Based Training
Learning independent fi(x) gt 0 iff i is a part of y Evaluation global Inf h(x) argmaxy\inC SU fi(x) Learning global find fi(x) s.t. y argmax fi(x) Evaluation global h(x) argmax fi(x)
40
Structured Output Learning
  • Abstract View
  • Decomposition versus Constraint Classification
  • More details Inference with Classifiers

41
Structured Output LearningSemantic Role Labeling
  • For each verb in a sentence
  • Identify all constituents that fill a semantic
    role
  • Determine their roles
  • Core Arguments, e.g., Agent, Patient or
    Instrument
  • Their adjuncts, e.g., Locative, Temporal or Manner

Y All possible ways to label the
tree C(Y) All valid ways to label the
tree argmaxy ? C(Y) g(x,y)
I
left
my
pearls
to
my
child
42
Components of Structured Output Learning
  • Input X
  • Output A collection of variables
  • Y (y1,...,yL) ? 1,...,KL
  • Length is example dependent
  • Constraints on the Output C(Y)
  • e.g. non-overlapping, no repeated values...
  • partition output to valid and invalid assignments
  • Representation
  • scoring function g(x,y)
  • e.g. linear g(x,y) w ? ?(x,y)
  • Inference
  • h(x) argmaxvalid y g(x,y)

Y
X
43
Decomposition-based Learning
  • Many choices for decomposition
  • Depends on problem, learning model, computation
    resources, etc...
  • Value-based decomposition
  • A function for each output value
  • fk(x,l), k 1,..,K
  • e.g. SRL tagging fA0(x,node), fA1(x,node),...
  • OvA learning
  • fk(x,node) gt 0 iff ky

44
Learning Discriminant Functions The General
Setting
  • g(x,y) gt g(x,y) y ? Y \ y
  • w ? ?(x,y) gt w ? ?(x,y) y ? Y \ y
  • w ? ?(x,y,y) w ? (?(x,y) - ?(x,y)) gt 0
  • P(x,y) ?(x,y,y) y ? Y \ y
  • P(S) P(x,y)(x,y) ? S
  • Learn unary classifer over P(S)
  • (binary) (P(S),-P(S))
  • Used in many works C02,WW00,CS01,CM03,TGK03

45
Structured Output LearningSemantic Role Labeling
  • Learn a collection of scoring functions
  • wA0?A0(x,y,n) , wA1?A1(x,y,n),...
  • scorev(x,y,n) wv?v(x,y,n)
  • Global score
  • g(x,y) ?n scoreyn(x,y,n) ?n wyn?yn(x,y,n)
  • Learn locally (LO, LI)
  • for each label variable (node) n A0
  • gA0(x,y,n) wA0?A0(x,y,n) gt 0 iff yn A0
  • Discriminant model dictates
  • g(x,y) gt g(x,y), y ? C(Y)
  • argmaxy ? C(Y) g(x,y)
  • Learn Globally (IBT)
  • g(x,y) w ?(x,y)

46
Summary
OvA Constraint Classification
Learning Independent fi(x) gt 0 iff yi Evaluation global h(x) argmax fi(x) Learning global find fi(x) s.t. y argmax fi(x) Evaluation global h(x) argmax fi(x)
Learn Inference Inference Based Training
Learning Independent fi(x) gt 0 iff i is a part of y Evaluation global Inference h(x) Inference fi(x) Efficient Learning Learning global find fi(x) s.t. Y Inference fi(x) Evaluation global inference h(x) Inference fi(x) Less Efficent Learning
Multi-class
Structured Output
Write a Comment
User Comments (0)
About PowerShow.com