Multiclass Classification in NLP - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Multiclass Classification in NLP

Description:

Name/Entity Recognition Label people, locations, and organizations in a sentence [PER Sam Houston],[born in] [LOC Virginia], [was a member of the] [ORG US Congress]. – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 47

Provided by: DanR97

Category:

more less

Transcript and Presenter's Notes

Title: Multiclass Classification in NLP

1
Multiclass Classification in NLP

Name/Entity Recognition
Label people, locations, and organizations in a
sentence
PER Sam Houston,born in LOC Virginia, was
a member of the ORG US Congress.
Decompose into sub-problems
Sam Houston, born in Virginia... ?
(PER,LOC,ORG,?) ? PER (1)
Sam Houston, born in Virginia... ?
(PER,LOC,ORG,?) ? None (0)
Sam Houston, born in Virginia... ?
(PER,LOC,ORG,?) ? LOC (2)
Many problems in NLP are decomposed this way
Disambiguation tasks
POS Tagging
Word-sense disambiguation
Verb Classification
Semantic-Role Labeling

2
Outline

Multi-Categorical Classification Tasks
example Semantic Role Labeling (SRL)
Decomposition Approaches
Constraint Classification
Unifies learning of multi-categorical classifiers
Structured-Output Learning
revisit SRL
Decomposition versus Constraint Classification
Goal
Discuss multi-class and structured output from
the same perspective.
Discuss similarities and differences

3
Multi-Categorical Output Tasks

Multi-class Classification (y ? 1,...,K)
character recognition (6)
document classification (homepage)
Multi-label Classification (y ? 1,...,K)
document classification ((homepage,facultypage))
Category Ranking (y ? ?K)
user preference ((love gt like gt hate))
document classification (hompage gt facultypage gt
sports)
Hierarchical Classification (y ? 1,..,K)
cohere with class hierarchy
place document into index where soccer is-a
sport

4
(more) Multi-Categorical Output Tasks

Sequential Prediction (y ? 1,...,K)
e.g. POS tagging ((NVNNA))
This is a sentence. ? D V N D
e.g. phrase identification
Many labels KL for length L sentence
Structured Output Prediction (y ? C(1,...,K))
e.g. parse tree, multi-level phrase
identification
e.g. sequential prediction
Constrained by
domain, problem, data, background knowledge,
etc...

5
Semantic Role LabelingA Structured-Output Problem

For each verb in a sentence
Identify all constituents that fill a semantic
role
Determine their roles
Core Arguments, e.g., Agent, Patient or
Instrument
Their adjuncts, e.g., Locative, Temporal or Manner

6
Semantic Role Labeling
I left my pearls to my daughter-in-law in my will.

Many possible valid output
Many possible invalid output

7
Structured Output Problems

Multi-Class
View y4 as (y1,...,yk) ( 0 0 0 1 0 0 0 )
The output is restricted by Exactly one of yi1
Learn f1(x),..,fk(x)
Sequence Prediction
e.g. POS tagging x (My name is Dav) y
(Pr,N,V,N)
e.g. restriction Every sentence must have a
verb
Structured Output
Arbitrary global constraints
Local functions do not have access to global
constraints!
Goal
Discuss multi-class and structured output from
the same perspective.
Discuss similarities and differences

8
Transform the sub-problems

Sam Houston, born in Virginia... ?
(PER,LOC,ORG,?) ? PER (1)
Transform each problem to feature vector
Sam Houston, born in Virginia
? (Bob-, JOHN-, SAM HOUSTON, HAPPY, -BORN,
--BORN,... )
? ( 0 , 0 , 1
, 0 , 1 , 1 ,... )
Transform each label to a class label
PER ? 1
LOC ? 2
ORG ? 3
? ? 0
Input 0,1d or Rd
Output 0,1,2,3,...,k

9
Solving multiclass with binary learning

Multiclass classifier
Function f Rd ? 1,2,3,...,k
Decompose into binary problems
Not always possible to learn
No theoretical justification (unless the problem
is easy)

10
The Real MultiClass Problem

General framework
Extend binary algorithms
Theoretically justified
Provably correct
Generalizes well
Verified Experimentally
Naturally extends binary classification
algorithms to mulitclass setting
e.g. Linear binary separation induces linear
boundaries in multiclass setting

11
Multi Class over Linear Functions

One versus all (OvA)

All versus all (AvA)

Direct winner-take-all (D-WTA)

12
WTA over linear functions

Assume examples generated from winner-take-all
y argmax wi . x ti
wi, x ? Rn , ti ? R

Note Voronoi diagrams are WTA functions
argminc ci x argmaxc ci . x ci2
/ 2

13
Learning via One-Versus-All (OvA) Assumption

Find vr,vb,vg,vy ? Rn such that
vr.x gt 0 iff y red ?
vb.x gt 0 iff y blue ?
vg.x gt 0 iff y green ?
vy.x gt 0 iff y yellow ?
Classifier f(x) argmax vi.x

H Rkn
Individual Classifiers
Decision Regions
14
Learning via All-Verses-All (AvA) Assumption

Find vrb,vrg,vry,vbg,vby,vgy ? Rd such that
vrb.x gt 0 if y red
lt 0 if y blue
vrg.x gt 0 if y red
lt 0 if y green
... (for all pairs)

Individual Classifiers
Decision Regions
15
Classifying with AvA
All are post-learning and might cause weird stuff
16
Summary (1) Learning Binary Classifiers

On-Line Perceptron, Winnow
Mistake bounded
Generalizes well (VC-Dim)
Works well in practice
SVM
Well motivated to maximize margin
Generalizes well
Works well in practice
Boosting, Neural Networks, etc...

17
From Binary to Multi-categorical

Decompose multi-categorical problems
into multiple (independent) binary problems
Multi-class OvA, AvA, ECOC, DT, etc...
Multi-label reduce to multi-class
Categorical Ranking reduce or regression
Sequence Prediction
Reduce to Multi-class
part/alphabet based decompositions
Structured Output
learn parts of output based on local
information!!!

18
Problems with Decompositions

Learning optimizes over local metrics
Poor global performance
What is the metric?
We dont care about the performance of the local
classifiers
Poor decomposition ? poor performance
Difficult local problems
Irrelevant local problems
Not clear how to decompose all Multi-category
problems

19
Multi-class OvA Decomposition a Linear
Representation

Hypothesis h(x) argmaxi vix
Decomposition
Each class represented by a linear function vix
Learning One-versus-all (OvA)
For each class i vix gt 0 iff iy
General Case
Each class represented by a function fi(x) gt 0

20
Learning via One-Versus-All (OvA) Assumption

Classifier f(x) argmax vi.x

Individual Classifiers

OvA Learning Find vi.x gt 0 iff yi
OvA is fine only if data is OvA separable!
Linear classifier can represent this function!
(voronoi) argmin d(ci,x) ? (wta) argmax cix di

21
Other Issues we Mentioned

Error Correcting Output Codes
Another (class of) decomposition
Difficulty how to make sure that the resulting
problems are separable.
Commented on the advantage of All vs. All when
working with the dual space (e.g., kernels)

22
Example SNoW Multi-class Classifier
How do we train?
How do we evaluate?
Targets (each an LTU)
Weighted edges (weight vectors)
Features

SNoW only represents the targets and weighted
edges

23
Winnow Extensions

Winnow learns monotone boolean functions
To learn non-monotone boolean functions
For each variable x, introduce x x
Learn monotone functions over 2n variables
To learn functions with real valued inputs
Balanced Winnow
2 weights per variable effective weight is the
difference
Update rule

24
An Intuition Balanced Winnow

In most multi-class classifiers you have a target
node that represents positive examples and target
node that represents negative examples.
Typically, we train each node separately (my/not
my example).
Rather, given an example we could say this is
more a example than a example.
We compared the activation of the different
target nodes (classifiers) on a given example.
(This example is more class than class -)

25
Constraint Classification

Can be viewed as a generalization of the balanced
Winnow to the multi-class case
Unifies multi-class, multi-label,
category-ranking
Reduces learning to a single binary learning task
Captures theoretical properties of binary
algorithm
Experimentally verified
Naturally extends Perceptron, SVM, etc...
Do all of this by representing labels as a set of
constraints or preferences among output labels.

26
Multi-category to Constraint Classification

Multiclass
(x, A) ? (x, (AgtB, AgtC, AgtD) )
Multilabel
(x, (A, B)) ? (x, ( (AgtC, AgtD, BgtC, BgtD) )
Label Ranking
(x, (5gt4gt3gt2gt1)) ? (x, ( (5gt4, 4gt3, 3gt2, 2gt1) )
Examples (x,y) y ? Sk
Sk partial order over class labels 1,...,k
defines preference relation ( gt ) for class
labeling
Constraint Classifier h X ? Sk

27
Learning Constraint ClassificationKesler
Construction

Transform Examples

igtj fi(x) - fj(x) gt 0 wi ? x - wj ? x gt 0 W
? Xi - W ? Xj gt 0 W ? (Xi - Xj) gt 0 W ? Xij
gt 0
Xi (0,x,0,0) ? Rkd Xj (0,0,0,x) ? Rkd Xij
Xi - Xj (0,x,0,-x) W (w1,w2,w3,w4) ? Rkd
28
Keslers Construction (1)

y argmaxi(r,b,g,y) vi.x
vi , x ? Rn
Find vr,vb,vg,vy ? Rn such that
vr.x gt vb.x
vr.x gt vg.x
vr.x gt vy.x

H Rkn
29
Keslers Construction (2)

Let v (vr,vb,vg,vy ) ? Rkn
Let 0n, be the n-dim zero vector
vr.x gt vb.x ? v.(x,-x,0n,0n) gt 0 ? v.(-x,x,0n,0n)
lt 0
vr.x gt vg.x ? v.(x,0n,-x,0n) gt 0 ? v.(x,0n,-x,0n)
lt 0
vr.x gt vy.x ? v.(x,0n,0n,-x) gt 0 ? v.(-x,0n,0n
,x) lt 0

30
Keslers Construction (3)

Let
v (v1, ..., vk) ? Rn x ... x Rn Rkn
xij (0(i-1)n, x, 0(k-i)n) (0(j-1)n, x,
0(k-j)n) ? Rkn
Given (x, y) ? Rn x 1,...,k
For all j ? y
Add to P(x,y), (xyj, 1)
Add to P-(x,y), (xyj, -1)
P(x,y) has k-1 positive examples (? Rkn)
P-(x,y) has k-1 negative examples (? Rkn)

31
Learning via Keslers Construction

Given (x1, y1), ..., (xN, yN) ? Rn x 1,...,k
Create
P ? P(xi,yi)
P ? P(xi,yi)
Find v (v1, ..., vk) ? Rkn, such that
v.x separates P from P
Output
f(x) argmax vi.x

32
Constraint Classification

Examples (x,y)
y ? Sk
Sk partial order over class labels 1,...,k
defines preference relation (lt) for class
labels
e.g. Multiclass 2lt1, 2lt3, 2lt4, 2lt5
e.g. Multilabel 1lt3, 1lt4, 1lt5, 2lt3, 2lt4, 4lt5
Constraint Classifier
f X ? Sk
f(x) is a partial order
f(x) is consistent with y if (iltj) ? y ? (iltj)
?f(x)

33
Implementation

Examples (x,y)
y ? Sk
Sk partial order over class labels 1,...,k
defines preference relation (gt) for class
labels
e.g. Multiclass 2gt1, 2gt3, 2gt4, 2gt5
Given an example that is labeled 2, the
activation of target 2 on it, should be larger
than the activation of the other targets.
SNoW implementation Conservative.
Only the target node that corresponds to the
correct label and the highest activation are
compared.
If both are the same target node no change.
Otherwise, promote one and demote the other.

34
Properties of Construction

Can learn any argmax vi.x function
Can use any algorithm to find linear separation
Perceptron Algorithm
ultraconservative online algorithm Crammer,
Singer 2001
Winnow Algorithm
multiclass winnow Masterharm 2000
Defines a multiclass margin
by binary margin in Rkd
multiclass SVM Crammer, Singer 2001

35
Margin Generalization Bounds

Linear Hypothesis space
h(x) argsort vi.x
vi, x ?Rd
argsort returns permutation of 1,...,k
CC margin-based bound
? min(x,y)?S min (i lt j)?y vi.x vj.x

m - number of examples
R - maxx x
? - confidence
C - average constraints

36
VC-style Generalization Bounds

Linear Hypothesis space
h(x) argsort vi.x
vi, x ?Rd
argsort returns permutation of 1,...,k
CC VC-based bound

m - number of examples
d - dimension of input space
delta - confidence
k - number of classes

37
Beyond Multiclass Classification

Ranking
category ranking (over classes)
ordinal regression (over examples)
Multilabel
x is both red and blue
Complex relationships
x is more red than blue, but not green
Millions of classes
sequence labeling (e.g. POS tagging) LATER
SNoW has an implementation of Constraint
Classification for the Multi-Class case. Try to
compare with 1-vs-all.
Experimental Issues when is this version of
multi-class better?
Several easy improvements are possible via
modifying the loss function.

38
Multi-class Experiments
Picture isnt so clear for very high dimensional
problems. Why?
39
Summary
OvA Constraint Classification
Learning independent fi(x) gt 0 iff yi Evaluation global h(x) argmax fi(x) Learning global find fi(x) s.t. y argmax fi(x) Evaluation global h(x) argmax fi(x)
Learn Inference Inference Based Training
Learning independent fi(x) gt 0 iff i is a part of y Evaluation global Inf h(x) argmaxy\inC SU fi(x) Learning global find fi(x) s.t. y argmax fi(x) Evaluation global h(x) argmax fi(x)
40
Structured Output Learning

Abstract View
Decomposition versus Constraint Classification
More details Inference with Classifiers

41
Structured Output LearningSemantic Role Labeling

For each verb in a sentence
Identify all constituents that fill a semantic
role
Determine their roles
Core Arguments, e.g., Agent, Patient or
Instrument
Their adjuncts, e.g., Locative, Temporal or Manner

Y All possible ways to label the
tree C(Y) All valid ways to label the
tree argmaxy ? C(Y) g(x,y)
I
left
my
pearls
to
my
child
42
Components of Structured Output Learning

Input X
Output A collection of variables
Y (y1,...,yL) ? 1,...,KL
Length is example dependent
Constraints on the Output C(Y)
e.g. non-overlapping, no repeated values...
partition output to valid and invalid assignments
Representation
scoring function g(x,y)
e.g. linear g(x,y) w ? ?(x,y)
Inference
h(x) argmaxvalid y g(x,y)

Y
X
43
Decomposition-based Learning

Many choices for decomposition
Depends on problem, learning model, computation
resources, etc...
Value-based decomposition
A function for each output value
fk(x,l), k 1,..,K
e.g. SRL tagging fA0(x,node), fA1(x,node),...
OvA learning
fk(x,node) gt 0 iff ky

44
Learning Discriminant Functions The General
Setting

g(x,y) gt g(x,y) y ? Y \ y
w ? ?(x,y) gt w ? ?(x,y) y ? Y \ y
w ? ?(x,y,y) w ? (?(x,y) - ?(x,y)) gt 0
P(x,y) ?(x,y,y) y ? Y \ y
P(S) P(x,y)(x,y) ? S
Learn unary classifer over P(S)
(binary) (P(S),-P(S))
Used in many works C02,WW00,CS01,CM03,TGK03

45
Structured Output LearningSemantic Role Labeling

Learn a collection of scoring functions
wA0?A0(x,y,n) , wA1?A1(x,y,n),...
scorev(x,y,n) wv?v(x,y,n)
Global score
g(x,y) ?n scoreyn(x,y,n) ?n wyn?yn(x,y,n)
Learn locally (LO, LI)
for each label variable (node) n A0
gA0(x,y,n) wA0?A0(x,y,n) gt 0 iff yn A0
Discriminant model dictates
g(x,y) gt g(x,y), y ? C(Y)
argmaxy ? C(Y) g(x,y)
Learn Globally (IBT)
g(x,y) w ?(x,y)

46
Summary
OvA Constraint Classification
Learning Independent fi(x) gt 0 iff yi Evaluation global h(x) argmax fi(x) Learning global find fi(x) s.t. y argmax fi(x) Evaluation global h(x) argmax fi(x)
Learn Inference Inference Based Training
Learning Independent fi(x) gt 0 iff i is a part of y Evaluation global Inference h(x) Inference fi(x) Efficient Learning Learning global find fi(x) s.t. Y Inference fi(x) Evaluation global inference h(x) Inference fi(x) Less Efficent Learning
Multi-class
Structured Output

Write a Comment

User Comments (0)