Title: Multiclass Classification in NLP
1Multiclass Classification in NLP
- Name/Entity Recognition
- Label people, locations, and organizations in a
sentence - PER Sam Houston,born in LOC Virginia, was
a member of the ORG US Congress. - Decompose into sub-problems
- Sam Houston, born in Virginia... ?
(PER,LOC,ORG,?) ? PER (1) - Sam Houston, born in Virginia... ?
(PER,LOC,ORG,?) ? None (0) - Sam Houston, born in Virginia... ?
(PER,LOC,ORG,?) ? LOC (2) - Many problems in NLP are decomposed this way
- Disambiguation tasks
- POS Tagging
- Word-sense disambiguation
- Verb Classification
- Semantic-Role Labeling
2Outline
- Multi-Categorical Classification Tasks
- example Semantic Role Labeling (SRL)
- Decomposition Approaches
- Constraint Classification
- Unifies learning of multi-categorical classifiers
- Structured-Output Learning
- revisit SRL
- Decomposition versus Constraint Classification
- Goal
- Discuss multi-class and structured output from
the same perspective. - Discuss similarities and differences
3Multi-Categorical Output Tasks
- Multi-class Classification (y ? 1,...,K)
- character recognition (6)
- document classification (homepage)
- Multi-label Classification (y ? 1,...,K)
- document classification ((homepage,facultypage))
- Category Ranking (y ? ?K)
- user preference ((love gt like gt hate))
- document classification (hompage gt facultypage gt
sports) - Hierarchical Classification (y ? 1,..,K)
- cohere with class hierarchy
- place document into index where soccer is-a
sport
4(more) Multi-Categorical Output Tasks
- Sequential Prediction (y ? 1,...,K)
- e.g. POS tagging ((NVNNA))
- This is a sentence. ? D V N D
- e.g. phrase identification
- Many labels KL for length L sentence
- Structured Output Prediction (y ? C(1,...,K))
- e.g. parse tree, multi-level phrase
identification - e.g. sequential prediction
- Constrained by
- domain, problem, data, background knowledge,
etc...
5Semantic Role LabelingA Structured-Output Problem
- For each verb in a sentence
- Identify all constituents that fill a semantic
role - Determine their roles
- Core Arguments, e.g., Agent, Patient or
Instrument - Their adjuncts, e.g., Locative, Temporal or Manner
6Semantic Role Labeling
I left my pearls to my daughter-in-law in my will.
- Many possible valid output
- Many possible invalid output
7Structured Output Problems
- Multi-Class
- View y4 as (y1,...,yk) ( 0 0 0 1 0 0 0 )
- The output is restricted by Exactly one of yi1
- Learn f1(x),..,fk(x)
- Sequence Prediction
- e.g. POS tagging x (My name is Dav) y
(Pr,N,V,N) - e.g. restriction Every sentence must have a
verb - Structured Output
- Arbitrary global constraints
- Local functions do not have access to global
constraints! - Goal
- Discuss multi-class and structured output from
the same perspective. - Discuss similarities and differences
8Transform the sub-problems
- Sam Houston, born in Virginia... ?
(PER,LOC,ORG,?) ? PER (1) - Transform each problem to feature vector
- Sam Houston, born in Virginia
- ? (Bob-, JOHN-, SAM HOUSTON, HAPPY, -BORN,
--BORN,... ) - ? ( 0 , 0 , 1
, 0 , 1 , 1 ,... ) - Transform each label to a class label
- PER ? 1
- LOC ? 2
- ORG ? 3
- ? ? 0
- Input 0,1d or Rd
- Output 0,1,2,3,...,k
9Solving multiclass with binary learning
- Multiclass classifier
- Function f Rd ? 1,2,3,...,k
- Decompose into binary problems
- Not always possible to learn
- No theoretical justification (unless the problem
is easy)
10The Real MultiClass Problem
- General framework
- Extend binary algorithms
- Theoretically justified
- Provably correct
- Generalizes well
- Verified Experimentally
- Naturally extends binary classification
algorithms to mulitclass setting - e.g. Linear binary separation induces linear
boundaries in multiclass setting
11Multi Class over Linear Functions
- Direct winner-take-all (D-WTA)
12WTA over linear functions
- Assume examples generated from winner-take-all
- y argmax wi . x ti
- wi, x ? Rn , ti ? R
- Note Voronoi diagrams are WTA functions
- argminc ci x argmaxc ci . x ci2
/ 2
13Learning via One-Versus-All (OvA) Assumption
- Find vr,vb,vg,vy ? Rn such that
- vr.x gt 0 iff y red ?
- vb.x gt 0 iff y blue ?
- vg.x gt 0 iff y green ?
- vy.x gt 0 iff y yellow ?
- Classifier f(x) argmax vi.x
H Rkn
Individual Classifiers
Decision Regions
14Learning via All-Verses-All (AvA) Assumption
- Find vrb,vrg,vry,vbg,vby,vgy ? Rd such that
- vrb.x gt 0 if y red
- lt 0 if y blue
- vrg.x gt 0 if y red
- lt 0 if y green
- ... (for all pairs)
Individual Classifiers
Decision Regions
15Classifying with AvA
All are post-learning and might cause weird stuff
16Summary (1) Learning Binary Classifiers
- On-Line Perceptron, Winnow
- Mistake bounded
- Generalizes well (VC-Dim)
- Works well in practice
- SVM
- Well motivated to maximize margin
- Generalizes well
- Works well in practice
- Boosting, Neural Networks, etc...
17From Binary to Multi-categorical
- Decompose multi-categorical problems
- into multiple (independent) binary problems
- Multi-class OvA, AvA, ECOC, DT, etc...
- Multi-label reduce to multi-class
- Categorical Ranking reduce or regression
- Sequence Prediction
- Reduce to Multi-class
- part/alphabet based decompositions
- Structured Output
- learn parts of output based on local
information!!!
18Problems with Decompositions
- Learning optimizes over local metrics
- Poor global performance
- What is the metric?
- We dont care about the performance of the local
classifiers - Poor decomposition ? poor performance
- Difficult local problems
- Irrelevant local problems
- Not clear how to decompose all Multi-category
problems
19Multi-class OvA Decomposition a Linear
Representation
- Hypothesis h(x) argmaxi vix
- Decomposition
- Each class represented by a linear function vix
- Learning One-versus-all (OvA)
- For each class i vix gt 0 iff iy
- General Case
- Each class represented by a function fi(x) gt 0
20Learning via One-Versus-All (OvA) Assumption
- Classifier f(x) argmax vi.x
Individual Classifiers
- OvA Learning Find vi.x gt 0 iff yi
- OvA is fine only if data is OvA separable!
- Linear classifier can represent this function!
- (voronoi) argmin d(ci,x) ? (wta) argmax cix di
21Other Issues we Mentioned
- Error Correcting Output Codes
- Another (class of) decomposition
- Difficulty how to make sure that the resulting
problems are separable. - Commented on the advantage of All vs. All when
working with the dual space (e.g., kernels)
22Example SNoW Multi-class Classifier
How do we train?
How do we evaluate?
Targets (each an LTU)
Weighted edges (weight vectors)
Features
- SNoW only represents the targets and weighted
edges
23Winnow Extensions
- Winnow learns monotone boolean functions
- To learn non-monotone boolean functions
- For each variable x, introduce x x
- Learn monotone functions over 2n variables
- To learn functions with real valued inputs
- Balanced Winnow
- 2 weights per variable effective weight is the
difference - Update rule
24An Intuition Balanced Winnow
- In most multi-class classifiers you have a target
node that represents positive examples and target
node that represents negative examples. - Typically, we train each node separately (my/not
my example). - Rather, given an example we could say this is
more a example than a example. - We compared the activation of the different
target nodes (classifiers) on a given example.
(This example is more class than class -)
25Constraint Classification
- Can be viewed as a generalization of the balanced
Winnow to the multi-class case - Unifies multi-class, multi-label,
category-ranking - Reduces learning to a single binary learning task
- Captures theoretical properties of binary
algorithm - Experimentally verified
- Naturally extends Perceptron, SVM, etc...
- Do all of this by representing labels as a set of
constraints or preferences among output labels.
26Multi-category to Constraint Classification
- Multiclass
- (x, A) ? (x, (AgtB, AgtC, AgtD) )
- Multilabel
- (x, (A, B)) ? (x, ( (AgtC, AgtD, BgtC, BgtD) )
- Label Ranking
- (x, (5gt4gt3gt2gt1)) ? (x, ( (5gt4, 4gt3, 3gt2, 2gt1) )
- Examples (x,y) y ? Sk
- Sk partial order over class labels 1,...,k
- defines preference relation ( gt ) for class
labeling - Constraint Classifier h X ? Sk
27Learning Constraint ClassificationKesler
Construction
igtj fi(x) - fj(x) gt 0 wi ? x - wj ? x gt 0 W
? Xi - W ? Xj gt 0 W ? (Xi - Xj) gt 0 W ? Xij
gt 0
Xi (0,x,0,0) ? Rkd Xj (0,0,0,x) ? Rkd Xij
Xi - Xj (0,x,0,-x) W (w1,w2,w3,w4) ? Rkd
28Keslers Construction (1)
- y argmaxi(r,b,g,y) vi.x
- vi , x ? Rn
- Find vr,vb,vg,vy ? Rn such that
- vr.x gt vb.x
- vr.x gt vg.x
- vr.x gt vy.x
H Rkn
29Keslers Construction (2)
- Let v (vr,vb,vg,vy ) ? Rkn
- Let 0n, be the n-dim zero vector
- vr.x gt vb.x ? v.(x,-x,0n,0n) gt 0 ? v.(-x,x,0n,0n)
lt 0 - vr.x gt vg.x ? v.(x,0n,-x,0n) gt 0 ? v.(x,0n,-x,0n)
lt 0 - vr.x gt vy.x ? v.(x,0n,0n,-x) gt 0 ? v.(-x,0n,0n
,x) lt 0
30Keslers Construction (3)
- Let
- v (v1, ..., vk) ? Rn x ... x Rn Rkn
- xij (0(i-1)n, x, 0(k-i)n) (0(j-1)n, x,
0(k-j)n) ? Rkn - Given (x, y) ? Rn x 1,...,k
- For all j ? y
- Add to P(x,y), (xyj, 1)
- Add to P-(x,y), (xyj, -1)
- P(x,y) has k-1 positive examples (? Rkn)
- P-(x,y) has k-1 negative examples (? Rkn)
31Learning via Keslers Construction
- Given (x1, y1), ..., (xN, yN) ? Rn x 1,...,k
- Create
- P ? P(xi,yi)
- P ? P(xi,yi)
- Find v (v1, ..., vk) ? Rkn, such that
- v.x separates P from P
- Output
- f(x) argmax vi.x
32Constraint Classification
- Examples (x,y)
- y ? Sk
- Sk partial order over class labels 1,...,k
- defines preference relation (lt) for class
labels - e.g. Multiclass 2lt1, 2lt3, 2lt4, 2lt5
- e.g. Multilabel 1lt3, 1lt4, 1lt5, 2lt3, 2lt4, 4lt5
- Constraint Classifier
- f X ? Sk
- f(x) is a partial order
- f(x) is consistent with y if (iltj) ? y ? (iltj)
?f(x)
33Implementation
- Examples (x,y)
- y ? Sk
- Sk partial order over class labels 1,...,k
- defines preference relation (gt) for class
labels - e.g. Multiclass 2gt1, 2gt3, 2gt4, 2gt5
- Given an example that is labeled 2, the
activation of target 2 on it, should be larger
than the activation of the other targets. - SNoW implementation Conservative.
- Only the target node that corresponds to the
correct label and the highest activation are
compared. - If both are the same target node no change.
- Otherwise, promote one and demote the other.
34Properties of Construction
- Can learn any argmax vi.x function
- Can use any algorithm to find linear separation
- Perceptron Algorithm
- ultraconservative online algorithm Crammer,
Singer 2001 - Winnow Algorithm
- multiclass winnow Masterharm 2000
- Defines a multiclass margin
- by binary margin in Rkd
- multiclass SVM Crammer, Singer 2001
35Margin Generalization Bounds
- Linear Hypothesis space
- h(x) argsort vi.x
- vi, x ?Rd
- argsort returns permutation of 1,...,k
- CC margin-based bound
- ? min(x,y)?S min (i lt j)?y vi.x vj.x
- m - number of examples
- R - maxx x
- ? - confidence
- C - average constraints
36VC-style Generalization Bounds
- Linear Hypothesis space
- h(x) argsort vi.x
- vi, x ?Rd
- argsort returns permutation of 1,...,k
- CC VC-based bound
- m - number of examples
- d - dimension of input space
- delta - confidence
- k - number of classes
37Beyond Multiclass Classification
- Ranking
- category ranking (over classes)
- ordinal regression (over examples)
- Multilabel
- x is both red and blue
- Complex relationships
- x is more red than blue, but not green
- Millions of classes
- sequence labeling (e.g. POS tagging) LATER
- SNoW has an implementation of Constraint
Classification for the Multi-Class case. Try to
compare with 1-vs-all. - Experimental Issues when is this version of
multi-class better? - Several easy improvements are possible via
modifying the loss function.
38Multi-class Experiments
Picture isnt so clear for very high dimensional
problems. Why?
39Summary
OvA Constraint Classification
Learning independent fi(x) gt 0 iff yi Evaluation global h(x) argmax fi(x) Learning global find fi(x) s.t. y argmax fi(x) Evaluation global h(x) argmax fi(x)
Learn Inference Inference Based Training
Learning independent fi(x) gt 0 iff i is a part of y Evaluation global Inf h(x) argmaxy\inC SU fi(x) Learning global find fi(x) s.t. y argmax fi(x) Evaluation global h(x) argmax fi(x)
40Structured Output Learning
- Abstract View
- Decomposition versus Constraint Classification
- More details Inference with Classifiers
41Structured Output LearningSemantic Role Labeling
- For each verb in a sentence
- Identify all constituents that fill a semantic
role - Determine their roles
- Core Arguments, e.g., Agent, Patient or
Instrument - Their adjuncts, e.g., Locative, Temporal or Manner
Y All possible ways to label the
tree C(Y) All valid ways to label the
tree argmaxy ? C(Y) g(x,y)
I
left
my
pearls
to
my
child
42Components of Structured Output Learning
- Input X
- Output A collection of variables
- Y (y1,...,yL) ? 1,...,KL
- Length is example dependent
- Constraints on the Output C(Y)
- e.g. non-overlapping, no repeated values...
- partition output to valid and invalid assignments
- Representation
- scoring function g(x,y)
- e.g. linear g(x,y) w ? ?(x,y)
- Inference
- h(x) argmaxvalid y g(x,y)
Y
X
43Decomposition-based Learning
- Many choices for decomposition
- Depends on problem, learning model, computation
resources, etc... - Value-based decomposition
- A function for each output value
- fk(x,l), k 1,..,K
- e.g. SRL tagging fA0(x,node), fA1(x,node),...
- OvA learning
- fk(x,node) gt 0 iff ky
44Learning Discriminant Functions The General
Setting
- g(x,y) gt g(x,y) y ? Y \ y
- w ? ?(x,y) gt w ? ?(x,y) y ? Y \ y
- w ? ?(x,y,y) w ? (?(x,y) - ?(x,y)) gt 0
- P(x,y) ?(x,y,y) y ? Y \ y
- P(S) P(x,y)(x,y) ? S
- Learn unary classifer over P(S)
- (binary) (P(S),-P(S))
- Used in many works C02,WW00,CS01,CM03,TGK03
45Structured Output LearningSemantic Role Labeling
- Learn a collection of scoring functions
- wA0?A0(x,y,n) , wA1?A1(x,y,n),...
- scorev(x,y,n) wv?v(x,y,n)
- Global score
- g(x,y) ?n scoreyn(x,y,n) ?n wyn?yn(x,y,n)
- Learn locally (LO, LI)
- for each label variable (node) n A0
- gA0(x,y,n) wA0?A0(x,y,n) gt 0 iff yn A0
- Discriminant model dictates
- g(x,y) gt g(x,y), y ? C(Y)
- argmaxy ? C(Y) g(x,y)
- Learn Globally (IBT)
- g(x,y) w ?(x,y)
46Summary
OvA Constraint Classification
Learning Independent fi(x) gt 0 iff yi Evaluation global h(x) argmax fi(x) Learning global find fi(x) s.t. y argmax fi(x) Evaluation global h(x) argmax fi(x)
Learn Inference Inference Based Training
Learning Independent fi(x) gt 0 iff i is a part of y Evaluation global Inference h(x) Inference fi(x) Efficient Learning Learning global find fi(x) s.t. Y Inference fi(x) Evaluation global inference h(x) Inference fi(x) Less Efficent Learning
Multi-class
Structured Output