Title: Max-margin sequential learning methods
1Max-margin sequential learning methods
2Announcements
- Upcoming assignments
- Wed 3/3 project proposal due
- personnel 1-2 page
- Spring break next week, no class
- Will get feedback on project proposals by end of
break - No write-ups for Distance Metrics for Text week
are due Wed 3/17 - not the Monday after spring break
3Collins paper
- Notation
- label (y) is a tag t
- observation (x) is word w
- history h is a 4-tuple ltti,ti-1,w1n,igt
- phis(h,t) is a feature of h, t
4Collins papers
- Notation cont
- Phi is summation of phi for all positions i
- alphas is weight to give phis
5Collins paper
6The theory
Claim 1 the algorithm is an instance of this
perceptron variant
Claim 2 the arguments in the mistake-bounded
classification results of FS99 extend
immediately to this ranking task as well.
7(No Transcript)
8FS99 algorithm
9FS99 result
10Collins result
11Results
- Two experiments
- POS tagging, using the Adwaits features
- NP chunking (Start,Continue,Outside tags)
- NER on special ATT dataset (another paper)
12Features for NP chunking
13Results
14More ideas
- The dual version of a perceptron
- w is built up by repeatedly adding examples gt w
is a weighted sum of the examples x1,...,xn - inner product ltw,xgt is can be rewritten
15Dual version of perceptron ranking
alpha i,j i,j range over example and
correct/incorrect tag sequence
16NER features for re-ranking MAXENT tagger output
17NER features
18NER results
19Altun et al paper
- Starting point dual version of Collins
perceptron algorithm - final hypothesis is weighted sum of inner
products with a subset of the examples - this a lot like an SVM except that the
perceptron algorithm is used to set the weights
rather than quadratic optimization
20SVM optimization
- Notation
- yi is the correct tag for xi
- y is an incorrect tag
- F(xi,yi) are features
- Optimization problem
- find weights w on the examples that maximize
minimal margin, limiting w1, or - minimize w2 such that every margin gt 1
21SVMs for ranking
22SVMs for ranking
Proposition (14) and (15) are equivalent
23SVMs for ranking
A binary classification problem with xi yi the
positive example and xi y negative examples,
except that thetai varies for each example. Why?
because were ranking.
24SVMs for ranking
- Altun et al work give the remaining details
- Like for perceptron learning, negative data is
found by running Viterbi given the learned
weights and looking for errors - Each mistake is a possible new support vector
- Need to iterate over the data repeatedly
- Could be exponential time before convergence if
the support vectors are dense...
25Altun et al results
- NER on 300 sentences from CoNLL2002 shared task
- Spanish
- Four entity types, nine labels (beginning-T,
intermediate-T, other) - POS tagging on 300 sentences from Penn TreeBank
- 5-CV, window of size 3, simple features
26Altun et al results
27Altun et al results