Support Vector Machines - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Support Vector Machines

Description:

Support Vector Machines (SVMs) Hypothesis Space of linear functions ... State-of-the-art NLP-tool suited for real applications. represents a good balance of: ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 36
Provided by: marias7
Category:

less

Transcript and Presenter's Notes

Title: Support Vector Machines


1
Support Vector Machines
2
Part I
  • Introduction
  • Supervised learning
  • Input/output and hyperplanes
  • Support vectors
  • Optimisation
  • Key ideas
  • Input space feature space
  • Kernels
  • Overfitting

3
Part II
  • Fast and accurate Part-of-Speech Tagging The
    SVM Approach revisited
  • (Jesús Giménez and Lluís Màrquez)
  • Problem Setting
  • Experiments, Results
  • Conclusion

4
1. Supervised Learning
  • Learning by examples
  • Training set pairs of input data labelled with
    output
  • e.g. input word output tag ? ltword/taggt
  • Target function mapping from input data to
    output, i.e. classifying ltwordgt correctly as
    lttaggt
  • Task approximate this mapping
  • solution/decision function

5
1. Supervised Learning
  • Algorithm may select set of possible solutions
  • hypothesis space
  • If the hypotheses,i.e. the output is
  • - binary ? binary classification task
  • - finite ? multiclass classific. task
  • - real-valued ? regression
  • Correctly classifying new data generalisation

6
Support Vector Machines (SVMs)
  • Hypothesis Space of linear functions (linear
    separator)
  • Training data xn (n-dimensional vectors)
  • Labels/classes di (i different classes)
  • Training Set L x d (x1,d1),...,(xm,dm)
  • (xm,dm) / m1,...M
  • Binary classification Function f, Input xm
  • if f(xm) gt 0 if f(xm) lt 0
  • positive class negative class
  • d 1 d -1

7
2. SVM Linear Separator
  • f(x) separates instances hyperplane H
  • H H(w,b) x 2 Rn / wTx b 0
  • with w2 Rn and a bias b2 R
  • For wTx b 1, dm 1
  • wTx b -1, dm -1
  • ! dm (wTx b) 1
  • (normalised combined constraint)

8
3. Margin of Separation
  • Geometric margin r between x and H
  • r (wTx b) / w
  • Margin of Separation µL
  • µL(w,b) minm1,...,M ( wTxmb / w )

9
H
H
10
3. Margin of Separation
  • The larger µ the more robust is H
  • If µ is max in all directions then there exits
    pos. and neg. instances exactly on µ.
  • Support Vectors (of H to L)
  • norm. distance rpos of support vectors from H
    1 / w
  • norm. distance rneg of support vectors from H
    -1 / w

11
4. Optimisation
  • Binary classification finding hyperplane that
    separates pos. neg. instances (decision
    function)
  • ? finding optimal hyperplane !
  • No instances in (2 / w) - space around H
  • maximise (2 / w)

12
4. Optimisation
  • i.e. minimse
  • (0.5 wT w)
  • satisfying the constraint
  • dm (wTx b) 1
  • How?

13
4. Lagrange Multiplier
  • Calculate saddle point of a function which has
    to satisfy a certain constraint
  • Introduce (pos. real-valued) ? and minimise
    function J
  • Q(?) J(w,b,?) , dm (wTx b) 1
  • such that
  • J(w,?) J(w,?) J(w,?)
  • Solve and find optimal w

14
4. Lagrange Multiplier
  • Optimal w is a linear combination of trainings
    set L
  • w ?(m1...M) ?m dm xm
  • but ?gt0 only for dm (wTxm b) 10, ie. for
    support vectors
  • ? optimal w is a linear combination of support
    vectors of L

15
4. Lagrange Multiplier
  • Q(?)
  • 0.5wTw - ?(m1,..M) ?m (dm (wTxm b)1)
  • -0.5 ?(m1...M) ? (n1...M) ?m?n dmdn
    (xm)Txn ?(m1...M) ?m
  • (only uses dot/scalar product in equation)

16
5. a) Feature Space
  • If not linearly separable (e.g. XOR in 2D)
  • Project to higher dimensional space feature
    space
  • ? Rn ? RN n lower dim, N higher dim
  • Input space Rn, feature space RN

17
5. a) Feature Space
  • Instead of L (xm,dm) / m1,...M
  • ? L (?(xm),dm) / m1,...M
  • Also for optimisation problem
  • Q?(?)
  • -0.5 ? ? ?m?n dmdn (?(xm)T? (xn)) ? ?m
  • (Only dot product lt?(xm)T?(xn)gt !)

18
5. b) Kernel Functions
  • When ? Rn ? RN
  • then Kernel K Rn Rn ? R
  • dot products
  • K?(x1, x2) lt?(x1),?(x2)gt
  • Find K with least complex k, e.g.
  • K(x,y) k(xT,y)

19
5. b) Kernel Functions
  • E.g.
  • ? R2 ? R4 ?(x) ?(x1, x2)
  • ?(x) (x12, x1x2, x2x1, x22)
  • k?(xT,y) ?

20
6. Overfitting
  • w becomes too complex, data is modelled too
    closely
  • Allow for errors (data is noisy anyway)
  • otherwise generalisation becomes poor
  • Soft margin s 0.5wTw C ? ?
  • New constraint on function
  • dm (wTx b) (1- ?) 0
  • dm (wTx b) 1- ?

21
Part II
  • Fast and accurate Part-of-Speech Tagging The
    SVM Approach revisited
  • (Jesús Giménez and Lluís Màrquez)
  • Problem Setting
  • Experiments, Results
  • Conclusion

22
1. Problem Setting
  • Tagging is multiclass classification task
  • ? binarise by training an SVM for each class
  • learn to distinguish between current class (i.e.
    lttaggt) and the rest
  • Restrict classes/tags by using lexicon and use
    only other possible classes/tags as negative
    instances for current
  • When tagging, choose most confident tag out of
    all binary SVM predictions for ltwordgt
  • e.g. tag with greatest distance to separator

23
1. Problem Setting
  • Coding of features

24
1. Problem Setting
  • Evaluate to binary features
  • e.g. bigram previous_word_is_XXX true/false
  • Context set to seven-token-window
  • When tagging, right-hand-tags are not yet known ?
    ambiguity class tag out of possible
    combinations (maybe)
  • Only need to include explicit n-grams when linear
    kernels are used
  • i.e. higher dim. Vector or higher dim. Kernel

25
2. Experiments
  • Corpus Penn Treebank III (1,17 million words)
  • Corpus divided in
  • Training set (60)
  • Validation, i.e. parameter optimisation (20)
  • Test set (20)
  • Tagset with 48 tags, only 34 used as in 1.,i.e.
    34 SVMs rest unambiguos

26
2. Experiments
  • Linear vs Polynomial Kernels
  • test various kernels acc. to degree d
  • each with default C-parameter
  • features filtered by number of occurrences n

27
(No Transcript)
28
2. Experiments
  • For feature set 1, degree 2 pol. Kernel is best
  • higher degrees lead to overfitting
  • more support vectors, less accuracy
  • For feature set 2 (incl. n-grams), linear Kernel
    is best, even better than degree 2 Kernel
  • less support vectors (sparse) and 3 times faster
    !
  • ? Preferable to extend feature set with n-grams
    and use linear Kernel

29
2. Experiments - Results
Ambiguos words in Overall in
TnT 92.64 97.27
SVM-tagger 94.35 97.74
  • Linear Kernel
  • greedy left-to-right tagging with no optimisation
    on sentence level
  • Closed vocabulary assumption
  • ? Performance in accuracy (compared to
    state-of-the-art HMM-based tagger TnT)

30
2. Experiments
  • Include unknown words
  • ambiguous word with all open-word classes as
    possible tags (18)
  • use feature template, e.g.
  • All Upper/Lower Case yes/no
  • Contains Capital letters yes/no
  • Contains a period/number... yes/no
  • Suffixes s1, s1s2, s1s2s3,...
  • AND all features for known words

31
2. Experiments - Results
ambig. known unknown all
TnT 92.2 96.9 84.6 96.5
SVM-tagger 93.6 97.2 83.5 96.9
SVM-tagger 94.1 97.3 83.6 97.0
32
2. Experiments - Results
  • SVMtagger, implemented in Perl
  • tagging speed of 1335 words/sec
  • maybe faster in C ?!
  • TnT
  • speed of 50000 words/sec

33
3. Conclusion
  • State-of-the-art NLP-tool suited for real
    applications
  • represents a good balance of
  • simplicity
  • flexibility (not domain-specific)
  • high performance
  • efficiency

34
3. Future Work
  • Experiment with and improve learning model for
    unknown words
  • Implement in C
  • Include probabilities of the whole sentence tag
    sequence in tagging scheme
  • Simplify model, i.e. decision function /
    hyperplane based on w
  • (accuracy is hardly worse with up to 70 of ws
    dimensions discarded how?)

35
Questions Discussion
Write a Comment
User Comments (0)
About PowerShow.com