Title: An Introduction to Support Vector Machine Classification
1An Introduction to Support Vector Machine
Classification
Bioinformatics Lecture 7/2/2003
by Pierre Dönnes
2Outline
- What do we mean with classification, why is it
useful - Machine learning- basic concept
- Support Vector Machines (SVM)
- Linear SVM basic terminology and some formulas
- Non-linear SVM the Kernel trick
- An example Predicting protein subcellular
location with SVM - Performance measurments
3Classification
- Everyday, all the time we classify things.
- Eg crossing the street
- Is there a car coming?
- At what speed?
- How far is it to the other side?
- Classification Safe to walk or not!!!
4- Decision tree learning
- IF (Outlook Sunny) (Humidity High)
- THEN PlayTennis NO
- IF (Outlook Sunny) (Humidity Normal)
- THEN PlayTennis YES
5Classification tasks in Bioinformatics
- Learning Task
- Given Expression profiles of leukemia patients
and healthy persons. - Compute A model distinguishing if a person has
leukemia from expression data. - Classification Task
- Given Expression profile of a new patient a
learned model - Determine If a patient has leukemia or not.
6Problems in classifying biological data
- Often high dimension of data.
- Hard to put up simple rules.
- Amount of data.
- Need automated ways to deal with the data.
- Use computers data processing, statistical
analysis, try to learn patterns from the data
(Machine Learning)
7Examples are - Support Vector Machines -
Artificial Neural Networks -
Boosting - Hidden Markov Models
8Black box view ofMachine Learning
Training data
Model
Model
Magic black box (learning machine)
Training data -Expression patterns of some
cancer expression data from healty
person Model - The model can
distinguish between healty and sick persons.
Can be used for prediction.
9Tennis example 2
Temperature
Humidity
play tennis
do not play tennis
10Linear Support Vector Machines
Data ltxi,yigt, i1,..,l xi ? Rd yi ? -1,1
x2
1
-1
x1
11Linear SVM 2
Data ltxi,yigt, i1,..,l xi ? Rd yi ? -1,1
f(x)
All hyperplanes in Rd are parameterize by a
vector (w) and a constant b. Can be expressed as
wxb0 (remember the equation for a hyperplane
from algebra!)
Our aim is to find such a hyperplane
f(x)sign(wxb), that correctly classify our
data.
12Definitions
Define the hyperplane H such that xiwb ? 1
when yi 1 xiwb ? -1 when yi -1
H1
H2
H1 and H2 are the planes H1 xiwb 1 H2
xiwb -1 The points on the planes H1 and H2
are the Support Vectors
H
d the shortest distance to the closest poitive
point
d- the shortest distance to the closest
negative point
The margin of a separating hyperplane is d d-.
13Maximizing the margin
We want a classifier with as big margin as
possible.
H1
H
H2
Recall the distance from a point(x0,y0) to a
line AxByc 0 isA x0 B y0 c/sqrt(A2B2)
The distance between H and H1 is wxb/w1/
w
The distance between H1 and H2 is 2/w
In order to maximize the margin, we need to
minimize w. With the condition that there
are no datapoints between H1 and H2 xiwb ? 1
when yi 1 xiwb ? -1 when yi -1 Can
be combined into yi(xiw) ? 1
14The Lagrangian trick
Reformulate the optimization problem A trick
often used in optimization is to do an Lagrangian
formulation of the problem.The constraints will
be replace by constraints on the Lagrangian
multipliers and the training data will only
occur as dot products.
Gives us the task Max Ld ??i
½??i?jxixj, Subject to w ??iyixi ??iyi
0
What we need to see xiand xj (input vectors)
appear only in the form of dot product we will
soon see why that is important.
15Problems with linear SVM
-1
1
What if the decison function is not a linear?
16Non-linear SVM 1
The Kernel trick
Imagine a function ? that maps the data into
another space ?Rd??
Rd
?
-1
1
?
-1
1
17Non-linear svm2
The function we end up optimizing is Max Ld
??i ½??i?jK(xixj), Subject to w
??iyixi ??iyi 0
Another kernel example The polynomial
kernel K(xi,xj) (xixj 1)p, where p is a
tunable parameter. Evaluating K only require one
addition and one exponentiation more than the
original dot product.
18Solving the optimization problem
- In many cases any general purpose optimization
package that solves linearly constrained
equations will do. - Newtons method
- Conjugate gradient descent
- Other methods involves nonlinear programming
techniques.
19Overtraining/overfitting
A well known problem with machine learning
methods is overtraining. This means that we have
learned the training data very well, but we can
not classify unseen examples correctly.
An example A botanist really knowing
trees.Everytime he sees a new tree, he claims it
is not a tree.
20Overtraining/overfitting 2
A measure of the risk of overtraining with SVM
(there are also other measures).
It can be shown that The portion, n, of unseen
data that will be missclassified is bound by n
? No of support vectors / number of training
examples
Ockhams razor principle Simpler system are
better than more complex ones. In SVM case fewer
support vectors mean a simpler representation of
the hyperplane.
Example Understanding a certain cancer if it can
be described by one gene is easier than if we
have to describe it with 5000.
21A practical example, protein localization
- Proteins are synthesized in the cytosol.
- Transported into different subcellular locations
where they carry out their functions. - Aim To predict in what location a certain
protein will end up!!!
22Subcellular Locations
23Method
- Hypothesis The amino acid composition of
proteins from different compartments should
differ. - Extract proteins with know subcellular location
from SWISSPROT. - Calculate the amino acid composition of the
proteins. - Try to differentiate between cytosol,
extracellular, mitochondria and nuclear by using
SVM
24Input encoding
Prediction of nuclear proteins Label the known
nuclear proteins as 1 and all others as 1. The
input vector xi represents the amino acid
composition. Eg xi (4.2,6.7,12,.,0.5)
A , C , D,.., Y)
Nuclear
SVM
Model
All others
25Cross-validation
Cross validation Split the data into n sets,
train on n-1 set, test on the set left out of
training.
1
Test set
Nuclear
1
1
2
3
2
1
All others
Training set
3
2
2
3
3
26Performance measurments
TP
Test data
Predictions
FP
Model
1
TN
-1
1
-1
FN
27Results
- We definetely get some predictive power out of
our models. - Seems to be a difference in composition of
proteins from different subcellular locations. - Another questions What about nuclear proteins.
Is there a difference between DNA-binding
proteins and others???
28Conclusions
- We have (hopefully) learned some basic concepts
and terminology of SVM. - We know about the risk of overtraining and how to
put a measure on the risk of bad generalization. - SVMs can be useful for example in predicting
subcellular location of proteins.
29You cant input anything into a learning
machine!!!
Image classification of tanks. Autofire when an
enemy tank is spotted. Input data Photos of own
and enemy tanks. Worked really good with the
training set used. In reality it failed
completely.
Reason All enemy tank photos taken in the
morning. All own tanks in dawn. The classifier
could recognize dusk from dawn!!!!
30References
http//www.kernel-machines.org/
http//www.support-vector.net/
Papers by Vapnik
C.J.C. Burges A tutorial on Support Vector
Machines. Data Mining and Knowledge Discovery
2121-167, 1998.