Support Vector Machines Classification Venables - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Support Vector Machines Classification Venables

Description:

It allows the use of more similarity measures than the OHA ... Sepal.Length Sepal.Width Petal.Length Petal.Width Species. 1 5.1 3.5 1.4 0.2 setosa ... – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 25

Provided by: josephb60

Category:

more less

Transcript and Presenter's Notes

Title: Support Vector Machines Classification Venables

1
Support Vector MachinesClassificationVenables
Ripley Section 12.5CSU Hayward Statistics 6601

Joseph Rickert
Timothy McKusick
December 1, 2004

2
Support Vector Machine

What is the SVM?
The SVM is a
generalization of the
Optimal Hyperplane Algorithm

Why is the SVM important?
It allows the use of more similarity measures
than the OHA
Through the use of kernel methods it works with
non vector data

3
Simple Linear Classifier

XRp
f(x) wTx b
Each x ? X is classified into 2
classes labeled y ? 1,-1
y 1 if f(x) ? 0 and
y -1 if f(x) lt 0
S (x1,y1),(x2,y2),...
Given S, the problem is to learn f (find w and b)
.
For each f check to see if all (xi,yi) are
correctly classified i.e. yif(xi) ? 0
Choose f so that the number of errors is minimized

4
But what if the training set is not linearly
separable?

f(x) wTx b defines two half planes xf(x) ?
1 and x f(x) ? -1
Classify with the Hinge loss
function c(f,x,y) max(0,1-yf(x))
c (f,x,y) ? as distance from correct half plane ?
If (x,y) is correctly classified with large
confidence then c(f,x,y) 0

wTxb gt 1
yf(x) ? 1 correct with large conf 0 ? yf(x) lt
0 correct with small conf yf(x) lt 0
misclassified
wTxb lt - 1
yf(x)
margin 2/w
1
5
SVMs combine requirements of large margin and few
misclassificationsby solving the problem

New formulation
min 1/2w2 C?c(f,xi,yi) w.r.t w,x and b
C is parameter that controls tradeoff between
margin and misclassification
Large C ? small margins but more samples
correctly classified with strong confidence
Technical difficulty hinge loss function
c(f,xi,yi) is not differentiable

Even better formulation use slack
variables xi
min 1/2w2 C?xi w.r.t w,x and b
under the constraint xi ?
c(f,xi,yi) ()
But () is equivalent to
xi ? 0
xi - 1 yi(wTxi b) ? 0
Solve this quadratic optimization problem with
Lagrange Multipliers

for i 1...n
6
Support Vectors

Lagrange Multiplier formulation
Find a that minimizes W(a)(-1/2)
??yiyjaiajxiTxj ?ai
under the constraints ?ai 0 and 0 ? ai
? C
The points with positive Lagrange Multipliers,ai
gt 0, are called Support Vectors
The set of support vectors contains all the
information used by the SVM to learn a
discrimination function

a 0
a C
0 lt a lt C
7
Kernel Methods data not represented
individually, but only through a set of pairwise
comparisons
X a set of objects(proteins)
F(s) (aatcgagtcac, atggacgtct, tgcactact)
Each object represented by a sequence
S
1 0.5 0.3 0.5 1 0.6 0.3 0.6 1
K
Each number in the kernel matrix is a measure of
the similarity or distance between two objects.
8
Kernels

Properties of Kernels
Kernels are measures of similarity K(x,x) large
when x and x are similar
Kernels must be
Positive definite
Symmetric
? kernel K, ? a Hilbert Space F and a mapping
F X ? F ? K(x,x) ltF(x),F(x)gt ? x,x ?
X
Hence all kernels can be thought of as dot
products in some feature space

Advantages of Kernels
Data of very different nature can be analyzed in
a unified framework
No matter what the objects are, n objects are
always represented by an n x n matrix
Many times, it is easier to compare objects than
represent them numerically
Complete modularity between function to represent
data and algorithm to analyze data

9
The Kernel Trick

Any algorithm for vector data that can be
expressed in terms of dot products can be
performed implicitly in the feature space
associated with the kernel by replacing each dot
product with the kernel representation
e.g. For some feature space F let
d(x,x) F(x) -
F(x)
But
F(x)-F(x)2 ltF(x),F(x)gt
ltF(x),F(x)gt - 2ltF(x),F(x)gt
So
d(x,x) (K(x,x)K(x,x)-2K(x,x))1/2

10
Nonlinear Separation

Nonlinear kernel
X is a vector space
the kernel F is nonlinear
linear separation in the feature space F can be
associated with non linear separation in X

F
X
F
11
SVM with Kernel

Final formulation
Find a that minimizes W(a)(-1/2)??yiyjaiajxiTxj
?ai
under the constraints ?ai 0 and 0 ? ai ? C
Find an index i, 0 lt ai lt C and set
b yi - ?yjajk(xixj)
The classification of a new object x ? X is then
determined by the sign of the function
f(x) ?yiaik(xix) b

12
iris data set (Anderson 1935) 150 cases, 50 each
of 3 species of iris Example from page 48 of
The e1071 Package.

First 10 lines of Iris
gt iris
Sepal.Length Sepal.Width Petal.Length
Petal.Width Species
1 5.1 3.5 1.4
0.2 setosa
2 4.9 3.0 1.4
0.2 setosa
3 4.7 3.2 1.3
0.2 setosa
4 4.6 3.1 1.5
0.2 setosa
5 5.0 3.6 1.4
0.2 setosa
6 5.4 3.9 1.7
0.4 setosa
7 4.6 3.4 1.4
0.3 setosa
8 5.0 3.4 1.5
0.2 setosa
9 4.4 2.9 1.4
0.2 setosa
10 4.9 3.1 1.5
0.1 setosa

13
SVM ANALYSIS OF IRIS DATA

SVM ANALYSIS OF IRIS DATA SET
classification mode
default with factor response
model lt- svm(Species ., data iris)
summary(model)

Call
svm(formula Species ., data iris)
Parameters
SVM-Type C-classification
SVM-Kernel radial
cost 1
gamma 0.25
Number of Support Vectors 51
( 8 22 21 )
Number of Classes 3
Levels
setosa versicolor virginica

Parameter C in Lagrange Formulation
Radial Kernel exp(-gu - v)2
14
Exploring the SVM Model

test with training data
x lt- subset(iris, select -Species)
y lt- Species
pred lt- predict(model, x)
Check accuracy
table(pred, y)
compute decision values
pred lt- predict(model, x, decision.values TRUE)
attr(pred, "decision.values")14,

y
pred setosa versicolor virginica
setosa 50 0 0
versicolor 0 48 2
virginica 0 2 48
setosa/versicolor setosa/virginica
versicolor/virginica
1, 1.196000 1.091667 0.6706543
2, 1.064868 1.055877 0.8482041
3, 1.181229 1.074370 0.6438237
4, 1.111282 1.052820 0.6780645

15
Visualize classes with MDS

visualize (classes by color, SV by crosses)
plot(cmdscale(dist(iris,-5)),
col as.integer(iris,5),
ch c("o","")1150 in modelindex 1)

cmdscale multidimensional scaling or
principal coordinates analysis
black sertosa red versicolor green virginica
16
iris split into training and test sets first 25
of each case training set

Call
svm(formula fS.TR ., data iris.train)
Parameters
SVM-Type C-classification
SVM-Kernel radial
cost 1
gamma 0.25
Number of Support Vectors 32
( 7 13 12 )
Number of Classes 3
Levels
setosa veriscolor virginica

SECOND SVM ANALYSIS OF IRIS DATA SET
classification mode
default with factor response
Train with iris.train.data
model.2 lt- svm(fS.TR ., data iris.train)
output from summary
summary(model.2)

17
iris test results

test with iris.test.data
x.2 lt- subset(iris.test, select -fS.TE)
y.2 lt- fS.TE
pred.2 lt- predict(model.2, x.2)
Check accuracy
table(pred.2, y.2)
compute decision values and probabilities
pred.2 lt- predict(model.2, x.2, decision.values
TRUE)
attr(pred.2, "decision.values")14,

y.2
pred.2 setosa veriscolor virginica
setosa 25 0 0
veriscolor 0 25 0
virginica 0 0 25
setosa/veriscolor setosa/virginica
veriscolor/virginica
1, 1.253378 1.086341 0.6065033
2, 1.000251 1.021445 0.8012664
3, 1.247326 1.104700 0.6068924
4, 1.164226 1.078913 0.6311566

18
iris training and test sets
19
Microarray Data from Golub et al. Molecular
Classification of Cancer Class Prediction by
Gene Expression Monitoring, Science, Vol 286,
10/15/1999

Expression levels of predictive genes .
Rows genes
Columns samples
Expression levels (EL) of each gene are relative
to the mean EL for that gene in the initial
dataset
Red if EL gt mean
Blue if EL lt mean
The scale indicates s above or below the mean
Top panel genes highly expressed in ALL
Bottom panel genes more highly expressed in AML.

20
Microarray Data Transposedrows samples,
columns genes

Microarray Data Transposedrows samples,
columns genes
,1 ,2 ,3 ,4 ,5,6 ,7
,8 ,9 ,10
1, -214 -153 -58 88 -295 -558 199 -176
252 206
2, -139 -73 -1 283 -264 -400 -330 -168
101 74
3, -76 -49 -307 309 -376 -650 33 -367
206 -215
4, -135 -114 265 12 -419 -585 158 -253
49 31
5, -106 -125 -76 168 -230 -284 4 -122
70 252
6, -138 -85 215 71 -272 -558 67 -186
87 193
7, -72 -144 238 55 -399 -551 131 -179
126 -20
8, -413 -260 7 -2 -541 -790 -275 -463
70 -169
9, 5 -127 106 268 -210 -535 0 -174
24 506
10, -88 -105 42 219 -178 -246 328 -148
177 183
11, -165 -155 -71 82 -163 -430 100 -109
56 350
12, -67 -93 84 25 -179 -323 -135 -127
-2 -66
13, -92 -119 -31 173 -233 -227 -49 -62
13 230
14, -113 -147 -118 243 -127 -398 -249 -228
-37 113
15, -107 -72 -126 149 -205 -284 -166 -185
1 -23

Training Data
38 Samples
7129 x 38 matrix
ALL 27
AML 11
Test Data
38 Samples
7129 x 34 matrix
ALL 20
AML 14

21
SVM ANALYSIS OF MICROARRAY DATAclassification
mode

default with factor response
y lt-c(rep(0,27),rep(1,11))
fy lt-factor(y,levels01)
levels(fy) lt-c("ALL","AML")
compute svm on first 3000 genes only because of
memory overflow problems
model.ma lt- svm(fy .,data fmat.train,13000)

Call
svm(formula fy ., data fmat.train,
13000)
Parameters
SVM-Type C-classification
SVM-Kernel radial
cost 1
gamma 0.0003333333
Number of Support Vectors 37
( 26 11 )
Number of Classes 2
Levels
ALL AML

22
Visualize Microarray Training Data with
Multidimensional Scaling

visualize Training Data
(classes by color, SV by crosses)
multidimensional scaling
pc lt- cmdscale(dist(fmat.train,13000))
plot(pc,
col as.integer(fy),
pch c("o","")13000 in model.maindex 1,
main"Training Data ALL 'Black' and AML 'Red'
Classes")

23
Check Model with Training DataPredict outcomes
of Test Data

check the training data
x lt- fmat.train,13000
pred.train lt- predict(model.ma, x)
check accuracy
table(pred.train, fy)
classify the test data
y2 lt-c(rep(0,20),rep(1,14))
fy2 lt-factor(y2,levels01)
levels(fy2) lt-c("ALL","AML")
x2 lt- fmat.test,13000
pred lt- predict(model.ma, x2)
check accuracy
table(pred, fy2)

fy
pred.train ALL AML
ALL 27 0
AML 0 11
fy2
pred ALL AML
ALL 20 13
AML 0 1

Training data correctly classified
Model is worthless so far
24
Conclusion

The SVM appears to be a powerful classifier
applicable to many different kinds of data
But
Kernel design is a full time job
Selecting model parameters is far from obvious
The math is formidable

Write a Comment

User Comments (0)