Introduction to SVMs - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Introduction to SVMs

Description:

Introduction to SVMs – PowerPoint PPT presentation

Number of Views:170
Avg rating:3.0/5.0
Slides: 86
Provided by: awm9
Category:

less

Transcript and Presenter's Notes

Title: Introduction to SVMs


1
Introduction to SVMs
2
SVMs
  • Geometric
  • Maximizing Margin
  • Kernel Methods
  • Making nonlinear decision boundaries linear
  • Efficiently!
  • Capacity
  • Structural Risk Minimization

3
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
4
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
5
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
6
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
How would you classify this data?
7
Linear Classifiers
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
Any of these would be fine.. ..but which is best?
8
Classifier Margin
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
Define the margin of a linear classifier as the
width that the boundary could be increased by
before hitting a datapoint.
9
Maximum Margin
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the maximum margin. This
is the simplest kind of SVM (Called an LSVM)
Linear SVM
10
Maximum Margin
a
x
f
yest
f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the, um, maximum
margin. This is the simplest kind of SVM (Called
an LSVM)
Support Vectors are those datapoints that the
margin pushes up against
Linear SVM
11
Why Maximum Margin?
  1. Intuitively this feels safest.
  2. If weve made a small error in the location of
    the boundary (its been jolted in its
    perpendicular direction) this gives us least
    chance of causing a misclassification.
  3. Theres some theory (using VC dimension) that is
    related to (but not the same as) the proposition
    that this is a good thing.
  4. Empirically it works very very well.

f(x,w,b) sign(w. x - b)
denotes 1 denotes -1
The maximum margin linear classifier is the
linear classifier with the, um, maximum
margin. This is the simplest kind of SVM (Called
an LSVM)
Support Vectors are those datapoints that the
margin pushes up against
12
A Good Separator
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
13
Noise in the Observations
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
14
Ruling Out Some Separators
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
15
Lots of Noise
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
16
Maximizing the Margin
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
17
Specifying a line and margin
Plus-Plane
Classifier Boundary
Predict Class 1 zone
Minus-Plane
Predict Class -1 zone
  • How do we represent this mathematically?
  • in m input dimensions?

18
Specifying a line and margin
Plus-Plane
Classifier Boundary
Predict Class 1 zone
Minus-Plane
Predict Class -1 zone
wxb1
wxb0
wxb-1
  • Plus-plane x w . x b 1
  • Minus-plane x w . x b -1

Classify as..


1 if w . x b gt 1


-1 if w . x b lt -1

Universe explodes if -1 lt w . x b lt 1
19
Computing the margin width
M Margin Width
Predict Class 1 zone
How do we compute M in terms of w and b?
Predict Class -1 zone
wxb1
wxb0
wxb-1
  • Plus-plane x w . x b 1
  • Minus-plane x w . x b -1
  • Claim The vector w is perpendicular to the Plus
    Plane. Why?

20
Computing the margin width
M Margin Width
Predict Class 1 zone
How do we compute M in terms of w and b?
Predict Class -1 zone
wxb1
wxb0
wxb-1
  • Plus-plane x w . x b 1
  • Minus-plane x w . x b -1
  • Claim The vector w is perpendicular to the Plus
    Plane. Why?

Let u and v be two vectors on the Plus Plane.
What is w . ( u v ) ?
And so of course the vector w is also
perpendicular to the Minus Plane
21
Computing the margin width
M Margin Width
x
Predict Class 1 zone
How do we compute M in terms of w and b?
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1
  • Plus-plane x w . x b 1
  • Minus-plane x w . x b -1
  • The vector w is perpendicular to the Plus Plane
  • Let x- be any point on the minus plane
  • Let x be the closest plus-plane-point to x-.

22
Computing the margin width
M Margin Width
x
Predict Class 1 zone
How do we compute M in terms of w and b?
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1
  • Plus-plane x w . x b 1
  • Minus-plane x w . x b -1
  • The vector w is perpendicular to the Plus Plane
  • Let x- be any point on the minus plane
  • Let x be the closest plus-plane-point to x-.
  • Claim x x- l w for some value of l. Why?

23
Computing the margin width
M Margin Width
x
Predict Class 1 zone
The line from x- to x is perpendicular to the
planes. So to get from x- to x travel some
distance in direction w.
How do we compute M in terms of w and b?
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1
  • Plus-plane x w . x b 1
  • Minus-plane x w . x b -1
  • The vector w is perpendicular to the Plus Plane
  • Let x- be any point on the minus plane
  • Let x be the closest plus-plane-point to x-.
  • Claim x x- l w for some value of l. Why?

24
Computing the margin width
M Margin Width
x
Predict Class 1 zone
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1
  • What we know
  • w . x b 1
  • w . x- b -1
  • x x- l w
  • x - x- M
  • Its now easy to get M in terms of w and b

25
Computing the margin width
M Margin Width
x
Predict Class 1 zone
x-
Predict Class -1 zone
wxb1
w . (x - l w) b 1 gt w . x - b l w .w
1 gt -1 l w .w 1 gt
wxb0
wxb-1
  • What we know
  • w . x b 1
  • w . x- b -1
  • x x- l w
  • x - x- M
  • Its now easy to get M in terms of w and b

26
Computing the margin width
M Margin Width
x
Predict Class 1 zone
x-
Predict Class -1 zone
wxb1
M x - x- l w
wxb0
wxb-1
  • What we know
  • w . x b 1
  • w . x- b -1
  • x x- l w
  • x - x- M

27
Learning the Maximum Margin Classifier
M Margin Width
x
Predict Class 1 zone
x-
Predict Class -1 zone
wxb1
wxb0
wxb-1
  • Given a guess of w and b we can
  • Compute whether all data points in the correct
    half-planes
  • Compute the width of the margin
  • So now we just need to write a program to search
    the space of ws and bs to find the widest
    margin that matches all the datapoints. How?
  • Gradient descent? Simulated Annealing? Matrix
    Inversion? EM? Newtons Method?

28
Dont worry its good for you
  • Linear Programming
  • find w
  • argmax c?w
  • subject to
  • w?ai ? bi, for i 1, , m
  • wj ? 0 for j 1, , n

There are fast algorithms for solving linear
programs including the simplex algorithm and
Karmarkars algorithm
29
Learning via Quadratic Programming
  • QP is a well-studied class of optimization
    algorithms to maximize a quadratic function of
    some real-valued variables subject to linear
    constraints.

30
Quadratic Programming
Quadratic criterion
Find
Subject to
n additional linear inequality constraints
And subject to
e additional linear equality constraints
31
Quadratic Programming
Quadratic criterion
Find
There exist algorithms for finding such
constrained quadratic optima much more
efficiently and reliably than gradient
ascent. (But they are very fiddlyyou probably
dont want to write one yourself)
Subject to
n additional linear inequality constraints
And subject to
e additional linear equality constraints
32
Learning the Maximum Margin Classifier
  • Given guess of w , b we can
  • Compute whether all data points are in the
    correct half-planes
  • Compute the margin width
  • Assume R datapoints, each (xk,yk) where yk /- 1

M
Predict Class 1 zone
Predict Class -1 zone
wxb1
wxb0
wxb-1
What should our quadratic optimization criterion
be?
How many constraints will we have? What should
they be?
R
w . xk b gt 1 if yk 1 w . xk b lt -1 if yk
-1
Minimize w.w
33
Uh-oh!
This is going to be a problem! What should we
do? Idea 1 Find minimum w.w, while minimizing
number of training set errors. Problem Two
things to minimize makes for an ill-defined
optimization
34
Uh-oh!
This is going to be a problem! What should we
do? Idea 1.1 Minimize w.w C (train
errors) Theres a serious practical problem
thats about to make us reject this approach. Can
you guess what it is?
Tradeoff parameter
35
Uh-oh!
This is going to be a problem! What should we
do? Idea 1.1 Minimize w.w C (train
errors) Theres a serious practical problem
thats about to make us reject this approach. Can
you guess what it is?
Tradeoff parameter
Cant be expressed as a Quadratic Programming
problem. Solving it may be too slow. (Also,
doesnt distinguish between disastrous errors and
near misses)
So any other ideas?
36
Uh-oh!
This is going to be a problem! What should we
do? Idea 2.0 Minimize w.w C (distance of
error points to their
correct place)
37
Learning Maximum Margin with Noise
  • Given guess of w , b we can
  • Compute sum of distances of points to their
    correct zones
  • Compute the margin width
  • Assume R datapoints, each (xk,yk) where yk /- 1

M
wxb1
wxb0
wxb-1
What should our quadratic optimization criterion
be?
How many constraints will we have? What should
they be?
38
Large-margin Decision Boundary
  • The decision boundary should be as far away from
    the data of both classes as possible
  • We should maximize the margin, m
  • Distance between the origin and the line wtxk is
    k/w

Class 2
m
Class 1
39
Finding the Decision Boundary
  • Let x1, ..., xn be our data set and let yi ÃŽ
    1,-1 be the class label of xi
  • The decision boundary should classify all points
    correctly Þ
  • The decision boundary can be found by solving the
    following constrained optimization problem
  • This is a constrained optimization problem.
    Solving it requires some new tools
  • Feel free to ignore the following several slides
    what is important is the constrained optimization
    problem above

40
Back to the Original Problem
  • The Lagrangian is
  • Note that w2 wTw
  • Setting the gradient of w.r.t. w and b to
    zero, we have

41
  • The Karush-Kuhn-Tucker conditions,

42
The Dual Problem
  • If we substitute to ,
    we have
  • Note that
  • This is a function of ai only

43
The Dual Problem
  • The new objective function is in terms of ai only
  • It is known as the dual problem if we know w, we
    know all ai if we know all ai, we know w
  • The original problem is known as the primal
    problem
  • The objective function of the dual problem needs
    to be maximized!
  • The dual problem is therefore

Properties of ai when we introduce the Lagrange
multipliers
The result when we differentiate the original
Lagrangian w.r.t. b
44
The Dual Problem
  • This is a quadratic programming (QP) problem
  • A global maximum of ai can always be found
  • w can be recovered by

45
Characteristics of the Solution
  • Many of the ai are zero
  • w is a linear combination of a small number of
    data points
  • This sparse representation can be viewed as
    data compression as in the construction of knn
    classifier
  • xi with non-zero ai are called support vectors
    (SV)
  • The decision boundary is determined only by the
    SV
  • Let tj (j1, ..., s) be the indices of the s
    support vectors. We can write
  • For testing with a new data z
  • Compute
    and classify z as class 1 if
    the sum is positive, and class 2 otherwise
  • Note w need not be formed explicitly

46
A Geometrical Interpretation
Class 2
a100
a80.6
a70
a20
a50
a10.8
a40
a61.4
a90
a30
Class 1
47
Non-linearly Separable Problems
  • We allow error xi in classification it is
    based on the output of the discriminant function
    wTxb
  • xi approximates the number of misclassified
    samples

48
Learning Maximum Margin with Noise
  • Given guess of w , b we can
  • Compute sum of distances of points to their
    correct zones
  • Compute the margin width
  • Assume R datapoints, each (xk,yk) where yk /- 1

M
e11
e2
wxb1
e7
wxb0
wxb-1
What should our quadratic optimization criterion
be? Minimize
How many constraints will we have? R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1
49
Learning Maximum Margin with Noise
m input dimensions
  • Given guess of w , b we can
  • Compute sum of distances of points to their
    correct zones
  • Compute the margin width
  • Assume R datapoints, each (xk,yk) where yk /- 1

M
e11
e2
Our original (noiseless data) QP had m1
variables w1, w2, wm, and b. Our new (noisy
data) QP has m1R variables w1, w2, wm, b, ek
, e1 , eR
wxb1
e7
wxb0
wxb-1
What should our quadratic optimization criterion
be? Minimize
How many constraints will we have? R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1
R records
50
Learning Maximum Margin with Noise
  • Given guess of w , b we can
  • Compute sum of distances of points to their
    correct zones
  • Compute the margin width
  • Assume R datapoints, each (xk,yk) where yk /- 1

M
e11
e2
wxb1
e7
wxb0
wxb-1
What should our quadratic optimization criterion
be? Minimize
How many constraints will we have? R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1
Theres a bug in this QP. Can you spot it?
51
Learning Maximum Margin with Noise
  • Given guess of w , b we can
  • Compute sum of distances of points to their
    correct zones
  • Compute the margin width
  • Assume R datapoints, each (xk,yk) where yk /- 1

M
e11
e2
wxb1
e7
wxb0
wxb-1
How many constraints will we have? 2R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1 ek gt 0 for all k
What should our quadratic optimization criterion
be? Minimize
52
Learning Maximum Margin with Noise
  • Given guess of w , b we can
  • Compute sum of distances of points to their
    correct zones
  • Compute the margin width
  • Assume R datapoints, each (xk,yk) where yk /- 1

M
e11
e2
wxb1
e7
wxb0
wxb-1
How many constraints will we have? 2R What should
they be? w . xk b gt 1-ek if yk 1 w . xk b
lt -1ek if yk -1 ek gt 0 for all k
What should our quadratic optimization criterion
be? Minimize
53
An Equivalent Dual QP
Minimize
w . xk b gt 1-ek if yk 1 w . xk b lt -1ek
if yk -1 ek gt 0, for all k
Maximize
where
Subject to these constraints
54
An Equivalent Dual QP
Maximize
where
Subject to these constraints
Then classify with f(x,w,b) sign(w. x - b)
55
Example XOR problem revisited Let the nonlinear
mapping be f(x) (1,x12, 21/2 x1x2, x22,
21/2 x1 , 21/2 x2)T And f(xi)(1,xi12, 21/2
xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T Therefore
the feature space is in 6D with input data in 2D
x1 (-1,-1), d1 - 1
x2 (-1, 1), d2 1 x3
( 1,-1), d3 1 x4 (-1,-1), d4
-1
56
Q(a) S ai ½ S S ai aj di dj f(xi) Tf(xj)
a1 a2 a3 a4 ½(9 a1 a1 - 2a1 a2 -2 a1 a3
2a1 a4 9a2 a2 2a2 a3 -2a2 a4 9a3
a3 -2a3 a4 9 a4 a4 ) To minimize Q, we only
need to calculate (due to optimality
conditions) which gives 1 9 a1 - a2 - a3
a4 1 -a1 9 a2 a3 - a4 1 -a1 a2 9
a3 - a4 1 a1 - a2 - a3 9 a4
57
The solution of which gives the optimal values
a0,1 a0,2 a0,3 a0,4 1/8
w0 S a0,i di f(xi) 1/8f(x1)- f(x2)-
f(x3) f(x4)
Where the first element of w0 gives the bias b
58
From earlier we have that the optimal hyperplane
is defined by w0T f(x) 0 That is

w0T f(x)
which is the optimal decision boundary for the
XOR problem. Furthermore we note that the
solution is unique since the optimal decision
boundary is unique
59
Output for polynomial RBF
60
Harder 1-dimensional dataset
Remember how permitting non-linear basis
functions made linear regression so much
nicer? Lets permit them here too
x0
61
For a non-linearly separable problem we have to
first map data onto feature space so that they
are linear separable
xi f(xi) Given
the training data sample (xi,yi), i1, ,N,
find the optimum values of the weight vector w
and bias b w S a0,i yi f(xi) where a0,i
are the optimal Lagrange multipliers determined
by maximizing the following objective function
subject to the constraints S ai yi
0 ai gt0
62
  • SVM building procedure
  • Pick a nonlinear mapping f
  • Solve for the optimal weight vector
  • However how do we pick the function f?
  • In practical applications, if it is not totally
    impossible to find f, it is very hard
  • In the previous example, the function f is quite
    complex How would we find it?
  • Answer the Kernel Trick

63
Notice that in the dual problem the image of
input vectors only involved as an inner product
meaning that the optimization can be performed in
the (lower dimensional) input space and that the
inner product can be replaced by an inner-product
kernel How do we relate the output of
the SVM to the kernel K? Look at the equation of
the boundary in the feature space and use the
optimality conditions derived from the Lagrangian
formulations
64
(No Transcript)
65
(No Transcript)
66
In the XOR problem, we chose to use the kernel
function K(x, xi) (x T xi1)2
1 x12 xi12 2 x1x2 xi1xi2
x22 xi22 2x1xi1 , 2x2xi2 Which implied the
form of our nonlinear functions f(x) (1,x12,
21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T And
f(xi)(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2
xi2)T However, we did not need to calculate f
at all and could simply have used the kernel to
calculate Q(a) S ai ½ S S ai aj di dj
K(xi, xj) Maximized and solved for ai and
derived the hyperplane via
67
We therefore only need a suitable choice of
kernel function cf Mercers Theorem Let
K(x,y) be a continuous symmetric kernel that
defined in the closed interval a,b. The
kernel K can be expanded in the form K (x,y)
f(x) T f(y) provided it is positive definite.
Some of the usual choices for K are Polynomial
SVM (x T xi1)p p specified by user RBF
SVM exp(-1/(2s2) x xi2) s specified by
user MLP SVM tanh(s0 x T xi s1)
68
An Equivalent Dual QP
Maximize
where
Subject to these constraints
Datapoints with ak gt 0 will be the support vectors
Then classify with f(x,w,b) sign(w. x - b)
..so this sum only needs to be over the support
vectors.
69
Quadratic Basis Functions
Constant Term
Linear Terms
  • Number of terms (assuming m input dimensions)
    (m2)-choose-2
  • (m2)(m1)/2
  • (as near as makes no difference) m2/2
  • You may be wondering what those
  • s are doing.
  • You should be happy that they do no harm
  • Youll find out why theyre there soon.

Pure Quadratic Terms
Quadratic Cross-Terms
70
QP with basis functions
where
Subject to these constraints
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
71
QP with basis functions
where
We must do R2/2 dot products to get this matrix
ready. Each dot product requires m2/2 additions
and multiplications The whole thing costs R2 m2
/4. Yeeks! or does it?
Subject to these constraints
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
72


Quadratic Dot Products

73
Just out of casual, innocent, interest, lets
look at another function of a and b
Quadratic Dot Products
74
Just out of casual, innocent, interest, lets
look at another function of a and b
Quadratic Dot Products
Theyre the same! And this is only O(m) to
compute!
75
QP with Quadratic basis functions
where
We must do R2/2 dot products to get this matrix
ready. Each dot product now only requires m
additions and multiplications
Subject to these constraints
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
76
Higher Order Polynomials
Polynomial f(x) Cost to build Qkl matrix traditionally Cost if 100 inputs f(a).f(b) Cost to build Qkl matrix efficiently Cost if 100 inputs
Quadratic All m2/2 terms up to degree 2 m2 R2 /4 2,500 R2 (a.b1)2 m R2 / 2 50 R2
Cubic All m3/6 terms up to degree 3 m3 R2 /12 83,000 R2 (a.b1)3 m R2 / 2 50 R2
Quartic All m4/24 terms up to degree 4 m4 R2 /48 1,960,000 R2 (a.b1)4 m R2 / 2 50 R2
77
QP with Quintic basis functions
Maximize
where
Subject to these constraints
Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
78
QP with Quintic basis functions
We must do R2/2 dot products to get this matrix
ready. In 100-d, each dot product now needs 103
operations instead of 75 million But there are
still worrying things lurking away. What are they?
Maximize
where
Subject to these constraints
  • The fear of overfitting with this enormous number
    of terms
  • The evaluation phase (doing a set of predictions
    on a test set) will be very expensive (why?)

Then define
Then classify with f(x,w,b) sign(w. f(x) - b)
79
QP with Quintic basis functions
We must do R2/2 dot products to get this matrix
ready. In 100-d, each dot product now needs 103
operations instead of 75 million But there are
still worrying things lurking away. What are they?
Maximize
where
The use of Maximum Margin magically makes this
not a problem
Subject to these constraints
  • The fear of overfitting with this enormous number
    of terms
  • The evaluation phase (doing a set of predictions
    on a test set) will be very expensive (why?)

Then define
Because each w. f(x) (see below) needs 75 million
operations. What can be done?
Then classify with f(x,w,b) sign(w. f(x) - b)
80
QP with Quintic basis functions
We must do R2/2 dot products to get this matrix
ready. In 100-d, each dot product now needs 103
operations instead of 75 million But there are
still worrying things lurking away. What are they?
Maximize
where
The use of Maximum Margin magically makes this
not a problem
Subject to these constraints
  • The fear of overfitting with this enormous number
    of terms
  • The evaluation phase (doing a set of predictions
    on a test set) will be very expensive (why?)

Then define
Because each w. f(x) (see below) needs 75 million
operations. What can be done?
Only Sm operations (Ssupport vectors)
Then classify with f(x,w,b) sign(w. f(x) - b)
81
QP with Quintic basis functions
We must do R2/2 dot products to get this matrix
ready. In 100-d, each dot product now needs 103
operations instead of 75 million But there are
still worrying things lurking away. What are they?
Maximize
where
The use of Maximum Margin magically makes this
not a problem
Subject to these constraints
  • The fear of overfitting with this enormous number
    of terms
  • The evaluation phase (doing a set of predictions
    on a test set) will be very expensive (why?)

Then define
Because each w. f(x) (see below) needs 75 million
operations. What can be done?
Only Sm operations (Ssupport vectors)
Then classify with f(x,w,b) sign(w. f(x) - b)
82
QP with Quintic basis functions
where
Why SVMs dont overfit as much as youd think No
matter what the basis function, there are really
only up to R parameters a1, a2 .. aR, and
usually most are set to zero by the Maximum
Margin. Asking for small w.w is like weight
decay in Neural Nets and like Ridge Regression
parameters in Linear regression and like the use
of Priors in Bayesian Regression---all designed
to smooth the function and reduce overfitting.
Subject to these constraints
Then define
Only Sm operations (Ssupport vectors)
Then classify with f(x,w,b) sign(w. f(x) - b)
83
SVM Kernel Functions
  • K(a,b)(a . b 1)d is an example of an SVM Kernel
    Function
  • Beyond polynomials there are other very high
    dimensional basis functions that can be made
    practical by finding the right Kernel Function
  • Radial-Basis-style Kernel Function
  • Neural-net-style Kernel Function

s, k and d are magic parameters that must be
chosen by a model selection method such as CV or
VCSRM
84
SVM Implementations
  • Sequential Minimal Optimization, SMO, efficient
    implementation of SVMs, Platt
  • in Weka
  • SVMlight
  • http//svmlight.joachims.org/

85
References
  • Tutorial on VC-dimension and Support Vector
    Machines
  • C. Burges. A tutorial on support vector machines
    for pattern recognition. Data Mining and
    Knowledge Discovery, 2(2)955-974, 1998.
    http//citeseer.nj.nec.com/burges98tutorial.html
  • The VC/SRM/SVM Bible
  • Statistical Learning Theory by Vladimir Vapnik,
    Wiley-Interscience 1998
Write a Comment
User Comments (0)
About PowerShow.com