Title: A Review of Our Course
1A Review of Our Course Classification and
Regression
- The Perceptron Algorithm Primal vs. Dual form
- An on-line and mistake-driven procedure
- Converge when problem is linearly separable
2 Classification Problem2-Category Linearly
Separable Case
Benign
Malignant
3Algebra of the Classification Problem Linearly
Separable Case
4Robust Linear Programming
Preliminary Approach to SVM
5Support Vector MachinesMaximizing the Margin
between Bounding Planes
A
A-
6Support Vector Classification
(Linearly Separable Case, Primal)
The hyperplane
that solves the
minimization problem
realizes the maximal margin hyperplane with
geometric margin
7Soft Margin SVM
(Nonseparable Case)
- If data are not linearly separable
- Primal problem is infeasible
- Dual problem is unbounded above
- Introduce the slack variable for each training
- point
- The inequality system is always feasible
e.g.
8(No Transcript)
9Two Different Measures of Training Error
2-Norm Soft Margin
1-Norm Soft Margin
10Optimization Problem Formulation
Problem setting Given functions
and
, defined on a domain
subject to
where
is called the objective function and
are called constraints.
11Definitions and Notation
12Definitions and Notation
where
is called the slack variable
13Definitions and Notation
- Remove an inactive constraint in an optimization
problem will NOT affect the optimal solution
- Very useful feature in SVM
- Least square problem is in this category
- SSVM formulation is in this category
- Difficult to find the global minimum without
- convexity assumption
14Gradient and Hessian
15The Most Important Concept in Optimization
(minimization)
- A point is said to be an optimal solution of a
- unconstrained minimization if there exists no
- decent direction
- A point is said to be an optimal solution of a
- constrained minimization if there exists no
- feasible decent direction
- There might exist decent direction but move
- along this direction will leave out the
feasible - region
16Two Important Algorithms for Unconstrained
Minimization Problem
- Steepest decent with exact line search
17Linear Program and Quadratic Program
- An optimization problem in which the objective
- function and all constraints are linear
functions - is called a linear programming problem
- If the objective function is convex quadratic
while - the constraints are all linear then the
problem is - called convex quadratic programming problem
- Standard SVM formulation is in this category
18Lagrangian Dual Problem
subject to
19Lagrangian Dual Problem
where
20Weak Duality Theorem
be a feasible solution of the primal
Let
problem and
a feasible solution of the
dual problem. Then
21Saddle Point of Lagrangian
satisfying
Let
is called
Then
The saddle point of the Lagrangian function
22Dual Problem of Linear Program
Primal LP
subject to
Dual LP
subject to
- All duality theorems hold and work perfectly!
23Dual Problem of Strictly Convex Quadratic Program
Primal QP
subject to
With strictly convex assumption, we have
Dual QP
24Dual Problem of Strictly Convex Quadratic Program
Primal QP
subject to
With strictly convex assumption, we have
Dual QP
25Support Vector Classification
(Linearly Separable Case, Dual Form)
The dual problem of previous MP
Dont forget
26Dual Representation of SVM
(Key of Kernel Methods
)
The hypothesis is determined by
27Learning in Feature Space
(Could Simplify the Classification Task)
- Learning in a high dimensional space could
degrade
generalization performance
- This phenomenon is called curse of dimensionality
- Even do not know the dimensionality of feature
space
- Deal with a huge and dense kernel matrix
- Reduced kernel can avoid this difficulty
28(No Transcript)
29Kernel Technique
Based on Mercers Condition (1909)
- The value of kernel function represents the
inner product in feature space - Kernel functions merge two steps
- 1. map input data from input space to
- feature space (might be infinite
dim.) - 2. do inner product in the feature space
-
30Linear Machine in Feature Space
Make it in the dual form
31Kernel Represent Inner Product in Feature Space
Definition A kernel is a function
such that
where
The classifier will become
32A Simple Example of Kernel
Polynomial Kernel of Degree 2
and the nonlinear map
Let
defined by
.
- There are many other nonlinear maps,
, that
satisfy the relation
33Power of the Kernel Technique
Consider a nonlinear map
that consists
of distinct features of all the monomials of
degree d.
.
Then
For example
- Is it necessary? We only need to know
!
342-Norm Soft Margin Dual Formulation
The Lagrangian for 2-norm soft margin
The partial derivatives with respect to
primal variables equal zeros
35Dual Maximization Problem For 2-Norm Soft Margin
Dual
- The corresponding KKT complementarity
- Use above conditions to find
36Introduce Kernel in Dual Formulation For 2-Norm
Soft Margin
- The feature space implicitly defined by
solves the QP problem
- Then the decision rule is defined by
- Use above conditions to find
37Introduce Kernel in Dual Formulation for 2-Norm
Soft Margin
is chosen so that
for any
with
Because
and
38Sequential Minimal Optimization
(SMO)
- Deals with an equality constraint and a box
constraints of dual problem
- Works on the smallest working set (only 2)
- Find the optimal solution by only changing
value that is in the working set
- The solution can be analytically defined
39Analytical Solution for Two Points
- We have a more restriction on changing
40A Restrictive Constraint on New
if
and
if
41-Support Vector Regression
(Linear Case
)
- Some tiny error should be discarded
42-Insensitive Loss Function
(Tiny Error Should Be Discarded)
43(No Transcript)
44Five Popular Loss Functions
45-Insensitive Loss Regression
- Linear -insensitive loss function
is a real function
where
- Quadratic -insensitive loss function
46- insensitive Support Vector Regression Model
should be as small as possible
- Some tiny error should be discarded
where
47Why minimize ?probably approximately
correct (pac)
Consider performing linear regression for any
training data distribution and
then
- Occams razor the simplest is the best
48Reformulated - SVR as a Constrained
Minimization Problem
subject to
n12m variables and 2m constrains minimization
problem
Enlarge the problem size and computational
complexity for solving the problem
49SV Regression by Minimizing Quadratic
-Insensitive Loss
- We have the following problem
where
50Primal Formulation of SVR for Quadratic
-Insensitive Loss
subject to
- Extremely important At the solution
51Simplify Dual Formulation of SVR
subject to
- The case , problem becomes to the
- least squares linear regression with a
weight - decay factor
52Kernel in Dual Formulation for SVR
subject to
- Then the regression function is defined by
where
is chosen such that
with
53Probably Approximately Correct Learning pac Model
fixed but unknown
distribution
according to an
- When we evaluate the quality of a hypothesis
(classification function)
we should take the
into account
unknown
distribution
error or expected error
)
made by the
- We call such measure risk functional and denote
it as
54 Generalization Error of pac Model
55Probably Approximately Correct
or
56Find the Hypothesis with Minimum Expected Risk?
should has the smallest
expected risk
Unrealistic !!!
57Empirical Risk Minimization (ERM)
are not needed)
(
and
- Only focusing on empirical risk will cause
overfitting
58Overfitting
Overfitting is a phenomena that the resulting
function fits the training set too well, but does
not have a good prediction performance on unseen
data.
generated by f(x) with random noise
Red dots
Spot nonlinear regression which
passes through this 8 points
59Tuning Procedure
The final value of parameter is one with the
maximum testing set correctness !
60VC Confidence
(The Bound between )
C. J. C. Burges, A tutorial on support vector
machines for pattern
recognition, Data Mining and Knowledge Discovery
2 (2) (1998), p.121-167
61Capacity (Complexity) of Hypothesis Space
VC-dimension
62 Shattering Points with Hyperplanes in
Can you always shatter three points with a line in
?
63Definition of VC-dimension
- The Vapnik-Chervonenkis dimension,
, of
hypothesis space
defined over the input space
is the size of the (existent) largest finite
subset
shattered by
of