A Review of Our Course - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

A Review of Our Course

Description:

e.g. Two Different Measures of. Training Error. 2-Norm Soft Margin: 1-Norm Soft Margin: ... Overfitting. Solid : Spot : nonlinear regression. which passes ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 64

Provided by: dmlab1Csi

Category:

more less

Transcript and Presenter's Notes

Title: A Review of Our Course

1
A Review of Our Course Classification and
Regression

The Perceptron Algorithm Primal vs. Dual form

An on-line and mistake-driven procedure

Converge when problem is linearly separable

2
Classification Problem2-Category Linearly
Separable Case
Benign
Malignant
3
Algebra of the Classification Problem Linearly
Separable Case
4
Robust Linear Programming
Preliminary Approach to SVM
5
Support Vector MachinesMaximizing the Margin
between Bounding Planes
A
A-
6
Support Vector Classification
(Linearly Separable Case, Primal)
The hyperplane
that solves the
minimization problem
realizes the maximal margin hyperplane with
geometric margin
7
Soft Margin SVM
(Nonseparable Case)

If data are not linearly separable

Primal problem is infeasible

Dual problem is unbounded above

Introduce the slack variable for each training
point

The inequality system is always feasible

e.g.
8
(No Transcript)
9
Two Different Measures of Training Error
2-Norm Soft Margin
1-Norm Soft Margin
10
Optimization Problem Formulation
Problem setting Given functions
and
, defined on a domain
subject to
where
is called the objective function and
are called constraints.
11
Definitions and Notation

Feasible region

12
Definitions and Notation
where

is called the slack variable
13
Definitions and Notation

Remove an inactive constraint in an optimization

problem will NOT affect the optimal solution

Very useful feature in SVM

Least square problem is in this category

SSVM formulation is in this category

Difficult to find the global minimum without
convexity assumption

14
Gradient and Hessian
15
The Most Important Concept in Optimization
(minimization)

A point is said to be an optimal solution of a
unconstrained minimization if there exists no
decent direction

A point is said to be an optimal solution of a
constrained minimization if there exists no
feasible decent direction

There might exist decent direction but move
along this direction will leave out the
feasible
region

16
Two Important Algorithms for Unconstrained
Minimization Problem

Steepest decent with exact line search

Newtons method

17
Linear Program and Quadratic Program

An optimization problem in which the objective
function and all constraints are linear
functions
is called a linear programming problem

If the objective function is convex quadratic
while
the constraints are all linear then the
problem is
called convex quadratic programming problem

Standard SVM formulation is in this category

18
Lagrangian Dual Problem
subject to
19
Lagrangian Dual Problem
where
20
Weak Duality Theorem
be a feasible solution of the primal
Let
problem and
a feasible solution of the
dual problem. Then
21
Saddle Point of Lagrangian
satisfying
Let
is called
Then
The saddle point of the Lagrangian function
22
Dual Problem of Linear Program
Primal LP
subject to
Dual LP
subject to

All duality theorems hold and work perfectly!

23
Dual Problem of Strictly Convex Quadratic Program
Primal QP
subject to
With strictly convex assumption, we have
Dual QP
24
Dual Problem of Strictly Convex Quadratic Program
Primal QP
subject to
With strictly convex assumption, we have
Dual QP
25
Support Vector Classification
(Linearly Separable Case, Dual Form)
The dual problem of previous MP
Dont forget
26
Dual Representation of SVM
(Key of Kernel Methods
)
The hypothesis is determined by
27
Learning in Feature Space
(Could Simplify the Classification Task)

Learning in a high dimensional space could
degrade

generalization performance

This phenomenon is called curse of dimensionality

Even do not know the dimensionality of feature
space

There is no free lunch

Deal with a huge and dense kernel matrix

Reduced kernel can avoid this difficulty

28
(No Transcript)
29
Kernel Technique
Based on Mercers Condition (1909)

The value of kernel function represents the
inner product in feature space
Kernel functions merge two steps
1. map input data from input space to
feature space (might be infinite
dim.)
2. do inner product in the feature space

30
Linear Machine in Feature Space
Make it in the dual form
31
Kernel Represent Inner Product in Feature Space
Definition A kernel is a function
such that
where
The classifier will become
32
A Simple Example of Kernel
Polynomial Kernel of Degree 2
and the nonlinear map
Let
defined by
.

There are many other nonlinear maps,

, that
satisfy the relation
33
Power of the Kernel Technique
Consider a nonlinear map
that consists
of distinct features of all the monomials of
degree d.
.
Then
For example

Is it necessary? We only need to know

This can be achieved

34
2-Norm Soft Margin Dual Formulation
The Lagrangian for 2-norm soft margin
The partial derivatives with respect to
primal variables equal zeros
35
Dual Maximization Problem For 2-Norm Soft Margin
Dual

The corresponding KKT complementarity

Use above conditions to find

36
Introduce Kernel in Dual Formulation For 2-Norm
Soft Margin

The feature space implicitly defined by

Suppose

solves the QP problem

Then the decision rule is defined by

Use above conditions to find

37
Introduce Kernel in Dual Formulation for 2-Norm
Soft Margin

is chosen so that
for any
with
Because
and
38
Sequential Minimal Optimization
(SMO)

Deals with an equality constraint and a box

constraints of dual problem

Works on the smallest working set (only 2)

Find the optimal solution by only changing

value that is in the working set

The solution can be analytically defined

The best feature of SMO

39
Analytical Solution for Two Points

We have a more restriction on changing

40
A Restrictive Constraint on New
if
and
if
41
-Support Vector Regression
(Linear Case
)

Given the training set

Motivated by SVM

Some tiny error should be discarded

42
-Insensitive Loss Function
(Tiny Error Should Be Discarded)
43
(No Transcript)
44
Five Popular Loss Functions
45
-Insensitive Loss Regression

Linear -insensitive loss function

is a real function
where

Quadratic -insensitive loss function

46
- insensitive Support Vector Regression Model

Motivated by SVM

should be as small as possible

Some tiny error should be discarded

where
47
Why minimize ?probably approximately
correct (pac)
Consider performing linear regression for any
training data distribution and
then

Occams razor the simplest is the best

48
Reformulated - SVR as a Constrained
Minimization Problem
subject to
n12m variables and 2m constrains minimization
problem
Enlarge the problem size and computational
complexity for solving the problem
49
SV Regression by Minimizing Quadratic
-Insensitive Loss

We have the following problem

where
50
Primal Formulation of SVR for Quadratic
-Insensitive Loss
subject to

Extremely important At the solution

51
Simplify Dual Formulation of SVR
subject to

The case , problem becomes to the
least squares linear regression with a
weight
decay factor

52
Kernel in Dual Formulation for SVR
subject to

Then the regression function is defined by

where
is chosen such that
with
53
Probably Approximately Correct Learning pac Model

fixed but unknown
distribution
according to an

When we evaluate the quality of a hypothesis

(classification function)
we should take the
into account
unknown
distribution
error or expected error
)
made by the

We call such measure risk functional and denote

it as
54

Generalization Error of pac Model
55
Probably Approximately Correct

We assert

or
56
Find the Hypothesis with Minimum Expected Risk?

The ideal hypothesis

should has the smallest
expected risk
Unrealistic !!!
57
Empirical Risk Minimization (ERM)
are not needed)
(
and

Only focusing on empirical risk will cause
overfitting

58
Overfitting
Overfitting is a phenomena that the resulting
function fits the training set too well, but does
not have a good prediction performance on unseen
data.
generated by f(x) with random noise
Red dots
Spot nonlinear regression which
passes through this 8 points
59
Tuning Procedure
The final value of parameter is one with the
maximum testing set correctness !
60
VC Confidence
(The Bound between )
C. J. C. Burges, A tutorial on support vector
machines for pattern
recognition, Data Mining and Knowledge Discovery
2 (2) (1998), p.121-167
61
Capacity (Complexity) of Hypothesis Space
VC-dimension
62
Shattering Points with Hyperplanes in
Can you always shatter three points with a line in
?
63
Definition of VC-dimension