A Review of Our Course - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

A Review of Our Course

Description:

e.g. Two Different Measures of. Training Error. 2-Norm Soft Margin: 1-Norm Soft Margin: ... Overfitting. Solid : Spot : nonlinear regression. which passes ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 64
Provided by: dmlab1Csi
Category:

less

Transcript and Presenter's Notes

Title: A Review of Our Course


1
A Review of Our Course Classification and
Regression
  • The Perceptron Algorithm Primal vs. Dual form
  • An on-line and mistake-driven procedure
  • Converge when problem is linearly separable

2
Classification Problem2-Category Linearly
Separable Case
Benign
Malignant
3
Algebra of the Classification Problem Linearly
Separable Case
4
Robust Linear Programming
Preliminary Approach to SVM
5
Support Vector MachinesMaximizing the Margin
between Bounding Planes
A
A-
6
Support Vector Classification
(Linearly Separable Case, Primal)
The hyperplane
that solves the
minimization problem
realizes the maximal margin hyperplane with
geometric margin
7
Soft Margin SVM
(Nonseparable Case)
  • If data are not linearly separable
  • Primal problem is infeasible
  • Dual problem is unbounded above
  • Introduce the slack variable for each training
  • point
  • The inequality system is always feasible

e.g.
8
(No Transcript)
9
Two Different Measures of Training Error
2-Norm Soft Margin
1-Norm Soft Margin
10
Optimization Problem Formulation
Problem setting Given functions
and
, defined on a domain
subject to
where
is called the objective function and
are called constraints.
11
Definitions and Notation
  • Feasible region

12
Definitions and Notation
where

is called the slack variable
13
Definitions and Notation
  • Remove an inactive constraint in an optimization

problem will NOT affect the optimal solution
  • Very useful feature in SVM
  • Least square problem is in this category
  • SSVM formulation is in this category
  • Difficult to find the global minimum without
  • convexity assumption

14
Gradient and Hessian
15
The Most Important Concept in Optimization
(minimization)
  • A point is said to be an optimal solution of a
  • unconstrained minimization if there exists no
  • decent direction
  • A point is said to be an optimal solution of a
  • constrained minimization if there exists no
  • feasible decent direction
  • There might exist decent direction but move
  • along this direction will leave out the
    feasible
  • region

16
Two Important Algorithms for Unconstrained
Minimization Problem
  • Steepest decent with exact line search
  • Newtons method

17
Linear Program and Quadratic Program
  • An optimization problem in which the objective
  • function and all constraints are linear
    functions
  • is called a linear programming problem
  • If the objective function is convex quadratic
    while
  • the constraints are all linear then the
    problem is
  • called convex quadratic programming problem
  • Standard SVM formulation is in this category

18
Lagrangian Dual Problem
subject to
19
Lagrangian Dual Problem
where
20
Weak Duality Theorem
be a feasible solution of the primal
Let
problem and
a feasible solution of the
dual problem. Then
21
Saddle Point of Lagrangian
satisfying
Let
is called
Then
The saddle point of the Lagrangian function
22
Dual Problem of Linear Program
Primal LP
subject to
Dual LP
subject to
  • All duality theorems hold and work perfectly!

23
Dual Problem of Strictly Convex Quadratic Program
Primal QP
subject to
With strictly convex assumption, we have
Dual QP
24
Dual Problem of Strictly Convex Quadratic Program
Primal QP
subject to
With strictly convex assumption, we have
Dual QP
25
Support Vector Classification
(Linearly Separable Case, Dual Form)
The dual problem of previous MP
Dont forget
26
Dual Representation of SVM
(Key of Kernel Methods
)
The hypothesis is determined by
27
Learning in Feature Space
(Could Simplify the Classification Task)
  • Learning in a high dimensional space could
    degrade

generalization performance
  • This phenomenon is called curse of dimensionality
  • Even do not know the dimensionality of feature
    space
  • There is no free lunch
  • Deal with a huge and dense kernel matrix
  • Reduced kernel can avoid this difficulty

28
(No Transcript)
29
Kernel Technique
Based on Mercers Condition (1909)
  • The value of kernel function represents the
    inner product in feature space
  • Kernel functions merge two steps
  • 1. map input data from input space to
  • feature space (might be infinite
    dim.)
  • 2. do inner product in the feature space

30
Linear Machine in Feature Space
Make it in the dual form
31
Kernel Represent Inner Product in Feature Space
Definition A kernel is a function
such that
where
The classifier will become
32
A Simple Example of Kernel
Polynomial Kernel of Degree 2
and the nonlinear map
Let
defined by
.
  • There are many other nonlinear maps,

, that
satisfy the relation
33
Power of the Kernel Technique
Consider a nonlinear map
that consists
of distinct features of all the monomials of
degree d.
.
Then
For example
  • Is it necessary? We only need to know

!
  • This can be achieved

34
2-Norm Soft Margin Dual Formulation
The Lagrangian for 2-norm soft margin
The partial derivatives with respect to
primal variables equal zeros
35
Dual Maximization Problem For 2-Norm Soft Margin
Dual
  • The corresponding KKT complementarity
  • Use above conditions to find

36
Introduce Kernel in Dual Formulation For 2-Norm
Soft Margin
  • The feature space implicitly defined by
  • Suppose

solves the QP problem
  • Then the decision rule is defined by
  • Use above conditions to find

37
Introduce Kernel in Dual Formulation for 2-Norm
Soft Margin

is chosen so that
for any
with
Because
and
38
Sequential Minimal Optimization
(SMO)
  • Deals with an equality constraint and a box

constraints of dual problem
  • Works on the smallest working set (only 2)
  • Find the optimal solution by only changing

value that is in the working set
  • The solution can be analytically defined
  • The best feature of SMO

39
Analytical Solution for Two Points
  • We have a more restriction on changing

40
A Restrictive Constraint on New
if
and
if
41
-Support Vector Regression
(Linear Case
)
  • Given the training set
  • Motivated by SVM
  • Some tiny error should be discarded

42
-Insensitive Loss Function
(Tiny Error Should Be Discarded)
43
(No Transcript)
44
Five Popular Loss Functions
45
-Insensitive Loss Regression
  • Linear -insensitive loss function

is a real function
where
  • Quadratic -insensitive loss function

46
- insensitive Support Vector Regression Model
  • Motivated by SVM

should be as small as possible
  • Some tiny error should be discarded

where
47
Why minimize ?probably approximately
correct (pac)
Consider performing linear regression for any
training data distribution and
then
  • Occams razor the simplest is the best

48
Reformulated - SVR as a Constrained
Minimization Problem
subject to
n12m variables and 2m constrains minimization
problem
Enlarge the problem size and computational
complexity for solving the problem
49
SV Regression by Minimizing Quadratic
-Insensitive Loss
  • We have the following problem

where
50
Primal Formulation of SVR for Quadratic
-Insensitive Loss
subject to
  • Extremely important At the solution

51
Simplify Dual Formulation of SVR
subject to
  • The case , problem becomes to the
  • least squares linear regression with a
    weight
  • decay factor

52
Kernel in Dual Formulation for SVR
subject to
  • Then the regression function is defined by

where
is chosen such that
with
53
Probably Approximately Correct Learning pac Model

fixed but unknown
distribution
according to an
  • When we evaluate the quality of a hypothesis

(classification function)
we should take the
into account
unknown
distribution
error or expected error
)
made by the
  • We call such measure risk functional and denote

it as
54

Generalization Error of pac Model
55
Probably Approximately Correct
  • We assert

or
56
Find the Hypothesis with Minimum Expected Risk?
  • The ideal hypothesis

should has the smallest
expected risk
Unrealistic !!!
57
Empirical Risk Minimization (ERM)
are not needed)
(
and
  • Only focusing on empirical risk will cause
    overfitting

58
Overfitting
Overfitting is a phenomena that the resulting
function fits the training set too well, but does
not have a good prediction performance on unseen
data.
generated by f(x) with random noise
Red dots
Spot nonlinear regression which
passes through this 8 points
59
Tuning Procedure
The final value of parameter is one with the
maximum testing set correctness !
60
VC Confidence
(The Bound between )
C. J. C. Burges, A tutorial on support vector
machines for pattern
recognition, Data Mining and Knowledge Discovery
2 (2) (1998), p.121-167
61
Capacity (Complexity) of Hypothesis Space
VC-dimension
62
Shattering Points with Hyperplanes in
Can you always shatter three points with a line in
?
63
Definition of VC-dimension
  • The Vapnik-Chervonenkis dimension,

, of
hypothesis space
defined over the input space
is the size of the (existent) largest finite
subset
shattered by
of
Write a Comment
User Comments (0)
About PowerShow.com