Classification

About This Presentation

Title:

Classification

Description:

Classification Yan Pan Some Popular Kernels Linear : K(xi,xj) = xit -1xj Polynomial : K(xi,xj) = (xit -1xj + c)d Gaussian (RBF) : K(xi,xj) = exp( k k(xik ... – PowerPoint PPT presentation

Number of Views:90

Avg rating:3.0/5.0

Slides: 113

Provided by: Man37

Category:

more less

Transcript and Presenter's Notes

Title: Classification

1
Classification
Yan Pan
2
Under and Over Fitting
3
Probability Theory

Non-negativity and unit measure
0 p(y) , p(?) 1, p(?) 0
Conditional probability p(yx)
p(x, y) p(yx) p(x) p(xy) p(y)
Bayes Theorem
p(yx) p(xy) p(y) / p(x)
Marginalization
p(x) ?y p(x, y) dy
Independence
p(x1, x2) p(x1) p(x2) ? p(x1x2) p(x1)
Chris Bishop, Pattern Recognition Machine
Learning

4
(No Transcript)
5
The Univariate Gaussian Density

p(x?,?) exp( -(x ?)2/2?2) / (2??2)½

?
1?
-1?
2?
-3?
3?
-2?
6
The Multivariate Gaussian Density

p(x?,?) exp( -½ (x ?)t ?-1 (x ?) )/
(2?)D/2?½

7
The Beta Density

p(?a,b) ?a-1(1 ?)b-1 ?(ab) / ?(a)?(b)

8
Probability Distribution Functions

Bernoulli Single trial with probability of
success ?
n ? 0, 1, ? ? 0, 1
p(n?) ? n(1 ?)1-n
Binomial N iid Bernoulli trials with n
successes
n ? 0, 1, , N, ? ? 0, 1,
p(nN,?) NCn? n(1 ?)N-n

9
A Toy Example

We dont know whether a coin is fair or not. We
are told that heads occurred n times in N coin
flips.
We are asked to predict whether the next coin
flip will result in a head or a tail.
Let y be a binary random variable such that y
1 represents the event that the next coin flip
will be a head and y 0 that it will be a tail
We should predict heads if p(y1n,N) gt
p(y0n,N)

10
The Maximum Likelihood Approach

Let p(y1n,N) ? and p(y0n,N) 1 - ? so
that we should predict heads if ? gt ½
How should we estimate ??
Assuming that the observed coin flips followed a
Binomial distribution, we could choose the value
of ? that maximizes the likelihood of observing
the data
?ML argmax? p(n?) argmax? NCn? n(1 ?)N-n
argmax? n log(?) (N n) log(1 ?)
n / N
We should predict heads if n gt ½ N

11
The Maximum A Posteriori Approach

We should choose the value of ? maximizing the
posterior probability of ? conditioned on the
data
We assume a
Binomial likelihood p(n?) NCn? n(1 ?)N-n
Beta prior p(?a,b)?a-1(1?)b-1?(ab)/?(a)?(b)
?MAP argmax? p(?n,a,b) argmax? p(n?)
p(?a,b)
argmax? ?n (1 ?)N-n ?a-1 (1?)b-1
(na-1) / (Nab-2) as if we saw an extra a
1 heads b 1 tails
We should predict heads if n gt ½ (N b a)

12
The Bayesian Approach

We should marginalize over ?
p(y1n,a,b) ?? p(y1n,?) p(?a,b,n) d?
?? ? p(?a,b,n) d?
?? ? ?(?a n, b N n) d?
(n a) / (N a b) as if we saw an extra a
heads b tails
We should predict heads if n gt ½ (N b a)
The Bayesian and MAP prediction coincide in this
case
In the very large data limit, both the Bayesian
and MAP prediction coincide with the ML
prediction (n gt ½ N)

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
Classification
19
Binary Classification
20
Approaches to Classification

Memorization
Can not deal with previously unseen data
Large scale annotated data acquisition cost
might be very high
Rule based expert system
Dependent on the competence of the expert.
Complex problems lead to a proliferation of
rules, exceptions, exceptions to exceptions, etc.
Rules might not transfer to similar problems
Learning from training data and prior knowledge
Focuses on generalization to novel data

21
Notation

Training Data
Set of N labeled examples of the form (xi, yi)
Feature vector x ? ?D. X x1 x2 xN
Label y ? ?1. y y1, y2 yNt. Ydiag(y)
Example Gender Identification

(x1 , y1 1)
(x2 , y2 1)
(x3 , y3 1)
(x4 , y4 -1)
22
Binary Classification
23
Binary Classification
b
w
wtx b 0
? w b
24
Machine Learning from the Optimization View

Before we go into the details of classification
and regression methods, we should take a close
look at the objective functions of machine
learning
Machine Learning???????(?????????????),?????????
????????????????,???????????,?????????????????

25
Supervised Learning
26
Common Form of Supervised Learning Problems

Minimize the following objective function
Regularization term Loss function
Regularization term control the model
complexity, avoid over fitting
Loss function measure the quality of the learned
function, i.e. predict error on the training data.

27
Ex.1 Linear Regression

E(w) ½ Sn (yn - wtxn)2 ½?wtw

28
Ex.2 Logistic Regression (classification method)

?(w, b) ½?wtw ?I log(1exp(-yi(bwtxi)))

29
Ex.3 SVM

E(w) ½?wtw ?I max(0,1-yiwtxi)
Or
E(w) ½?wtw ?I max(0,1-yiwtxi)2

30
How to measure error?

True yi
Predicted wtxi
????????
I (yi ! wtxi )
( yi - wtxi )2
???????-1,1 ?????
yi wtxi

31
Approximate the Zero-One Loss

Squared Error
Exponential Loss
Logistic Loss
Hinge Loss
Sigmoid Loss

32
Regularized Logistic Regression
Zhu Hastie, KLR and the
Import Vector Machine, NIPS 01
33
Regularized Logistic Regression
Zhu Hastie, KLR and the
Import Vector Machine, NIPS 01
34
(No Transcript)
35
(No Transcript)
36
Convex Functions

Convex f f(?x1 (1- ?)x2) ? ? f(x1) (1-
?)f(x2)
The Hessian ?2f is always positive
semi-definite
The tangent is always a lower bound to f

37
(No Transcript)
38
Gradient Descent

Iteration xn1 xn - ?n?f(xn)
Step size selection Armijo rule
Stopping criterion Change in f is miniscule

39
Gradient Descent Logistic Regression

?(w, b) ½?wtw ?I log(1exp(-yi(bwtxi)))
?w?(w, b) ?w ?I p(-yixi,w) yi xi
?b?(w, b) ?I p(-yixi,w) yi
Beware of numerical issues while coding!

40
Gradient Decent Algorithm

Input x0, objective f(x), e, T
Output x_star that minimize f(x)
t0
While (t0 (f(x_t-1) f(x_t)gte
Tlt100000 ))
g_t gradient of f(x) at x_t
for( i10 igt-6 i--)
s2i
x_t1x_t sg_t
if (f(x_t1 lt f(x_t))
break
t
Output x_t

41
Newton Methods

Iteration xn1 xn - ?nH-1?f(xn)
Approximate f by a 2nd order Taylor expansion
The error can now decrease quadratically

42
Newton Decent Algorithm

Input x0, objective f(x), e, T
Output x_star that minimize f(x)
t0
While (t0 (f(x_t-1) f(x_t)gte Tlt10))
g_t gradient of f(x) at x_t
h_t hessian matrix of f(x) at x_t
s inverse matrix of h_t
x_t1x_t sg_t
t
Output x_t

43
Quasi-Newton Methods

Computing and inverting the Hessian is expensive
Quasi-Newton methods can approximate H-1
directly (LBFGS)
Iteration xn1 xn - ?nBn-1?f(xn)
Secant equation ?f(xn1) ?f(xn) Bn1(xn1
xn)
The secant equation does not fully determine B
LBFGS updates Bn1-1 using two rank one matrices

44
Machine Learning Problems from the Probability
View
45
Bayes Decision Rule

Bayes decision rule
p(y1x) gt p(y-1x) ? y 1 y -1
?? p(y1x) gt ½ ? y 1 y -1

46
Bayesian Approach

p(yx,X,Y) ?f p(y,fx,X,Y) df
?f p(yf,x,X,Y) p(fx,X,Y) df
?f p(yf,x) p(fX,Y) df
This integral is often intractable.
To solve it we can
Choose the distributions so that the solution is
analytic (conjugate priors)
Approximate the true distribution of p(fX,Y) by
a simpler distribution (variational methods)
Sample from p(fX,Y) (MCMC)

47
Maximum A Posteriori (MAP)

p(yx,X,Y) ?f p(yf,x) p(fX,Y) df
p(yfMAP,x) when p(fX,Y) ?(f fMAP)
The more training data there is the better
p(fX,Y) approximates a delta function
We can make predictions using a single function,
fMAP, and our focus shifts to estimating fMAP.

48
MAP Maximum Likelihood (ML)

fMAP argmaxf p(fX,Y)
argmaxf p(X,Yf) p(f) / p(X,Y)
argmaxf p(X,Yf) p(f)
fML ? argmaxf p(X,Yf) (Maximum Likelihood)
Maximum Likelihood holds if
There is a lot of training data so that
p(X,Yf) gtgt p(f)
Or if there is no prior knowledge so that p(f)
is uniform (improper)

49
IID Data

fML argmaxf p(X,Yf)
argmaxf ?I p(xi,yif)
The independent and identically distributed
assumption holds only if we know everything about
the joint distribution of the features and
labels.
In particular, p(X,Y) ? ?I p(xi,yi)

50
Discriminative Methods Logistic Regression
51
Discriminative Methods

?MAP argmax? p(?) ?I p(xi,yi ?)
We assume that
p(?) p(w) p(w?)
p(xi,yi ?) p(yi xi, ?) p(xi ?)
p(yi xi, w) p(xi w?)
?MAP argmaxw p(w) ?I p(yi xi, w)
argmaxw? p(w?) ?I p(xiw?)
It turns out that only w plays a role in
determining the posterior distribution
p(yx,X,Y) p(yx, ?MAP) p(yx, wMAP)
where wMAP argmaxw p(w) ?I p(yi xi, w)

52
Disc. Methods Logistic Regression

?MAP argmaxw,b p(w) ?I p(yi xi, w)
Regularized Logistic Regression
Gaussian prior p(w) exp( -½ ? wtw)
Logistic likelihood
p(yi xi, w) 1 / (1 exp(-yi(b wtxi)))

53
Regularized Logistic Regression

?MAP argmaxw,b p(w) ?I p(yi xi, w)
argminw,b ½?wtw ?I log(1exp(-yi(bwtxi)))
Bad news No closed form solution for w and b
Good news We have to minimize a convex function
We can obtain the global optimum
The function is smooth
Tom Minka, A comparison of numerical optimizers
for LR (Matlab code)
Keerthi et al., A Fast Dual Algorithm for Kernel
Logistic Regression, ML 05
Andrew and Gao, OWL-QN ICML 07
Krishnapuram et al., SMLR PAMI 05

54
Regularized Logistic Regression
Zhu Hastie, KLR and the
Import Vector Machine, NIPS 01
55
Regularized Logistic Regression
Zhu Hastie, KLR and the
Import Vector Machine, NIPS 01
56
Convex Functions

Convex f f(?x1 (1- ?)x2) ? ? f(x1) (1-
?)f(x2)
The Hessian ?2f is always positive
semi-definite
The tangent is always a lower bound to f

57
Gradient Descent

Iteration xn1 xn - ?n?f(xn)
Step size selection Armijo rule
Stopping criterion Change in f is miniscule

58
Gradient Descent Logistic Regression

?(w, b) ½?wtw ?I log(1exp(-yi(bwtxi)))
?w?(w, b) ?w ?I p(-yixi,w) yi xi
?b?(w, b) ?I p(-yixi,w) yi
Beware of numerical issues while coding!

59
Newton Methods

Iteration xn1 xn - ?nH-1?f(xn)
Approximate f by a 2nd order Taylor expansion
The error can now decrease quadratically

60
Quasi-Newton Methods

Computing and inverting the Hessian is expensive
Quasi-Newton methods can approximate H-1
directly (LBFGS)
Iteration xn1 xn - ?nBn-1?f(xn)
Secant equation ?f(xn1) ?f(xn) Bn1(xn1
xn)
The secant equation does not fully determine B
LBFGS updates Bn1-1 using two rank one matrices

61
Multi-class Logistic Regression

Multinomial Logistic Regression
1-vs-All
Learn L binary classifiers for an L class
problem
For the lth classifier, examples from class l
are ve while examples from all other classes are
ve
Classify new points according to max probability
1-vs-1
Learn L(L-1)/2 binary classifiers for an L class
problem by considering every class pair
Classify novel points by majority vote
Classify novel points by building a DAG

62
Multi-class Logistic Regression

Assume
Non-linear multi-class classifier
Number of classes L
Number of training points per class N
Algorithm training time for M points O(M3)
Classification time given M training pointsO(M)

63
Multi-class Logistic Regression

Multinomial Logistic Regression
Training time O(L6N3)
Classification time for a new point O(L2N)
1-vs-All
Training time O(L4N3)
Classification time for a new point O(L2N)
1-vs-1
Training time O(L2N3)
Majority vote classification time O(L2N)
DAG classification time O(LN)

64
Multinomial Logistic Regression

?MAP argmaxw,b p(w) ?I p(yi xi, w)
Regularized Multinomial Logistic Regression
Gaussian prior
p(w) exp( -½ ? ?lwltwl)
Multinomial logistic posterior
p(yi l xi, w) efl(xi) / ?k efk(xi)
where fk(xi) wktxi bk
Note that we have to learn an extra classifier by
not explicitly enforcing ?l p(yi l xi, w) 1

65
Multinomial Logistic Regression

?(w, b) ½? ?kwktwk ?I log(?k fk(xi)) -
?k?kyi fk(xi)
?wk?(w, b) ?wk ?I p(yi k xi,w) - ?kyi
xi
?bk?(w, b) ?I p(yi k xi,w) - ?kyi

66
Multi-class Logistic Regression
67
Multi-class Logistic Regression
68
Multi-class Logistic Regression
69
Multi-class Logistic Regression
70
Multi-class Logistic Regression
71
Multi-class Logistic Regression
72
From Probabilities to Loss Functions
?MAP argminw,b ½?wtw ?I log(1exp(1-yi(bwtxi
)))
73
Support Vector Machines
74
Binary Classification
75
A Separating Hyperplane
76
Maximum Margin Hyperplane
Geometric Intuition Choose the perpendicular
bisector of the shortest line segment joining the
convex hulls of the two classes
77
SVM Notation
Margin 2 / ?wtw
Support Vector
b
Support Vector
Support Vector
Support Vector
w
wtx b -1
wtx b 0
wtx b 1
78
Calculating the Margin

Let x be any point on the ve supporting plane
and x- the closest point on the ve supporting
plane
Margin x x-
? w (since x x- ?w)
2 w/w2 (assuming ? 2/w2)
2/w
wtx b 1
wtx- b -1
? wt(x x-) 2 ? ? wtw 2 ? ? 2/w2

79
Hard Margin SVM Primal

Maximize 2/w
such that wtxi b ? 1 if yi 1
wtxi b ? -1 if yi -1
Difficult to optimize directly
Convex Quadratic Program (QP) reformulation
Minimize ½wtw
such that yi(wtxi b) ? 1
Convex QPs can be easy to optimize

80
Linearly Inseparable Data

Minimize ½wtw C (Misclassified points)
such that yi(wtxi b) ? 1 (for good
points)
The optimization problem is NP Hard in general
Disastrous errors are penalized the same as near
misses

81
Inseparable Data Hinge Loss
Margin 2 / ?wtw
? gt 1
Misclassified point
? lt 1
b
Support Vector
? 0
Support Vector
w
wtx b -1
? 0
wtx b 0
wtx b 1
82
The C-SVM Primal Formulation

Minimize ½wtw C ?i ?i
such that yi(wtxi b) ? 1 ?i
?i ? 0
The optimization is a convex QP
The globally optimal solution will be obtained
Number of variables D N 1
Number of constraints 2N
Solvers can train on 800K points in 47K (sparse)
dimensions in less than 2 minutes on a standard
PC
Fan et al., LIBLINEAR JMLR 08
Bordes et al., LaRank ICML 07

83
The C-SVM Dual Formulation

Maximize 1t? ½?tYKY?
such that 1tY? 0
0 ? ? ? C
K is a kernel matrix such that Kij K(xi, xj)
xitxj
? are the dual variables (Lagrange multipliers)
Knowing ? gives us w and b
The dual is also a convex QP
Number of variables N
Number of constraints 2N 1
Fan et al., LIBSVM JMLR 05
Joachims, SVMLight

84
SVMs versus Regularized LR
Most of the SVM ?s are zero!
85
SVMs versus Regularized LR
Most of the SVM ?s are zero!
86
SVMs versus Regularized LR
Most of the SVM ?s are not zero
87
Duality

Primal P Minx f0(x)
s. t. fi(x) ? 0 1 ? i ? N
hi(x) 0 1 ? i ? M
Lagrangian L(x,?,?) f0(x) ?i ?ifi(x) ?i
?ihi(x)
Dual D Max?,? Minx L(x,?,?)
s. t. ? ? 0

88
Duality

The Lagrange dual is always concave (even if the
primal is not convex) and might be an easier
problem to optimize
Weak duality P ? D
Always holds
Strong duality P D
Does not always hold
Usually holds for convex problems
Holds for the SVM QP

89
Karush-Kuhn-Tucker (KKT) Conditions

If strong duality holds, then for x, ? and ?
to be optimal the following KKT conditions must
necessarily hold
Primal feasibility fi(x) ? 0 hi(x) 0
for 1 ? i
Dual feasibility ? ? 0
Stationarity ?x L(x, ?,?) 0
Complimentary slackness ?ifi(x) 0
If x, ? and ? satisfy the KKT conditions for
a convex problem then they are optimal

90
SVM Duality

Primal P Minw,?,b ½wtw Ct?
s. t. Y(Xtw b1) ? 1 ?
? ? 0
Lagrangian L(?,?, w,?,b) ½wtw Ct? ?t?
?tY(Xtw b1) 1 ?
Dual D Max? 1t? ½?tYKY?
s. t. 1tY? 0
0 ? ? ? C

91
SVM KKT Conditions

Lagrangian L(?,?, w,?,b) ½wtw Ct? ?t?
?tY(Xtw b1) 1 ?
Stationarity conditions
?w L 0 ? w XY? (Representer Theorem)
?? L 0 ? C ? ?
?b L 0 ? ?tY1 0
Complimentary Slackness conditions
?i yi (xitw b) 1 ?i 0
?i?i 0

92
Hinge Loss and Sparseness in ?

From the Stationarity and Complimentary
Slackness conditions it is easy to show that
?i 0 ? xi has been classified correctly and
lies beyond its supporting hyperplane
0 lt ?i lt C ? xi is a support vector and lies on
its supporting hyperplane
?i C ? xi has been misclassified or is a
margin violator

93
Hinge Loss and Sparseness in ?

SVM ?s are sparse but LR ?s are not

94
Linearly Inseparable Data

This 1D dataset can not be separated using a
single hyperplane (threshold)
We need a non-linear decision boundary

x
95
Increasing Dimensionality Non-linearly

The dataset is now linearly separable in ? space

x
?(x) (x, x2)
96
The Kernel Trick

Let the lifted training set be (?(xi), yi)
Define the kernel such that Kij K(xi, xj)
?(xi)t ?(xj)
Primal P Minw,?,b ½wtw Ct?
s. t. Y(?(X)tw b1) ? 1 ?
? ? 0
Dual D Max? 1t? ½?tYKY?
s. t. 1tY? 0
0 ? ? ? C
Classifier f(x) sign(?(x)tw b)
sign(?tYK(,x) b)

97
The Kernel Trick

Let ?(x) 1, ?2x1, , ?2xD , x12, , xD2,
?2x1x2, , ?2x1xD, , ?2xD-1xDt
Define K(xi, xj) ?(xi)t ?(xj) (xitxj 1)2
Primal
Number of variables D? N 1
Number of constraints 2N
Number of flops for calculating ?(x)tw O(D2)
Number of flops for deg 20 polynomial O(D20)
Dual
Number of variables N
Number of constraints 2N 1
Number of flops for calculating Kij O(D)
Number of flops for deg 20 polynomial O(D)

98
Some Popular Kernels

Linear K(xi,xj) xit?-1xj
Polynomial K(xi,xj) (xit?-1xj c)d
Gaussian (RBF) K(xi,xj) exp( ?k ?k(xik
xjk)2)
Chi-Squared K(xi,xj) exp( ?2(xi, xj) )
Sigmoid K(xi,xj) tanh(xitxj c)
? should be positive definite, c ? 0, ? ? 0 and d
should be a natural number

99
Valid Kernels Mercers Theorem

Let Z be a compact subset of ?D and K a
continuous symmetric function. Then K is a kernel
if
?Z ? Z f(x) K(x,z) f(z) dx dz ? 0
for all square integrable real valued function f
on Z.

100
Valid Kernels Mercers Theorem

Let Z be a compact subset of ?D and K a
continuous symmetric function. Then K is a kernel
if
?Z ? Z f(x) K(x,z) f(z) dx dz ? 0
for all square integrable real valued function f
on Z.
K is a kernel if every finite symmetric matrix
formed by evaluating K on pairs of points from Z
is positive semi-definite

101
Operations on Kernels

The following operations result in valid kernels
K(xi,xj) ?k ?k Kk(xi,xj) (?k ? 0)
K(xi,xj) ?k Kk(xi,xj)
K(xi,xj) f(xi) f(xj) (f ?D ? ?)
K(xi,xj) p(K1(xi,xj)) (p ve coeff poly)
K(xi,xj) exp(K1(xi,xj))
Kernels can be defined over graphs, sets,
strings and many other interesting data structures

102
Kernels

Kernels should encode all our prior knowledge
about feature similarities.
Kernel parameters can be chosen through cross
validation or learnt (see Multiple Kernel
Learning).
Non-linear kernels can sometimes boost
classification performance tremendously.
Non-linear kernels are generally expensive (both
during training and for prediction)

103
Polynomial Kernel of Degree 2
104
Polynomial Kernel of Degree 5
105
RBF Kernel
106
Exponential ?2 Kernel
107
Kernel Parameter Setting - Underfitting
108
Kernel Parameter Setting
109
Kernel Parameter Setting Overfitting
110
Structured Output Prediction

Minimize f ½f2 C ?i ?i
such that f(xi,yi) ? f(xi,y) ?(yi,y) ?i
? y ? yi
?i ? 0
Prediction argmaxy f(x,y)
This formulation minimizes the hinge on the loss
? on the training set subject to regularization
on f
Can be used to predict sets, graphs, etc. for
suitable choices of ?
Taskar et al., Max-Margin Markov Networks NIPS
03
Tsochantaridis et al., Large Margin Methods for
Structured Interdependent Output Variables
JMLR 05

111
Multi-Class SVM

Minimize f ½f2 C ?i ?i
such that f(xi,yi) ? f(xi,y) ?(yi,y) ?i
? y ? yi
?i ? 0
Prediction argmaxy f(x,y)
?(yi,y) ?yi,y
f(x,y) wt ?(x) ? ?(y) bt?(y)
wyt?(x) by (assuming ?(y) ey)
Weston and Watkin, SVMs for Multi-Class Pattern
Recognition ESANN 99
Bordes et al., LaRank ICML 07

112
Multi-Class SVM

Min w,b ½ ?kwktwk C ?i ?i
s. t. wyit?(xi) byi ? wyt?(xi) by 1
?I ?y?yi
?i ? 0
Prediction argmaxy wyt?(x) by
For L classes, with N points per class, the
number of constraints is NL2
Finding the exact solution for real world
non-linear problems is often infeasible
In practice, we can obtain an approximate
solution or switch to the 1-vs-All or 1-vs-1
formulations