Title: Classification
1Classification
Yan Pan
2Under and Over Fitting
3Probability Theory
- Non-negativity and unit measure
- 0 p(y) , p(?) 1, p(?) 0
- Conditional probability p(yx)
- p(x, y) p(yx) p(x) p(xy) p(y)
- Bayes Theorem
- p(yx) p(xy) p(y) / p(x)
- Marginalization
- p(x) ?y p(x, y) dy
- Independence
- p(x1, x2) p(x1) p(x2) ? p(x1x2) p(x1)
- Chris Bishop, Pattern Recognition Machine
Learning
4(No Transcript)
5The Univariate Gaussian Density
- p(x?,?) exp( -(x ?)2/2?2) / (2??2)½
?
1?
-1?
2?
-3?
3?
-2?
6The Multivariate Gaussian Density
- p(x?,?) exp( -½ (x ?)t ?-1 (x ?) )/
(2?)D/2?½
7The Beta Density
- p(?a,b) ?a-1(1 ?)b-1 ?(ab) / ?(a)?(b)
8Probability Distribution Functions
- Bernoulli Single trial with probability of
success ? - n ? 0, 1, ? ? 0, 1
- p(n?) ? n(1 ?)1-n
- Binomial N iid Bernoulli trials with n
successes - n ? 0, 1, , N, ? ? 0, 1,
- p(nN,?) NCn? n(1 ?)N-n
9A Toy Example
- We dont know whether a coin is fair or not. We
are told that heads occurred n times in N coin
flips. - We are asked to predict whether the next coin
flip will result in a head or a tail. - Let y be a binary random variable such that y
1 represents the event that the next coin flip
will be a head and y 0 that it will be a tail - We should predict heads if p(y1n,N) gt
p(y0n,N)
10The Maximum Likelihood Approach
- Let p(y1n,N) ? and p(y0n,N) 1 - ? so
that we should predict heads if ? gt ½ - How should we estimate ??
- Assuming that the observed coin flips followed a
Binomial distribution, we could choose the value
of ? that maximizes the likelihood of observing
the data - ?ML argmax? p(n?) argmax? NCn? n(1 ?)N-n
- argmax? n log(?) (N n) log(1 ?)
- n / N
- We should predict heads if n gt ½ N
11The Maximum A Posteriori Approach
- We should choose the value of ? maximizing the
posterior probability of ? conditioned on the
data - We assume a
- Binomial likelihood p(n?) NCn? n(1 ?)N-n
- Beta prior p(?a,b)?a-1(1?)b-1?(ab)/?(a)?(b)
- ?MAP argmax? p(?n,a,b) argmax? p(n?)
p(?a,b) - argmax? ?n (1 ?)N-n ?a-1 (1?)b-1
- (na-1) / (Nab-2) as if we saw an extra a
1 heads b 1 tails - We should predict heads if n gt ½ (N b a)
12The Bayesian Approach
- We should marginalize over ?
- p(y1n,a,b) ?? p(y1n,?) p(?a,b,n) d?
- ?? ? p(?a,b,n) d?
- ?? ? ?(?a n, b N n) d?
- (n a) / (N a b) as if we saw an extra a
heads b tails - We should predict heads if n gt ½ (N b a)
- The Bayesian and MAP prediction coincide in this
case - In the very large data limit, both the Bayesian
and MAP prediction coincide with the ML
prediction (n gt ½ N)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18Classification
19Binary Classification
20Approaches to Classification
- Memorization
- Can not deal with previously unseen data
- Large scale annotated data acquisition cost
might be very high - Rule based expert system
- Dependent on the competence of the expert.
- Complex problems lead to a proliferation of
rules, exceptions, exceptions to exceptions, etc. - Rules might not transfer to similar problems
- Learning from training data and prior knowledge
- Focuses on generalization to novel data
21Notation
- Training Data
- Set of N labeled examples of the form (xi, yi)
- Feature vector x ? ?D. X x1 x2 xN
- Label y ? ?1. y y1, y2 yNt. Ydiag(y)
- Example Gender Identification
(x1 , y1 1)
(x2 , y2 1)
(x3 , y3 1)
(x4 , y4 -1)
22Binary Classification
23Binary Classification
b
w
wtx b 0
? w b
24Machine Learning from the Optimization View
- Before we go into the details of classification
and regression methods, we should take a close
look at the objective functions of machine
learning - Machine Learning???????(?????????????),?????????
- ????????????????,???????????,?????????????????
25Supervised Learning
26Common Form of Supervised Learning Problems
- Minimize the following objective function
- Regularization term Loss function
- Regularization term control the model
complexity, avoid over fitting - Loss function measure the quality of the learned
function, i.e. predict error on the training data.
27Ex.1 Linear Regression
- E(w) ½ Sn (yn - wtxn)2 ½?wtw
28Ex.2 Logistic Regression (classification method)
- ?(w, b) ½?wtw ?I log(1exp(-yi(bwtxi)))
29Ex.3 SVM
- E(w) ½?wtw ?I max(0,1-yiwtxi)
- Or
- E(w) ½?wtw ?I max(0,1-yiwtxi)2
30How to measure error?
- True yi
- Predicted wtxi
- ????????
- I (yi ! wtxi )
- ( yi - wtxi )2
- ???????-1,1 ?????
- yi wtxi
31Approximate the Zero-One Loss
- Squared Error
- Exponential Loss
- Logistic Loss
- Hinge Loss
- Sigmoid Loss
32Regularized Logistic Regression
Zhu Hastie, KLR and the
Import Vector Machine, NIPS 01
33Regularized Logistic Regression
Zhu Hastie, KLR and the
Import Vector Machine, NIPS 01
34(No Transcript)
35(No Transcript)
36Convex Functions
- Convex f f(?x1 (1- ?)x2) ? ? f(x1) (1-
?)f(x2) - The Hessian ?2f is always positive
semi-definite - The tangent is always a lower bound to f
37(No Transcript)
38Gradient Descent
- Iteration xn1 xn - ?n?f(xn)
- Step size selection Armijo rule
- Stopping criterion Change in f is miniscule
39Gradient Descent Logistic Regression
- ?(w, b) ½?wtw ?I log(1exp(-yi(bwtxi)))
- ?w?(w, b) ?w ?I p(-yixi,w) yi xi
- ?b?(w, b) ?I p(-yixi,w) yi
- Beware of numerical issues while coding!
40Gradient Decent Algorithm
- Input x0, objective f(x), e, T
- Output x_star that minimize f(x)
- t0
- While (t0 (f(x_t-1) f(x_t)gte
Tlt100000 )) - g_t gradient of f(x) at x_t
- for( i10 igt-6 i--)
-
- s2i
- x_t1x_t sg_t
- if (f(x_t1 lt f(x_t))
- break
-
- t
-
- Output x_t
41Newton Methods
- Iteration xn1 xn - ?nH-1?f(xn)
- Approximate f by a 2nd order Taylor expansion
- The error can now decrease quadratically
42Newton Decent Algorithm
- Input x0, objective f(x), e, T
- Output x_star that minimize f(x)
- t0
- While (t0 (f(x_t-1) f(x_t)gte Tlt10))
- g_t gradient of f(x) at x_t
- h_t hessian matrix of f(x) at x_t
- s inverse matrix of h_t
-
- x_t1x_t sg_t
-
- t
-
- Output x_t
43Quasi-Newton Methods
- Computing and inverting the Hessian is expensive
- Quasi-Newton methods can approximate H-1
directly (LBFGS) - Iteration xn1 xn - ?nBn-1?f(xn)
- Secant equation ?f(xn1) ?f(xn) Bn1(xn1
xn) - The secant equation does not fully determine B
- LBFGS updates Bn1-1 using two rank one matrices
44Machine Learning Problems from the Probability
View
45Bayes Decision Rule
- Bayes decision rule
- p(y1x) gt p(y-1x) ? y 1 y -1
- ?? p(y1x) gt ½ ? y 1 y -1
46Bayesian Approach
- p(yx,X,Y) ?f p(y,fx,X,Y) df
- ?f p(yf,x,X,Y) p(fx,X,Y) df
- ?f p(yf,x) p(fX,Y) df
- This integral is often intractable.
- To solve it we can
- Choose the distributions so that the solution is
analytic (conjugate priors) - Approximate the true distribution of p(fX,Y) by
a simpler distribution (variational methods) - Sample from p(fX,Y) (MCMC)
47Maximum A Posteriori (MAP)
- p(yx,X,Y) ?f p(yf,x) p(fX,Y) df
- p(yfMAP,x) when p(fX,Y) ?(f fMAP)
- The more training data there is the better
p(fX,Y) approximates a delta function - We can make predictions using a single function,
fMAP, and our focus shifts to estimating fMAP.
48MAP Maximum Likelihood (ML)
- fMAP argmaxf p(fX,Y)
- argmaxf p(X,Yf) p(f) / p(X,Y)
- argmaxf p(X,Yf) p(f)
- fML ? argmaxf p(X,Yf) (Maximum Likelihood)
- Maximum Likelihood holds if
- There is a lot of training data so that
- p(X,Yf) gtgt p(f)
- Or if there is no prior knowledge so that p(f)
is uniform (improper)
49IID Data
- fML argmaxf p(X,Yf)
- argmaxf ?I p(xi,yif)
- The independent and identically distributed
assumption holds only if we know everything about
the joint distribution of the features and
labels. - In particular, p(X,Y) ? ?I p(xi,yi)
50Discriminative Methods Logistic Regression
51Discriminative Methods
- ?MAP argmax? p(?) ?I p(xi,yi ?)
- We assume that
- p(?) p(w) p(w?)
- p(xi,yi ?) p(yi xi, ?) p(xi ?)
- p(yi xi, w) p(xi w?)
- ?MAP argmaxw p(w) ?I p(yi xi, w)
- argmaxw? p(w?) ?I p(xiw?)
- It turns out that only w plays a role in
determining the posterior distribution - p(yx,X,Y) p(yx, ?MAP) p(yx, wMAP)
- where wMAP argmaxw p(w) ?I p(yi xi, w)
52Disc. Methods Logistic Regression
- ?MAP argmaxw,b p(w) ?I p(yi xi, w)
- Regularized Logistic Regression
- Gaussian prior p(w) exp( -½ ? wtw)
- Logistic likelihood
- p(yi xi, w) 1 / (1 exp(-yi(b wtxi)))
53Regularized Logistic Regression
- ?MAP argmaxw,b p(w) ?I p(yi xi, w)
- argminw,b ½?wtw ?I log(1exp(-yi(bwtxi)))
- Bad news No closed form solution for w and b
- Good news We have to minimize a convex function
- We can obtain the global optimum
- The function is smooth
- Tom Minka, A comparison of numerical optimizers
for LR (Matlab code) - Keerthi et al., A Fast Dual Algorithm for Kernel
Logistic Regression, ML 05 - Andrew and Gao, OWL-QN ICML 07
- Krishnapuram et al., SMLR PAMI 05
54Regularized Logistic Regression
Zhu Hastie, KLR and the
Import Vector Machine, NIPS 01
55Regularized Logistic Regression
Zhu Hastie, KLR and the
Import Vector Machine, NIPS 01
56Convex Functions
- Convex f f(?x1 (1- ?)x2) ? ? f(x1) (1-
?)f(x2) - The Hessian ?2f is always positive
semi-definite - The tangent is always a lower bound to f
57Gradient Descent
- Iteration xn1 xn - ?n?f(xn)
- Step size selection Armijo rule
- Stopping criterion Change in f is miniscule
58Gradient Descent Logistic Regression
- ?(w, b) ½?wtw ?I log(1exp(-yi(bwtxi)))
- ?w?(w, b) ?w ?I p(-yixi,w) yi xi
- ?b?(w, b) ?I p(-yixi,w) yi
- Beware of numerical issues while coding!
59Newton Methods
- Iteration xn1 xn - ?nH-1?f(xn)
- Approximate f by a 2nd order Taylor expansion
- The error can now decrease quadratically
60Quasi-Newton Methods
- Computing and inverting the Hessian is expensive
- Quasi-Newton methods can approximate H-1
directly (LBFGS) - Iteration xn1 xn - ?nBn-1?f(xn)
- Secant equation ?f(xn1) ?f(xn) Bn1(xn1
xn) - The secant equation does not fully determine B
- LBFGS updates Bn1-1 using two rank one matrices
61Multi-class Logistic Regression
- Multinomial Logistic Regression
- 1-vs-All
- Learn L binary classifiers for an L class
problem - For the lth classifier, examples from class l
are ve while examples from all other classes are
ve - Classify new points according to max probability
- 1-vs-1
- Learn L(L-1)/2 binary classifiers for an L class
problem by considering every class pair - Classify novel points by majority vote
- Classify novel points by building a DAG
62Multi-class Logistic Regression
- Assume
- Non-linear multi-class classifier
- Number of classes L
- Number of training points per class N
- Algorithm training time for M points O(M3)
- Classification time given M training pointsO(M)
63Multi-class Logistic Regression
- Multinomial Logistic Regression
- Training time O(L6N3)
- Classification time for a new point O(L2N)
- 1-vs-All
- Training time O(L4N3)
- Classification time for a new point O(L2N)
- 1-vs-1
- Training time O(L2N3)
- Majority vote classification time O(L2N)
- DAG classification time O(LN)
64Multinomial Logistic Regression
- ?MAP argmaxw,b p(w) ?I p(yi xi, w)
- Regularized Multinomial Logistic Regression
- Gaussian prior
- p(w) exp( -½ ? ?lwltwl)
- Multinomial logistic posterior
- p(yi l xi, w) efl(xi) / ?k efk(xi)
- where fk(xi) wktxi bk
-
- Note that we have to learn an extra classifier by
not explicitly enforcing ?l p(yi l xi, w) 1
65Multinomial Logistic Regression
- ?(w, b) ½? ?kwktwk ?I log(?k fk(xi)) -
?k?kyi fk(xi) - ?wk?(w, b) ?wk ?I p(yi k xi,w) - ?kyi
xi - ?bk?(w, b) ?I p(yi k xi,w) - ?kyi
66Multi-class Logistic Regression
67Multi-class Logistic Regression
68Multi-class Logistic Regression
69Multi-class Logistic Regression
70Multi-class Logistic Regression
71Multi-class Logistic Regression
72From Probabilities to Loss Functions
?MAP argminw,b ½?wtw ?I log(1exp(1-yi(bwtxi
)))
73Support Vector Machines
74Binary Classification
75A Separating Hyperplane
76Maximum Margin Hyperplane
Geometric Intuition Choose the perpendicular
bisector of the shortest line segment joining the
convex hulls of the two classes
77SVM Notation
Margin 2 / ?wtw
Support Vector
b
Support Vector
Support Vector
Support Vector
w
wtx b -1
wtx b 0
wtx b 1
78Calculating the Margin
- Let x be any point on the ve supporting plane
and x- the closest point on the ve supporting
plane - Margin x x-
- ? w (since x x- ?w)
- 2 w/w2 (assuming ? 2/w2)
- 2/w
- wtx b 1
- wtx- b -1
- ? wt(x x-) 2 ? ? wtw 2 ? ? 2/w2
79Hard Margin SVM Primal
- Maximize 2/w
- such that wtxi b ? 1 if yi 1
- wtxi b ? -1 if yi -1
- Difficult to optimize directly
- Convex Quadratic Program (QP) reformulation
- Minimize ½wtw
- such that yi(wtxi b) ? 1
- Convex QPs can be easy to optimize
80Linearly Inseparable Data
- Minimize ½wtw C (Misclassified points)
- such that yi(wtxi b) ? 1 (for good
points) - The optimization problem is NP Hard in general
- Disastrous errors are penalized the same as near
misses
81Inseparable Data Hinge Loss
Margin 2 / ?wtw
? gt 1
Misclassified point
? lt 1
b
Support Vector
? 0
Support Vector
w
wtx b -1
? 0
wtx b 0
wtx b 1
82The C-SVM Primal Formulation
- Minimize ½wtw C ?i ?i
- such that yi(wtxi b) ? 1 ?i
- ?i ? 0
- The optimization is a convex QP
- The globally optimal solution will be obtained
- Number of variables D N 1
- Number of constraints 2N
- Solvers can train on 800K points in 47K (sparse)
dimensions in less than 2 minutes on a standard
PC - Fan et al., LIBLINEAR JMLR 08
- Bordes et al., LaRank ICML 07
83The C-SVM Dual Formulation
- Maximize 1t? ½?tYKY?
- such that 1tY? 0
- 0 ? ? ? C
- K is a kernel matrix such that Kij K(xi, xj)
xitxj - ? are the dual variables (Lagrange multipliers)
- Knowing ? gives us w and b
- The dual is also a convex QP
- Number of variables N
- Number of constraints 2N 1
- Fan et al., LIBSVM JMLR 05
- Joachims, SVMLight
84SVMs versus Regularized LR
Most of the SVM ?s are zero!
85SVMs versus Regularized LR
Most of the SVM ?s are zero!
86SVMs versus Regularized LR
Most of the SVM ?s are not zero
87Duality
- Primal P Minx f0(x)
- s. t. fi(x) ? 0 1 ? i ? N
- hi(x) 0 1 ? i ? M
- Lagrangian L(x,?,?) f0(x) ?i ?ifi(x) ?i
?ihi(x) - Dual D Max?,? Minx L(x,?,?)
- s. t. ? ? 0
88Duality
- The Lagrange dual is always concave (even if the
primal is not convex) and might be an easier
problem to optimize - Weak duality P ? D
- Always holds
- Strong duality P D
- Does not always hold
- Usually holds for convex problems
- Holds for the SVM QP
89Karush-Kuhn-Tucker (KKT) Conditions
- If strong duality holds, then for x, ? and ?
to be optimal the following KKT conditions must
necessarily hold - Primal feasibility fi(x) ? 0 hi(x) 0
for 1 ? i - Dual feasibility ? ? 0
- Stationarity ?x L(x, ?,?) 0
- Complimentary slackness ?ifi(x) 0
- If x, ? and ? satisfy the KKT conditions for
a convex problem then they are optimal
90SVM Duality
- Primal P Minw,?,b ½wtw Ct?
- s. t. Y(Xtw b1) ? 1 ?
- ? ? 0
- Lagrangian L(?,?, w,?,b) ½wtw Ct? ?t?
- ?tY(Xtw b1) 1 ?
- Dual D Max? 1t? ½?tYKY?
- s. t. 1tY? 0
- 0 ? ? ? C
91SVM KKT Conditions
- Lagrangian L(?,?, w,?,b) ½wtw Ct? ?t?
- ?tY(Xtw b1) 1 ?
- Stationarity conditions
- ?w L 0 ? w XY? (Representer Theorem)
- ?? L 0 ? C ? ?
- ?b L 0 ? ?tY1 0
- Complimentary Slackness conditions
- ?i yi (xitw b) 1 ?i 0
- ?i?i 0
92Hinge Loss and Sparseness in ?
- From the Stationarity and Complimentary
Slackness conditions it is easy to show that - ?i 0 ? xi has been classified correctly and
lies beyond its supporting hyperplane - 0 lt ?i lt C ? xi is a support vector and lies on
its supporting hyperplane - ?i C ? xi has been misclassified or is a
margin violator
93Hinge Loss and Sparseness in ?
- SVM ?s are sparse but LR ?s are not
94Linearly Inseparable Data
- This 1D dataset can not be separated using a
single hyperplane (threshold) - We need a non-linear decision boundary
x
95Increasing Dimensionality Non-linearly
- The dataset is now linearly separable in ? space
x
?(x) (x, x2)
96The Kernel Trick
- Let the lifted training set be (?(xi), yi)
- Define the kernel such that Kij K(xi, xj)
?(xi)t ?(xj) - Primal P Minw,?,b ½wtw Ct?
- s. t. Y(?(X)tw b1) ? 1 ?
- ? ? 0
- Dual D Max? 1t? ½?tYKY?
- s. t. 1tY? 0
- 0 ? ? ? C
- Classifier f(x) sign(?(x)tw b)
sign(?tYK(,x) b)
97The Kernel Trick
- Let ?(x) 1, ?2x1, , ?2xD , x12, , xD2,
?2x1x2, , ?2x1xD, , ?2xD-1xDt - Define K(xi, xj) ?(xi)t ?(xj) (xitxj 1)2
- Primal
- Number of variables D? N 1
- Number of constraints 2N
- Number of flops for calculating ?(x)tw O(D2)
- Number of flops for deg 20 polynomial O(D20)
- Dual
- Number of variables N
- Number of constraints 2N 1
- Number of flops for calculating Kij O(D)
- Number of flops for deg 20 polynomial O(D)
98Some Popular Kernels
- Linear K(xi,xj) xit?-1xj
- Polynomial K(xi,xj) (xit?-1xj c)d
- Gaussian (RBF) K(xi,xj) exp( ?k ?k(xik
xjk)2) - Chi-Squared K(xi,xj) exp( ?2(xi, xj) )
- Sigmoid K(xi,xj) tanh(xitxj c)
- ? should be positive definite, c ? 0, ? ? 0 and d
should be a natural number
99Valid Kernels Mercers Theorem
- Let Z be a compact subset of ?D and K a
continuous symmetric function. Then K is a kernel
if - ?Z ? Z f(x) K(x,z) f(z) dx dz ? 0
- for all square integrable real valued function f
on Z.
100Valid Kernels Mercers Theorem
- Let Z be a compact subset of ?D and K a
continuous symmetric function. Then K is a kernel
if - ?Z ? Z f(x) K(x,z) f(z) dx dz ? 0
- for all square integrable real valued function f
on Z. - K is a kernel if every finite symmetric matrix
formed by evaluating K on pairs of points from Z
is positive semi-definite
101Operations on Kernels
- The following operations result in valid kernels
- K(xi,xj) ?k ?k Kk(xi,xj) (?k ? 0)
- K(xi,xj) ?k Kk(xi,xj)
- K(xi,xj) f(xi) f(xj) (f ?D ? ?)
- K(xi,xj) p(K1(xi,xj)) (p ve coeff poly)
- K(xi,xj) exp(K1(xi,xj))
- Kernels can be defined over graphs, sets,
strings and many other interesting data structures
102Kernels
- Kernels should encode all our prior knowledge
about feature similarities. - Kernel parameters can be chosen through cross
validation or learnt (see Multiple Kernel
Learning). - Non-linear kernels can sometimes boost
classification performance tremendously. - Non-linear kernels are generally expensive (both
during training and for prediction)
103Polynomial Kernel of Degree 2
104Polynomial Kernel of Degree 5
105RBF Kernel
106Exponential ?2 Kernel
107Kernel Parameter Setting - Underfitting
108Kernel Parameter Setting
109Kernel Parameter Setting Overfitting
110Structured Output Prediction
- Minimize f ½f2 C ?i ?i
- such that f(xi,yi) ? f(xi,y) ?(yi,y) ?i
? y ? yi - ?i ? 0
- Prediction argmaxy f(x,y)
- This formulation minimizes the hinge on the loss
? on the training set subject to regularization
on f - Can be used to predict sets, graphs, etc. for
suitable choices of ? - Taskar et al., Max-Margin Markov Networks NIPS
03 - Tsochantaridis et al., Large Margin Methods for
Structured Interdependent Output Variables
JMLR 05
111Multi-Class SVM
- Minimize f ½f2 C ?i ?i
- such that f(xi,yi) ? f(xi,y) ?(yi,y) ?i
? y ? yi - ?i ? 0
- Prediction argmaxy f(x,y)
- ?(yi,y) ?yi,y
- f(x,y) wt ?(x) ? ?(y) bt?(y)
- wyt?(x) by (assuming ?(y) ey)
- Weston and Watkin, SVMs for Multi-Class Pattern
Recognition ESANN 99 - Bordes et al., LaRank ICML 07
112Multi-Class SVM
- Min w,b ½ ?kwktwk C ?i ?i
- s. t. wyit?(xi) byi ? wyt?(xi) by 1
?I ?y?yi - ?i ? 0
- Prediction argmaxy wyt?(x) by
- For L classes, with N points per class, the
number of constraints is NL2 - Finding the exact solution for real world
non-linear problems is often infeasible - In practice, we can obtain an approximate
solution or switch to the 1-vs-All or 1-vs-1
formulations