Title: Discussion of Assignments 3 and 4 and Cross-Validation Methods
1Discussion of Assignments 3 and
4andCross-Validation Methods
- Padhraic Smyth
- Information and Computer Science
- CS 175, Fall 2007
2Review of Assignment 3 (Perceptron)
3perceptron.m function
- function thresholded_outputs
perceptron(weights,data) - function thresholded_outputs
perceptron(weights,data) -
- Compute the class predictions for perceptron
(linear classifier) -
Sample code for CS 175 -
- Inputs
- weights 1 x (d1) row vector of weights
- data N x (d1) matrix of training data
-
- Outputs
- outputs N x 1 vector of perceptron
outputs -
- error checking
- if size(weights,1) 1
- error('The first argument (weights) should
be a row vector') - end
4perceptron_error.m function
- function cerror, mse perceptron_error(weights,
data,targets) - function cerror, mse perceptron_error(weig
hts,data,targets) -
- Compute mis-classification error and mean
squared error for - a perceptron (linear) classifier
-
Sample code for CS 175 -
- Inputs
- weights 1 x (d1) row vector of weights
- data N x (d1) matrix of training data
- targets N x 1 vector of target values
(1 or -1) -
- Outputs
- cerror the percentage of examples
misclassified (between 0 and 100) - mse the mean-square error (sum of
squared errors divided by N) -
-
5perceptron_error.m function
- N size(data, 1)
-
- error checking
- if nargin 3
- error('The function takes three arguments
(weights, data, targets)') - end
- if size(weights,1) 1
- error('The first argument (weights) should
be a row vector') - end
- if size(data,2) size(weights,2)
- error('The first two arguments (weights and
data) should have the same number of columns') - end
- if size(data,1) size(targets,1)
- error('The last two arguments (targets and
data) should have the same number of rows') - end
-
6perceptron_error.m function
-
- calculate the unthresholded outputs, for all
rows in data, N x 1 vector - f (weights data)
-
- compare thresholded output to the target values
to get the accuracy - cerror 100 sum(sign(f) targets)/N
- calculate the sigmoid version of the outputs,
for all rows in data, N x 1 vector - outputs sigmoid(f)
-
- compare sigmoid output vector to the target
vector to get the mse - mse sum((outputs-targets).2)/N
-
-
7perceptron_error.m function
-
- calculate the unthresholded outputs, for all
rows in data, N x 1 vector - f (weights data)
-
- compare thresholded output to the target values
to get the accuracy - cerror 100 sum(sign(f) targets)/N
- calculate the sigmoid version of the outputs,
for all rows in data, N x 1 vector - outputs sigmoid(f)
-
- compare sigmoid output vector to the target
vector to get the mse - mse sum((outputs-targets).2)/N
-
-
Vectorized computation of classification error
rate
Vectorized computation of sigmoid output
Vectorized computation of MSE
Local function defining the sigmoid. Note that it
works on vectors
8Principle of Gradient Descent
- Gradient descent algorithm
- Start with some initial guess at w
- Move downhill in small steps direction of
steepest descent - What is the direction of steepest descent?
- The negative of the gradient, evaluated at w
- What is the gradient?
- Gradient vector of derivatives with respect to
each component of w - E.g., if w w1, w2, w3 then
gradientg(w) d g(w)/ dw1, d g(w)/dw2, d
g(w)/dw3 - Note that the gradient is itself a vector (or a
direction) - After moving, recompute the gradient, get a new
downhill direction, and move again. - Keep repeating this until the decrease in g(w) is
less than some threshold, i.e., we appear to be
on a flat part of the g(w) surface. -
9Illustration of Gradient Descent
g(w)
w1
w2
10Illustration of Gradient Descent
g(w)
w1
w2
11Illustration of Gradient Descent
g(w)
w1
Direction of steepest descent direction
of negative gradient
w2
12Illustration of Gradient Descent
g(w)
w1
Original point in weight space
New point in weight space
w2
13Gradient Descent Algorithm
- Algorithm converges to either
- Global minimum if g(w) is convex (has a single
minimum) - this is the case for the perceptron
- Local minimum if g(w) has multiple local minima
- This is the case for multilayer neural networks
- To avoid local minima, in practice we rerun the
gradient - descent algorithm from multiple random
starting points - pick the solution with the lowest MSE.
- Note that the backpropagation algorithm is based
on - gradient descent (using a clever way to
calculate the gradient) - Note that the algorithm need not converge at all
if the learning rate (i.e., step size) is too
large
14Gradient Descent Algorithm
- Mathematically, the Gradient Descent Rule
- w new w old - h D (w)
- where
- D (w) is the gradient and
- h is the learning rate (small, positive)
-
-
15Gradient Descent Algorithm
- Mathematically, the Gradient Descent Rule
- w new w old - h D (w)
- where
- D (w) is the gradient and
- h is the learning rate (small, positive)
- In MATLAB, for the perceptron with sigmoid
outputs this translates into the following update
rule - weights weights - rate (o - targets(i))
dsigmoid(o) data(i, ) -
This whole part is the gradient, evaluated at the
current weight vector
16learn_perceptron.m function
-
- function weights,mse,acc learn_perceptron(dat
a,targets,rate,threshold,init_method,random_seed,p
lotflag,k) - function weights,mse,acc learn_perceptron(da
ta,targets,rate,threshold,init_method,random_seed,
plotflag,k) -
- Learn the weights for a perceptron (linear)
classifier to minimize its - mean squared error.
-
Sample code for CS
175 -
- Inputs
- data N x (d1) matrix of training data
- targets N x 1 vector of target values (1
or -1) - rate learning rate for the perceptron
algorithm (e.g., rate 0.001) - threshold if the reduction in MSE from one
iteration to the next is less - than threshold, then halt
learning (e.g., threshold 0.000001) - init_method method used to initialize the
weights (1 random, 2 half - way between 2 random points in
each group, 3 half way between - the centroids in each group)
- random_seed this is an integer used to
"seed" the random number generator - for either methods 1 or 2 for
initialization (this is useful
17learn_perceptron.m function
-
- N, d size(data)
-
- error checking
- if nargin lt 4
- error('The function takes at least 4
arguments (data, targets, rate, threshold)') - end
- if size(data,1) size(targets,1)
- error('The number of rows in the first two
arguments (data, targets) does not match!') - end
- initialize the input arguments
- if exist('k')
- k 100
- end
- if exist('plotflag')
- plotflag 0
18learn_perceptron.m function
-
- initialize the weights
- weights initialize_weights175(data,targets,init_
method,random_seed) -
- iteration0
- while iteration lt 2 ( abs(mse(iteration) -
mse(iteration-1)) gt threshold ) -
- iteration iteration 1
-
- cycle through all of the examples
- for i1N
- calculate the unthresholded output for
the ith row of "data" - o sigmoid( weights data(i,)' )
- update the weight vector
- weights weights rate (targets(i) -
o) dsigmoid(o) data(i, ) - end
-
- calculate the errors using current
parameter values - cerror(iteration), mse(iteration)
perceptron_error(weights, data, targets)
19learn_perceptron.m function
-
- create the plots of the MSE and Accuracy Vs.
iteration number - if (plotflag 1)
- figure(2)
- subplot(2, 1, 1)
- plot(mse,'b-')
- xlabel('iteration')
- ylabel('MSE')
-
- subplot(2, 1, 2)
- plot(100-cerror,'b-')
- xlabel('iteration')
- ylabel('Accuracy')
- end
-
- local functions..
- function s sigmoid(x)
- Compute the sigmoid function, scaled from -1 to
1
20MATLAB Demonstration
-
- Download MATLAB demo code (Zip file) from Web
page - Run demo_perceptron_image_classification.m
-
-
21Additional Concepts in Classification (Relevant
to Assignment 4)
22Assignment 4
- threshold_image.m
- Simple function to display thresholded images
- knn_dispset.m
- Finds and displays the k-nearest-neighbors for a
given image - test_classifiers.m
- Uses cross-validation to compare classifiers
- (code is provided)
- test_imageclassifiers.m
- Compare different classification methods on image
data - Uses cross-validation
23Assignment 4 using kNN to find similar images
- function list knndispset(imageset,i,j,k,plotf
lag) function list knndispset(imageset,i
,j,k, plotflag) a brief description of
what the function does ......
Your Name, CS 175,
date Inputs imageset an
array structure of images (CS 175 format)
i, j integers specifying that
imageset(i,j).image is the query image
k number of neighbors to find plotflag
display the k nearest neighbors if plotflag 1
Outputs list a k x 2 matrix,
where the first row contains the indices from
imageset of the nearest neighbor, the second
row contains the indices of the 2nd nearest
neighbor, and so forth.
24MATLAB demo of knndispset
- knndispset(i2straight,5,1,15,1)
25MATLAB demo of knndispset
- knndispset(i2straight,18,1,15,1)
26Training Data and Test Data
- Training data
- labeled data used to build a classifier
- Test data
- new data, not used in the training process, to
evaluate how well a classifier does on new data - Memorization versus Generalization
- better training_accuracy
- memorizing the training data
- better test_accuracy
- generalizing to new data
- in general, we would like our classifier to
perform well on new test data, not just on
training data, - i.e., we would like it to generalize well to new
data - Test accuracy is more important than training
accuracy
27Test Accuracy and Generalization
- The accuracy of our classifier on new unseen data
is a fair/honest assessment of the performance of
our classifier - Why is training accuracy not good enough?
- Training accuracy is optimistic
- a classifier like nearest-neighbor can construct
boundaries which always separate all training
data points, but which do not separate new points - e.g., what is the training accuracy of kNN, k
1? - A flexible classifier can overfit the training
data - in effect it just memorizes the training data,
but does not learn the general relationship
between x and C - Generalization
- We are really interested in how our classifier
generalizes to new data - test data accuracy is a good estimate of
generalization performance
28Another Example
29A More Complex Decision Boundary
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
Decision
Decision
Region 1
Region 2
Feature 2
Decision
Boundary
Feature 1
30Example The Overfitting Phenomenon
Y
X
31A Complex Model
Y high-order polynomial in X
Y
X
32The True (simpler) Model
Y a X b noise
Y
X
33How Overfitting affects Prediction
Predictive Error
Error on Training Data
Model Complexity
34How Overfitting affects Prediction
Predictive Error
Error on Test Data
Error on Training Data
Model Complexity
35How Overfitting affects Prediction
Predictive Error
Overfitting
Underfitting
Error on Test Data
Error on Training Data
Model Complexity
Ideal Range for Model Complexity
36Comparing Two Classifiers
- Say we have 2 classifiers, C1 and C2
- We want to choose the best one to use for future
predictions - e.g., medical diagnosis
- e.g., email filtering
- Can we use Training Accuracy to choose between
them? - No
- e.g., C1 perceptron, C2kNN
- e.g., training accuracy(kNN) 100, but it is
not necessarily best - We can choose according to whichever of
test_accuracy(C1) or test_accuracy(C2) is larger
37Training and Validation Data
Full Data Set
Idea train each model on the training
data and then test each models accuracy on the
validation data
Training Data
Validation Data
38 The v-fold Cross-Validation Method
- Why just choose one particular 90/10 split of
the data? - In principle we could do this multiple times
- v-fold Cross-Validation (e.g., v10)
- randomly partition our full data set into v
disjoint subsets (each roughly of size n/v, n
total number of training data points) - for i 110 (here v 10)
- train on 90 of data,
- Acc(i) accuracy on other 10
- end
- Cross-Validation-Accuracy 1/v Si Acc(i)
- choose the method with the highest
cross-validation accuracy - common values for v are 5 and 10
- Can also do leave-one-out where v n
39Disjoint Validation Data Sets
Full Data Set
Validation Data
Training Data
1st partition
40Disjoint Validation Data Sets
Full Data Set
Validation Data
Validation Data
Training Data
1st partition
2nd partition
41More on Cross-Validation
- Notes
- cross-validation generates an approximate
estimate of how well the classifier will do on
unseen data - by averaging over different partitions it is more
robust than just a single train/validate
partition of the data - v-fold cross-validation is a generalization
- partition data into disjoint validation subsets
of size n/v - train, validate, and average over the v
partitions - e.g., v10 is commonly used
- v-fold cross-validation is approximately v times
computationally more expensive than just fitting
a model to all of the data
42Sample MATLAB code for Cross-Validation
first randomly order the data (n number of
data points) rand('state',rseed) index
randperm(n) data ordereddata(index,) labels
orderedlabels(index)
43Sample MATLAB code for Cross-Validation
now perform v-fold cross-validation olddata
data oldlabels labels nvalidate
floor(n/v) for i1v set testdata and
testlabels to be the first nvalidate rows of
olddata,oldlabels .. set traindata
and trainlabels to be the rest of rows of
olddata,oldlabels ... call
classifiers with traindata, trainlabels,
testdata, testlabels cvaccuracy(i)
classifier(..) olddata traindata
testdata oldlabels trainlabels
testlabels end overall_cvaccuacy
mean(cvaccuracy)
44Assignment 4 Cross-Validation code (provided)
function cvacc, trainacc test_classifiers(data
1,data2,kvalues,v,rseed) function cvacc,
trainacc test_classifiers(data1,data2,kvalues,
v,rseed) cross-validation results with
minimum distance and knn classifiers
INPUTS data1 n1 x d feature data for class
1 data2 n2 x d feature data for class 2
kvalues row vector of values of k for knn v
for "v-fold" cross-validation rseed random
seed setting before permuting the data order
OUTPUT cvacc accuracy estimated using
cross-validation trainacc accuracy on the
training data (accuracy expressed as a
percentage, between 0 and 100)
45Example of running cross-validation code
- gtgt test_classifiers(d1,d2,1,5,1234)
- Training Data Results Minimum distance
accuracy 87.50 KNN, k1, accuracy 100.00 - Cross Validation Results (v5) Minimum
distance accuracy 85.00 KNN, k1, accuracy
82.50 - If we change to k3 nearest-neighbors, the
results are as follows - gtgt test_classifiers(d1,d2,3,5,1234)
- Training Data Results Minimum distance
accuracy 87.50 KNN, k3, accuracy 95.00 - Cross Validation Results (v5) Minimum
distance accuracy 85.00 KNN, k3, accuracy
85.00
46Assignment 4 Classifying images
function cvacc, trainacc
test_imageclassifiers(imageset1,imageset2,plotflag
,kvalues,v,rseed) Learns a classifier to
classify images in imageset1 from images in
imageset2, using minimum distance and knn
classifiers, and returns the training and
cross-validation accuracies.
Your name,
CS 175A INPUTS imageset1, imageset2
arrays (of size m x n, and m2 x n2) of
structures, where imageset1(i,j).image is a
matrix of pixel (image) values of size
nx by ny. It is assumed that all images
are of the same size in both imageset1
and imageset2. plotflag if plotflag1, plot
the mean image for each set, and plot the
difference of the means of the images in the two
sets. kvalues an K x 1 vector of k values
for the knn classifier v number of "folds"
for v-fold cross-validation
47The Minimum-Distance Classifier
- A very simple classifier
- Assume we have data from M classes (e.g., M2)
- Calculate the mean for each class, e.g., Mean1
and Mean2 - mean vector sum of all vectors/number of
vectors - mean vector centroid of points
- Classify each new point x as follows
- for j 1 M
- calculate the distance dj Euclidean distance(x,
Meanj) - distance from x to the jth class mean
- choose the minimum distance as the predicted
class - assign x to the closest mean
48(No Transcript)
49Assignment 4 Minimum Distance Classifier
(provided)
function acc minimum_distance(traindata,trainlab
els,testdata,testlabels) implementation of a
minimum distance classifier INPUTS
traindata N1 x d matrix of feature data
trainlabels N1 x 1 column vector of classlabels
testdata N2 x d matrix of feature data
trainlabels N2 x 1 column vector of classlabels
OUTPUTS acc accuracy (percentage)
on the test data for a classifier trained
on the training data
50Summary
- Assignment 2
- Perceptron code
- Can use perceptrons (or any classifier) to
classify images - Assignment 4
- Nearest-neighbor with images
- Cross-validation
- Minimum distance classifier
- Due Tuesday at 930am