Discussion of Assignments 3 and 4 and Cross-Validation Methods

About This Presentation

Title:

Discussion of Assignments 3 and 4 and Cross-Validation Methods

Description:

Illustration of Gradient Descent. w1. w2. g(w) Slide Set 8: Cross Validation 10 ... Illustration of Gradient Descent. w1. w2. g(w) Original point in. weight ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 51

Provided by: padhrai

Category:

more less

Transcript and Presenter's Notes

Title: Discussion of Assignments 3 and 4 and Cross-Validation Methods

1
Discussion of Assignments 3 and
4andCross-Validation Methods

Padhraic Smyth
Information and Computer Science
CS 175, Fall 2007

2
Review of Assignment 3 (Perceptron)
3
perceptron.m function

function thresholded_outputs
perceptron(weights,data)
function thresholded_outputs
perceptron(weights,data)
Compute the class predictions for perceptron
(linear classifier)
Sample code for CS 175
Inputs
weights 1 x (d1) row vector of weights
data N x (d1) matrix of training data
Outputs
outputs N x 1 vector of perceptron
outputs
error checking
if size(weights,1) 1
error('The first argument (weights) should
be a row vector')
end

4
perceptron_error.m function

function cerror, mse perceptron_error(weights,
data,targets)
function cerror, mse perceptron_error(weig
hts,data,targets)
Compute mis-classification error and mean
squared error for
a perceptron (linear) classifier
Sample code for CS 175
Inputs
weights 1 x (d1) row vector of weights
data N x (d1) matrix of training data
targets N x 1 vector of target values
(1 or -1)
Outputs
cerror the percentage of examples
misclassified (between 0 and 100)
mse the mean-square error (sum of
squared errors divided by N)

5
perceptron_error.m function

N size(data, 1)
error checking
if nargin 3
error('The function takes three arguments
(weights, data, targets)')
end
if size(weights,1) 1
error('The first argument (weights) should
be a row vector')
end
if size(data,2) size(weights,2)
error('The first two arguments (weights and
data) should have the same number of columns')
end
if size(data,1) size(targets,1)
error('The last two arguments (targets and
data) should have the same number of rows')
end

6
perceptron_error.m function

calculate the unthresholded outputs, for all
rows in data, N x 1 vector
f (weights data)
compare thresholded output to the target values
to get the accuracy
cerror 100 sum(sign(f) targets)/N
calculate the sigmoid version of the outputs,
for all rows in data, N x 1 vector
outputs sigmoid(f)
compare sigmoid output vector to the target
vector to get the mse
mse sum((outputs-targets).2)/N

7
perceptron_error.m function

calculate the unthresholded outputs, for all
rows in data, N x 1 vector
f (weights data)
compare thresholded output to the target values
to get the accuracy
cerror 100 sum(sign(f) targets)/N
calculate the sigmoid version of the outputs,
for all rows in data, N x 1 vector
outputs sigmoid(f)
compare sigmoid output vector to the target
vector to get the mse
mse sum((outputs-targets).2)/N

Vectorized computation of classification error
rate
Vectorized computation of sigmoid output
Vectorized computation of MSE
Local function defining the sigmoid. Note that it
works on vectors
8
Principle of Gradient Descent

Gradient descent algorithm
Start with some initial guess at w
Move downhill in small steps direction of
steepest descent
What is the direction of steepest descent?
The negative of the gradient, evaluated at w
What is the gradient?
Gradient vector of derivatives with respect to
each component of w
E.g., if w w1, w2, w3 then
gradientg(w) d g(w)/ dw1, d g(w)/dw2, d
g(w)/dw3
Note that the gradient is itself a vector (or a
direction)
After moving, recompute the gradient, get a new
downhill direction, and move again.
Keep repeating this until the decrease in g(w) is
less than some threshold, i.e., we appear to be
on a flat part of the g(w) surface.

9
Illustration of Gradient Descent
g(w)
w1
w2
10
Illustration of Gradient Descent
g(w)
w1
w2
11
Illustration of Gradient Descent
g(w)
w1
Direction of steepest descent direction
of negative gradient
w2
12
Illustration of Gradient Descent
g(w)
w1
Original point in weight space
New point in weight space
w2
13
Gradient Descent Algorithm

Algorithm converges to either
Global minimum if g(w) is convex (has a single
minimum)
this is the case for the perceptron
Local minimum if g(w) has multiple local minima
This is the case for multilayer neural networks
To avoid local minima, in practice we rerun the
gradient
descent algorithm from multiple random
starting points
pick the solution with the lowest MSE.
Note that the backpropagation algorithm is based
on
gradient descent (using a clever way to
calculate the gradient)
Note that the algorithm need not converge at all
if the learning rate (i.e., step size) is too
large

14
Gradient Descent Algorithm

Mathematically, the Gradient Descent Rule
w new w old - h D (w)
where
D (w) is the gradient and
h is the learning rate (small, positive)

15
Gradient Descent Algorithm

Mathematically, the Gradient Descent Rule
w new w old - h D (w)
where
D (w) is the gradient and
h is the learning rate (small, positive)
In MATLAB, for the perceptron with sigmoid
outputs this translates into the following update
rule
weights weights - rate (o - targets(i))
dsigmoid(o) data(i, )

This whole part is the gradient, evaluated at the
current weight vector
16
learn_perceptron.m function

function weights,mse,acc learn_perceptron(dat
a,targets,rate,threshold,init_method,random_seed,p
lotflag,k)
function weights,mse,acc learn_perceptron(da
ta,targets,rate,threshold,init_method,random_seed,
plotflag,k)
Learn the weights for a perceptron (linear)
classifier to minimize its
mean squared error.
Sample code for CS
175
Inputs
data N x (d1) matrix of training data
targets N x 1 vector of target values (1
or -1)
rate learning rate for the perceptron
algorithm (e.g., rate 0.001)
threshold if the reduction in MSE from one
iteration to the next is less
than threshold, then halt
learning (e.g., threshold 0.000001)
init_method method used to initialize the
weights (1 random, 2 half
way between 2 random points in
each group, 3 half way between
the centroids in each group)
random_seed this is an integer used to
"seed" the random number generator
for either methods 1 or 2 for
initialization (this is useful

17
learn_perceptron.m function

N, d size(data)
error checking
if nargin lt 4
error('The function takes at least 4
arguments (data, targets, rate, threshold)')
end
if size(data,1) size(targets,1)
error('The number of rows in the first two
arguments (data, targets) does not match!')
end
initialize the input arguments
if exist('k')
k 100
end
if exist('plotflag')
plotflag 0

18
learn_perceptron.m function

initialize the weights
weights initialize_weights175(data,targets,init_
method,random_seed)
iteration0
while iteration lt 2 ( abs(mse(iteration) -
mse(iteration-1)) gt threshold )
iteration iteration 1
cycle through all of the examples
for i1N
calculate the unthresholded output for
the ith row of "data"
o sigmoid( weights data(i,)' )
update the weight vector
weights weights rate (targets(i) -
o) dsigmoid(o) data(i, )
end
calculate the errors using current
parameter values
cerror(iteration), mse(iteration)
perceptron_error(weights, data, targets)

19
learn_perceptron.m function

create the plots of the MSE and Accuracy Vs.
iteration number
if (plotflag 1)
figure(2)
subplot(2, 1, 1)
plot(mse,'b-')
xlabel('iteration')
ylabel('MSE')
subplot(2, 1, 2)
plot(100-cerror,'b-')
xlabel('iteration')
ylabel('Accuracy')
end
local functions..
function s sigmoid(x)
Compute the sigmoid function, scaled from -1 to
1

20
MATLAB Demonstration

Download MATLAB demo code (Zip file) from Web
page
Run demo_perceptron_image_classification.m

21
Additional Concepts in Classification (Relevant
to Assignment 4)
22
Assignment 4

threshold_image.m
Simple function to display thresholded images
knn_dispset.m
Finds and displays the k-nearest-neighbors for a
given image
test_classifiers.m
Uses cross-validation to compare classifiers
(code is provided)
test_imageclassifiers.m
Compare different classification methods on image
data
Uses cross-validation

23
Assignment 4 using kNN to find similar images

function list knndispset(imageset,i,j,k,plotf
lag) function list knndispset(imageset,i
,j,k, plotflag) a brief description of
what the function does ......
                            Your Name, CS 175,
date Inputs      imageset an
array structure of images (CS 175 format)
     i, j integers specifying that
imageset(i,j).image is the query image
k number of neighbors to find     plotflag
display the k nearest neighbors if plotflag 1
Outputs     list a k x 2 matrix,
where the first row contains the indices from
imageset of the nearest neighbor, the second
row contains the indices of the 2nd nearest
neighbor, and so forth.

24
MATLAB demo of knndispset

knndispset(i2straight,5,1,15,1)

25
MATLAB demo of knndispset

knndispset(i2straight,18,1,15,1)

26
Training Data and Test Data

Training data
labeled data used to build a classifier
Test data
new data, not used in the training process, to
evaluate how well a classifier does on new data
Memorization versus Generalization
better training_accuracy
memorizing the training data
better test_accuracy
generalizing to new data
in general, we would like our classifier to
perform well on new test data, not just on
training data,
i.e., we would like it to generalize well to new
data
Test accuracy is more important than training
accuracy

27
Test Accuracy and Generalization

The accuracy of our classifier on new unseen data
is a fair/honest assessment of the performance of
our classifier
Why is training accuracy not good enough?
Training accuracy is optimistic
a classifier like nearest-neighbor can construct
boundaries which always separate all training
data points, but which do not separate new points
e.g., what is the training accuracy of kNN, k
1?
A flexible classifier can overfit the training
data
in effect it just memorizes the training data,
but does not learn the general relationship
between x and C
Generalization
We are really interested in how our classifier
generalizes to new data
test data accuracy is a good estimate of
generalization performance

28
Another Example
29
A More Complex Decision Boundary
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
Decision
Decision
Region 1
Region 2
Feature 2
Decision
Boundary
Feature 1
30
Example The Overfitting Phenomenon
Y
X
31
A Complex Model
Y high-order polynomial in X
Y
X
32
The True (simpler) Model
Y a X b noise
Y
X
33
How Overfitting affects Prediction
Predictive Error
Error on Training Data
Model Complexity
34
How Overfitting affects Prediction
Predictive Error
Error on Test Data
Error on Training Data
Model Complexity
35
How Overfitting affects Prediction
Predictive Error
Overfitting
Underfitting
Error on Test Data
Error on Training Data
Model Complexity
Ideal Range for Model Complexity
36
Comparing Two Classifiers

Say we have 2 classifiers, C1 and C2
We want to choose the best one to use for future
predictions
e.g., medical diagnosis
e.g., email filtering
Can we use Training Accuracy to choose between
them?
No
e.g., C1 perceptron, C2kNN
e.g., training accuracy(kNN) 100, but it is
not necessarily best
We can choose according to whichever of
test_accuracy(C1) or test_accuracy(C2) is larger

37
Training and Validation Data
Full Data Set
Idea train each model on the training
data and then test each models accuracy on the
validation data
Training Data
Validation Data
38
The v-fold Cross-Validation Method

Why just choose one particular 90/10 split of
the data?
In principle we could do this multiple times
v-fold Cross-Validation (e.g., v10)
randomly partition our full data set into v
disjoint subsets (each roughly of size n/v, n
total number of training data points)
for i 110 (here v 10)
train on 90 of data,
Acc(i) accuracy on other 10
end
Cross-Validation-Accuracy 1/v Si Acc(i)
choose the method with the highest
cross-validation accuracy
common values for v are 5 and 10
Can also do leave-one-out where v n

39
Disjoint Validation Data Sets
Full Data Set
Validation Data
Training Data
1st partition
40
Disjoint Validation Data Sets
Full Data Set
Validation Data
Validation Data
Training Data
1st partition
2nd partition
41
More on Cross-Validation

Notes
cross-validation generates an approximate
estimate of how well the classifier will do on
unseen data
by averaging over different partitions it is more
robust than just a single train/validate
partition of the data
v-fold cross-validation is a generalization
partition data into disjoint validation subsets
of size n/v
train, validate, and average over the v
partitions
e.g., v10 is commonly used
v-fold cross-validation is approximately v times
computationally more expensive than just fitting
a model to all of the data

42
Sample MATLAB code for Cross-Validation
first randomly order the data (n number of
data points) rand('state',rseed) index
randperm(n) data ordereddata(index,) labels
orderedlabels(index)
43
Sample MATLAB code for Cross-Validation
now perform v-fold cross-validation olddata
data oldlabels labels nvalidate
floor(n/v) for i1v set testdata and
testlabels to be the first nvalidate rows of
olddata,oldlabels .. set traindata
and trainlabels to be the rest of rows of
olddata,oldlabels ... call
classifiers with traindata, trainlabels,
testdata, testlabels cvaccuracy(i)
classifier(..) olddata traindata
testdata oldlabels trainlabels
testlabels end overall_cvaccuacy
mean(cvaccuracy)
44
Assignment 4 Cross-Validation code (provided)
function cvacc, trainacc test_classifiers(data
1,data2,kvalues,v,rseed) function cvacc,
trainacc test_classifiers(data1,data2,kvalues,
v,rseed) cross-validation results with
minimum distance and knn classifiers
INPUTS data1 n1 x d feature data for class
1 data2 n2 x d feature data for class 2
kvalues row vector of values of k for knn v
for "v-fold" cross-validation rseed random
seed setting before permuting the data order
OUTPUT cvacc accuracy estimated using
cross-validation trainacc accuracy on the
training data (accuracy expressed as a
percentage, between 0 and 100)
45
Example of running cross-validation code

gtgt test_classifiers(d1,d2,1,5,1234)
Training Data Results Minimum distance
accuracy 87.50 KNN, k1, accuracy 100.00
Cross Validation Results (v5) Minimum
distance accuracy 85.00 KNN, k1, accuracy
82.50
If we change to k3 nearest-neighbors, the
results are as follows
gtgt test_classifiers(d1,d2,3,5,1234)
Training Data Results Minimum distance
accuracy 87.50 KNN, k3, accuracy 95.00
Cross Validation Results (v5) Minimum
distance accuracy 85.00 KNN, k3, accuracy
85.00

46
Assignment 4 Classifying images
function cvacc, trainacc
test_imageclassifiers(imageset1,imageset2,plotflag
,kvalues,v,rseed) Learns a classifier to
classify images in imageset1 from images in
imageset2, using minimum distance and knn
classifiers, and returns the training and
cross-validation accuracies.
                                     Your name,
CS 175A INPUTS    imageset1, imageset2
arrays (of size m x n, and m2 x n2)        of
structures, where imageset1(i,j).image is a
matrix of        pixel (image) values of size
nx by ny. It is assumed        that all images
are of the same size in both imageset1
and imageset2.    plotflag if plotflag1, plot
the mean image for each set,    and plot the
difference of the means of the images in the two
sets.    kvalues an K x 1 vector of k values
for the knn classifier    v number of "folds"
for v-fold cross-validation
47
The Minimum-Distance Classifier

A very simple classifier
Assume we have data from M classes (e.g., M2)
Calculate the mean for each class, e.g., Mean1
and Mean2
mean vector sum of all vectors/number of
vectors
mean vector centroid of points
Classify each new point x as follows
for j 1 M
calculate the distance dj Euclidean distance(x,
Meanj)
distance from x to the jth class mean
choose the minimum distance as the predicted
class
assign x to the closest mean

48
(No Transcript)
49
Assignment 4 Minimum Distance Classifier
(provided)
function acc minimum_distance(traindata,trainlab
els,testdata,testlabels) implementation of a
minimum distance classifier INPUTS
traindata N1 x d matrix of feature data
trainlabels N1 x 1 column vector of classlabels
testdata N2 x d matrix of feature data
trainlabels N2 x 1 column vector of classlabels
OUTPUTS acc accuracy (percentage)
on the test data for a classifier trained
on the training data
50
Summary