Discussion of Assignments 3 and 4 and Cross-Validation Methods - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Discussion of Assignments 3 and 4 and Cross-Validation Methods

Description:

Illustration of Gradient Descent. w1. w2. g(w) Slide Set 8: Cross Validation 10 ... Illustration of Gradient Descent. w1. w2. g(w) Original point in. weight ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 51
Provided by: padhrai
Category:

less

Transcript and Presenter's Notes

Title: Discussion of Assignments 3 and 4 and Cross-Validation Methods


1
Discussion of Assignments 3 and
4andCross-Validation Methods
  • Padhraic Smyth
  • Information and Computer Science
  • CS 175, Fall 2007

2
Review of Assignment 3 (Perceptron)
3
perceptron.m function
  • function thresholded_outputs
    perceptron(weights,data)
  • function thresholded_outputs
    perceptron(weights,data)
  • Compute the class predictions for perceptron
    (linear classifier)

  • Sample code for CS 175
  • Inputs
  • weights 1 x (d1) row vector of weights
  • data N x (d1) matrix of training data
  • Outputs
  • outputs N x 1 vector of perceptron
    outputs
  • error checking
  • if size(weights,1) 1
  • error('The first argument (weights) should
    be a row vector')
  • end

4
perceptron_error.m function
  • function cerror, mse perceptron_error(weights,
    data,targets)
  • function cerror, mse perceptron_error(weig
    hts,data,targets)
  • Compute mis-classification error and mean
    squared error for
  • a perceptron (linear) classifier

  • Sample code for CS 175
  • Inputs
  • weights 1 x (d1) row vector of weights
  • data N x (d1) matrix of training data
  • targets N x 1 vector of target values
    (1 or -1)
  • Outputs
  • cerror the percentage of examples
    misclassified (between 0 and 100)
  • mse the mean-square error (sum of
    squared errors divided by N)

5
perceptron_error.m function
  • N size(data, 1)
  • error checking
  • if nargin 3
  • error('The function takes three arguments
    (weights, data, targets)')
  • end
  • if size(weights,1) 1
  • error('The first argument (weights) should
    be a row vector')
  • end
  • if size(data,2) size(weights,2)
  • error('The first two arguments (weights and
    data) should have the same number of columns')
  • end
  • if size(data,1) size(targets,1)
  • error('The last two arguments (targets and
    data) should have the same number of rows')
  • end

6
perceptron_error.m function
  • calculate the unthresholded outputs, for all
    rows in data, N x 1 vector
  • f (weights data)
  • compare thresholded output to the target values
    to get the accuracy
  • cerror 100 sum(sign(f) targets)/N
  • calculate the sigmoid version of the outputs,
    for all rows in data, N x 1 vector
  • outputs sigmoid(f)
  • compare sigmoid output vector to the target
    vector to get the mse
  • mse sum((outputs-targets).2)/N

7
perceptron_error.m function
  • calculate the unthresholded outputs, for all
    rows in data, N x 1 vector
  • f (weights data)
  • compare thresholded output to the target values
    to get the accuracy
  • cerror 100 sum(sign(f) targets)/N
  • calculate the sigmoid version of the outputs,
    for all rows in data, N x 1 vector
  • outputs sigmoid(f)
  • compare sigmoid output vector to the target
    vector to get the mse
  • mse sum((outputs-targets).2)/N

Vectorized computation of classification error
rate
Vectorized computation of sigmoid output
Vectorized computation of MSE
Local function defining the sigmoid. Note that it
works on vectors
8
Principle of Gradient Descent
  • Gradient descent algorithm
  • Start with some initial guess at w
  • Move downhill in small steps direction of
    steepest descent
  • What is the direction of steepest descent?
  • The negative of the gradient, evaluated at w
  • What is the gradient?
  • Gradient vector of derivatives with respect to
    each component of w
  • E.g., if w w1, w2, w3 then
    gradientg(w) d g(w)/ dw1, d g(w)/dw2, d
    g(w)/dw3
  • Note that the gradient is itself a vector (or a
    direction)
  • After moving, recompute the gradient, get a new
    downhill direction, and move again.
  • Keep repeating this until the decrease in g(w) is
    less than some threshold, i.e., we appear to be
    on a flat part of the g(w) surface.

9
Illustration of Gradient Descent
g(w)
w1
w2
10
Illustration of Gradient Descent
g(w)
w1
w2
11
Illustration of Gradient Descent
g(w)
w1
Direction of steepest descent direction
of negative gradient
w2
12
Illustration of Gradient Descent
g(w)
w1
Original point in weight space
New point in weight space
w2
13
Gradient Descent Algorithm
  • Algorithm converges to either
  • Global minimum if g(w) is convex (has a single
    minimum)
  • this is the case for the perceptron
  • Local minimum if g(w) has multiple local minima
  • This is the case for multilayer neural networks
  • To avoid local minima, in practice we rerun the
    gradient
  • descent algorithm from multiple random
    starting points
  • pick the solution with the lowest MSE.
  • Note that the backpropagation algorithm is based
    on
  • gradient descent (using a clever way to
    calculate the gradient)
  • Note that the algorithm need not converge at all
    if the learning rate (i.e., step size) is too
    large

14
Gradient Descent Algorithm
  • Mathematically, the Gradient Descent Rule
  • w new w old - h D (w)
  • where
  • D (w) is the gradient and
  • h is the learning rate (small, positive)

15
Gradient Descent Algorithm
  • Mathematically, the Gradient Descent Rule
  • w new w old - h D (w)
  • where
  • D (w) is the gradient and
  • h is the learning rate (small, positive)
  • In MATLAB, for the perceptron with sigmoid
    outputs this translates into the following update
    rule
  • weights weights - rate (o - targets(i))
    dsigmoid(o) data(i, )

This whole part is the gradient, evaluated at the
current weight vector
16
learn_perceptron.m function
  • function weights,mse,acc learn_perceptron(dat
    a,targets,rate,threshold,init_method,random_seed,p
    lotflag,k)
  • function weights,mse,acc learn_perceptron(da
    ta,targets,rate,threshold,init_method,random_seed,
    plotflag,k)
  • Learn the weights for a perceptron (linear)
    classifier to minimize its
  • mean squared error.

  • Sample code for CS
    175
  • Inputs
  • data N x (d1) matrix of training data
  • targets N x 1 vector of target values (1
    or -1)
  • rate learning rate for the perceptron
    algorithm (e.g., rate 0.001)
  • threshold if the reduction in MSE from one
    iteration to the next is less
  • than threshold, then halt
    learning (e.g., threshold 0.000001)
  • init_method method used to initialize the
    weights (1 random, 2 half
  • way between 2 random points in
    each group, 3 half way between
  • the centroids in each group)
  • random_seed this is an integer used to
    "seed" the random number generator
  • for either methods 1 or 2 for
    initialization (this is useful

17
learn_perceptron.m function
  • N, d size(data)
  • error checking
  • if nargin lt 4
  • error('The function takes at least 4
    arguments (data, targets, rate, threshold)')
  • end
  • if size(data,1) size(targets,1)
  • error('The number of rows in the first two
    arguments (data, targets) does not match!')
  • end
  • initialize the input arguments
  • if exist('k')
  • k 100
  • end
  • if exist('plotflag')
  • plotflag 0

18
learn_perceptron.m function
  • initialize the weights
  • weights initialize_weights175(data,targets,init_
    method,random_seed)
  • iteration0
  • while iteration lt 2 ( abs(mse(iteration) -
    mse(iteration-1)) gt threshold )
  • iteration iteration 1
  • cycle through all of the examples
  • for i1N
  • calculate the unthresholded output for
    the ith row of "data"
  • o sigmoid( weights data(i,)' )
  • update the weight vector
  • weights weights rate (targets(i) -
    o) dsigmoid(o) data(i, )
  • end
  • calculate the errors using current
    parameter values
  • cerror(iteration), mse(iteration)
    perceptron_error(weights, data, targets)

19
learn_perceptron.m function
  • create the plots of the MSE and Accuracy Vs.
    iteration number
  • if (plotflag 1)
  • figure(2)
  • subplot(2, 1, 1)
  • plot(mse,'b-')
  • xlabel('iteration')
  • ylabel('MSE')
  • subplot(2, 1, 2)
  • plot(100-cerror,'b-')
  • xlabel('iteration')
  • ylabel('Accuracy')
  • end
  • local functions..
  • function s sigmoid(x)
  • Compute the sigmoid function, scaled from -1 to
    1

20
MATLAB Demonstration
  • Download MATLAB demo code (Zip file) from Web
    page
  • Run demo_perceptron_image_classification.m

21
Additional Concepts in Classification (Relevant
to Assignment 4)
22
Assignment 4
  • threshold_image.m
  • Simple function to display thresholded images
  • knn_dispset.m
  • Finds and displays the k-nearest-neighbors for a
    given image
  • test_classifiers.m
  • Uses cross-validation to compare classifiers
  • (code is provided)
  • test_imageclassifiers.m
  • Compare different classification methods on image
    data
  • Uses cross-validation

23
Assignment 4 using kNN to find similar images
  • function  list knndispset(imageset,i,j,k,plotf
    lag)   function  list knndispset(imageset,i
    ,j,k, plotflag)       a brief description of
    what the function does     ......  
                                Your Name, CS 175,
    date       Inputs        imageset  an
    array structure of images (CS 175 format)  
         i, j  integers specifying that
    imageset(i,j).image is the query image       
    k number of neighbors to find       plotflag
    display the k nearest neighbors if plotflag 1
          Outputs       list a k x 2 matrix,
    where the first row contains the indices from
    imageset of the nearest neighbor, the second
    row contains the indices of the 2nd nearest
    neighbor, and so forth.

24
MATLAB demo of knndispset
  • knndispset(i2straight,5,1,15,1)

25
MATLAB demo of knndispset
  • knndispset(i2straight,18,1,15,1)

26
Training Data and Test Data
  • Training data
  • labeled data used to build a classifier
  • Test data
  • new data, not used in the training process, to
    evaluate how well a classifier does on new data
  • Memorization versus Generalization
  • better training_accuracy
  • memorizing the training data
  • better test_accuracy
  • generalizing to new data
  • in general, we would like our classifier to
    perform well on new test data, not just on
    training data,
  • i.e., we would like it to generalize well to new
    data
  • Test accuracy is more important than training
    accuracy

27
Test Accuracy and Generalization
  • The accuracy of our classifier on new unseen data
    is a fair/honest assessment of the performance of
    our classifier
  • Why is training accuracy not good enough?
  • Training accuracy is optimistic
  • a classifier like nearest-neighbor can construct
    boundaries which always separate all training
    data points, but which do not separate new points
  • e.g., what is the training accuracy of kNN, k
    1?
  • A flexible classifier can overfit the training
    data
  • in effect it just memorizes the training data,
    but does not learn the general relationship
    between x and C
  • Generalization
  • We are really interested in how our classifier
    generalizes to new data
  • test data accuracy is a good estimate of
    generalization performance

28
Another Example
29
A More Complex Decision Boundary
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
Decision
Decision
Region 1
Region 2
Feature 2
Decision
Boundary
Feature 1
30
Example The Overfitting Phenomenon
Y
X
31
A Complex Model
Y high-order polynomial in X
Y
X
32
The True (simpler) Model
Y a X b noise
Y
X
33
How Overfitting affects Prediction
Predictive Error
Error on Training Data
Model Complexity
34
How Overfitting affects Prediction
Predictive Error
Error on Test Data
Error on Training Data
Model Complexity
35
How Overfitting affects Prediction
Predictive Error
Overfitting
Underfitting
Error on Test Data
Error on Training Data
Model Complexity
Ideal Range for Model Complexity
36
Comparing Two Classifiers
  • Say we have 2 classifiers, C1 and C2
  • We want to choose the best one to use for future
    predictions
  • e.g., medical diagnosis
  • e.g., email filtering
  • Can we use Training Accuracy to choose between
    them?
  • No
  • e.g., C1 perceptron, C2kNN
  • e.g., training accuracy(kNN) 100, but it is
    not necessarily best
  • We can choose according to whichever of
    test_accuracy(C1) or test_accuracy(C2) is larger

37
Training and Validation Data
Full Data Set
Idea train each model on the training
data and then test each models accuracy on the
validation data
Training Data
Validation Data
38
The v-fold Cross-Validation Method
  • Why just choose one particular 90/10 split of
    the data?
  • In principle we could do this multiple times
  • v-fold Cross-Validation (e.g., v10)
  • randomly partition our full data set into v
    disjoint subsets (each roughly of size n/v, n
    total number of training data points)
  • for i 110 (here v 10)
  • train on 90 of data,
  • Acc(i) accuracy on other 10
  • end
  • Cross-Validation-Accuracy 1/v Si Acc(i)
  • choose the method with the highest
    cross-validation accuracy
  • common values for v are 5 and 10
  • Can also do leave-one-out where v n

39
Disjoint Validation Data Sets
Full Data Set
Validation Data
Training Data
1st partition
40
Disjoint Validation Data Sets
Full Data Set
Validation Data
Validation Data
Training Data
1st partition
2nd partition
41
More on Cross-Validation
  • Notes
  • cross-validation generates an approximate
    estimate of how well the classifier will do on
    unseen data
  • by averaging over different partitions it is more
    robust than just a single train/validate
    partition of the data
  • v-fold cross-validation is a generalization
  • partition data into disjoint validation subsets
    of size n/v
  • train, validate, and average over the v
    partitions
  • e.g., v10 is commonly used
  • v-fold cross-validation is approximately v times
    computationally more expensive than just fitting
    a model to all of the data

42
Sample MATLAB code for Cross-Validation
first randomly order the data (n number of
data points) rand('state',rseed) index
randperm(n) data ordereddata(index,) labels
orderedlabels(index)
43
Sample MATLAB code for Cross-Validation
now perform v-fold cross-validation olddata
data oldlabels labels nvalidate
floor(n/v) for i1v set testdata and
testlabels to be the first nvalidate rows of
olddata,oldlabels .. set traindata
and trainlabels to be the rest of rows of
olddata,oldlabels ... call
classifiers with traindata, trainlabels,
testdata, testlabels cvaccuracy(i)
classifier(..) olddata traindata
testdata oldlabels trainlabels
testlabels end overall_cvaccuacy
mean(cvaccuracy)
44
Assignment 4 Cross-Validation code (provided)
function cvacc, trainacc test_classifiers(data
1,data2,kvalues,v,rseed) function cvacc,
trainacc test_classifiers(data1,data2,kvalues,
v,rseed) cross-validation results with
minimum distance and knn classifiers
INPUTS data1 n1 x d feature data for class
1 data2 n2 x d feature data for class 2
kvalues row vector of values of k for knn v
for "v-fold" cross-validation rseed random
seed setting before permuting the data order
OUTPUT cvacc accuracy estimated using
cross-validation trainacc accuracy on the
training data (accuracy expressed as a
percentage, between 0 and 100)
45
Example of running cross-validation code
  • gtgt test_classifiers(d1,d2,1,5,1234)
  • Training Data Results Minimum distance
    accuracy 87.50 KNN, k1, accuracy 100.00
  • Cross Validation Results (v5) Minimum
    distance accuracy 85.00 KNN, k1, accuracy
    82.50
  • If we change to k3 nearest-neighbors, the
    results are as follows
  • gtgt test_classifiers(d1,d2,3,5,1234)
  • Training Data Results Minimum distance
    accuracy 87.50 KNN, k3, accuracy 95.00
  • Cross Validation Results (v5) Minimum
    distance accuracy 85.00 KNN, k3, accuracy
    85.00

46
Assignment 4 Classifying images
function cvacc, trainacc
test_imageclassifiers(imageset1,imageset2,plotflag
,kvalues,v,rseed) Learns a classifier to
classify images in imageset1 from images in
imageset2, using minimum distance and knn
classifiers, and returns the training and
cross-validation accuracies.
                                     Your name,
CS 175A INPUTS    imageset1, imageset2
arrays (of size m x n, and m2 x n2)        of
structures, where imageset1(i,j).image is a
matrix of        pixel (image) values of size
nx by ny. It is assumed        that all images
are of the same size in both imageset1       
and imageset2.    plotflag if plotflag1, plot
the mean image for each set,    and plot the
difference of the means of the images in the two
sets.    kvalues an K x 1 vector of k values
for the knn classifier    v number of "folds"
for v-fold cross-validation
47
The Minimum-Distance Classifier
  • A very simple classifier
  • Assume we have data from M classes (e.g., M2)
  • Calculate the mean for each class, e.g., Mean1
    and Mean2
  • mean vector sum of all vectors/number of
    vectors
  • mean vector centroid of points
  • Classify each new point x as follows
  • for j 1 M
  • calculate the distance dj Euclidean distance(x,
    Meanj)
  • distance from x to the jth class mean
  • choose the minimum distance as the predicted
    class
  • assign x to the closest mean

48
(No Transcript)
49
Assignment 4 Minimum Distance Classifier
(provided)
function acc minimum_distance(traindata,trainlab
els,testdata,testlabels) implementation of a
minimum distance classifier INPUTS
traindata N1 x d matrix of feature data
trainlabels N1 x 1 column vector of classlabels
testdata N2 x d matrix of feature data
trainlabels N2 x 1 column vector of classlabels
OUTPUTS acc accuracy (percentage)
on the test data for a classifier trained
on the training data
50
Summary
  • Assignment 2
  • Perceptron code
  • Can use perceptrons (or any classifier) to
    classify images
  • Assignment 4
  • Nearest-neighbor with images
  • Cross-validation
  • Minimum distance classifier
  • Due Tuesday at 930am
Write a Comment
User Comments (0)
About PowerShow.com