Beyond Perceptrons - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Beyond Perceptrons

Description:

The objective of the learning process is to adjust the free parameters (weights) to minimize ... i.e. the free parameters (weights and biases) are updated ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 71
Provided by: fakp
Category:

less

Transcript and Presenter's Notes

Title: Beyond Perceptrons


1
Beyond Perceptrons
  • Yes there is a lot of math in the text we will
    concern ourselves with only enough to understand
    the important algorithms

2
Overview Review
  • Concepts to remember about perceptrons
  • Linear decision surface
  • Transfer function
  • Hardlimit
  • Single layer, single neuron
  • What are we trying to optimize?
  • Error

3
Perceptron Learning Rule
  • Initialize w(0)0
  • Input x(n) and compute y(n)
  • Update w as
  • w(n1)w(n) d(n)-y(n)x(n)
  • Repeat steps 2 and 3 until no more weight changes

4
Taylor Series Expansion (TSE)
  • The TSE is a way to determine the value of a
    function F(w) given its value at location w,
    i.e. F(w).
  • One restriction is that F must be an analytic
    function meaning that all of its derivatives
    exist
  • We will use the TSE to approximate the value of
    an error function.

5
Error Function
  • Remember, the error function is what we want to
    minimize for a classifier or function
    approximator.
  • In other words we want to find the value of W (a
    vector of wis ) that minimizes the error
    function
  • With the perceptron, we used gradient or steepest
    descent to get the perceptron learning rule
    (PLR).
  • In the PLR we move in the direction of minus the
    gradient (or the slope)

6
TSE
7
TSE
  • Note that if (w-w ) is small, then only the
    first two terms of the TSE are needed
  • i.e. with a small (w-w ) then the next term has
    a (w-w )2 which is even smaller, etc.

8
Steepest Descent
  • If one used just the first two terms of the TSE
    then we have for an error function updating by
    steepest descent

9
(No Transcript)
10
(No Transcript)
11
LLS (Linear Least Square), vs LMS (Least Mean
Square)
  • In general, the Linear Least Square approach
    tries to find the weights to minimize the
    function
  • This means we dont update the Ws until all of
    the training data has been input and an error
    computed.

12
Other Methods
  • There are other methods to determine the error or
    learning rate, e.g. Newtons method, but all are
    much more complicated, we will ignore them.

13
LLS vs LMS
  • In general, the LMS approach tries to find the
    weights to minimize the error based on
    instantaneous errors.

14
Least Mean Square (LMS)
  • We then have as an estimate of the gradient of
    the error function
  • We can use the steepest descent formula as

15
LMS Convergence Considerations
  • That which is changing the weight at each
    iteration is the learning rate , and the
    input vector
  • Stability of the LMS algorithm, is a function of
    the statistical characteristics of the input and
    the size of the learning rate.
  • Stated another way we have to select according
    to the environment in which the Xs are given

16
LMS Convergence Considerations
  • What we want is for the mean squared error to
    converge
  • It can be shown that the LMS algorithm will
    converge to the minimum mean square error if

17
LMS Convergence Considerations
  • The correlation matrix Rx E(xxT)
  • Generally dont have and so must use
    something else
  • Use the trace of Rx
  • 0lt lt2/TrRx

18
LMS Convergence Considerations
  • TRRx is just a conservative estimate
  • The trace Rx of a square matrix is the sum of its
    diagonal elements
  • The diagonal elements of R are the average values
    of the square of the inputs

19
LMS Convergence Considerations
  • Lets say we have the inputs x11 2 3, x24 7
    8, and x32 3 4
  • Then for Rx we have
  • (114422)/3 7
  • (227733)/321
  • (338844)/26.3
  • Rx72126.354.3

20
LMS
  • Typically requires a number of iterations gt 10
    times the dimensionality (number of elements in a
    training vector) of the input vector
  • Sometimes, rather than a fixed learning rate can
    use an annealing schedule

21
Perceptrons
  • At least with respect to classification, it
    appears you can do anything you want with
    perceptrons, if you can have at least 3 layers.
    So whats the problem?
  • The problem is that the perceptron training rule
    only works for linearly separable
    classifications, and a single layer. Therefore,
    the process of finding the hyperplanes is not
    automated beyond one layer does not work.

22
BackPropagation - BP
  • In terms of impact and use, BP is probably the
    most important training algorithm developed for
    neural networks
  • Developed and published by the Rummelhart PDP
    group in 1984
  • Allows one to automatically update the weights in
    the hidden layers, i.e. can train a multilayer
    network

23
BP
  • BP uses the LMS (Least Mean Squares)
    minimization, like perceptrons
  • Some authors call all neurons perceptrons, but
    in our case, we will simply call them neurons,
    because, as you will see, in a BP network, the
    transfer functions for the neurons cannot be like
    those for perceptrons, i.e. they cannot be
    hard-limits. They must be differentiable.

24
BP
  • The error signal at the output of neuron j at
    iteration n, i.e. the nth training sample is
  • The instantaneous value of the error we will call
    it energy for neuron j is

25
BP
  • Since there may be more than one output layer
    neuron, we define the total energy of all the
    output layer neurons as
  • The average squared error energy becomes

26
BP
  • The objective of the learning process is to
    adjust the free parameters (weights) to minimize
  • To do this, we use a LMS approach, i.e. the free
    parameters (weights and biases) are updated on a
    pattern-by-pattern basis until one epoch has
    occurred.
  • If the error is then acceptable, we stop
    otherwise, we run for another epoch.

27
BP Hold on to your hats!
  • For neuron j with m inputs, and thus m weights we
    have

28
BP
  • Thus the actual output of neuron j is
  • In the LMS algorithm we want to apply a weight
    correction that is proportional to the gradient
    at this point. For each weight correction, it is
    the partial derivative of the error with respect
    to the weight being updated. (next slide)

29
BP
  • We want
  • We can use the chain rule of calculus to find
    this

30
BP
  • The derivative
  • Is called by the author a sensitivity factor in
    that it determines the direction of search in the
    weight space with respect to a particular weight

31
BP
  • Earlier we had

32
BP
  • So now, from the preceding we have

33
BP
  • Using our definition of e we have

34
BP
35
BP
  • This finally gives us

36
BP
  • According to the learning rule that we have, we
    want to update a weight by minus (remember we
    want to go in the negative direction of the
    gradient) the gradient of the error function with
    respect to that weight, multiplied by the
    learning rate, i.e.

37
BP
38
BP
39
BP
  • In words, the preceding derivation says
  • Give the network an input say x(n)
  • For an output neuron j, calculate the output
    yj(n) and the error ej (n)dj (j)-yj (n)
  • We also need the derivative of the transfer
    function for the neuron.

40
BP
  • With the preceding, then we know how to update
    the neuron

41
BP Example
  • Consider an output layer neuron defined as
    follows
  • Update W

42
BP
  • There are two questions about the BP algorithm
    that need to be addressed
  • Question 1 Since we need the derivative of the
    transfer function as part of the weight update
    formula, what happens with a perceptron with say
    a hardlimit transfer function?
  • Answer Cant use it, i.e. the transfer function
    bmust be everywhere differentiable

43
BP
  • Question 2 What if the neuron is not an output
    layer neuron? How do we update its weight(s)?
  • Answer We use the same basic approach, but we
    need a way to figure how to assign an error value
    to each hidden layer neuron

44
BP
  • Note that a hidden layer neuron is simply what a
    neuron in other than the output layer is called.
    This is because its output is not directly
    visible at the output layer.
  • By error assignment, we mean how much of the
    error in the output layer is the responsibility
    of the given hidden layer neuron?

45
BP
  • Note that the error assigned to a hidden layer
    neuron is not with respect to a single output
    neuron. This is because the output of a hidden
    layer neuron is input to all of the output layer
    neurons.
  • Thus, the error assigned to a hidden layer neuron
    will be a function of all of the errors in the
    next layer of neurons.

46
(No Transcript)
47
BP
48
BP
49
BP
50
BP FINALLY!
51
Put it All together
52
  • In this example I have tried to simplify as much
    as possible but retain everything that is needed
    for a thorough example.
  • Network configuration
  • 3 layers numbered 1,2,3 where 1 is the first or
    input layer, and 3 is the output layer.
  • Therefore, there are 2 hidden layers, i.e. 1 and
    2
  • The transfer function for all neurons is simply a
    linear function represented as the sum of the
    weights and the inputs

53
  • The input vector X is a 2 element vector which we
    augment with a 1 value so that we can take care
    of the bias as part of the weight vector

54
  • The network is set up as follows

Layer 2
Layer 3
Layer 1
Y13
0.25
Y12
1
1
0.3
X1
0.1
Y11
0.6
-0.1
0.05
1
1
0.2
1
0.4
X2
0.3
-0.1
Y22
Y23
2
2
1
0.2
0.35
0.15
1
1
55
  • From the preceding example, the weight matrices
    become

56
  • For this example we will use as the initial input
    and target values
  • Remember, in our notation Y0 X
  • Also, let learning rate 0.145

57
  • We first feed the inputs forward to get an output
    That will be as follows

58
  • Assuming that my calculations are correct, we have

Layer 2
Layer 3
Layer 1
(.5915)
0.25
(.17)
1
1
0.3
0.1
1
(.4)
0.6
-0.1
0.05
1
1
0.2
1
0.4
2
0.3
-0.1
(.51)
(.303)
2
2
1
0.2
0.35
0.15
1
1
59
  • Now we need to feed back the error.
  • For the output layer we have

60
  • Now we evaluate the updates for the weights of
    the output layer neurons. Use 0.145

61
  • Assuming that my calculations are correct, we now
    have calculated

Layer 2
Layer 3
Layer 1
0.26
1
1
1
0.659
-0.0698
1
1
1
2
0.342
2
2
1
0.325
0.396
1
1
62
  • The challenging part is the calculation of the
    hidden layer weight updates. From class and the
    text we have (or we still want)
  • The problem is we dont know e, because we dont
    know the target output for a hidden neuron

63
  • In the text, the author defines
  • Using our notation, for an output layer neuron we
    have
  • So the question is, what is the for hidden
    layer neurons?

64
  • In the text, the author derives the as

65
  • Now we are ready to update the weights in layers
    1 and 2. As before

66
(No Transcript)
67
  • Assuming that my calculations are correct, we now
    have calculated

Layer 2
Layer 3
Layer 1
0.26
1
1
.335
1
0.659
-0.0698
.139
1
1
1
2
0.341
.417
2
2
1
0.325
.393
0.396
1
1
68
  • Now we only have to calculate the weight update
    for the first or input layer

69
  • Assuming that my calculations are correct, we now
    have calculated

Layer 2
Layer 3
Layer 1
0.26
1
1
0.144
.335
1
0.659
-0.0698
.139
1
0.288
1
1
2
0.341
.417
-0.056
2
2
1
0.325
.393
0.396
1
1
70
What Next?
  • Now, input the next (X, target output) set and
    again update as shown
Write a Comment
User Comments (0)
About PowerShow.com