Title: Function Approximation With ANNs
1Function Approximation With ANNs
- Resources
- Chapter 20, textbook
- Sections 20.5
- Fausett (1994) chapter Delta rule
- Supervised Training
- Delta rule Single-layer networks
- Backpropagation Multi-layer networks
2Single Layer Feed-Forward ANN
- Network has n inputs and m outputs
- One layer of weights
- Training data
- Pairs (input, output)
- input is a vector of length n
- output is a vector of length m
- Function approximation problem is to find neural
network weights, f, such that - f(input) output
- Sometimes call this
- association learning
n inputs
m outputs
fully connected
. . .
. . .
3Interpolation Problem
- May be viewed as multidimensional interpolation
- number of dimensions correspond to the number of
input units, and output units - Example
- one-dimensional input, one-dimensional output
problem - input vectors have only one component output
vectors also have one component - this is now a curve fitting problem which curve
we choose to fit the data will change the result
for new inputs i.e., for generalization - In general, there is no unique solution to curve
the we choose - Raises a sampling issue
- We need to have data that adequately sample the
distribution
4Sampling issue
If this was the underlying system we are
obtaining data from, how would we select the
samples?
5Example of Association Learning
- Images of three types
- Converted to grayscale
- Want to associate each image with a shape name
- Dog shape with audio representation of word dog
- Yellow pail with audio representation of pail
- Green bucket with audio representation of
bucket - Assume image is 16x16 pixels (256 inputs)
- Assume output audio is represented by 10 real
number values - How do we find neural network weights to
approximate - f(image) audio
6Simplify Problem
- One Output Unit
- Activation function Identity f(x) x
- output defined just by weighted sum of inputs
- Example problem, with n 3
n inputs
1 output
fully connected
. . .
How do we establish weights for the ANN?
7General Weight Update Algorithm
- Initialize the weight to random values
- Compute errors
- While errors produced by the network are too
great - For current sample
- Update network weights using a weight update rule
- Re-compute error for current sample
(Incremental weight update)
8Weight Update Rule 1 Delta Rule
Change in Ith weight of weight vector
Learning rate (scalar, constant)
t
Target or correct output
y_in
Net (summed, weighted) input to output unit
Ith input value
9Example
- W (W1, W2, W3)
- Initially W (.5 .2 .4)
- Let ? 0.5
- Apply delta rule
W1
W2
W3
Delta rule
10One Epoch of Training
Delta rule
11Step 1 of Training
Delta rule
12Remaining Steps in First Epoch of Training
Delta rule
13Completing the Example
- After 18 epochs
- Weights
- W1 0.990735
- W2 -0.970018005
- W3 0.98147
- Does this adequately approximate the training
data?
W1
W2
W3
http//www.cprince.com/courses/cs5541fall03/lectur
es/neural-networks/delta-rule1.xls
14Actual Outputs
So, we have one method to incrementally adjust
the network weights, based on a series of
training samples This is typically called
training or learning
15What about
- The following weights?
- W1 1
- W2 -1
- W3 1
- Generalization?
- (0 1 0)
- (1 1 0)
- (0 1 1)
16Why is the Delta-Rule Effective?
- Delta rule implements a form of error
minimization - Change weights to reduce sum squared error, E
- For a specific training pattern, the sum squared
error is - E (t y_in)2
- Recall
- t desired output of network
- y_in actual output of network
- The derivative of E gives the slope or gradient
of E - Gives both direction of most rapid increase in E,
the error, and direction of most rapid decrease - Want the derivative with respect to the weights
- We are adjusting the weights in an effort to
decrease the error - Since y_in is a function of multiple weights, we
will have partial derivatives - Adjusting weight WI in the direction of
- will reduce the error
17Delta Rule Approach
E (t y_in)2
- E and y_in defined as before
- Note that y_in is computed for one training
sample - Define training as modifying weights so that
the error is reduced - Typically, this is done iteratively
- E.g., Weights modified to reduce error for the
current training sample, then modified to reduce
error for another training sample etc. - Approach
- Take the derivative of E, the error, with respect
to the weights - Tells us how to change weights so as to minimize
E - Results in changes in weights that reduce E, the
error
18Partial Derivative of E, Error
Since
i.e., chain rule
(t is a constant in this context)
Since
19Completing Derivation of Delta Rule
Because we are looking for
we negate
Incorporating constant of 2 into the learning
rate, ? gives
Changing the weight by this amount will reduce
the error, E, for this data sample
20Delta Rule with Activation Functions
Example for f
Change in Ith weight of weight vector
Learning rate (scalar, constant)
t
Target or correct output
y_in
Net (summed, weighted) input to output unit
Ith input value
f
Differentiable activation function
y
Output of network f(y_in)
21Derivation of Delta Rule for Use With Activation
Functions
- f differentiable activation function
- e.g., sigmoidal
Output of network
y
Now we need
22Derivation of
In derivation, we can apply chain rule to
this f(g(x0) f(g(x0))g(x0)
23Delta Rule With Activation Function
Because we are looking for
we negate
Incorporating the constant into ? gives
24How do we modify this to use the sigmoidal
activation function?
25Delta Rule With Sigmoidal Activation Function
- Need to take the derivative of the sigmoidal
function
26Extension to Multiple Output Units
- Have been dealing with only a single output unit
- Need to generalize to multiple output units
Weight from i-th input unit to j-th output unit
n inputs
m outputs
Expected output from j-th output unit
Actual output from j-th output unit
fully connected
Summed weighted input to j-th output unit
. . .
. . .
Ith input value
27What about Multiple Layers?
- This works when we have
- known outputs (supervision)
- single layer ANN
- How can we train weights for multi-layer ANNs?
28Backpropagation
- A method of training weights in a multi-layer,
feedforward network - Problem is to establish correct or expected
values for layers other than the output layer - Method
- Start at output layer
- Work backwards from output layer, to successive
layers to left - Modify weights at each step
29Problem
General form of connections between layers
- First, compute weight updates for right weight
layer using the delta rule
30Now, need to consider input layer to hidden layer
- Previously, in the delta rule, we needed the
partial derivative
- We now need the partial derivative
31Expected output
Where
Previously,
Actual output
Now, well consider all p output units (error for
one training sample)
Our goal is to find
32Taking the partial derivative of this, well find
it defined in terms of the v weights. Why?
Because the y outputs are indirectly generated,
in part, by the v weights Since each v weight may
have an indirect effect on potentially each of
the y outputs, we need to consider each of the y
outputs in the formulation.
33Since
Collapsing back to the summation, we have
34Recall,
Now, deriving
(Chain rule)
Substituting gives
35For convenience, let
(We can calculate this directly)
Now, derive
36Deriving
Recall,
is the single hidden layer unit that weight
is affecting
Now, need
37Deriving
Recall,
(Chain rule)
Since
(We can calculate this directly)
Finally, we have an expression in terms of the v
weights!
38Now, we have
Putting it together
39Finally, the weight change for connections to
hidden layer units
40Application of Backpropagation
- The equations so far have been general, for two
weight layer networks - Specialize the weight update equations for the
following network architecture
etc.
41z-y weights
x-z weights
These update rules are applied once per data
sample per epoch