Introduction to Neural Networks - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to Neural Networks

Description:

1986 Rumelhart, Hinton, Williams. Gradient descent method that ... Bipolar sigmoid activation function. a = 1. 3 input units, 5 hidden units,1 output unit ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 25

Provided by: Joh7

Learn more at: https://www.cs.montana.edu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Neural Networks

1
Introduction to Neural Networks

John Paxton
Montana State University
Summer 2003

2
Chapter 6 Backpropagation

1986 Rumelhart, Hinton, Williams
Gradient descent method that minimizes the total
squared error of the output.
Applicable to multilayer, feedforward, supervised
neural networks.
Revitalizes interest in neural networks!

3
Backpropagation

Appropriate for any domain where inputs must be
mapped onto outputs.
1 hidden layer is sufficient to learn any
continuous mapping to any arbitrary accuracy!
Memorization versus generalization tradeoff.

4
Architecture

input layer, hidden layer, output layer

1
1
y1
x1
z1
ym
xn
zp
wpm
vnp
5
General Process

Feedforward the input signals.
Backpropagate the error.
Adjust the weights.

6
Activation Function Characteristics

Continuous.
Differentiable.
Monotonically nondecreasing.
Easy to compute.
Saturates (reaches limits).

7
Activation Functions

Binary Sigmoid f(x) 1 / 1 e-x f(x)
f(x)1 f(x)
Bipolar Sigmoid f(x) -1 2 / 1
e-x f(x) 0.5 1 f(x) 1 f(x)

8
Training Algorithm

1. initialize weights to small random values,
for example -0.5 .. 0.5
2. while stopping condition is false do
steps 3 8
3. for each training pair do steps 4-8

9
Training Algorithm

4. zin.j S (xi vij)
zj f(zin.j)
5. yin.j S (zi wij)
yj f(yin.j)
6. error(yj) (tj yj) f(yin.j)
tj is the target value
7. error(zk) S error(yj) wkj f(zin.k)

10
Training Algorithm

8. wkj(new) wkj(old) aerror(yj)zk
vkj(new) vkj(old) aerror(zj))xk
a is the learning rate
An epoch is one cycle through the training
vectors.

11
Choices

Initial Weights
random -0.5 .. 0.5, dont want the derivative
to be 0
Nguyen-Widrow b 0.7 p(1/n) n number of
input units p number of hidden units vij b
vij(random) / vj(random)

12
Choices

Stopping Condition (avoid overtraining!)
Set aside some of the training pairs as a
validations set.
Stop training when the error on the validation
set stops decreasing.

13
Choices

Number of Training Pairs
total number of weights / desired average error
on test set
where the average error on the training pairs is
half of the above desired average

14
Choices

Data Representation
Bipolar is better than binary because 0 units
dont learn.
Discrete values red, green, blue?
Continuous values 15.0 .. 35.0?
Number of Hidden Layers
1 is sufficient
Sometimes, multiple layers might speed up the
learning

15
Example

XOR.
Bipolar data representation.
Bipolar sigmoid activation function.
a 1
3 input units, 5 hidden units,1 output unit
Initial Weights are all 0.
Training example (1 -1). Target 1.

16
Example

4. z1 f(10 10 -10) 0.5
z2 z3 z4 0.5
5. y1 f(10 0.50 0.50 0.50 0.50)
0.5
6. error(y1) (1 0.5) 0.5 (1 0) (1
0) 0.25
7. error(z1) 0 f(zin.1) 0 error(z2)
error(z3) error(z4)

17
Example

8. w01(new) w01(old) aerror(y1)z0
0 1 0.25 1 0.25
v21(new) v21(old) aerror(z1)x2
0 1 0 -1 0.

18
Exercise

Draw the updated neural network.
Present the example 1 -1 as an example to
classify. How is it classified now?
If learning were to occur, how would the network
weights change this time?

19
XOR Experiments

Binary Activation/Binary Representation 3000
epochs.
Bipolar Activation/Bipolar Representation 400
epochs.
Bipolar Activation/Modified Bipolar
Representation -0.8 .. 0.8 265 epochs.
Above experiment with Nguyen-Widrow weight
initialization 125 epochs.

20
Variations

Momentum D wjk(t1) a error(yj) zk m
D wjk(t) D vij(t1) similar
m is 0.0 .. 1.0
The previous experiment takes 38 epochs.

21
Variations

Batch update the weights to smooth the changes.
Adapt the learning rate. For example, in the
delta-bar-delta procedure each weight has its own
learning rate that varies over time.
2 consecutive weight increases or decreases will
increase the learning rate.

22
Variations

Alternate Activation Functions
Strictly Local Backpropagation
makes the algorithm more biologically plausible
by making all computations local
cortical units sum their inputs
synaptic units apply an activation function
thalamic units compute errors
equivalent to standard backpropagation

23
Variations

Strictly Local Backpropagationinput cortical
layer -gt input synaptic layer -gthidden cortical
layer -gt hidden synaptic layer -gtoutput cortical
layer-gt output synaptic layer-gt output thalamic
layer
Number of Hidden Layers

24
Hecht-Neilsen Theorem

Given any continuous function f In -gt Rm where
I is 0, 1, f can be represented exactly by a
feedforward network having n input units, 2n 1
hidden units, and m output units.

Write a Comment

User Comments (0)