Introduction to Neural Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Neural Networks

Description:

1986 Rumelhart, Hinton, Williams. Gradient descent method that ... Bipolar sigmoid activation function. a = 1. 3 input units, 5 hidden units,1 output unit ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 25
Provided by: Joh7
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Neural Networks


1
Introduction to Neural Networks
  • John Paxton
  • Montana State University
  • Summer 2003

2
Chapter 6 Backpropagation
  • 1986 Rumelhart, Hinton, Williams
  • Gradient descent method that minimizes the total
    squared error of the output.
  • Applicable to multilayer, feedforward, supervised
    neural networks.
  • Revitalizes interest in neural networks!

3
Backpropagation
  • Appropriate for any domain where inputs must be
    mapped onto outputs.
  • 1 hidden layer is sufficient to learn any
    continuous mapping to any arbitrary accuracy!
  • Memorization versus generalization tradeoff.

4
Architecture
  • input layer, hidden layer, output layer

1
1
y1
x1
z1
ym
xn
zp
wpm
vnp
5
General Process
  • Feedforward the input signals.
  • Backpropagate the error.
  • Adjust the weights.

6
Activation Function Characteristics
  • Continuous.
  • Differentiable.
  • Monotonically nondecreasing.
  • Easy to compute.
  • Saturates (reaches limits).

7
Activation Functions
  • Binary Sigmoid f(x) 1 / 1 e-x f(x)
    f(x)1 f(x)
  • Bipolar Sigmoid f(x) -1 2 / 1
    e-x f(x) 0.5 1 f(x) 1 f(x)

8
Training Algorithm
  • 1. initialize weights to small random values,
    for example -0.5 .. 0.5
  • 2. while stopping condition is false do
  • steps 3 8
  • 3. for each training pair do steps 4-8

9
Training Algorithm
  • 4. zin.j S (xi vij)
  • zj f(zin.j)
  • 5. yin.j S (zi wij)
  • yj f(yin.j)
  • 6. error(yj) (tj yj) f(yin.j)
  • tj is the target value
  • 7. error(zk) S error(yj) wkj f(zin.k)

10
Training Algorithm
  • 8. wkj(new) wkj(old) aerror(yj)zk
  • vkj(new) vkj(old) aerror(zj))xk
  • a is the learning rate
  • An epoch is one cycle through the training
    vectors.

11
Choices
  • Initial Weights
  • random -0.5 .. 0.5, dont want the derivative
    to be 0
  • Nguyen-Widrow b 0.7 p(1/n) n number of
    input units p number of hidden units vij b
    vij(random) / vj(random)

12
Choices
  • Stopping Condition (avoid overtraining!)
  • Set aside some of the training pairs as a
    validations set.
  • Stop training when the error on the validation
    set stops decreasing.

13
Choices
  • Number of Training Pairs
  • total number of weights / desired average error
    on test set
  • where the average error on the training pairs is
    half of the above desired average

14
Choices
  • Data Representation
  • Bipolar is better than binary because 0 units
    dont learn.
  • Discrete values red, green, blue?
  • Continuous values 15.0 .. 35.0?
  • Number of Hidden Layers
  • 1 is sufficient
  • Sometimes, multiple layers might speed up the
    learning

15
Example
  • XOR.
  • Bipolar data representation.
  • Bipolar sigmoid activation function.
  • a 1
  • 3 input units, 5 hidden units,1 output unit
  • Initial Weights are all 0.
  • Training example (1 -1). Target 1.

16
Example
  • 4. z1 f(10 10 -10) 0.5
  • z2 z3 z4 0.5
  • 5. y1 f(10 0.50 0.50 0.50 0.50)
    0.5
  • 6. error(y1) (1 0.5) 0.5 (1 0) (1
    0) 0.25
  • 7. error(z1) 0 f(zin.1) 0 error(z2)
    error(z3) error(z4)

17
Example
  • 8. w01(new) w01(old) aerror(y1)z0
  • 0 1 0.25 1 0.25
  • v21(new) v21(old) aerror(z1)x2
  • 0 1 0 -1 0.

18
Exercise
  • Draw the updated neural network.
  • Present the example 1 -1 as an example to
    classify. How is it classified now?
  • If learning were to occur, how would the network
    weights change this time?

19
XOR Experiments
  • Binary Activation/Binary Representation 3000
    epochs.
  • Bipolar Activation/Bipolar Representation 400
    epochs.
  • Bipolar Activation/Modified Bipolar
    Representation -0.8 .. 0.8 265 epochs.
  • Above experiment with Nguyen-Widrow weight
    initialization 125 epochs.

20
Variations
  • Momentum D wjk(t1) a error(yj) zk m
    D wjk(t) D vij(t1) similar
  • m is 0.0 .. 1.0
  • The previous experiment takes 38 epochs.

21
Variations
  • Batch update the weights to smooth the changes.
  • Adapt the learning rate. For example, in the
    delta-bar-delta procedure each weight has its own
    learning rate that varies over time.
  • 2 consecutive weight increases or decreases will
    increase the learning rate.

22
Variations
  • Alternate Activation Functions
  • Strictly Local Backpropagation
  • makes the algorithm more biologically plausible
    by making all computations local
  • cortical units sum their inputs
  • synaptic units apply an activation function
  • thalamic units compute errors
  • equivalent to standard backpropagation

23
Variations
  • Strictly Local Backpropagationinput cortical
    layer -gt input synaptic layer -gthidden cortical
    layer -gt hidden synaptic layer -gtoutput cortical
    layer-gt output synaptic layer-gt output thalamic
    layer
  • Number of Hidden Layers

24
Hecht-Neilsen Theorem
  • Given any continuous function f In -gt Rm where
    I is 0, 1, f can be represented exactly by a
    feedforward network having n input units, 2n 1
    hidden units, and m output units.
Write a Comment
User Comments (0)
About PowerShow.com