Title: Neural Networks
1 Neural Networks
2WHY ARTIFICIAL NEURAL NETWORKS?
- Characteristics of the human brain that are not
present in von Neumann or modern parallel
computers include - massive parallelism,
- distributed representation and computation,
- learning ability,
- generalization ability,
- adaptivety,
- inherent contextual information processing,
- fault tolerance, and
- low energy consumption.
- It is hoped that devices based on biological
neural networks will possess some of these
desirable characteristics.
3(No Transcript)
4ANNs
- Inspired by biological neural networks, ANNs are
massively parallel computing systems consisting
of an extremely large number of simple processors
with many interconnections. - ANN models attempt to use some organizational
principles believed to be used in the human
5Brief historical review
- ANN research has experienced three periods of
extensive activity - The first peak in the 1940s was due to McCulloch
and Pitts' - The second occurred in the 1960s with
Rosenblatt's perceptron convergence theorem and
Minsky and Papert's work showing the limitations
of a simple perceptron. Minsky and Papert's
results dampened the enthusiasm of most
researchers which lasted almost 20 years. - Since the early 1980s, ANNs have received
considerable renewed interest. The major
developments include - Hopfield's energy approach in 1982 and
- The back-propagation learning algorithm for
multilayer perceptrons (multilayer feed forward
networks) first proposed by Werbos, and then
popularized by Rumelhart et al. in 1986.
6Biological neural networks
- A neuron (or nerve cell) is a special biological
cell that processes information. It is composed
of a cell body, or soma, and two types of
out-reaching tree-like branches the axon and the
dendrites.
7Biological neural networks (cont.)
- A neuron receives signals (impulses) from other
neurons through its dendrites (receivers) and
transmits signals generated by its cell body
along the axon (transmitter), which eventually
branches into strands and sub strands. - At the terminals of these strands are the
synapses. - A synapse is an elementary structure and
functional unit between two neurons (an axon
strand of one neuron and a dendrite of another)
8Biological neural networks (cont.)
- The human brain contains about 1011 neurons,
which is approximately the number of stars in the
Milky Way. - Neurons are massively connected, much more
complex and dense than telephone networks. - Each neuron is connected to 103 to l04 other
neurons. - In total, the human brain contains approximately
1014 to 1015 interconnections.
9Biological neural networks (cont.)
- Complex perceptual decisions such as face
recognition are typically made by humans within a
few hundred milliseconds. - These decisions are made by a network of neurons
whose operational speed is only a few
milliseconds. This implies that - the computations cannot take more than about 100
serial stages. - the brain runs parallel programs that are about
100 steps long for such perceptual tasks. This is
known as the hundred step rule
10Computational models of neurons
- This mathematical neuron computes a weighted sum
of its n input signals ,x,, j 1,2, . . . , n. - Generates an output of 1 if this sum gt certain
threshold U. Otherwise, an output of 0 results.
11- Mathematically
- ?(.) is the unit step function
- wj is the synapse weight
- associated with the jth input
- For simplicity of notation, we often consider the
threshold U as another weight wo - U attached
to the neuron with a constant input x0 1
12Activation Functions
13The Sigmoid
- The standard sigmoid function is the logistic
function, defined by
where ? is the slope parameter
14Network architectures
- ANNs can be viewed as weighted directed graphs in
which artificial neurons are nodes and directed
edges (with weights) are connections between
neuron outputs and neuron inputs. - feed-forward networks, in which graphs have no
loops, and - recurrent (or feedback) networks, in which loops
occur because of feedback connections.
15Network architectures
Different connectivity's yield different network
behaviors
16Network architectures
- Feed-forward networks are
- static, that is, they produce only one set of
output values rather than a sequence of values
from a given input. - memory-less in the sense that their response to
an input is independent of the previous network
state. - Recurrent, or feedback, networks are
- dynamic systems.
- When a new input pattern is presented, the neuron
outputs are computed. Because of the feedback
paths, the inputs to each neuron are then
modified, which leads the network to enter a new
state. - Different network architectures require
appropriate learning algorithms.
17Learning
- A learning process in the ANN context can be
viewed as the problem of updating network
architecture and connection weights so that a
network can efficiently perform a specific task. - The network usually must learn the connection
weights from available training patterns. - Performance is improved over time by iteratively
updating the weights in the network.
18Learning
- ANNs' ability to automatically learn from
examples makes them attractive and exciting. - ANNs appear to learn underlying rules (like
input-output relationships) from the given
collection of representative examples. - This is one of the major advantages of neural
networks over traditional expert systems.
19Learning algorithm
- To understand or design a learning process, you
must have - A learning paradigm a model of the environment
in which a neural network operates, i.e., you
must know what information is available to the
network. - Learning rules you must understand how network
weights are updated, i.e., which learning rules
govern the updating process. - A learning algorithm refers to a procedure in
which learning rules are used for adjusting the
weights.
20Learning paradigms
- Supervised learning The network is provided with
a correct answer (output) for every input pattern
- learning with a teacher. - Weights are determined to allow the network to
produce answers as close as possible to the known
correct answers. - Reinforcement learning is a variant of
supervised learning in which the network is
provided with only a critique on the correctness
of network outputs, not the correct answers
themselves. - Unsupervised learning The network explores the
underlying structure in the data, or correlations
between patterns in the data, and organizes
patterns into categories from these correlations
- learning without a teacher. - Hybrid learning Part of the weights are usually
determined through supervised learning, while the
others are obtained through unsupervised learning
- combines supervised and unsupervised learning.
21Learning theory
- Learning theory must address three fundamental
and practical issues associated with learning
from samples capacity, sample complexity, and
computational complexity. - Capacity how many patterns can be stored, and
what functions and decision boundaries a network
can form. - Sample complexity determines the number of
training patterns needed to train the network to
guarantee a valid generalization. - Too few patterns may cause over-fitting
(wherein the network performs well on the
training data set, but poorly on independent test
patterns drawn from the same distribution as the
training patterns). - Computational complexity refers to the time
required for a learning algorithm to estimate a
solution from training patterns. - Many existing learning algorithms have high
computational complexity.
22Learning rules
- Error correction, Boltzmann, Hebbian, and
Competitive learning. - ERROR-CORRECTION RULES During the learning
process, the actual output y generated by the
network may not equal the desired output d. - The basic principle of error-correction learning
rules is to use the error signal (d-y) to modify
the connection weights to gradually reduce this
error. - The perceptron learning rule is based on this
error-correction principle. - A perceptron consists of a single neuron with
adjustable weights, wj, j 1,2, . . . , n, and
threshold U (threshold function).
23ERROR-CORRECTION RULES
- Given an input vector x (xl, x2, . . . , xn)t,
the net input to the neuron is - The output y of the perceptron
- is 1 if v gt 0, and 0 otherwise.
- In a two-class classification problem, the
perceptron assigns an input pattern to one class
if y 1, and to the other class if y0. - The linear equation defines the decision boundary
that halves the space.
24Perceptron learning algorithm
- Randomly initialize weights and threshold w1 w2
wm - Present an input vector x (xl, x2, . . . , xn)t
and evaluate the output of the neuron. - Update the weights according to
- wj (t 1) wj (t) ?? (d-y) xj
- where d is the desired output, t is the
iteration number, and ? is the gain step size (
0.0 lt ? lt 1.0)
25Perceptron learning algorithm
- Note that learning occurs only when the
perceptron makes an error. - The perceptron convergence theorem Rosenblatt
proved that when training patterns are drawn from
two linearly separable classes, the perceptron
learning procedure converges after a finite
number of iterations. - In practice, you do not know whether the patterns
are linearly separable. - Many variations of this learning algorithm have
been proposed in the literature - Other activation functions that lead to different
learning characteristics can also be used. - The back-propagation learning algorithm is based
on the error-correction principle.
26Perceptrons and Boolean Functions
- If inputs are all 0s and 1s and outputs are all
0s and 1s - Can learn the function x1 ? x2
-
- Can learn the function x1 ? x2 .
27Perceptrons and Boolean Functions
- What about the exclusive or function?
- f(x1,x2) x1 ? x2
- (x1 ? x2) ? ( x1 ? x2)
28XOR problem
- Desired make an ANN which will produce Y X1
xor X2 on inputs X1 and X2. - Problem there is no single line that can cut
X1 X2 space into two proper regions. Therefore,
cannot use a single-layer neural net. - Solution use multilayer network
29HEBBIAN RULE
- The oldest learning rule is Hebbs postulate of
learning. Hebb based it on the following
observation from neurobiological experiments - If neurons on both sides of a synapse are
activated synchronously and repeatedly, the
synapses strength is selectively increased. - Mathematically, the Hebbian rule can be described
as - where xi and yj are the output values of neurons
i and j, respectively, which are connected by the
synapse wij and ? is the learning rate. Note
that xi is the input to the synapse.
30HEBBIAN RULE
- An important property of this rule is that
- learning is done locally, i.e., the change in
synapse weight depends only on the activities of
the two neurons connected by it. - This significantly simplifies the complexity of
the learning circuit in a VLSI implementation.
31HEBBIAN RULE
- A single neuron trained using the Hebbian rule
exhibits an orientation selectivity. - The points depicted are drawn from a
two-dimensional Gaussian distribution and used
for training a neuron. - The weight vector of the neuron is initialized to
w0. - As the learning proceeds, the weight vector
moves progressively closer to the - direction w of maximal
- variance in the data.
- w is the eigenvector of the
- covariance matrix of the data
- corresponding to the largest
- eigen value.
32BOLTZMANN LEARNING
- The Boltzmann machine (named in honour of a
19th-century scientist by its inventors) - Boltzmann machines are symmetric recurrent
networks consisting of binary units ( 1 for on
and -1 for off). - the weight on the connection from unit i to unit
j is equal to the weight on the connection from
unit j to unit i. - A subset of the neurons, called visible, interact
with the environment the rest, called hidden, do
not. - Each neuron is a stochastic unit that generates
an output (or state) according to the Boltzmann
distribution of statistical mechanics.
33- Boltzmann machines operate in two modes
- Clamped visible neurons are clamped onto
specific states determined by the environment
and - Free-running both visible and hidden neurons are
allowed to operate freely. The hidden neurons
always operate freely. - K is the number of visible neurons
- L is the number of hidden neurons.
34BOLTZMANN LEARNING
- Boltzmann learning is a stochastic learning rule
derived from information-theoretic and
thermodynamic principles. - The objective of Boltzmann learning is to adjust
the connection weights so that the states of
visible units satisfy a particular desired
probability distribution. - According to the Boltzmann learning rule, the
change in the connection weight wg is given by -
- where ? is the learning rate, and ?ij and ?ij are
the correlations between the states of units i
and j when the network operates in the clamped
mode and free-running mode, respectively.
35Summary of the Boltzmann Machine Learning
Procedure
- 1. Initialization set weights to random numbers
in 1,1 - 2. Clamping Phase Present the net with the
mapping it is supposed to learn by clamping input
and output units to patterns. For each pattern,
perform simulated annealing on the hidden units
at a sequence T0, T1, ..., Tfinal of
temperatures. At the final temperature, collect
statistics to estimate the correlations
36Summary of the Boltzmann Machine Learning
Procedure
- 3. Free-Running Phase Repeat the calculations
performed in step 2, but this time clamp only the
input units. Hence, at the final temperature,
estimate the correlations - 4. Updating of Weights update them using the
learning rule Where ? is a learning rate
parameter.
37Summary of the Boltzmann Machine Learning
Procedure
- 5. Iterate until Convergence Iterate steps 2 to
4 until the learning procedure converges with no
more changes taking place in the synaptic weights
wji for all j, i.
38(No Transcript)
39Alternative Boltzmann Architecture
- Alternatively, the visible units may be viewed as
divided into input and output units. - In this case the Boltzmann machine performs
association under the supervision of a teacher,
with the input units receiving information form
the environment, and the output units reporting
the outcome for that input pattern.
40Boltzmann vs Hopfield
- Similarities
- 1. Processing units have binary states (1)
- 2. Connections between units are symmetric
- 3. Units are picked at random and one at a time
for updating - 4. Units have no self-feedback.
- Differences
- 1. Boltzmann machine permits the use of hidden
neurons. - 2. Boltzmann machine uses stochastic neurons with
a probabilistic firing mechanism, whereas the
standard Hopfield net uses neurons based on the
McCulloch-Pitts model with a deterministic firing
mechanism. - 3. Boltzmann machine may also be trained by a
probabilistic form of supervision.
41COMPETITIVE LEARNING RULES
- Competitive-learning output units compete among
themselves for activation. As a result, only one
output unit is active at any given time. This
phenomenon is known as winner-take-all. - Competitive learning has been found to exist in
biological neural network. - Competitive learning often clusters or
categorizes the input data. Similar patterns are
grouped by the network and represented by a
single unit. This grouping is done automatically
based on data correlations.
42COMPETITIVE LEARNING RULES
- The simplest competitive learning network
consists of a single layer of output units. - Each output unit i in the network connects to all
the input units (xi ,s) via weights, wij , j
1,2, . . . , n. - Each output unit also connects to all other
output units via inhibitory weights but has a
self-feed back with an excitatory weight.
43COMPETITIVE LEARNING RULES
- A simple competitive learning rule can be stated
as - Note that only the weights of the winner unit get
updated. - The effect of this learning rule is to move the
stored pattern in the winner unit (weights) a
little bit closer to the input pattern. - Assume that all input vectors have been
normalized to have unit length. - The weight vectors of the three units are
randomly initialized. Their initial and final
positions on the sphere after competitive
learning are marked as Xs.
- Each of the three natural groups (clusters) of
patterns has been discovered by - an output unit whose weight vector points to the
center of gravity of the - discovered group.
44COMPETITIVE LEARNING RULES
- You can see from the competitive learning rule
that the network will not stop learning (updating
weights) unless the learning rate q is 0. - A particular input pattern can fire different
output units at different iterations during
learning. - The system is said to be stable if no pattern in
the training data changes its category after a
finite number of learning iterations. - One way to achieve stability is to force the
learning rate to decrease gradually as the
learning process proceeds towards 0. However,
this artificial freezing of learning causes
another problem termed plasticity, which is the
ability to adapt to new data. This is known as
Grossbergs stability- plasticity dilemma in
competitive learning.
45COMPETITIVE LEARNING RULES
- The most well-known example of competitive
learning is vector quantization for data
compression. - It has been widely used in speech and image
processing for efficient storage, transmission,
and modeling. - Its goal is to represent a set or distribution of
input vectors with a relatively small number of
prototype vectors (weight vectors), or a
codebook. Once a codebook has been constructed
and agreed upon by both the transmitter and the
receiver, you need only transmit or store the
index of the corresponding prototype to the input
vector. - Given an input vector, its corresponding
prototype can be found by searching for the
nearest prototype in the codebook.
46Well known learning algorithms
47Well known learning algorithms
48SUMMARY
- Learning rules based on error-correction can be
used for training feed-forward networks - Hebbian learning rules have been used for all
types of network architectures. - Each learning algorithm is designed for training
a specific architecture. - When we discuss a learning algorithm, a
particular network architecture association is
implied. - Each algorithm can perform only a few tasks well.
- Other algorithms, including Adaline, Madaline,
linear discriminant analysis, Sammon's projection
, and principal component analysis.
49Multilayer Networks
- The class of functions representable by
perceptrons is limited
This is a nonlinear function Of a linear
combination Of non linear functions
Of linear combinations of inputs
50A 1-HIDDEN LAYER NET
NINPUTS 2
NHIDDEN 3
w11
w1
x1
w21
w31
w2
w12
w22
x2
w3
w32
51OTHER NEURAL NETS
52Multilayer perceptron
- The most popular class of multilayer feed-forward
networks is multilayer perceptrons - Each computational unit employs either the
thresholding function or the sigmoid function. - Multilayer perceptrons can form arbitrarily
complex decision boundaries and represent any
Boolean function. - The development of the back-propagation learning
algorithm for determining weights in a multilayer
perceptron has made these networks the most
popular among researchers and users of neural
networks.
53Multilayer perceptron
- We denote wij(l) as the weight on the connection
between the ith unit in layer (l-1) to jth unit
in layer l. - Let (x(1), d(1)), (x(2), d(2)), . . . , (x(p),
d(p)) be a set of p training patterns
(input-output pairs), - where x(i) ? Rn is the input vector in the
n-dimensional pattern space, and - d(i) ? 0, l m, an m-dimensional hypercube.
- For classification purposes, m is the number of
classes. The squared error cost function most
frequently used in the ANN literature is defined
as
54Back-propagation
- The back-propagation algorithm is a
gradient-descent method to minimize the
squared-error cost function E.
55GRADIENT DESCENT
- Suppose we have a scalar function
- We want to find a local minimum.
- Assume our current weight is w
- GRADIENT DESCENT RULE
- ? is called the LEARNING RATE. A small positive
number, e.g. ? 0.05
56Gradient Descent in m Dimensions
points in direction of steepest ascent.
GRADIENT DESCENT RULE Equivalently
.where wj is the jth weight just like a linear
feedback system
57A RULE KNOWN BY MANY NAMES
The Widrow Hoff rule
The LMS Rule
The delta rule
The adaline rule
Classical conditioning
58Back-propagation algorithm
- 1. Initialize the weights to small random
variables - 2- Randomly choose an input pattern X(u)
- 3- propagate the signal forward through the
network - 4- Compute ?iL in the output layer (Oi yiL)
- ?il g (hil) diu yil,
- where hil represents the net input to the ith
unit in the lth layer, g is the derivative of
the activation function g. - 5- Compute the deltas for the preceding layers by
propagating the errors backwards - ?il g (hil) ?j wijl1 ?jl1 ,
- for l (L-1),, 1
- 6- Update weights using
- ?wjil ?il yjl-1
- 7- Go to step 2 and repeat for next pattern until
the error in the output layer is acceptably low,
or a prespecified number of iterations is
reached.
59Backpropagation algorithm (instance-based)
- 1 Randomize the weights ws to small random
values (both positive and negative) to ensure
that the network is not saturated by large values
of weights. - 2 Select an instance t, that is the vector
xk(t), i 1,...,Ninp (a pair of input and
output patterns), from the training set. - 3 Apply the network input vector to network
input. - 4 Calculate the network output vector
zk(t), k 1,...,Nout. - 5 Calculate the errors for each of the outputs k
, k1,...,Nout, the difference between the
desired output and the network output - (for simplicity we will denote it as
simply E). - 6 Calculate the necessary updates for weights
-ws in a way that minimizes this error (discussed
below). - 7 Adjust the weights of the network by -ws.
- 8 Repeat steps 2 6 for each instance (pair of
inputoutput vectors) in the training set until
the error for the entire system (error E defined
above or the error on cross-validation set) is
acceptably low, or the pre-defined number of
iterations is reached.
60Backpropagation algorithm
- Often it is reasonable not to update weights
immediately after processing each instance, but
accumulates (sums up) the necessary changes
across a subset of training instances (call an
epoch) and only then updates the weights. This
allows for faster convergence (Smith 1993). - Epoch can be the part or the whole training set.
After the whole training set is processed (this
sequence of steps is called an iteration), - the whole process is repeated again in an
iterative fashion until the total error is
acceptably low. - Number of such iterations may sometimes be as
high as several thousand.
61Backpropagation algorithm (epoch-based, with
cumulative updates)
- 1 6 as above
- 7 add up the calculated weights updates -ws to
the accumulated total updates ?Ws. - 8 Repeat steps 2 7 for several instances
comprising an epoch. - 9 Adjust the weights ws of the network by the
updates -Ws. - 10 Repeat steps 2 9 until all instances in the
training set are processed. This constitutes one
iteration. - 11 Repeat the iteration of steps 2 10 until the
error for the entire system (error E defined
above or the error on cross-validation set) is
acceptably low, or the pre-defined number of
iterations is reached.
62 63Backpropagation
- In a Single-layer network,
- Each neuron adjusts its weights according to
what output was expected of it, and the output it
gave. This can be mathematically expressed by the
Perceptron Delta Rule - Where w is the array of weights,
- x is the array of inputs.
64The Sigmoid (logistic) function
- One of the more popular alternatives function
used with back-propagation nets is the Sigmoid
(logistic) function.
65The perceptron learning rule
- Where w is the array of weights,
- x is the array of inputs, and ? is defined as the
learning rate. - yi and di are the actual and desired outputs,
respectively. - Calculating the deltas for the output layer as
66Calculate delta for the hidden layers
- We have to know the effect on the output of the
neuron if a weight is to change. - Therefore, we need to know the derivative of the
error with respect to that weight. - It has been proven that for neuron q in hidden
layer p, delta is
Each delta value for hidden layers require that
the delta value for the layer after it be
calculated.
67Backpropagation example
NINPUTS 2
NHIDDEN 2
1
W1(0,1)
1
W1(0,2)
W2(0,1)
W1(1,1)
W2(0,1)
W2(1,1)
x1
W1(1,2)
W1(2,1)
x2
W2(2,1)
W1(2,2)
68Back propagation algorithm
- 1-Initialize the weights to small random
variables - Layer 1
- Layer 2
69- Randomly choose an input pattern X(u)
- 3- Propagate the signal forward through the
network
Layer 1 X2(i) ?k0,1,2 Wi (k,i) X(k)
70- Out(x) g(?k0,1,2 Wi (k,i) X(k) )
- X2(i) g(?k0,1,2 Wi (k,i) X(k) )
- g(x) 1/(1e-x)
71- 4. Compute ?iL in the output layer (Oi yiL)
- ?il g (hil) diu yil,
- where hil represents the net input to the ith
unit in the lth layer, g is the derivative of
the activation function g. - d3(1) x3(1)(1 - x3(1))(d - x3(1))
72- 5- Compute the deltas for the preceding layers by
propagating the errors backwards - ?il g (hil) ?j wijl1 ?jl1 ,
- for l (L-1),, 1
73- 6- Update weights using
- ?wjil ?il yjl-1
- Taking ? as 0.05
- dw2(0,1) ?x1(0)d2(1)
74- 7- Go to step 2 and repeat for next pattern until
the error in the output layer below a
prespecified number of iterations is reached.
75- Run the entire process again on the next set of
training data. - Slowly, as the training data is fed in and the
network in retrained a few thousand times, the
network could balance out to certain values.
76APPLICATIONS
- To successfully work with real-world problems,
you must deal with numerous design issues,
including network model, network size, activation
function, learning parameters, and number of
training samples. - Pattern classification
- Clustering
- Function approximation
- Prediction
- Optimization
- Content addressable memory
- Control
77Reference