Title: One-layer neural networks Approximation problems
1One-layer neural networksApproximation problems
- Approximation problems
- Architecture and functioning (ADALINE, MADALINE)
- Learning based on error minimization
- The gradient algorithm
- Widrow-Hoff and delta algorithms
2Approximation problems
- Approximation (regression)
- Problem estimate a functional dependence
between two variables - The training set contains pairs of corresponding
values
Linear approximation
Nonlinear approximation
3Architecture
- One layer NN one layer of input units and one
layer of functional units
Fictive unit
-1
W
Y
X
Total connectivity
Output vector
Input vector
N input units
M functional units (output units
4Functioning
- Computing the output signal
- Usually the activation function is linear
- Examples
- ADALINE (ADAptive LINear Element)
- MADALINE (Multiple ADAptive LINear Element)
5Learning based on error minimization
- Training set (X1,d1),,(XL,dL),
- Xl - vector from RN, dl vector from
RM - Error function measure of the distance between
the output produced by the network and the
desired output - Notations
6Learning based on error minimization
- Learning optimization task
- find W which minimizes E(W)
- Variants
- In the case of linear activation functions W can
be computed by using tools from linear algebra - In the case of nonlinear functions the minimum
can be estimated by using a numerical method -
7Learning based on error minimization
- First variant. Particular case
- M1 (one output unit with linear activation
function) - L1 (one example)
8Learning based on error minimization
9Learning based on error minimization
- Second variant use of a numerical minimization
method - Gradient method
- Is an iterative method based on the idea that the
gradient of a function indicates the direction
on which the function is increasing - In order to estimate the minimum of a function
the current position is moved in the opposite
direction of the gradient -
10Learning based on error minimization
Direction opposite to the gradient
Direction opposite to the gradient
f(x)lt0
f(x)gt0
x1
xk-1
x0
11Learning based on error minimization
- Algorithm to minimize E(W) based on the gradient
method - Initialization
- W(0)initial values,
- k0 (iteration counter)
- Iterative process
- REPEAT
- W(k1)W(k)-etagrad(E(W(k)))
- kk1
- UNTIL a stopping condition is satisfied
12Learning based on error minimization
- Remark the gradient method is a local
optimization method it can be easily trapped in
local minima
13Widrow-Hoff algorithm
- learning algorithm for a linear network
- it minimizes E(W) by applying a gradient-like
adjustment for each example from the training set - Gradient computation
14Widrow-Hoff algorithm
- Algorithms structure
- Initialization
- wij(0)rand(-1,1) (the weights are randomly
initialized in -1,1), - k0 (iteration counter)
- Iterative process
- REPEAT
- FOR l1,L DO
- Compute yi(l) and deltai(l)di(l)-yi(l), i1,M
- Adjust the weights wijwijetadeltai(l)xj(l)
- Compute the E(W) for the new values of the
weights - kk1
- UNTIL E(W)ltE OR kgtkmax
15Widrow-Hoff algorithm
- Remarks
- If the error function has only one optimum the
algorithm converges (but not in a finite number
of steps) to the optimal values of W - The convergence speed is influenced by the value
of the learning rate (eta) - The value E is a measure of the accuracy we
expect to obtain - Is one of the simplest learning algorithms but it
can by applied only for one-layer networks with
linear activation functions
16Delta algorithm
- algorithm similar with Widrow-Hoff but for
networks with nonlinear activation functions - the only difference is in the gradient
computation - Gradient computation
17Delta algorithm
- Particularities
- 1. The error function can have many minima, thus
the algorithm can be trapped in one of these
(meaning that the learning is not complete) - 2. For sigmoidal functions the derivates can be
computed in an efficient way by using the
following relations
18Limits of one-layer networks
- The one layer networks have limited capability
being able only to - Solve simple (e.g. linearly separable)
classification problems - Approximate simple (e.g. linear) dependences
- Solution include hidden layers
- Remark the hidden units should have nonlinear
activation functions