Overview over different methods Supervised Learning - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Overview over different methods Supervised Learning

Description:

(x) is the sigmoid function: 1/(1 e-x) d (x)/dx= (x) (1- (x) ... networks of sigmoid units. backpropagation: Gradient Descent Rule for Sigmoid Output Function ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 36
Provided by: bccngoe
Category:

less

Transcript and Presenter's Notes

Title: Overview over different methods Supervised Learning


1
Overview over different methods Supervised
Learning
And many more
You are here !
2
Some more basicsThreshold Logic Unit (TLU)
inputs
weights
w1
output
activation
w2
?
u2
v
. . .
q
a?i1n wi ui
wn
un
1 if a ? q v 0 if a
lt q

3
Activation Functions
threshold
linear
v
v
a
a
piece-wise linear
sigmoid
v
v
a
a
4
Decision Surface of a TLU
Decision line w1 u1 w2 u2 q
1
1
u2
1
gt q
0
0
0
u1
0
1
0
lt q
5
Scalar Products Projections
u
u
u
w
w
w
j
j
j
w v gt 0
w v 0
w v lt 0
u
w
j
w u wu cos j
6
Geometric Interpretation
The relation wuq implicitly defines the
decision line
u2
Decision line w1 u1 w2 u2 q
w
wuq
v1
uwq/w
uw
u1
u
v0
7
Geometric Interpretation
  • In n dimensions the relation wuq defines a n-1
    dimensional hyper-plane, which is perpendicular
    to the weight vector w.
  • On one side of the hyper-plane (wugtq) all
    patterns are classified by the TLU as 1, while
    those that get classified as 0 lie on the other
    side of the hyper-plane.
  • If patterns can be not separated by a hyper-plane
    then they cannot be correctly classified with a
    TLU.

8
Linear Separability
u2
u2
w1? w2? q ?
w11 w21 q1.5
1
0
1
0
u1
u1
0
1
0
0
Logical XOR
Logical AND
9
Threshold as Weight
qwn1
un1-1
w1
wn1
w2
?
v
. . .
a ?i1n1 wi ui
wn
  • 1 if a ? 0
  • v
  • 0 if a lt0


10
Geometric Interpretation
The relation wu0 defines the decision line
u2
Decision line
w
wu0
v1
u1
v0
u
11
Training ANNs
  • Training set S of examples u,vt
  • u is an input vector and
  • vt the desired target output
  • Example Logical And
  • S (0,0),0, (0,1),0, (1,0),0, (1,1),1
  • Iterative process
  • Present a training example u , compute network
    output v , compare output v with target vt,
    adjust weights and thresholds
  • Learning rule
  • Specifies how to change the weights w and
    thresholds q of the network as a function of the
    inputs u, output v and target vt.

12
Adjusting the Weight Vector
u
u
w w mu
jgt90
mu
w
w
Target vt 1 Output v0
Move w in the direction of u
u
u
w
-mu
jlt90
w
w w - mu
Target vt 0 Output v1
Move w away from the direction of u
13
Perceptron Learning Rule
  • ww m (vt-v) u
  • Or in components
  • wi wi Dwi wi m (vt-v) ui (i1..n1)
  • With wn1 q and un1-1
  • The parameter m is called the learning rate. It
    determines the magnitude of weight updates Dwi .
  • If the output is correct (vtv) the weights are
    not changed (Dwi 0).
  • If the output is incorrect (vt ? v) the weights
    wi are changed such that the output of the TLU
    for the new weights wi is closer/further to the
    input ui.

14
Perceptron Training Algorithm
  • Repeat
  • for each training vector pair (u, vt)
  • evaluate the output y when u is the input
  • if v?vt then
  • form a new weight vector w according
  • to ww m (vt-v) u
  • else
  • do nothing
  • end if
  • end for
  • Until vvt for all training vector pairs

15
Perceptron Learning Rule
16
Perceptron Convergence Theorem
  • The algorithm converges to the correct
    classification
  • if the training data is linearly separable
  • and m is sufficiently small
  • If two classes of vectors u1 and u2 are
    linearly separable, the application of the
    perceptron training algorithm will eventually
    result in a weight vector w0, such that w0
    defines a TLU whose decision hyper-plane
    separates u1 and u2 (Rosenblatt 1962).
  • Solution w0 is not unique, since if w0 u 0
    defines a hyper-plane, so does w0 k w0.

17
(No Transcript)
18
Linear Separability
u2
u2
w1? w2? q ?
w11 w21 q1.5
1
0
1
0
u1
u1
0
1
0
0
Logical XOR
Logical AND
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Multiple TLUs
  • Handwritten alphabetic character recognition
  • 26 classes A,B,C,Z
  • First TLU distinguishes between As and
    non-As, second TLU between Bs and non-Bs
    etc.

. . .
v1
v2
v26
wji connects ui with vj
. . .
wji wji m (vtj-vj) ui
Essentially this makes the output and target a
vector, too.
23
Generalized Perceptron Learning Rule
If we do not include the threshold as an input we
use the follow description of the perceptron with
symmetrical outputs (this does not matter much,
though)
Then we get for the learning rule
and
This implies
Hence, if vt1 and v-1 the weight change
increase the term w.u-q and vice versa. This is
what we need to compensate the error!
24
Linear Unit no Threshold!
inputs
weights
w1
output
activation
w2
?
v
. . .
v a ?i1n wi vi
a?i1n wi ui
wn
Lets abbreviate the target output (vectors) by t
in the next slides
25
Gradient Descent Learning Rule
  • Consider linear unit without threshold and
    continuous output v (not just 1,1)
  • vw0 w1 u1 wn un
  • Train the wis such that they minimize the
    squared error
  • Ew1,,wn ½ ?d?D (td-vd)2
  • where D is the set of training examples and
  • t the target outputs.

26
Gradient Descent
Dlt(1,1),1gt,lt(-1,-1),1gt,
lt(1,-1),-1gt,lt(-1,1),-1gt
?w-m ?Ew
-1/m ?wi - ?E/?wi ?/?wi 1/2?d(td-vd)2 ?/?wi
1/2?d(td-?i wi ui)2 ?d(td- vd)(-ui)
27
Gradient Descent
  • Gradient-Descent(training_examples, m)
  • Each training example is a pair of the form
    (u1,un),t where (u1,,un) is the vector of
    input values, and t is the target output value, m
    is the learning rate (e.g. 0.1)
  • Initialize each wi to some small random value
  • Until the termination condition is met, Do
  • Initialize each ?wi to zero
  • For each (u1,un),t in training_examples Do
  • Input the instance (u1,,un) to the linear unit
    and compute the output v
  • For each linear unit weight wi Do
  • ?wi ?wi m (t-v) ui
  • For each linear unit weight wi Do
  • wiwi?wi

28
Incremental Stochastic Gradient Descent
  • Batch mode gradient descent
  • ww - m ?EDw over the entire data D
  • EDw1/2?d(td-vd)2
  • Incremental (stochastic) mode gradient descent
  • ww - m ?Edw over individual training
    examples d
  • Edw1/2 (td-vd)2
  • Incremental Gradient Descent can approximate
    Batch Gradient Descent arbitrarily closely if m
    is small enough.

29
Perceptron vs. Gradient Descent Rule
  • perceptron rule
  • wi wi m (tp-vp) uip
  • derived from manipulation of decision surface.
  • gradient descent rule
  • wi wi m (tp-vp) uip
  • derived from minimization of error function
  • Ew1,,wn ½ ?p (tp-vp)2
  • by means of gradient descent.

30
Perceptron vs. Gradient Descent Rule
  • Perceptron learning rule guaranteed to succeed if
  • Training examples are linearly separable
  • Sufficiently small learning rate m.
  • Linear unit training rules using gradient descent
  • Guaranteed to converge to hypothesis with minimum
    squared error
  • Given sufficiently small learning rate m
  • Even when training data contains noise
  • Even when training data not separable by
    hyperplane.

31
Presentation of Training Examples
  • Presenting all training examples once to the ANN
    is called an epoch.
  • In incremental stochastic gradient descent
    training examples can be presented in
  • Fixed order (1,2,3,M)
  • Randomly permutated order (5,2,7,,3)
  • Completely random (4,1,7,1,5,4,)

32
Neuron with Sigmoid-Function
inputs
weights
w1
output
activation
w2
?
y
. . .
a?i1n wi xi
wn
ys(a) 1/(1e-a)
33
Sigmoid Unit
x0-1
w1
w0
a?i0n wi ui
v?(a)1/(1e-a)
w2
?
v
. . .
?(x) is the sigmoid function 1/(1e-x)
wn
d?(x)/dx ?(x) (1- ?(x))
  • Derive gradient decent rules to train
  • one sigmoid function
  • ?E/?wi -?p(tp-v) v (1-v) uip
  • Multilayer networks of sigmoid units
  • backpropagation

34
Gradient Descent Rule for Sigmoid Output Function
s
sigmoid
Epw1,,wn ½ (tp-vp)2
  • ?Ep/?wi ?/?wi ½ (tp-vp)2
  • ?/?wi ½ (tp- s(Si wi uip))2
  • (tp-up) s(Si wi uip) (-uip)
  • for vs(a) 1/(1e-a)
  • s(a) e-a/(1e-a)2s(a) (1-s(a))

a
s
a
wi wi ?wi wi m v(1-v)(tp-vp) uip
35
Gradient Descent Learning Rule
vj
wji
ui
  • ?wi m vjp(1-vjp) (tjp-vjp) uip

activation of pre-synaptic neuron
learning rate
error dj of post-synaptic neuron
derivative of activation function
36
(No Transcript)
37
ALVINN
Automated driving at 70 mph on a public highway
Camera image
30 outputs for steering
30x32 weights into one out of four hidden unit
4 hidden units
30x32 pixels as inputs
Write a Comment
User Comments (0)
About PowerShow.com