Artificial Neural Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Artificial Neural Networks

Description:

Least Mean Square Error. Multi-layer networks. Sigmoid node. Backpropagation ... Perceptron Algo. Correct Output (t=o) Weights are unchanged. Incorrect Output (t ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 36
Provided by: tau
Category:

less

Transcript and Presenter's Notes

Title: Artificial Neural Networks


1
Artificial Neural Networks
2
Outline
  • Biological Motivation
  • Perceptron
  • Gradient Descent
  • Least Mean Square Error
  • Multi-layer networks
  • Sigmoid node
  • Backpropagation

3
Biological Neural Systems
  • Neuron switching time gt 10-3 secs
  • Number of neurons in the human brain 1010
  • Connections (synapses) per neuron 104105
  • Face recognition 0.1 secs
  • High degree of parallel computation
  • Distributed representations

4
Artificial Neural Networks
  • Many simple neuron-like threshold units
  • Many weighted interconnections
  • Multiple outputs
  • Highly parallel and distributed processing
  • Learning by tuning the connection weights

5
Perceptron Linear threshold unit
x01
w1
w0
w2
S
o
. . .
?i0n wi xi
wn
1 if ?i0n wi xi gt0 o(xi)
-1 otherwise

6
Decision Surface of a Perceptron
x2



-
-
x1

-
-
Linearly Separable
Theorem VC-dim n1
7
Perceptron Learning Rule
S sample xi input vector tc(x) is the target
value o is the perceptron output ? learning rate
(a small constant ), assume ?1
wi wi ?wi ?wi ? (t - o) xi
8
Perceptron Algo.
  • Correct Output (to)
  • Weights are unchanged
  • Incorrect Output (t?o)
  • Change weights !
  • False Positive (t1 and o-1)
  • Add x to w
  • False Negative (t-1 and o1)
  • Subtract x from w

9
Perceptron Learning Rule
10
Perceptron Algorithm Analysis
  • Theorem The number of errors of the Perceptron
    Algorithm is bounded
  • Proof
  • Make all examples positive
  • change ltxi,bigt to ltbixi, 1gt
  • Margin of hyperplan w

11
Perceptron Algorithm Analysis II
  • Let mi be the number of errors of xi
  • M ? mi
  • From the algorithm w ? mixi
  • Let w be a separating hyperplane

12
Perceptron Algorithm Analysis III
  • Change in weights
  • Since w errs on xi , we have wxi lt0
  • Total weight

13
Perceptron Algorithm Analysis IV
  • Consider the angle between w and w
  • Putting it all together

14
Gradient Descent Learning Rule
  • Consider linear unit without threshold and
    continuous output o (not just 1,1)
  • ow0 w1 x1 wn xn
  • Train the wis such that they minimize the
    squared error
  • Ew1,,wn ½ ?d?S (td-od)2
  • where S is the set of training examples

15
Gradient Descent
Slt(1,1),1gt,lt(-1,-1),1gt,
lt(1,-1),-1gt,lt(-1,1),-1gt
?w-? ?Ew
?wi-? ?E/?wi ?/?wi 1/2?d(td-od)2 ?/?wi
1/2?d(td-?i wi xi)2 ?d(td- od)(-xi)
16
Gradient Descent
  • Gradient-Descent(Straining_examples, ?)
  • Until TERMINATION Do
  • Initialize each ?wi to zero
  • For each ltx,tgt in S Do
  • Compute oltx,wgt
  • For each weight wi Do
  • ?wi ?wi ? (t-o) xi
  • For each weight wi Do1
  • wiwi?wi

17
Incremental Stochastic Gradient Descent
  • Batch mode Gradient Descent
  • ww - ? ?ESw over the entire data S
  • ESw1/2?d(td-od)2
  • Incremental mode gradient descent
  • ww - ? ?Edw over individual training
    examples d
  • Edw1/2 (td-od)2
  • Incremental Gradient Descent can approximate
    Batch Gradient Descent arbitrarily closely if ?
    is small enough

18
Comparison Perceptron and Gradient Descent Rule
  • Perceptron learning rule guaranteed to succeed if
  • Training examples are linearly separable
  • No guarantee otherwise
  • Linear unit using Gradient Descent
  • Converges to hypothesis with minimum squared
    error.
  • Given sufficiently small learning rate ?
  • Even when training data contains noise
  • Even when training data not linearly separable

19
Multi-Layer Networks
output layer
hidden layer(s)
input layer
20
Sigmoid Unit
x01
w1
w0
z?i0n wi xi
o?(z)1/(1e-z)
w2
S
o
. . .
wn
?(z) 1/(1e-z) sigmoid function.
21
Sigmoid Function
?(z) 1/(1e-z)
d?(z)/dz ?(z) (1- ?(z))
  • Gradient Decent Rule
  • one sigmoid function
  • ?E/?wi -?d(td-od) od (1-od) xi
  • Multilayer networks of sigmoid units
  • backpropagation

22
Backpropagation overview
  • Make threshold units differentiable
  • Use sigmoid functions
  • Given a sample compute
  • The error
  • The Gradient
  • Use the chain rule to compute the Gradient

23
Backpropagation Motivation
  • Consider the square error
  • ESw1/2?d ? S ?k ? output (td,k-od,k)2
  • Gradient ?ESw
  • Update ww - ? ?ESw
  • How do we compute the Gradient?

24
Backpropagation Algorithm
  • Forward phase
  • Given input x, compute the output of each unit
  • Backward phase
  • For each output k compute

25
Backpropagation Algorithm
  • Backward phase
  • For each hidden unit h compute
  • Update weights
  • wi,jwi,j?wi,j where ?wi,j ? ?j xi

26
Backpropagation output node
27
Backpropagation output node
28
Backpropagation inner node
29
Backpropagation inner node
30
Backpropagation Summary
  • Gradient descent over entire network weight
    vector
  • Easily generalized to arbitrary directed graphs
  • Finds a local, not necessarily global error
    minimum
  • in practice often works well
  • requires multiple invocations with different
    initial weights
  • A variation is to include momentum term
  • ?wi,j(n) ? ?j xi ? ?wi,j (n-1)
  • Minimizes error training examples
  • Training is fairly slow, yet prediction is fast

31
Expressive Capabilities of ANN
  • Boolean functions
  • Every boolean function can be represented by
    network with single hidden layer
  • But might require exponential (in number of
    inputs) hidden units
  • Continuous functions
  • Every bounded continuous function can be
    approximated with arbitrarily small error, by
    network with one hidden layer Cybenko 1989,
    Hornik 1989
  • Any function can be approximated to arbitrary
    accuracy by a network with two hidden layers
    Cybenko 1988

32
VC-dim of ANN
  • A more general bound.
  • Concept class F(C,G)
  • G Directed acyclic graph
  • C concept class, dVC-dim(C)
  • n input nodes
  • s inner nodes (of degree r)
  • Theorem VC-dim(F(C,G)) lt 2ds log (es)

33
Proof
  • Bound ?F(C,G)(m)
  • Find smallest d s.t. ?F(C,G)(m) lt2m
  • Let Sx1, , xm
  • For each fixed G we define a matrix U
  • Ui,j ci(xj), where ci is a specific i-th
    concept
  • U describes the computations of S in G
  • TF(C,G) number of different matrices.

34
Proof (continue)
  • Clearly ?F(C,G)(m) ? TF(C,G)
  • Let G be G without the root.
  • ?F(C,G)(m) ? TF(C,G) ? TF(C,G) ?C(m)
  • Inductively, ?F(C,G)(m) ? ?C(m)s
  • Recall VC Bound ?C(m) ? (em/d)d
  • Combined bound ?F(C,G)(m) ?(em/d)ds

35
Proof (cont.)
  • Solve for (em/d)ds?2m
  • Holds for m ? 2ds log(es)
  • QED
  • Back to ANN
  • VC-dim(C)n1
  • VC(ANN) ? 2(n1) log (es)
Write a Comment
User Comments (0)
About PowerShow.com