CS621 : Artificial Intelligence - PowerPoint PPT Presentation

1 / 57

About This Presentation

Title:

CS621 : Artificial Intelligence

Description:

Prepending -1 to the 0-class vector Xi and negating it, gives ... Weights in a ff NN. wmn is the weight of the connection from the nth neuron to the mth neuron ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 58

Provided by: CFI9

Category:

more less

Transcript and Presenter's Notes

Title: CS621 : Artificial Intelligence

1
CS621 Artificial Intelligence

Pushpak BhattacharyyaCSE Dept., IIT Bombay
Lecture 21 Perceptron training Introducing
Feedforward N/W

2
Perceptron as a learning device
3
Perceptron Training Algorithm

1. Start with a random value of w
ex lt0,0,0gt
2. Test for wxi gt 0
If the test succeeds for i1,2,n
then return w
3. Modify w, wnext wprev xfail
4. Go to 2

4
Tracing PTA on OR-example

wlt0,0,0gt wx1 fails
wlt-1,0,1gt wx4 fails
wlt0,0 ,1gt wx2 fails
wlt-1,1,1gt wx1 fails
wlt0,1,2gt wx4 fails
wlt1,1,2gt wx2 fails
wlt0,2,2gt wx4 fails
wlt1,2,2gt success

5
Proof of Convergence of PTA

Perceptron Training Algorithm (PTA)
Statement
Whatever be the initial choice of weights and
whatever be the vector chosen for testing, PTA
converges if the vectors are from a linearly
separable function.

6
Proof of Convergence of PTA

Consider the expression
G(wn) wn . w
wn
where wn weight at nth iteration W exists
since the vectors are from a linearly separable
function
G(wn) wn . w . cos ?
wn
where ? angle between wn and w
G(wn) w . cos ?
G(wn) w ( as -1 cos ? 1)

7
Behavior of Numerator of G

wn . w (wn-1 Xn-1fail ) . w
wn-1 . w Xn-1fail . w
(wn-2 Xn-2fail ) . w Xn-1fail . w ..
w0 . w ( X0fail X1fail .... Xn-1fail ).
w
w.Xifail is always positive note
carefully
Suppose Xj ? , where ? is the minimum
magnitude.
Num of G w0 . w n ? . w
So, numerator of G grows with n.

8
Behavior of Denominator of G

wn ? wn . wn
? (wn-1 Xn-1fail )2
? (wn-1)2 2. wn-1. Xn-1fail (Xn-1fail )2
? (wn-1)2 (Xn-1fail )2 (as wn-1. Xn-1fail
0 )
? (w0)2 (X0fail )2 (X1fail )2 . (Xn-1fail
)2
Xj ? (max magnitude)
So, Denom ? (w0)2 n?2

9
Some Observations

Numerator of G grows as n
Denominator of G grows as ? n
gt Numerator grows faster than denominator
If PTA does not terminate, G(wn) values will
become unbounded.

10
Some Observations contd.

But, as G(wn) w which is finite, this is
impossible!
Hence, PTA has to converge.
Proof is due to Marvin Minsky.

11
Study of Linear Separability

W. Xj 0 defines a hyperplane in the (n1)
dimension.
gt W vector and Xj vectors are perpendicular to
each other.

y
x1
12
Linear Separability
X1
X2

Xk1 -
Xk
Positive set w. Xj gt 0 ?jk
Xk2 -
-
Xm -
Negative set w. Xj lt 0 ?jgtk
Separating hyperplane
13
Test for Linear Separability (LS)

Theorem
A function is linearly separable iff the
vectors corresponding to the function do not have
a Positive Linear Combination (PLC)
PLC Both a necessary and sufficient condition.
X1, X2, , Xm - Vectors of the function
Y1, Y2, , Ym - Augmented negated set
Prepending -1 to the 0-class vector Xi and
negating it, gives Yi

14
Example (1) - XNOR

The set Yi has a PLC if ? Pi Yi 0 ,
1 i m
where each Pi is a non-negative scalar and
atleast one Pi gt 0
Example 2 bit even-parity (X-NOR function)

15
Example (1) - XNOR

P1 -1 0 0 T P2 1 0 -1 T
P3 1 -1 0 T P4 -1 1 1 T
0 0 0 T
All Pi 1 gives the result.
For Parity function,
PLC exists gt Not linearly separable.

16
Example (2) Majority function

3-bit majority function

Suppose PLC exists. Equations obtained are
P1 P2 P3- P4 P5- P6- P7 P8 0
-P5 P6 P7 P8 0
-P3 P4 P7 P8 0
-P2 P4 P6 P8 0
On solving, all Pi will be forced to 0
3 bit majority function
gt No PLC gt LS

17
Limitations of perceptron

Non-linear separability is all pervading
Single perceptron does not have enough computing
power
Eg XOR cannot be computed by perceptron

18
Solutions

Tolerate error (Ex pocket algorithm used by
connectionist expert systems).
Try to get the best possible hyperplane using
only perceptrons
Use higher dimension surfaces
Ex Degree - 2 surfaces like parabola
Use layered network

19
Example - XOR

? Calculation of XOR
w21
w11
x1x2
x1x2
Calculation of
x1x2

w21.5
w1-1
x2
x1
20
Example - XOR

w21
w11
x1x2
1
1
x1x2
1.5
-1
-1
1.5
x2
x1
21
Multi Layer Perceptron (MLP)

Question- How to find weights for the hidden
layers when no target output is available?
Credit assignment problem to be solved by
Gradient Descent

22
Some Terminology

A multilayer feedforward neural network has
Input layer
Output layer
Hidden layer (assists computation)
Output units and hidden units are called
computation units.

23
Training of the MLP

Multilayer Perceptron (MLP)
Question- How to find weights for the hidden
layers when no target output is available?
Credit assignment problem to be solved by
Gradient Descent

24
Gradient Descent Technique

Let E be the error at the output layer
ti target output oi observed output
i is the index going over n neurons in the
outermost layer
j is the index going over the p patterns (1 to p)
Ex XOR p4 and n1

25
Weights in a ff NN

wmn is the weight of the connection from the nth
neuron to the mth neuron
E vs surface is a complex surface in the
space defined by the weights wij
gives the direction in which a movement
of the operating point in the wmn co-ordinate
space will result in maximum decrease in error

m
wmn
n
26
Sigmoid neurons

Gradient Descent needs a derivative computation
- not possible in perceptron due to the
discontinuous step function used!
? Sigmoid neurons with easy-to-compute
derivatives used!
Computing power comes from non-linearity of
sigmoid function.

27
Derivative of Sigmoid function
28
Training algorithm

Initialize weights to random values.
For input x ltxn,xn-1,,x0gt, modify weights as
follows
Target output t, Observed output o
Iterate until E lt ? (threshold)

29
Calculation of ?wi
30
Observations

Does the training technique support our
intuition?
The larger the xi, larger is ?wi
Error burden is borne by the weight values
corresponding to large input values

31
Observations contd.

?wi is proportional to the departure from target
Saturation behaviour when o is 0 or 1
If o lt t, ?wi gt 0 and if o gt t, ?wi lt 0 which
is consistent with the Hebbs law

32
Hebbs law
nj
wji
ni

If nj and ni are both in excitatory state (1)
Then the change in weight must be such that it
enhances the excitation
The change is proportional to both the levels of
excitation
?wji a e(nj) e(ni)
If ni and nj are in a mutual state of inhibition
( one is 1 and the other is -1),
Then the change in weight is such that the
inhibition is enhanced (change in weight is
negative)

33
Saturation behavior

The algorithm is iterative and incremental
If the weight values or number of input values is
very large, the output will be large, then the
output will be in saturation region.
The weight values hardly change in the saturation
region

34
Backpropagation algorithm
Output layer (m o/p neurons)
.
j
wji
.
i
Hidden layers
.
.
Input layer (n i/p neurons)

Fully connected feed forward network
Pure FF network (no jumping of connections over
layers)

35
Gradient Descent Equations
36
Example - Character Recognition

Output layer 26 neurons (all capital)
First output neuron has the responsibility of
detecting all forms of A
Centralized representation of outputs
In distributed representations, all output
neurons participate in output

37
Backpropagation for outermost layer
38
Backpropagation for hidden layers
Output layer (m o/p neurons)
.
k
.
j
Hidden layers
.
i
.
Input layer (n i/p neurons)
?k is propagated backwards to find value of ?j
39
Backpropagation for hidden layers
40
General Backpropagation Rule

General weight updating rule
Where

for outermost layer
for hidden layers
41
Issues in the training algorithm

Algorithm is greedy. It always changes weight
such that E reduces.
The algorithm may get stuck up in a local
minimum.
If we observe that E is not getting reduced
anymore, the following may be the reasons

42
Issues in the training algorithm contd.

Stuck in local minimum.
Network paralysis. (High ve or ve i/p makes
neurons to saturate.)
(learning rate) is too small.

43
Diagnostics in action(1)

1) If stuck in local minimum, try the following
Re-initializing the weight vector.
Increase the learning rate.
Introduce more neurons in the hidden layer.

44
Diagnostics in action (1) contd.

2) If it is network paralysis, then increase the
number of neurons in the hidden layer.
Problem How to configure the hidden layer ?
Known One hidden layer seems to be sufficient.
Kolmogorov (1960s)

45
Diagnostics in action(2)

Kolgomorov statement
A feedforward network with three layers (input,
output and hidden) with appropriate I/O relation
that can vary from neuron to neuron is sufficient
to compute any function.
More hidden layers reduce the size of individual
layers.

46
Diagnostics in action(3)

3) Observe the outputs If they are close to 0 or
1, try the following
Scale the inputs or divide by a normalizing
factor.
Change the shape and size of the sigmoid.

47
Diagnostics in action(3)

Introduce momentum factor.
Accelerates the movement out of the trough.
Dampens oscillation inside the trough.
Choosing If is large, we may jump
over the minimum.

48
An application in Medical Domain
49
Expert System for Skin Diseases Diagnosis

Bumpiness and scaliness of skin
Mostly for symptom gathering and for developing
diagnosis skills
Not replacing doctors diagnosis

50
Architecture of the FF NN

96-20-10
96 input neurons, 20 hidden layer neurons, 10
output neurons
Inputs skin disease symptoms and their
parameters
Location, distribution, shape, arrangement,
pattern, number of lesions, presence of an active
norder, amount of scale, elevation of papuls,
color, altered pigmentation, itching, pustules,
lymphadenopathy, palmer thickening, results of
microscopic examination, presence of herald
pathc, result of dermatology test called KOH

51
Output

10 neurons indicative of the diseases
psoriasis, pityriasis rubra pilaris, lichen
planus, pityriasis rosea, tinea versicolor,
dermatophytosis, cutaneous T-cell lymphoma,
secondery syphilis, chronic contact dermatitis,
soberrheic dermatitis

52
Training data

Input specs of 10 model diseases from 250
patients
0.5 is some specific symptom value is not knoiwn
Trained using standard error backpropagation
algorithm

53
Testing

Previously unused symptom and disease data of 99
patients
Result
Correct diagnosis achieved for 70 of
papulosquamous group skin diseases
Success rate above 80 for the remaining diseases
except for psoriasis
psoriasis diagnosed correctly only in 30 of the
cases
Psoriasis resembles other diseases within the
papulosquamous group of diseases, and is somewhat
difficult even for specialists to recognise.

54
Explanation capability

Rule based systems reveal the explicit path of
reasoning through the textual statements
Connectionist expert systems reach conclusions
through complex, non linear and simultaneous
interaction of many units
Analysing the effect of a single input or a
single group of inputs would be difficult and
would yield incor6rect results

55
Explanation contd.

The hidden layer re-represents the data
Outputs of hidden neurons are neither symtoms nor
decisions

56
(No Transcript)
57
Discussion

Symptoms and parameters contributing to the
diagnosis found from the n/w
Standard deviation, mean and other tests of
significance used to arrive at the importance of
contributing parameters
The n/w acts as apprentice to the expert

Write a Comment

User Comments (0)