Title: CS621 : Artificial Intelligence
1CS621 Artificial Intelligence
- Pushpak BhattacharyyaCSE Dept., IIT Bombay
- Lecture 21 Perceptron training Introducing
Feedforward N/W
2Perceptron as a learning device
3Perceptron Training Algorithm
- 1. Start with a random value of w
- ex lt0,0,0gt
- 2. Test for wxi gt 0
- If the test succeeds for i1,2,n
- then return w
- 3. Modify w, wnext wprev xfail
- 4. Go to 2
4Tracing PTA on OR-example
- wlt0,0,0gt wx1 fails
- wlt-1,0,1gt wx4 fails
- wlt0,0 ,1gt wx2 fails
- wlt-1,1,1gt wx1 fails
- wlt0,1,2gt wx4 fails
- wlt1,1,2gt wx2 fails
- wlt0,2,2gt wx4 fails
- wlt1,2,2gt success
5Proof of Convergence of PTA
- Perceptron Training Algorithm (PTA)
- Statement
- Whatever be the initial choice of weights and
whatever be the vector chosen for testing, PTA
converges if the vectors are from a linearly
separable function.
6Proof of Convergence of PTA
- Consider the expression
- G(wn) wn . w
- wn
- where wn weight at nth iteration W exists
since the vectors are from a linearly separable
function - G(wn) wn . w . cos ?
- wn
- where ? angle between wn and w
- G(wn) w . cos ?
- G(wn) w ( as -1 cos ? 1)
7Behavior of Numerator of G
- wn . w (wn-1 Xn-1fail ) . w
- wn-1 . w Xn-1fail . w
- (wn-2 Xn-2fail ) . w Xn-1fail . w ..
- w0 . w ( X0fail X1fail .... Xn-1fail ).
w - w.Xifail is always positive note
carefully - Suppose Xj ? , where ? is the minimum
magnitude. - Num of G w0 . w n ? . w
- So, numerator of G grows with n.
8Behavior of Denominator of G
- wn ? wn . wn
- ? (wn-1 Xn-1fail )2
- ? (wn-1)2 2. wn-1. Xn-1fail (Xn-1fail )2
- ? (wn-1)2 (Xn-1fail )2 (as wn-1. Xn-1fail
0 ) - ? (w0)2 (X0fail )2 (X1fail )2 . (Xn-1fail
)2 - Xj ? (max magnitude)
- So, Denom ? (w0)2 n?2
9Some Observations
- Numerator of G grows as n
- Denominator of G grows as ? n
- gt Numerator grows faster than denominator
- If PTA does not terminate, G(wn) values will
become unbounded.
10Some Observations contd.
- But, as G(wn) w which is finite, this is
impossible! - Hence, PTA has to converge.
- Proof is due to Marvin Minsky.
11Study of Linear Separability
- W. Xj 0 defines a hyperplane in the (n1)
dimension. - gt W vector and Xj vectors are perpendicular to
each other.
y
x1
12Linear Separability
X1
X2
Xk1 -
Xk
Positive set w. Xj gt 0 ?jk
Xk2 -
-
Xm -
Negative set w. Xj lt 0 ?jgtk
Separating hyperplane
13Test for Linear Separability (LS)
- Theorem
- A function is linearly separable iff the
vectors corresponding to the function do not have
a Positive Linear Combination (PLC) - PLC Both a necessary and sufficient condition.
- X1, X2, , Xm - Vectors of the function
- Y1, Y2, , Ym - Augmented negated set
- Prepending -1 to the 0-class vector Xi and
negating it, gives Yi
14Example (1) - XNOR
- The set Yi has a PLC if ? Pi Yi 0 ,
- 1 i m
- where each Pi is a non-negative scalar and
- atleast one Pi gt 0
- Example 2 bit even-parity (X-NOR function)
15Example (1) - XNOR
- P1 -1 0 0 T P2 1 0 -1 T
- P3 1 -1 0 T P4 -1 1 1 T
- 0 0 0 T
- All Pi 1 gives the result.
- For Parity function,
- PLC exists gt Not linearly separable.
16Example (2) Majority function
- Suppose PLC exists. Equations obtained are
- P1 P2 P3- P4 P5- P6- P7 P8 0
- -P5 P6 P7 P8 0
- -P3 P4 P7 P8 0
- -P2 P4 P6 P8 0
- On solving, all Pi will be forced to 0
- 3 bit majority function
- gt No PLC gt LS
17Limitations of perceptron
- Non-linear separability is all pervading
- Single perceptron does not have enough computing
power - Eg XOR cannot be computed by perceptron
18Solutions
- Tolerate error (Ex pocket algorithm used by
connectionist expert systems). - Try to get the best possible hyperplane using
only perceptrons - Use higher dimension surfaces
- Ex Degree - 2 surfaces like parabola
- Use layered network
19Example - XOR
? Calculation of XOR
w21
w11
x1x2
x1x2
Calculation of
x1x2
w21.5
w1-1
x2
x1
20Example - XOR
w21
w11
x1x2
1
1
x1x2
1.5
-1
-1
1.5
x2
x1
21Multi Layer Perceptron (MLP)
- Question- How to find weights for the hidden
layers when no target output is available? - Credit assignment problem to be solved by
Gradient Descent
22Some Terminology
- A multilayer feedforward neural network has
- Input layer
- Output layer
- Hidden layer (assists computation)
- Output units and hidden units are called
- computation units.
23Training of the MLP
- Multilayer Perceptron (MLP)
- Question- How to find weights for the hidden
layers when no target output is available? - Credit assignment problem to be solved by
Gradient Descent
24Gradient Descent Technique
- Let E be the error at the output layer
- ti target output oi observed output
- i is the index going over n neurons in the
outermost layer - j is the index going over the p patterns (1 to p)
- Ex XOR p4 and n1
25Weights in a ff NN
- wmn is the weight of the connection from the nth
neuron to the mth neuron - E vs surface is a complex surface in the
space defined by the weights wij - gives the direction in which a movement
of the operating point in the wmn co-ordinate
space will result in maximum decrease in error
m
wmn
n
26Sigmoid neurons
- Gradient Descent needs a derivative computation
- - not possible in perceptron due to the
discontinuous step function used! - ? Sigmoid neurons with easy-to-compute
derivatives used! - Computing power comes from non-linearity of
sigmoid function.
27Derivative of Sigmoid function
28Training algorithm
- Initialize weights to random values.
- For input x ltxn,xn-1,,x0gt, modify weights as
follows - Target output t, Observed output o
- Iterate until E lt ? (threshold)
29Calculation of ?wi
30Observations
- Does the training technique support our
intuition? - The larger the xi, larger is ?wi
- Error burden is borne by the weight values
corresponding to large input values
31Observations contd.
- ?wi is proportional to the departure from target
- Saturation behaviour when o is 0 or 1
- If o lt t, ?wi gt 0 and if o gt t, ?wi lt 0 which
is consistent with the Hebbs law
32Hebbs law
nj
wji
ni
- If nj and ni are both in excitatory state (1)
- Then the change in weight must be such that it
enhances the excitation - The change is proportional to both the levels of
excitation - ?wji a e(nj) e(ni)
- If ni and nj are in a mutual state of inhibition
( one is 1 and the other is -1), - Then the change in weight is such that the
inhibition is enhanced (change in weight is
negative)
33Saturation behavior
- The algorithm is iterative and incremental
- If the weight values or number of input values is
very large, the output will be large, then the
output will be in saturation region. - The weight values hardly change in the saturation
region
34Backpropagation algorithm
Output layer (m o/p neurons)
.
j
wji
.
i
Hidden layers
.
.
Input layer (n i/p neurons)
- Fully connected feed forward network
- Pure FF network (no jumping of connections over
layers)
35Gradient Descent Equations
36Example - Character Recognition
- Output layer 26 neurons (all capital)
- First output neuron has the responsibility of
detecting all forms of A - Centralized representation of outputs
- In distributed representations, all output
neurons participate in output
37Backpropagation for outermost layer
38Backpropagation for hidden layers
Output layer (m o/p neurons)
.
k
.
j
Hidden layers
.
i
.
Input layer (n i/p neurons)
?k is propagated backwards to find value of ?j
39Backpropagation for hidden layers
40General Backpropagation Rule
- General weight updating rule
- Where
for outermost layer
for hidden layers
41Issues in the training algorithm
- Algorithm is greedy. It always changes weight
such that E reduces. - The algorithm may get stuck up in a local
minimum. - If we observe that E is not getting reduced
anymore, the following may be the reasons
42Issues in the training algorithm contd.
- Stuck in local minimum.
- Network paralysis. (High ve or ve i/p makes
neurons to saturate.) - (learning rate) is too small.
43Diagnostics in action(1)
- 1) If stuck in local minimum, try the following
- Re-initializing the weight vector.
- Increase the learning rate.
- Introduce more neurons in the hidden layer.
44Diagnostics in action (1) contd.
- 2) If it is network paralysis, then increase the
number of neurons in the hidden layer. - Problem How to configure the hidden layer ?
- Known One hidden layer seems to be sufficient.
Kolmogorov (1960s)
45Diagnostics in action(2)
- Kolgomorov statement
- A feedforward network with three layers (input,
output and hidden) with appropriate I/O relation
that can vary from neuron to neuron is sufficient
to compute any function. - More hidden layers reduce the size of individual
layers.
46Diagnostics in action(3)
- 3) Observe the outputs If they are close to 0 or
1, try the following - Scale the inputs or divide by a normalizing
factor. - Change the shape and size of the sigmoid.
47Diagnostics in action(3)
- Introduce momentum factor.
- Accelerates the movement out of the trough.
- Dampens oscillation inside the trough.
- Choosing If is large, we may jump
over the minimum.
48An application in Medical Domain
49Expert System for Skin Diseases Diagnosis
- Bumpiness and scaliness of skin
- Mostly for symptom gathering and for developing
diagnosis skills - Not replacing doctors diagnosis
50Architecture of the FF NN
- 96-20-10
- 96 input neurons, 20 hidden layer neurons, 10
output neurons - Inputs skin disease symptoms and their
parameters - Location, distribution, shape, arrangement,
pattern, number of lesions, presence of an active
norder, amount of scale, elevation of papuls,
color, altered pigmentation, itching, pustules,
lymphadenopathy, palmer thickening, results of
microscopic examination, presence of herald
pathc, result of dermatology test called KOH
51Output
- 10 neurons indicative of the diseases
- psoriasis, pityriasis rubra pilaris, lichen
planus, pityriasis rosea, tinea versicolor,
dermatophytosis, cutaneous T-cell lymphoma,
secondery syphilis, chronic contact dermatitis,
soberrheic dermatitis
52Training data
- Input specs of 10 model diseases from 250
patients - 0.5 is some specific symptom value is not knoiwn
- Trained using standard error backpropagation
algorithm
53Testing
- Previously unused symptom and disease data of 99
patients - Result
- Correct diagnosis achieved for 70 of
papulosquamous group skin diseases - Success rate above 80 for the remaining diseases
except for psoriasis - psoriasis diagnosed correctly only in 30 of the
cases - Psoriasis resembles other diseases within the
papulosquamous group of diseases, and is somewhat
difficult even for specialists to recognise.
54Explanation capability
- Rule based systems reveal the explicit path of
reasoning through the textual statements - Connectionist expert systems reach conclusions
through complex, non linear and simultaneous
interaction of many units - Analysing the effect of a single input or a
single group of inputs would be difficult and
would yield incor6rect results
55Explanation contd.
- The hidden layer re-represents the data
- Outputs of hidden neurons are neither symtoms nor
decisions
56(No Transcript)
57Discussion
- Symptoms and parameters contributing to the
diagnosis found from the n/w - Standard deviation, mean and other tests of
significance used to arrive at the importance of
contributing parameters - The n/w acts as apprentice to the expert