CS621 : Artificial Intelligence - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

CS621 : Artificial Intelligence

Description:

Prepending -1 to the 0-class vector Xi and negating it, gives ... Weights in a ff NN. wmn is the weight of the connection from the nth neuron to the mth neuron ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 58
Provided by: CFI9
Category:

less

Transcript and Presenter's Notes

Title: CS621 : Artificial Intelligence


1
CS621 Artificial Intelligence
  • Pushpak BhattacharyyaCSE Dept., IIT Bombay
  • Lecture 21 Perceptron training Introducing
    Feedforward N/W

2
Perceptron as a learning device
3
Perceptron Training Algorithm
  • 1. Start with a random value of w
  • ex lt0,0,0gt
  • 2. Test for wxi gt 0
  • If the test succeeds for i1,2,n
  • then return w
  • 3. Modify w, wnext wprev xfail
  • 4. Go to 2

4
Tracing PTA on OR-example
  • wlt0,0,0gt wx1 fails
  • wlt-1,0,1gt wx4 fails
  • wlt0,0 ,1gt wx2 fails
  • wlt-1,1,1gt wx1 fails
  • wlt0,1,2gt wx4 fails
  • wlt1,1,2gt wx2 fails
  • wlt0,2,2gt wx4 fails
  • wlt1,2,2gt success

5
Proof of Convergence of PTA
  • Perceptron Training Algorithm (PTA)
  • Statement
  • Whatever be the initial choice of weights and
    whatever be the vector chosen for testing, PTA
    converges if the vectors are from a linearly
    separable function.

6
Proof of Convergence of PTA
  • Consider the expression
  • G(wn) wn . w
  • wn
  • where wn weight at nth iteration W exists
    since the vectors are from a linearly separable
    function
  • G(wn) wn . w . cos ?
  • wn
  • where ? angle between wn and w
  • G(wn) w . cos ?
  • G(wn) w ( as -1 cos ? 1)

7
Behavior of Numerator of G
  • wn . w (wn-1 Xn-1fail ) . w
  • wn-1 . w Xn-1fail . w
  • (wn-2 Xn-2fail ) . w Xn-1fail . w ..
  • w0 . w ( X0fail X1fail .... Xn-1fail ).
    w
  • w.Xifail is always positive note
    carefully
  • Suppose Xj ? , where ? is the minimum
    magnitude.
  • Num of G w0 . w n ? . w
  • So, numerator of G grows with n.

8
Behavior of Denominator of G
  • wn ? wn . wn
  • ? (wn-1 Xn-1fail )2
  • ? (wn-1)2 2. wn-1. Xn-1fail (Xn-1fail )2
  • ? (wn-1)2 (Xn-1fail )2 (as wn-1. Xn-1fail
    0 )
  • ? (w0)2 (X0fail )2 (X1fail )2 . (Xn-1fail
    )2
  • Xj ? (max magnitude)
  • So, Denom ? (w0)2 n?2

9
Some Observations
  • Numerator of G grows as n
  • Denominator of G grows as ? n
  • gt Numerator grows faster than denominator
  • If PTA does not terminate, G(wn) values will
    become unbounded.

10
Some Observations contd.
  • But, as G(wn) w which is finite, this is
    impossible!
  • Hence, PTA has to converge.
  • Proof is due to Marvin Minsky.

11
Study of Linear Separability
  • W. Xj 0 defines a hyperplane in the (n1)
    dimension.
  • gt W vector and Xj vectors are perpendicular to
    each other.

y
x1
12
Linear Separability
X1
X2

Xk1 -
Xk
Positive set w. Xj gt 0 ?jk
Xk2 -
-
Xm -
Negative set w. Xj lt 0 ?jgtk
Separating hyperplane
13
Test for Linear Separability (LS)
  • Theorem
  • A function is linearly separable iff the
    vectors corresponding to the function do not have
    a Positive Linear Combination (PLC)
  • PLC Both a necessary and sufficient condition.
  • X1, X2, , Xm - Vectors of the function
  • Y1, Y2, , Ym - Augmented negated set
  • Prepending -1 to the 0-class vector Xi and
    negating it, gives Yi

14
Example (1) - XNOR
  • The set Yi has a PLC if ? Pi Yi 0 ,
  • 1 i m
  • where each Pi is a non-negative scalar and
  • atleast one Pi gt 0
  • Example 2 bit even-parity (X-NOR function)

15
Example (1) - XNOR
  • P1 -1 0 0 T P2 1 0 -1 T
  • P3 1 -1 0 T P4 -1 1 1 T
  • 0 0 0 T
  • All Pi 1 gives the result.
  • For Parity function,
  • PLC exists gt Not linearly separable.

16
Example (2) Majority function
  • 3-bit majority function
  • Suppose PLC exists. Equations obtained are
  • P1 P2 P3- P4 P5- P6- P7 P8 0
  • -P5 P6 P7 P8 0
  • -P3 P4 P7 P8 0
  • -P2 P4 P6 P8 0
  • On solving, all Pi will be forced to 0
  • 3 bit majority function
  • gt No PLC gt LS

17
Limitations of perceptron
  • Non-linear separability is all pervading
  • Single perceptron does not have enough computing
    power
  • Eg XOR cannot be computed by perceptron

18
Solutions
  • Tolerate error (Ex pocket algorithm used by
    connectionist expert systems).
  • Try to get the best possible hyperplane using
    only perceptrons
  • Use higher dimension surfaces
  • Ex Degree - 2 surfaces like parabola
  • Use layered network

19
Example - XOR
  • 0.5

? Calculation of XOR
w21
w11
x1x2
x1x2
Calculation of
x1x2
  • 1

w21.5
w1-1
x2
x1
20
Example - XOR
  • 0.5

w21
w11
x1x2
1
1
x1x2
1.5
-1
-1
1.5
x2
x1
21
Multi Layer Perceptron (MLP)
  • Question- How to find weights for the hidden
    layers when no target output is available?
  • Credit assignment problem to be solved by
    Gradient Descent

22
Some Terminology
  • A multilayer feedforward neural network has
  • Input layer
  • Output layer
  • Hidden layer (assists computation)
  • Output units and hidden units are called
  • computation units.

23
Training of the MLP
  • Multilayer Perceptron (MLP)
  • Question- How to find weights for the hidden
    layers when no target output is available?
  • Credit assignment problem to be solved by
    Gradient Descent

24
Gradient Descent Technique
  • Let E be the error at the output layer
  • ti target output oi observed output
  • i is the index going over n neurons in the
    outermost layer
  • j is the index going over the p patterns (1 to p)
  • Ex XOR p4 and n1

25
Weights in a ff NN
  • wmn is the weight of the connection from the nth
    neuron to the mth neuron
  • E vs surface is a complex surface in the
    space defined by the weights wij
  • gives the direction in which a movement
    of the operating point in the wmn co-ordinate
    space will result in maximum decrease in error

m
wmn
n
26
Sigmoid neurons
  • Gradient Descent needs a derivative computation
  • - not possible in perceptron due to the
    discontinuous step function used!
  • ? Sigmoid neurons with easy-to-compute
    derivatives used!
  • Computing power comes from non-linearity of
    sigmoid function.

27
Derivative of Sigmoid function
28
Training algorithm
  • Initialize weights to random values.
  • For input x ltxn,xn-1,,x0gt, modify weights as
    follows
  • Target output t, Observed output o
  • Iterate until E lt ? (threshold)

29
Calculation of ?wi
30
Observations
  • Does the training technique support our
    intuition?
  • The larger the xi, larger is ?wi
  • Error burden is borne by the weight values
    corresponding to large input values

31
Observations contd.
  • ?wi is proportional to the departure from target
  • Saturation behaviour when o is 0 or 1
  • If o lt t, ?wi gt 0 and if o gt t, ?wi lt 0 which
    is consistent with the Hebbs law

32
Hebbs law
nj
wji
ni
  • If nj and ni are both in excitatory state (1)
  • Then the change in weight must be such that it
    enhances the excitation
  • The change is proportional to both the levels of
    excitation
  • ?wji a e(nj) e(ni)
  • If ni and nj are in a mutual state of inhibition
    ( one is 1 and the other is -1),
  • Then the change in weight is such that the
    inhibition is enhanced (change in weight is
    negative)

33
Saturation behavior
  • The algorithm is iterative and incremental
  • If the weight values or number of input values is
    very large, the output will be large, then the
    output will be in saturation region.
  • The weight values hardly change in the saturation
    region

34
Backpropagation algorithm
Output layer (m o/p neurons)
.
j
wji
.
i
Hidden layers
.
.
Input layer (n i/p neurons)
  • Fully connected feed forward network
  • Pure FF network (no jumping of connections over
    layers)

35
Gradient Descent Equations
36
Example - Character Recognition
  • Output layer 26 neurons (all capital)
  • First output neuron has the responsibility of
    detecting all forms of A
  • Centralized representation of outputs
  • In distributed representations, all output
    neurons participate in output

37
Backpropagation for outermost layer
38
Backpropagation for hidden layers
Output layer (m o/p neurons)
.
k
.
j
Hidden layers
.
i
.
Input layer (n i/p neurons)
?k is propagated backwards to find value of ?j
39
Backpropagation for hidden layers
40
General Backpropagation Rule
  • General weight updating rule
  • Where

for outermost layer
for hidden layers
41
Issues in the training algorithm
  • Algorithm is greedy. It always changes weight
    such that E reduces.
  • The algorithm may get stuck up in a local
    minimum.
  • If we observe that E is not getting reduced
    anymore, the following may be the reasons

42
Issues in the training algorithm contd.
  • Stuck in local minimum.
  • Network paralysis. (High ve or ve i/p makes
    neurons to saturate.)
  • (learning rate) is too small.

43
Diagnostics in action(1)
  • 1) If stuck in local minimum, try the following
  • Re-initializing the weight vector.
  • Increase the learning rate.
  • Introduce more neurons in the hidden layer.

44
Diagnostics in action (1) contd.
  • 2) If it is network paralysis, then increase the
    number of neurons in the hidden layer.
  • Problem How to configure the hidden layer ?
  • Known One hidden layer seems to be sufficient.
    Kolmogorov (1960s)

45
Diagnostics in action(2)
  • Kolgomorov statement
  • A feedforward network with three layers (input,
    output and hidden) with appropriate I/O relation
    that can vary from neuron to neuron is sufficient
    to compute any function.
  • More hidden layers reduce the size of individual
    layers.

46
Diagnostics in action(3)
  • 3) Observe the outputs If they are close to 0 or
    1, try the following
  • Scale the inputs or divide by a normalizing
    factor.
  • Change the shape and size of the sigmoid.

47
Diagnostics in action(3)
  • Introduce momentum factor.
  • Accelerates the movement out of the trough.
  • Dampens oscillation inside the trough.
  • Choosing If is large, we may jump
    over the minimum.

48
An application in Medical Domain
49
Expert System for Skin Diseases Diagnosis
  • Bumpiness and scaliness of skin
  • Mostly for symptom gathering and for developing
    diagnosis skills
  • Not replacing doctors diagnosis

50
Architecture of the FF NN
  • 96-20-10
  • 96 input neurons, 20 hidden layer neurons, 10
    output neurons
  • Inputs skin disease symptoms and their
    parameters
  • Location, distribution, shape, arrangement,
    pattern, number of lesions, presence of an active
    norder, amount of scale, elevation of papuls,
    color, altered pigmentation, itching, pustules,
    lymphadenopathy, palmer thickening, results of
    microscopic examination, presence of herald
    pathc, result of dermatology test called KOH

51
Output
  • 10 neurons indicative of the diseases
  • psoriasis, pityriasis rubra pilaris, lichen
    planus, pityriasis rosea, tinea versicolor,
    dermatophytosis, cutaneous T-cell lymphoma,
    secondery syphilis, chronic contact dermatitis,
    soberrheic dermatitis

52
Training data
  • Input specs of 10 model diseases from 250
    patients
  • 0.5 is some specific symptom value is not knoiwn
  • Trained using standard error backpropagation
    algorithm

53
Testing
  • Previously unused symptom and disease data of 99
    patients
  • Result
  • Correct diagnosis achieved for 70 of
    papulosquamous group skin diseases
  • Success rate above 80 for the remaining diseases
    except for psoriasis
  • psoriasis diagnosed correctly only in 30 of the
    cases
  • Psoriasis resembles other diseases within the
    papulosquamous group of diseases, and is somewhat
    difficult even for specialists to recognise.

54
Explanation capability
  • Rule based systems reveal the explicit path of
    reasoning through the textual statements
  • Connectionist expert systems reach conclusions
    through complex, non linear and simultaneous
    interaction of many units
  • Analysing the effect of a single input or a
    single group of inputs would be difficult and
    would yield incor6rect results

55
Explanation contd.
  • The hidden layer re-represents the data
  • Outputs of hidden neurons are neither symtoms nor
    decisions

56
(No Transcript)
57
Discussion
  • Symptoms and parameters contributing to the
    diagnosis found from the n/w
  • Standard deviation, mean and other tests of
    significance used to arrive at the importance of
    contributing parameters
  • The n/w acts as apprentice to the expert
Write a Comment
User Comments (0)
About PowerShow.com