Title: Spring 2006 Artificial Intelligence COSC40503 week 5
1Spring 2006 Artificial Intelligence
COSC 40503week 5
- Antonio Sanchez
- Texas Christian University
2About Learning and Behavior
Herbert Simon (1916-2001)
- Learning is any change in a system that produces
a more or less permanent change in its capacity
for adapting to its environment. -
- Human beings, viewed as behaving systems, are
quite simple. The apparent complexity of our
behavior over time is largely a reflection of the
complexity of the environment in which we find
ourselves.
3Connectionism
- The concept of a Neuron has invited many lines of
research. - Yet the power of living neurons lies in both
- Their huge number 1010 in humans, 104 in small
bug - Their connectivity 105 in humans
- It is therefore by the power of the
- ganglia (1015) that we present a
- complex and intelligent behavior
- Neurons can be model as digital
- or continuous systems
4A simple Neuron Model
- Orthogonally fh 0
- Normality ff 1 execute fi
nfi/sqrt(nf.nf) - Soma ni,j ? ai,kdk,j
- for all k dendrites
- Linear Axon ai,j ni,j
- Non Linear Axon if ni,j gt Thresholdi,j
- then ai,j 1 else
ai,j 0 - Lineal compensation ?di,j ?ai,kai,j
0 lt ? lt 1 - k is entry axon
j exit axon - N pattern compensation ?di,j
??i,k,hai,j,h for all h patterns
5A LinearNeuronal Network
It can learn a linear function such as
Exclusive OR
6A Hidden Layer Neuronal Network
Exclusive OR
Th 1
Th 1
Th1
Excel file
Source Rumelhart, D. E., Hinton, G. E.,
Williams, R. J. (1986a). Learning internal
representations by error propagation. In D. E.
Rumelhart, J. L. McClelland (Eds.), Parallel
distributed processing Explorations in the
microstructure of cognition. Vol. 1 Foundations
(pp. 318--362). Cambridge, MA MIT Press.
7Credit Assignment using Gradient Descent method
to change the dendritic weights
- Change the function for firing neurons from a
step function to a continous one with a close
behavior, e.g. a sigmoid axon - a(j) 1/(1e-n(j))
- Calculate the derivative of the sigmoid
- da(j)/dn(j) a(j)(1-a(j))
- Determine output error
- e(output) TrueValue - a(j)
- Determine the axon error
- ea(j) a(j)(1-a(j))e(output)
- Minimize E by gradient descent change each
weight by an amount proportional to the partial
derivative - ?d aea(j)a(h)
If slope is negative ? increase n(j) If slope is
positive ? decrease n(j) Local minima for Error
are the places where derivative equals zero
8Faster Convergence Momentum rule
- Add a fraction (amomentum) of the last change to
the current change
?d(t) aea(j)a(h)ß?d(t-1)
Study and Run Excel file
9Extended Hidden LayerNeuronal Network
Hidden Layer
Inputs
Outputs
10A Lesson from Mother NatureUsing the
Scientific Method Observation
11Hypothesis
When an axon of cell A is near enough to excite
cell B and repeatedly or persistently takes part
in firing it, some grow process or metabolic
change takes place in one or both cells such that
As efficiency, as one of the cells firing B, is
increased Donald O. Hebb 1949
12Synthesis and Validation
The connectivity of a brain is product of its
interaction with its environment
High Interaction
Low Interaction
13Argumentation
- In any case the connectivity is due to at least
the following - three factors
- Time
- Born, Child, Young, Adult
- Interaction
- Low, Medium, High
- Activity of the brain
- Low, Medium, High
14How about storing Information
2 bit net for
Th 1
Value -1
Th 0
Value -1
Th 1
15How about storing Knowledge
2 bit net for gt
Th 1
Value 1
Th 1
Value 1
Value -1
Th 0
16Digital Neural Network requirements
- Extrapolating the previous data to Gigabytes of
knowledge and - information, we would require an enormous amount
of cells - and connections, some like 1011 for a full
Gigabyte of - information and knowledge
- However we must take into account three important
facts for the - case of natural neurons
- They are not binary digital, but continuous
- There is a lot of implicit coding in the ganglia
- They do not store all the information, but only
important patterns of them (aka knowledge)
17Some important knowledge on Artificial Neural
Networks
- They are slow but once they learn they perform
beautifully - Treat the threshold as negative dendrite with an
axon with value of 1 - To speed up convergence try with random seeds for
the initial weights - If they do not converge use another set of random
weights - To reduce the size of the network, use binary
coding for both the inputs and the outputs - Use other deterministic filters you deem
necessary such as the mean, sd - Do not overload the network with more patterns
than a maximum of 20 to 25 of the number of
neurons - A single hidden layer is enough, it should
comprised about half of the input cells and/or
twice the number output cells
18Linear Recurrent Networks
- Associative Memory
- Hopfield Network
Excel files
19Unsupervised Learning
So far we have talked about training with a
purpose, I.e. showing some examples and give some
type of compensation, this is called supervised
learning. Yet following Donald O. Hebb
hypothesis, we do not event need to supervised
the learning process, this will take place
anyway!
- Examples
- Kohonen Maps
- Bayesian Nets
20Bayesian Belief Networks
Based on the joint probability concepts of Thomas
Bayes, these networks are used to obtain
inference based on the connection that occurs
within related events, here the basic equations
(try to remember them) p(A ? B) p(A) p(B) -
p(A n B) p(A n B) p(A)p(BA) p(B)p(AB)
p(A,B) p(A n B) 1 - p(A ? B) If p(A n B)
p(A)p(B) then A B are independent events and
no inference can be obtained from having one
event happening If p(A n B) 0 then A B are
mutually exclusive events, that is one cannot
happen if the other is present
21Bayesian Belief Networks 2
An important extension is that p(A,B,C)
p(A)p(BA)p(CA,B) p(B)p(AB)p(CA,B)
p(A)p(CA)p(BA,C) p(B)p(CB)p(AC,B)
p(C )p(AC)p(BA,C) p(A n B n C) For the
general case p(x1,,xn) p p(xi E) for all
i where E is the required a priori evidence
22Bayesian Belief Networks 3
The key aspect is to arrange the events in such a
fashion that we obtain an adequate tree of the
joint probabilities for various events, such that
we can assume that p(F,P,C)
p(C)p(PC)p(FC,P) is computed only as
p(C) p(PC) p(F P)
And
p(E,P,C,S,F) p(C)p(PC)p(SC)p(EC,P,S)p(
FC,P,S,E) is computed only as
p(C)p(SC)p(PC)p(EP,S)p(FP)
See Excel files