Title: Other and related for 2 hours
1Other and related for 2 hours
- Christer Johansson
- Computational Linguistics
- Bergen University
2New Developments
- Local Learning
- cautious generalization (Instance Based
learning)
3Radial Basis Functions (RBFs)
- Features
- One hidden layer
- The activation of a hidden unit is determined by
the distance between the input vector and a
prototype vector
Outputs
Radial units
Inputs
4Learning
- The training is performed by deciding on
- How many hidden nodes there should be
- The centers and the sharpness of the Gaussians
- 2 steps
- In the 1st stage, the input data set is used to
determine the parameters of the basis functions - In the 2nd stage, functions are kept fixed while
the second layer weights are estimated ( Simple
BP algorithm like for MLPs)
5MLPs versus RBFs
- Classification
- MLPs separate classes via hyperplanes
- RBFs separate classes via hyperspheres
- Learning
- MLPs use distributed learning
- RBFs use localized learning
- RBFs train faster
- Structure
- MLPs have one or more hidden layers
- RBFs have only one layer
- RBFs require more hidden neurons gt curse of
dimensionality
MLP
X2
X1
X2
RBF
X1
6Temporal Processing
- Simple Recurrent Networks (SRN)
7SRN
- Uses the back-propagation algorithm
- Feeds back the (a) hidden layers activition
- This becomes part of the input in the next
processing step. - Initially the activation of the hidden layer is
undefined. May take a little while to stabilize.
8SRN
9SRN an example
An SRN can be usedto predict the next character
of a sequence. Errors typically drops from start
of words. A method to detect words?
10xor in time (spurious regularities)
If we run a loopover this data set we typically
getgood predictionfor input, but notthe xor
function. To learn xor in time we must make
sure the input is randomto give xor a chance.
1 ? 1 ? 0 0 1 ? 0 ? 1 1 0
? 1 ? 1 1 0 ? 0 ? 0 0
11Neural Nets with a Lexicon
- Can symbolic and connectionistprocessing be
combined?
12NN with lexicon
- Early models of language acquisition stressed
that connectionist models didnt need a separate
lexicon. Everything was stored in the net. - Miikkulainen Dyer showed that adding a lexicon
could help the net invent a useful input
representation.
13NN with lexicon
- Their model mapped short sentences (with a task
to assign thematic roles to words). - Words were index numbers
- The index pointed to a representation that was
trained. - The error signal was sent one step further, and
used to update the representations in the
lexicon. (!)
14Inventing features
15Miikkulainen FGREP DISCERN
- The combination of a lexicon and a neural net
proved successful. - Interesting because it marries symbolic AI with
connectionism. - There is suddenly room for hybrid models.
16An alternative Learning Law
- Winner Takes All
- Kohonen Maps (SOM, LVQ)
17Self organizing maps
- The purpose of SOM is to map a multidimensional
input space onto a topology preserving map of
neurons - Preserve a topological so that neighboring
neurons respond to similar input patterns - The topological structure is often a 2 or 3
dimensional space - Each neuron is assigned a weight vector with the
same dimensionality of the input space - Input patterns are compared to each weight vector
and the closest wins (Euclidean Distance)
18Self organizing maps
- The result of SOM is a clustering of data, so
that similar input appear closer in the map
space. - An implicit categorization is discovered without
feedback. - Problems with winner-takes all
- If sequence of training is ordered in a certain
way one neuron could form a universal class for
everything. - Remedy conscience. Neurons are discouraged
from being greedy their probability of being
the winner is influenced by how many times they
have won previously.
19- The activation of the neuron spreads to its
direct neighborhood gtneighbors become sensitive
to similar input patterns - The size of the neighborhood is initially large
but reduce over time gt Specialization of the
network - Other measures of distance possible which
defines the neighborhood.
2nd neighborhood
First neighborhood
20Adaptation
- During training, the winner neuron and its
neighborhood adapts to make their weight vector
more similar to the input pattern that caused the
activation - The neurons are moved closer to the input
pattern (the weights of the neurons to the input
are adapted). - The magnitude of the adaptation is controlled via
a learning parameter which decays over time
21Interpretation of Neural Nets
- Exclusive or Inclusive ProbabilitiesFuzzy Logic
22Fuzzy Logic
Values in neural networks are usually shades as
neurons gradually gets activated. Fuzzy Logic
uses shades of truth, and can combinesome of the
strengths of symbolic AI with the strengths of
connectionist AI.
23Fuzzy is not Probabilistic
In Fuzzy Logic we can say thatthe car is 0.70 in
parking pocket Aand 0.30 in parking pocket
B. This is not the same as saying it is in A
with 0.70 probability. (it would then be
either in A orsomewhere else.)
C A R
24Vagueness
When do we get old? Are we not old at 39, but
old at 40? In crisp logic there would have to be
one sharp dividing line between the oldand the
not old. Fuzzy allows us to be young and old at
the same time to some
degree. (cf. Non-linear activation function that
helped to save the perceptron for multi-layered
networks).
25Neural Nets
- Fuzzy Logic or Probabilistic Reasoning?
- NNs are often used to estimate parameters in
Fuzzy Logic systems. The main question is to
what degree is the input a member of the valid
classes.
26Neural Nets
- When we provide NNs as a knowledge source the
information about the detected vagueness is
preserved. - It is also possible to allow some adaptation of
that information in the final product. - NNs can also be used to approximate probability
distributions.
27Neural Nets Hidden Markov Models
- Alike Different
- Hybrid Models
28Neural Nets
- NNs can be used to approximate probability
distributions. - This is a main problem in probabilistic modeling
(Hidden Markov Models). - Probabilities are estimated from large data
bases, but there will always be a need for more
data. Larger contexts means sparse data neural
networks may help generalize to unseen data.
29HMM
- Work with probabilities of state transitions
- Derived from a corpus?
- Either you are in a state or you are not.
30HMM
- Assume that new input is going to be like the old
input (from a corpus). - What happens if we see a new unit (word)?
- Markov assumption The probability of the next
state only depends on this state.
31HMM
- What happens if we see a new unit (word)?
- or a completely new sequence
- We might estimate the probabilities associated
with this new word based on probabilities
observed for new (low freq) words previously. - We might use a neural network to integrate
information about word form, position, etc. into
an activation vector for this word. This vector
could be used to approximate the probabilities we
need.
32Comparison
- HMM
- Maximize probability of observation (etc).
- Probabilities from observation
- Discrete units
- Smoothing techniques for assigning prob to unseen.
- NN
- Minimize errors for recognition (etc.)
- Activation Vectors fuzzy membership
- sub-symbolic
- Smoothing based on regularities and cues in
corpus.
33LINKS
Tlearn is a free neural network simulator
tlearn, crl, ucsd Jeffrey Elman popularized
neural networks in
Cognitive Science / Linguistics Course
http//www.cs.wisc.edu/dyer/cs540/notes/nn.html T
he PDP book Rumelhart McClelland,
Parallel Distributed Processing I
II. Fuzzy Logic Bart Kosko, 1994. Fuzzy
Thinking. Michael Negnevitsky, 2002.
Artificial intelligence. Search for Lotfi
Zadeh. Neurons http//cti.itc.virginia.edu/psyc
220
34More resources(some illustrations which were
found on the internet)
- Connectionism (fact laden)
- http//www.dcs.shef.ac.uk/yorick/ai_course/com107
0.9.ppt COM1070.10.ppt - Technical
- http//acat02.sinp.msu.ru/presentations/prevotet/tu
torial.ppt ACAT2002.ppt -
- The tlearn simulator
- http//crl.ucsd.edu/innate/tlearn.html (check
out the book http//crl.ucsd.edu/innate/tlearn.htm
l )