Title: Overview over different methods
1Overview over different methods
2Different Types/Classes of Learning
- Unsupervised Learning (non-evaluative feedback)
- Trial and Error Learning.
- No Error Signal.
- No influence from a Teacher, Correlation
evaluation only. - Reinforcement Learning (evaluative feedback)
- (Classic. Instrumental) Conditioning,
Reward-based Lng. - Good-Bad Error Signals.
- Teacher defines what is good and what is bad.
- Supervised Learning (evaluative error-signal
feedback) - Teaching, Coaching, Imitation Learning, Lng.
from examples and more. - Rigorous Error Signals.
- Direct influence from a teacher/teaching signal.
3An unsupervised learning rule
For Learning One input, one output.
4Self-organizing mapsunsupervised learning
input
map
Neighborhood relationships are usually preserved
() Absolute structure depends on initial
condition and cannot be predicted (-)
5An unsupervised learning rule
For Learning One input, one output
6Classical Conditioning
I. Pawlow
7An unsupervised learning rule
For Learning One input, one output
8Supervised Learning Example OCR
9The influence of the type of learning on speed
and autonomy of the learner
10Hebbian learning
When an axon of cell A excites cell B and
repeatedly or persistently takes part in firing
it, some growth processes or metabolic change
takes place in one or both cells so that As
efficiency ... is increased. Donald Hebb
(1949)
A
B
A
t
B
11Overview over different methods
You are here !
12Hebbian Learning
w1
u1
v
Vector Notation Cell Activity v
w . u
This is a dot product, where w is a weight vector
and u the input vector. Strictly we need to
assume that weight changes are slow, otherwise
this turns into a differential eq.
13dw1
Single Input
m v u1 m ltlt 1
dt
dw
m v u m ltlt 1
Many Inputs
dt
As v is a single output, it is scalar.
dw
Averaging Inputs
m ltv ugt m ltlt 1
dt
We can just average over all input patterns and
approximate the weight change by this. Remember,
this assumes that weight changes are slow.
If we replace v with w . u we can write
dw
m Q . w where Q ltuugt is
the input correlation
matrix
dt
Note Hebb yields an instable (always growing)
weight vector!
14Synaptic plasticity evoked artificially Examples
of Long term potentiation (LTP) and long term
depression (LTD). LTP First demonstrated by
Bliss and Lomo in 1973. Since then induced in
many different ways, usually in slice. LTD,
robustly shown by Dudek and Bear in 1992, in
Hippocampal slice.
15(No Transcript)
16(No Transcript)
17(No Transcript)
18LTP will lead to new synaptic contacts
19Conventional LTP Hebbian Learning
Synaptic change
Symmetrical Weight-change curve
The temporal order of input and output does not
play any role
20(No Transcript)
21Spike timing dependent plasticity - STDP
22Spike Timing Dependent Plasticity Temporal
Hebbian Learning
Synaptic change
Acausal
Causal (possibly)
Weight-change curve (BiPoo, 2001)
23Back to the Math. We had
dw1
Single Input
m v u1 m ltlt 1
dt
dw
m v u m ltlt 1
Many Inputs
dt
As v is a single output, it is scalar.
dw
Averaging Inputs
m ltv ugt m ltlt 1
dt
We can just average over all input patterns and
approximate the weight change by this. Remember,
this assumes that weight changes are slow.
If we replace v with w . u we can write
dw
m Q . w where Q ltuugt is
the input correlation
matrix
dt
Note Hebb yields an instable (always growing)
weight vector!
24Covariance Rule(s)
Normally firing rates are only positive and plain
Hebb would yield only LTP. Hence we introduce a
threshold to also get LTD
dw
m (v - Q) u m ltlt 1
Output threshold
dt
dw
m v (u - Q) m ltlt 1
Input vector threshold
dt
Many times one sets the threshold as the average
activity of some reference time period (training
period)
Q ltvgt or Q ltugt together with v
w . u we get
dw
m C . w, where C is the covariance matrix of
the input
dt
http//en.wikipedia.org/wiki/Covariance_matrix
C lt(u-ltugt)(u-ltugt)gt ltuugt - ltu2gt lt(u-ltugt)ugt
25The covariance rule can produce LTP without (!)
post-synaptic output. This is biologically
unrealistic and the BCM rule (Bienenstock,
Cooper, Munro) takes care of this.
BCM- Rule
dw
m vu (v - Q) m ltlt 1
dt
As such this rule is again unstable, but BCM
introduces a sliding threshold
dQ
n (v2 - Q) n lt 1
dt
Note the rate of threshold change n should be
faster than then weight changes (m), but slower
than the presentation of the individual
input patterns. This way the weight growth will
be over-dampened relative to the (weight
induced) activity increase.
26Problem Hebbian Learning can lead to unlimited
weight growth. Solution Weight normalization a)
subtractive (subtract the mean change of all
weights from each individual weight). b)
multiplicative (mult. each weight by a gradually
decreasing factor).
27Examples of Applications
- Kohonen (1984). Speech recognition - a map of
phonemes in the Finish language - Goodhill (1993) proposed a model for the
development of retinotopy and ocular dominance,
based on Kohonen Maps (SOM) - Angeliol et al (1988) travelling salesman
problem (an optimization problem) - Kohonen (1990) learning vector quantization
(pattern classification problem) - Ritter Kohonen (1989) semantic maps
28Differential Hebbian Learning of
Sequences Learning to act in response
to sequences of sensor events
29Overview over different methods
You are here !
30History of the Concept of Temporally Asymmetrical
Learning Classical Conditioning
I. Pawlow
31(No Transcript)
32History of the Concept of Temporally Asymmetrical
Learning Classical Conditioning
Correlating two stimuli which are shifted with
respect to each other in time. Pavlovs Dog
Bell comes earlier than Food This requires to
remember the stimuli in the system. Eligibility
Trace A synapse remains eligible for
modification for some time after it was active
(Hull 1938, then a still abstract concept).
I. Pawlow
33Classical Conditioning Eligibility Traces
The first stimulus needs to be remembered in
the system
34History of the Concept of Temporally Asymmetrical
Learning Classical Conditioning Eligibility
Traces
Note There are vastly different time-scales for
(Pavlovs) hehavioural experiments Typically up
to 4 seconds as compared to STDP at
neurons Typically 40-60 milliseconds (max.)
I. Pawlow
35Defining the Trace
In general there are many ways to do this, but
usually one chooses a trace that looks
biologically realistic and allows for some
analytical calculations, too.
EPSP-like functions
a-function
Shows an oscillation.
Dampened Sine wave
Double exp.
This one is most easy to handle analytically and,
thus, often used.
36Overview over different methods
37Differential Hebb Learning Rule
Simpler Notation x Input u Traced Input
Xi
w
V
ui
Early Bell
S
X0
u0
Late Food
38Convolution used to define the traced
input, Correlation used to calculate weight
growth.
39Differential Hebbian Learning
Output
Produces asymmetric weight change curve (if the
filters h produce unimodal humps)
40Conventional LTP
Synaptic change
Symmetrical Weight-change curve
The temporal order of input and output does not
play any role
41Differential Hebbian Learning
Output
Produces asymmetric weight change curve (if the
filters h produce unimodal humps)
42Spike-timing-dependent plasticity(STDP) Some
vague shape similarity
Synaptic change
TtPost - tPre
ms
Weight-change curve (BiPoo, 2001)
43Overview over different methods
44The biophysical equivalent of Hebbs postulate
Plastic Synapse
NMDA/AMPA
Pre-Post Correlation, but why is this needed?
45Plasticity is mainly mediated by so
called N-methyl-D-Aspartate (NMDA)
channels. These channels respond to Glutamate as
their transmitter and they are voltage depended
46Biophysical Model Structure
x
NMDA synapse
v
Hence NMDA-synapses (channels) do require a
(hebbian) correlation between pre and
post-synaptic activity!
47 Local Events at the Synapse
x1
Current sources under the synapse
?Local
u1
v
48On Eligibility Traces
w
S
49Model structure
- Dendritic compartment
- Plastic synapse with NMDA channels
- Source of Ca2 influx and coincidence
detector -
Plastic Synapse
NMDA/AMPA
50Plasticity Rule (Differential Hebb)
Instantenous weight change
Presynaptic influence Glutamate effect on NMDA
channels
Postsynaptic influence
51Pre-synaptic influence
Normalized NMDA conductance
NMDA channels are instrumental for LTP and LTD
induction (Malenka and Nicoll, 1999 Dudek and
Bear ,1992)
52Depolarizing potentials in the dendritic tree
Dendritic spikes
(Larkum et al., 2001 Golding et al, 2002 Häusser
and Mel, 2003)
Back-propagating spikes
(Stuart et al., 1997)
53Postsyn. Influence
For F we use a low-pass filtered (slow) version
of a back-propagating or a dendritic spike.
54BP and D-Spikes
55Weight Change Curves Source of Depolarization
Back-Propagating Spikes
Back-propagating spike
Weight change curve
NMDAr activation
Back-propagating spike
T
TtPost tPre
56CLOSED LOOP LEARNING
- Learning to Act (to produce appropriate behavior)
- Instrumental (Operant) Conditioning
57conditioned Input
This is an open-loop system
58Closed loop
Behaving
Sensing
Adaptable Neuron
Env.
59Instrumental/Operant Conditioning
60Behaviorism
All we need to know in order to describe and
explain behavior is this actions followed by
good outcomes are likely to recur, and actions
followed by bad outcomes are less likely to
recur. (Skinner, 1953) Skinner had invented
the type of experiments called operant
conditioning.
B.F. Skinner (1904-1990)
61Operant behavior occurs without an observable
external stimulus. Operates on
the organisms environment. The
behavior is instrumental in securing a stimulus
more representative of everyday learning.
Skinner Box
62OPERANT CONDITIONING TECHNIQUES
- POSITIVE REINFORCEMENT increasing a behavior by
administering a reward - NEGATIVE REINFORCEMENT increasing a behavior by
removing an aversive stimulus when a behavior
occurs - PUNISHMENT decreasing a behavior by
administering an aversive stimulus following a
behavior OR by removing a positive stimulus - EXTINCTION decreasing a behavior by not
rewarding it
63Overview over different methods
You are here !
64How to assure behavioral learning convergence ??
This is achieved by starting with a stable
reflex-like action and learning to supercede it
by an anticipatory action.
65Reflex Only (Compare to an electronic closed
loop controller!)
Think of a Thermostat !
This structure assures initial (behavioral)
stability (homeostasis)
66Robot Application
67Robot Application
Learning Goal Correlate the vision signals with
the touch signals and navigate without collisions.
Initially built-in behavior Retraction reaction
whenever an obstacle is touched.
68Robot Example
69What has happened during learning to the system ?
The primary reflex re-action has effectively been
eliminated and replaced by an anticipatory action
70Overview over different methods Supervised
Learning
And many more
You are here !
71Supervised learning methods are
mostly non-neuronal and will therefore not be
discussed here.
72Reinforcement Learning (RL)
Learning from rewards (and punishments) Learning
to assess the value of states. Learning goal
directed behavior.
- RL has been developed rather independently from
two different fields - Dynamic Programming and Machine Learning (Bellman
Equation). - Psychology (Classical Conditioning) and later
Neuroscience (Dopamine System in the brain)
73Back to Classical Conditioning
U(C)S Unconditioned Stimulus U(C)R
Unconditioned Response CS Conditioned
Stimulus CR Conditioned Response
I. Pawlow
74Less classical but also Conditioning ! (Example
from a car advertisement)
Learning the association CS
? U(C)R
Porsche ? Good Feeling
75Overview over different methods Reinforcement
Learning
You are here !
76Overview over different methods Reinforcement
Learning
And later also here !
77Notation
US r,R Reward CS s,u
Stimulus State1 CR v,V (Strength of
the) Expected Reward Value UR --- (not
required in mathematical formalisms of
RL) Weight w weight used for calculating the
value e.g. vwu Action a Action Policy
p Policy
1 Note The notion of a state really only makes
sense as soon as there is more than one state.
78A note on Value and Reward Expectation If
you are at a certain state then you would value
this state according to how much reward you can
expect when moving on from this state to the
end-point of your trial. Hence Value
Expected Reward ! More accurately Value
Expected cumulative future discounted reward.
(for this, see later!)
79Types of Rules
- Rescorla-Wagner Rule Allows for explaining
several types of conditioning experiments. - TD-rule (TD-algorithm) allows measuring the value
of states and allows accumulating rewards.
Thereby it generalizes the Resc.-Wagner rule. - TD-algorithm can be extended to allow measuring
the value of actions and thereby control behavior
either by ways of - Q or SARSA learning or with
- Actor-Critic Architectures
80Overview over different methods Reinforcement
Learning
You are here !
81Rescorla-Wagner Rule
Train
Result
Pre-Train
Pavlovian Extinction Partial
u?r
u?vmax
u?r
u??
u?v0
u?r u??
u?vltmax
We define v wu, with u1 or u0, binary
and w ? w mdu with d r - v
The associability between stimulus u and reward r
is represented by the learning rate m.
This learning rule minimizes the avg. squared
error between actual reward r and the prediction
v, hence minlt(r-v)2gt
We realize that d is the prediction error.
82Pawlovian
Extinction
Partial
Stimulus u is paired with r1 in 100 of the
discrete epochs for Pawlovian and in 50 of the
cases for Partial.
83Rescorla-Wagner Rule, Vector Form for Multiple
Stimuli
We define v w.u, and w ?
w mdu with d r v Where we minimize d.
84Rescorla-Wagner Rule, Vector Form for Multiple
Stimuli
Train
Result
Pre-Train
Inhibitory
u1u2??, u1?r
u1?vmax, u2?vlt0
Inhibitory Conditioning Presentation of one
stimulus together with the reward and alternating
presenting a pair of stimuli where the reward is
missing. In this case the second stimulus
actually predicts the ABSENCE of the reward
(negative v). Trials in which the first stimulus
is presented together with the reward lead to
w1gt0. In trials where both stimuli are present
the net prediction will be vw1u1w2u2 0. As
u1,21 (or zero) and w1gt0, we get w2lt0 and,
consequentially, v(u2)lt0.
85Rescorla-Wagner Rule, Vector Form for Multiple
Stimuli
Train
Result
Pre-Train
Overshadow
u1u2?r
u1?vltmax, u2?vltmax
Overshadowing Presenting always two stimuli
together with the reward will lead to a sharing
of the reward prediction between them. We get v
w1u1w2u2 r. Using different learning rates m
will lead to differently strong growth of w1,2
and represents the often observed different
saliency of the two stimuli.
86Rescorla-Wagner Rule, Vector Form for Multiple
Stimuli
Train
Result
Pre-Train
Secondary
u1?r
u2?u1
u2?vmax
Secondary Conditioning reflect the replacement
of one stimulus by a new one for the prediction
of a reward. As we have seen the Rescorla-Wagner
Rule is very simple but still able to represent
many of the basic findings of diverse
conditioning experiments. Secondary conditioning,
however, CANNOT be captured.
87Predicting Future Reward
The Rescorla-Wagner Rule cannot deal with the
sequentiallity of stimuli (required to deal with
Secondary Conditioning). As a consequence it
treats this case similar to Inhibitory
Conditioning lead to negative w2.
Animals can predict to some degree such sequences
and form the correct associations. For this we
need algorithms that keep track of time. Here we
do this by ways of states that are subsequently
visited and evaluated.
88Prediction and Control
- The goal of RL is two-fold
- To predict the value of states (exploring the
state space following a policy) Prediction
Problem. - Change the policy towards finding the optimal
policy Control Problem.
Terminology (again)
- State,
- Action,
- Reward,
- Value,
- Policy
89Markov Decision Problems (MDPs)
rewards
actions
states
If the future of the system depends always only
on the current state and action then the system
is said to be Markovian.
90What does an RL-agent do ?
An RL-agent explores the state space trying to
accumulate as much reward as possible. It follows
a behavioral policy performing actions (which
usually will lead the agent from one state to the
next). For the Prediction Problem It updates the
value of each given state by assessing how much
future (!) reward can be obtained when moving
onwards from this state (State Space). It does
not change the policy, rather it evaluates it.
(Policy Evaluation).
91For the Control Problem It updates the value of
each given action at a given state and of by
assessing how much future reward can be obtained
when performing this action at that state
(State-Action Space, which is larger than the
State Space). and all following actions at the
following state moving onwards.
Guess Will we have to evaluate ALL states and
actions onwards?
92What does an RL-agent do ?
Exploration Exploitation Dilemma The agent
wants to get as much cumulative reward (also
often called return) as possible. For this it
should always perform the most rewarding action
exploiting its (learned) knowledge of the state
space. This way it might however miss an action
which leads (a bit further on) to a much more
rewarding path. Hence the agent must also
explore into unknown parts of the state space.
The agent must, thus, balance its policy to
include exploitation and exploration.
Policies
- Greedy Policy The agent always exploits and
selects the most rewarding action. This is
sub-optimal as the agent never finds better new
paths.
93Policies
- e-Greedy Policy With a small probability e the
agent will choose a non-optimal action. All
non-optimal actions are chosen with equal
probability. This can take very long as it is
not known how big e should be. One can also
anneal the system by gradually lowering e to
become more and more greedy. - Softmax Policy e-greedy can be problematic
because of (). Softmax ranks the actions
according to their values and chooses roughly
following the ranking using for example
94Overview over different methods Reinforcement
Learning
You are here !
95Towards TD-learning Pictorial View
In the following slides we will treat Policy
evaluation We define some given policy and want
to evaluate the state space. We are at the moment
still not interested in evaluating actions or in
improving policies.
Back to the question To get the value of a
given state, will we have to evaluate ALL states
and actions onwards?
There is no unique answer to this! Different
methods exist which assign the value of a state
by using differently many (weighted) values of
subsequent states. We will discuss a few but
concentrate on the most commonly used
TD-algorithm(s). Temporal Difference (TD)
Learning
96Formalising RL Policy Evaluation with goal to
find the optimal value function of the state space
We consider a sequence st, rt1, st1, rt2, . .
. , rT , sT . Note, rewards occur downstream (in
the future) from a visited state. Thus, rt1 is
the next future reward which can be reached
starting from state st. The complete return Rt to
be expected in the future from state st is, thus,
given by
where g1 is a discount factor. This accounts for
the fact that rewards in the far future should be
valued less. Reinforcement learning assumes that
the value of a state V(s) is directly equivalent
to the expected return Ep at this state, where p
denotes the (here unspecified) action policy to
be followed.
Thus, the value of state st can be iteratively
updated with
97We use a as a step-size parameter, which is not
of great importance here, though, and can be held
constant. Note, if V(st) correctly predicts the
expected complete return Rt, the update will be
zero and we have found the final value. This
method is called constant-a Monte Carlo update.
It requires to wait until a sequence has reached
its terminal state (see some slides before!)
before the update can commence. For long
sequences this may be problematic. Thus, one
should try to use an incremental procedure
instead. We define a different update rule with
The elegant trick is to assume that, if the
process converges, the value of the next state
V(st1) should be an accurate estimate of the
expected return downstream to this state (i.e.,
downstream to st1). Thus, we would hope that the
following holds
Indeed, proofs exist that under certain boundary
conditions this procedure, known as TD(0),
converges to the optimal value function for all
states.
98Reinforcement Learning Relations to Brain
Function I
You are here !
99How to implement TD in a Neuronal Way
Now we have
100How to implement TD in a Neuronal Way
101Reinforcement Learning Relations to Brain
Function II
You are here !
102TD-learning Brain Function
This neuron is supposed to represent the d-error
of TD-learning, which has moved forward as
expected.
DA-responses in the basal ganglia pars compacta
of the substantia nigra and the medially
adjoining ventral tegmental area (VTA).
103TD-learning Brain Function
This neuron is supposed to represent the reward
expectation signal v. It has extended forward
(almost) to the CS (here called Tr) as expected
from the TD-rule. Such neurons are found in the
striatum, orbitofrontal cortex and amygdala.
104Reinforcement Learning The Control Problem
So far we have concentrated on evaluating and
unchanging policy. Now comes the question of how
to actually improve a policy p trying to find the
optimal policy.
- We will discuss
- Actor-Critic Architectures
- But not
- SARSA Learning
- Q-Learning
Abbreviation for policy p
105Reinforcement Learning Control Problem I
You are here !
106Control Loops
A basic feedbackloop controller (Reflex) as in
the slide before.
107Control Loops
An Actor-Critic Architecture The Critic produces
evaluative, reinforcement feedback for the Actor
by observing the consequences of its actions. The
Critic takes the form of a TD-error which gives
an indication if things have gone better or worse
than expected with the preceding action. Thus,
this TD-error can be used to evaluate the
preceding action If the error is positive the
tendency to select this action should be
strengthened or else, lessened.
108Example of an Actor-Critic Procedure
Action selection here follows the Gibbs Softmax
method
where p(s,a) are the values of the modifiable (by
the Critcic!) policy parameters of the actor,
indicting the tendency to select action a when
being in state s.
We can now modify p for a given state action pair
at time t with
where dt is the d-error of the TD-Critic.
109Reinforcement Learning Control I Brain
Function III
You are here !
110Actor-Critics and the Basal Ganglia
The basal ganglia are a brain structure involved
in motor control. It has been suggested that they
learn by ways of an Actor-Critic mechanism.
VPventral pallidum, SNrsubstantia nigra pars
reticulata, SNcsubstantia nigra pars
compacta, GPiglobus pallidus pars
interna, GPeglobus pallidus pars
externa, VTAventral tegmental area, RRAretrorubr
al area, STNsubthalamic nucleus.
111Actor-Critics and the Basal Ganglia The Critic
DA
C
CortexC, striatumS, STNsubthalamic Nucleus,
DAdopamine system, rreward.
So called striosomal modules fulfill the
functions of the adaptive Critic. The
prediction-error (d) characteristics of the
DA-neurons of the Critic are generated by 1)
Equating the reward r with excitatory input from
the lateral hypothalamus. 2) Equating the term
v(t) with indirect excitation at the DA-neurons
which is initiated from striatal striosomes and
channelled through the subthalamic nucleus onto
the DA neurons. 3) Equating the term v(t-1) with
direct, long-lasting inhibition from striatal
striosomes onto the DA-neurons. There are many
problems with this simplistic view though
timing, mismatch to anatomy, etc.
112The End