Overview over different methods

About This Presentation

Title:

Overview over different methods

Description:

No influence from a Teacher, Correlation evaluation only. ... a) subtractive (subtract the mean change of all weights from each individual weight) ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 113

Provided by: fwoerg

more less

Transcript and Presenter's Notes

Title: Overview over different methods

1
Overview over different methods
2
Different Types/Classes of Learning

Unsupervised Learning (non-evaluative feedback)
Trial and Error Learning.
No Error Signal.
No influence from a Teacher, Correlation
evaluation only.
Reinforcement Learning (evaluative feedback)
(Classic. Instrumental) Conditioning,
Reward-based Lng.
Good-Bad Error Signals.
Teacher defines what is good and what is bad.
Supervised Learning (evaluative error-signal
feedback)
Teaching, Coaching, Imitation Learning, Lng.
from examples and more.
Rigorous Error Signals.
Direct influence from a teacher/teaching signal.

3
An unsupervised learning rule
For Learning One input, one output.
4
Self-organizing mapsunsupervised learning
input
map
Neighborhood relationships are usually preserved
() Absolute structure depends on initial
condition and cannot be predicted (-)
5
An unsupervised learning rule
For Learning One input, one output
6
Classical Conditioning
I. Pawlow
7
An unsupervised learning rule
For Learning One input, one output
8
Supervised Learning Example OCR
9
The influence of the type of learning on speed
and autonomy of the learner
10
Hebbian learning
When an axon of cell A excites cell B and
repeatedly or persistently takes part in firing
it, some growth processes or metabolic change
takes place in one or both cells so that As
efficiency ... is increased. Donald Hebb
(1949)
A
B
A
t
B
11
Overview over different methods
You are here !
12
Hebbian Learning
w1
u1
v
Vector Notation Cell Activity v
w . u
This is a dot product, where w is a weight vector
and u the input vector. Strictly we need to
assume that weight changes are slow, otherwise
this turns into a differential eq.
13
dw1
Single Input
m v u1 m ltlt 1
dt
dw
m v u m ltlt 1
Many Inputs
dt
As v is a single output, it is scalar.
dw
Averaging Inputs
m ltv ugt m ltlt 1
dt
We can just average over all input patterns and
approximate the weight change by this. Remember,
this assumes that weight changes are slow.
If we replace v with w . u we can write
dw
m Q . w where Q ltuugt is
the input correlation
matrix
dt
Note Hebb yields an instable (always growing)
weight vector!
14
Synaptic plasticity evoked artificially Examples
of Long term potentiation (LTP) and long term
depression (LTD). LTP First demonstrated by
Bliss and Lomo in 1973. Since then induced in
many different ways, usually in slice. LTD,
robustly shown by Dudek and Bear in 1992, in
Hippocampal slice.
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
LTP will lead to new synaptic contacts
19
Conventional LTP Hebbian Learning
Synaptic change
Symmetrical Weight-change curve
The temporal order of input and output does not
play any role
20
(No Transcript)
21
Spike timing dependent plasticity - STDP
22
Spike Timing Dependent Plasticity Temporal
Hebbian Learning
Synaptic change
Acausal
Causal (possibly)
Weight-change curve (BiPoo, 2001)
23
Back to the Math. We had
dw1
Single Input
m v u1 m ltlt 1
dt
dw
m v u m ltlt 1
Many Inputs
dt
As v is a single output, it is scalar.
dw
Averaging Inputs
m ltv ugt m ltlt 1
dt
We can just average over all input patterns and
approximate the weight change by this. Remember,
this assumes that weight changes are slow.
If we replace v with w . u we can write
dw
m Q . w where Q ltuugt is
the input correlation
matrix
dt
Note Hebb yields an instable (always growing)
weight vector!
24
Covariance Rule(s)
Normally firing rates are only positive and plain
Hebb would yield only LTP. Hence we introduce a
threshold to also get LTD
dw
m (v - Q) u m ltlt 1
Output threshold
dt
dw
m v (u - Q) m ltlt 1
Input vector threshold
dt
Many times one sets the threshold as the average
activity of some reference time period (training
period)
Q ltvgt or Q ltugt together with v
w . u we get
dw
m C . w, where C is the covariance matrix of
the input
dt
http//en.wikipedia.org/wiki/Covariance_matrix
C lt(u-ltugt)(u-ltugt)gt ltuugt - ltu2gt lt(u-ltugt)ugt
25
The covariance rule can produce LTP without (!)
post-synaptic output. This is biologically
unrealistic and the BCM rule (Bienenstock,
Cooper, Munro) takes care of this.
BCM- Rule
dw
m vu (v - Q) m ltlt 1
dt
As such this rule is again unstable, but BCM
introduces a sliding threshold
dQ
n (v2 - Q) n lt 1
dt
Note the rate of threshold change n should be
faster than then weight changes (m), but slower
than the presentation of the individual
input patterns. This way the weight growth will
be over-dampened relative to the (weight
induced) activity increase.
26
Problem Hebbian Learning can lead to unlimited
weight growth. Solution Weight normalization a)
subtractive (subtract the mean change of all
weights from each individual weight). b)
multiplicative (mult. each weight by a gradually
decreasing factor).
27
Examples of Applications

Kohonen (1984). Speech recognition - a map of
phonemes in the Finish language
Goodhill (1993) proposed a model for the
development of retinotopy and ocular dominance,
based on Kohonen Maps (SOM)
Angeliol et al (1988) travelling salesman
problem (an optimization problem)
Kohonen (1990) learning vector quantization
(pattern classification problem)
Ritter Kohonen (1989) semantic maps

28
Differential Hebbian Learning of
Sequences Learning to act in response
to sequences of sensor events
29
Overview over different methods
You are here !
30
History of the Concept of Temporally Asymmetrical
Learning Classical Conditioning
I. Pawlow
31
(No Transcript)
32
History of the Concept of Temporally Asymmetrical
Learning Classical Conditioning
Correlating two stimuli which are shifted with
respect to each other in time. Pavlovs Dog
Bell comes earlier than Food This requires to
remember the stimuli in the system. Eligibility
Trace A synapse remains eligible for
modification for some time after it was active
(Hull 1938, then a still abstract concept).
I. Pawlow
33
Classical Conditioning Eligibility Traces
The first stimulus needs to be remembered in
the system
34
History of the Concept of Temporally Asymmetrical
Learning Classical Conditioning Eligibility
Traces
Note There are vastly different time-scales for
(Pavlovs) hehavioural experiments Typically up
to 4 seconds as compared to STDP at
neurons Typically 40-60 milliseconds (max.)
I. Pawlow
35
Defining the Trace
In general there are many ways to do this, but
usually one chooses a trace that looks
biologically realistic and allows for some
analytical calculations, too.
EPSP-like functions
a-function
Shows an oscillation.
Dampened Sine wave
Double exp.
This one is most easy to handle analytically and,
thus, often used.
36
Overview over different methods
37
Differential Hebb Learning Rule
Simpler Notation x Input u Traced Input
Xi
w
V
ui
Early Bell
S
X0
u0
Late Food
38
Convolution used to define the traced
input, Correlation used to calculate weight
growth.
39
Differential Hebbian Learning
Output
Produces asymmetric weight change curve (if the
filters h produce unimodal humps)
40
Conventional LTP
Synaptic change
Symmetrical Weight-change curve
The temporal order of input and output does not
play any role
41
Differential Hebbian Learning
Output
Produces asymmetric weight change curve (if the
filters h produce unimodal humps)
42
Spike-timing-dependent plasticity(STDP) Some
vague shape similarity
Synaptic change
TtPost - tPre
ms
Weight-change curve (BiPoo, 2001)
43
Overview over different methods
44
The biophysical equivalent of Hebbs postulate
Plastic Synapse
NMDA/AMPA
Pre-Post Correlation, but why is this needed?
45
Plasticity is mainly mediated by so
called N-methyl-D-Aspartate (NMDA)
channels. These channels respond to Glutamate as
their transmitter and they are voltage depended
46
Biophysical Model Structure
x
NMDA synapse
v
Hence NMDA-synapses (channels) do require a
(hebbian) correlation between pre and
post-synaptic activity!
47
Local Events at the Synapse
x1
Current sources under the synapse
?Local
u1
v
48
On Eligibility Traces

w
S
49
Model structure

Dendritic compartment
Plastic synapse with NMDA channels
Source of Ca2 influx and coincidence
detector

Plastic Synapse
NMDA/AMPA
50
Plasticity Rule (Differential Hebb)
Instantenous weight change
Presynaptic influence Glutamate effect on NMDA
channels
Postsynaptic influence
51
Pre-synaptic influence
Normalized NMDA conductance
NMDA channels are instrumental for LTP and LTD
induction (Malenka and Nicoll, 1999 Dudek and
Bear ,1992)
52
Depolarizing potentials in the dendritic tree
Dendritic spikes
(Larkum et al., 2001 Golding et al, 2002 Häusser
and Mel, 2003)
Back-propagating spikes
(Stuart et al., 1997)
53
Postsyn. Influence
For F we use a low-pass filtered (slow) version
of a back-propagating or a dendritic spike.
54
BP and D-Spikes
55
Weight Change Curves Source of Depolarization
Back-Propagating Spikes
Back-propagating spike
Weight change curve
NMDAr activation
Back-propagating spike
T
TtPost tPre
56
CLOSED LOOP LEARNING

Learning to Act (to produce appropriate behavior)
Instrumental (Operant) Conditioning

57
conditioned Input
This is an open-loop system
58
Closed loop
Behaving
Sensing
Adaptable Neuron
Env.
59
Instrumental/Operant Conditioning
60
Behaviorism
All we need to know in order to describe and
explain behavior is this actions followed by
good outcomes are likely to recur, and actions
followed by bad outcomes are less likely to
recur. (Skinner, 1953) Skinner had invented
the type of experiments called operant
conditioning.
B.F. Skinner (1904-1990)
61
Operant behavior occurs without an observable
external stimulus. Operates on
the organisms environment. The
behavior is instrumental in securing a stimulus
more representative of everyday learning.
Skinner Box
62
OPERANT CONDITIONING TECHNIQUES

POSITIVE REINFORCEMENT increasing a behavior by
administering a reward
NEGATIVE REINFORCEMENT increasing a behavior by
removing an aversive stimulus when a behavior
occurs
PUNISHMENT decreasing a behavior by
administering an aversive stimulus following a
behavior OR by removing a positive stimulus
EXTINCTION decreasing a behavior by not
rewarding it

63
Overview over different methods
You are here !
64
How to assure behavioral learning convergence ??
This is achieved by starting with a stable
reflex-like action and learning to supercede it
by an anticipatory action.
65
Reflex Only (Compare to an electronic closed
loop controller!)
Think of a Thermostat !
This structure assures initial (behavioral)
stability (homeostasis)
66
Robot Application
67
Robot Application
Learning Goal Correlate the vision signals with
the touch signals and navigate without collisions.
Initially built-in behavior Retraction reaction
whenever an obstacle is touched.
68
Robot Example
69
What has happened during learning to the system ?
The primary reflex re-action has effectively been
eliminated and replaced by an anticipatory action
70
Overview over different methods Supervised
Learning
And many more
You are here !
71
Supervised learning methods are
mostly non-neuronal and will therefore not be
discussed here.
72
Reinforcement Learning (RL)
Learning from rewards (and punishments) Learning
to assess the value of states. Learning goal
directed behavior.

RL has been developed rather independently from
two different fields
Dynamic Programming and Machine Learning (Bellman
Equation).
Psychology (Classical Conditioning) and later
Neuroscience (Dopamine System in the brain)

73
Back to Classical Conditioning
U(C)S Unconditioned Stimulus U(C)R
Unconditioned Response CS Conditioned
Stimulus CR Conditioned Response
I. Pawlow
74
Less classical but also Conditioning ! (Example
from a car advertisement)
Learning the association CS
? U(C)R
Porsche ? Good Feeling
75
Overview over different methods Reinforcement
Learning
You are here !
76
Overview over different methods Reinforcement
Learning
And later also here !
77
Notation
US r,R Reward CS s,u
Stimulus State1 CR v,V (Strength of
the) Expected Reward Value UR --- (not
required in mathematical formalisms of
RL) Weight w weight used for calculating the
value e.g. vwu Action a Action Policy
p Policy
1 Note The notion of a state really only makes
sense as soon as there is more than one state.
78
A note on Value and Reward Expectation If
you are at a certain state then you would value
this state according to how much reward you can
expect when moving on from this state to the
end-point of your trial. Hence Value
Expected Reward ! More accurately Value
Expected cumulative future discounted reward.
(for this, see later!)
79
Types of Rules

Rescorla-Wagner Rule Allows for explaining
several types of conditioning experiments.
TD-rule (TD-algorithm) allows measuring the value
of states and allows accumulating rewards.
Thereby it generalizes the Resc.-Wagner rule.
TD-algorithm can be extended to allow measuring
the value of actions and thereby control behavior
either by ways of
Q or SARSA learning or with
Actor-Critic Architectures

80
Overview over different methods Reinforcement
Learning
You are here !
81
Rescorla-Wagner Rule
Train
Result
Pre-Train
Pavlovian Extinction Partial
u?r
u?vmax
u?r
u??
u?v0
u?r u??
u?vltmax
We define v wu, with u1 or u0, binary
and w ? w mdu with d r - v
The associability between stimulus u and reward r
is represented by the learning rate m.
This learning rule minimizes the avg. squared
error between actual reward r and the prediction
v, hence minlt(r-v)2gt
We realize that d is the prediction error.
82
Pawlovian
Extinction
Partial
Stimulus u is paired with r1 in 100 of the
discrete epochs for Pawlovian and in 50 of the
cases for Partial.
83
Rescorla-Wagner Rule, Vector Form for Multiple
Stimuli
We define v w.u, and w ?
w mdu with d r v Where we minimize d.
84
Rescorla-Wagner Rule, Vector Form for Multiple
Stimuli
Train
Result
Pre-Train
Inhibitory
u1u2??, u1?r
u1?vmax, u2?vlt0
Inhibitory Conditioning Presentation of one
stimulus together with the reward and alternating
presenting a pair of stimuli where the reward is
missing. In this case the second stimulus
actually predicts the ABSENCE of the reward
(negative v). Trials in which the first stimulus
is presented together with the reward lead to
w1gt0. In trials where both stimuli are present
the net prediction will be vw1u1w2u2 0. As
u1,21 (or zero) and w1gt0, we get w2lt0 and,
consequentially, v(u2)lt0.
85
Rescorla-Wagner Rule, Vector Form for Multiple
Stimuli
Train
Result
Pre-Train
Overshadow
u1u2?r
u1?vltmax, u2?vltmax
Overshadowing Presenting always two stimuli
together with the reward will lead to a sharing
of the reward prediction between them. We get v
w1u1w2u2 r. Using different learning rates m
will lead to differently strong growth of w1,2
and represents the often observed different
saliency of the two stimuli.
86
Rescorla-Wagner Rule, Vector Form for Multiple
Stimuli
Train
Result
Pre-Train
Secondary
u1?r
u2?u1
u2?vmax
Secondary Conditioning reflect the replacement
of one stimulus by a new one for the prediction
of a reward. As we have seen the Rescorla-Wagner
Rule is very simple but still able to represent
many of the basic findings of diverse
conditioning experiments. Secondary conditioning,
however, CANNOT be captured.
87
Predicting Future Reward
The Rescorla-Wagner Rule cannot deal with the
sequentiallity of stimuli (required to deal with
Secondary Conditioning). As a consequence it
treats this case similar to Inhibitory
Conditioning lead to negative w2.
Animals can predict to some degree such sequences
and form the correct associations. For this we
need algorithms that keep track of time. Here we
do this by ways of states that are subsequently
visited and evaluated.
88
Prediction and Control

The goal of RL is two-fold
To predict the value of states (exploring the
state space following a policy) Prediction
Problem.
Change the policy towards finding the optimal
policy Control Problem.

Terminology (again)

State,
Action,
Reward,
Value,
Policy

89
Markov Decision Problems (MDPs)
rewards
actions
states
If the future of the system depends always only
on the current state and action then the system
is said to be Markovian.
90
What does an RL-agent do ?
An RL-agent explores the state space trying to
accumulate as much reward as possible. It follows
a behavioral policy performing actions (which
usually will lead the agent from one state to the
next). For the Prediction Problem It updates the
value of each given state by assessing how much
future (!) reward can be obtained when moving
onwards from this state (State Space). It does
not change the policy, rather it evaluates it.
(Policy Evaluation).
91
For the Control Problem It updates the value of
each given action at a given state and of by
assessing how much future reward can be obtained
when performing this action at that state
(State-Action Space, which is larger than the
State Space). and all following actions at the
following state moving onwards.
Guess Will we have to evaluate ALL states and
actions onwards?
92
What does an RL-agent do ?
Exploration Exploitation Dilemma The agent
wants to get as much cumulative reward (also
often called return) as possible. For this it
should always perform the most rewarding action
exploiting its (learned) knowledge of the state
space. This way it might however miss an action
which leads (a bit further on) to a much more
rewarding path. Hence the agent must also
explore into unknown parts of the state space.
The agent must, thus, balance its policy to
include exploitation and exploration.
Policies

Greedy Policy The agent always exploits and
selects the most rewarding action. This is
sub-optimal as the agent never finds better new
paths.

93
Policies

e-Greedy Policy With a small probability e the
agent will choose a non-optimal action. All
non-optimal actions are chosen with equal
probability. This can take very long as it is
not known how big e should be. One can also
anneal the system by gradually lowering e to
become more and more greedy.
Softmax Policy e-greedy can be problematic
because of (). Softmax ranks the actions
according to their values and chooses roughly
following the ranking using for example

94
Overview over different methods Reinforcement
Learning
You are here !
95
Towards TD-learning Pictorial View
In the following slides we will treat Policy
evaluation We define some given policy and want
to evaluate the state space. We are at the moment
still not interested in evaluating actions or in
improving policies.
Back to the question To get the value of a
given state, will we have to evaluate ALL states
and actions onwards?
There is no unique answer to this! Different
methods exist which assign the value of a state
by using differently many (weighted) values of
subsequent states. We will discuss a few but
concentrate on the most commonly used
TD-algorithm(s). Temporal Difference (TD)
Learning
96
Formalising RL Policy Evaluation with goal to
find the optimal value function of the state space
We consider a sequence st, rt1, st1, rt2, . .
. , rT , sT . Note, rewards occur downstream (in
the future) from a visited state. Thus, rt1 is
the next future reward which can be reached
starting from state st. The complete return Rt to
be expected in the future from state st is, thus,
given by
where g1 is a discount factor. This accounts for
the fact that rewards in the far future should be
valued less. Reinforcement learning assumes that
the value of a state V(s) is directly equivalent
to the expected return Ep at this state, where p
denotes the (here unspecified) action policy to
be followed.
Thus, the value of state st can be iteratively
updated with
97
We use a as a step-size parameter, which is not
of great importance here, though, and can be held
constant. Note, if V(st) correctly predicts the
expected complete return Rt, the update will be
zero and we have found the final value. This
method is called constant-a Monte Carlo update.
It requires to wait until a sequence has reached
its terminal state (see some slides before!)
before the update can commence. For long
sequences this may be problematic. Thus, one
should try to use an incremental procedure
instead. We define a different update rule with
The elegant trick is to assume that, if the
process converges, the value of the next state
V(st1) should be an accurate estimate of the
expected return downstream to this state (i.e.,
downstream to st1). Thus, we would hope that the
following holds
Indeed, proofs exist that under certain boundary
conditions this procedure, known as TD(0),
converges to the optimal value function for all
states.
98
Reinforcement Learning Relations to Brain
Function I
You are here !
99
How to implement TD in a Neuronal Way
Now we have
100
How to implement TD in a Neuronal Way
101
Reinforcement Learning Relations to Brain
Function II
You are here !
102
TD-learning Brain Function
This neuron is supposed to represent the d-error
of TD-learning, which has moved forward as
expected.
DA-responses in the basal ganglia pars compacta
of the substantia nigra and the medially
adjoining ventral tegmental area (VTA).
103
TD-learning Brain Function
This neuron is supposed to represent the reward
expectation signal v. It has extended forward
(almost) to the CS (here called Tr) as expected
from the TD-rule. Such neurons are found in the
striatum, orbitofrontal cortex and amygdala.
104
Reinforcement Learning The Control Problem
So far we have concentrated on evaluating and
unchanging policy. Now comes the question of how
to actually improve a policy p trying to find the
optimal policy.

We will discuss
Actor-Critic Architectures
But not
SARSA Learning
Q-Learning

Abbreviation for policy p
105
Reinforcement Learning Control Problem I
You are here !
106
Control Loops
A basic feedbackloop controller (Reflex) as in
the slide before.
107
Control Loops
An Actor-Critic Architecture The Critic produces
evaluative, reinforcement feedback for the Actor
by observing the consequences of its actions. The
Critic takes the form of a TD-error which gives
an indication if things have gone better or worse
than expected with the preceding action. Thus,
this TD-error can be used to evaluate the
preceding action If the error is positive the
tendency to select this action should be
strengthened or else, lessened.
108
Example of an Actor-Critic Procedure
Action selection here follows the Gibbs Softmax
method
where p(s,a) are the values of the modifiable (by
the Critcic!) policy parameters of the actor,
indicting the tendency to select action a when
being in state s.
We can now modify p for a given state action pair
at time t with
where dt is the d-error of the TD-Critic.
109
Reinforcement Learning Control I Brain
Function III
You are here !
110
Actor-Critics and the Basal Ganglia
The basal ganglia are a brain structure involved
in motor control. It has been suggested that they
learn by ways of an Actor-Critic mechanism.
VPventral pallidum, SNrsubstantia nigra pars
reticulata, SNcsubstantia nigra pars
compacta, GPiglobus pallidus pars
interna, GPeglobus pallidus pars
externa, VTAventral tegmental area, RRAretrorubr
al area, STNsubthalamic nucleus.
111
Actor-Critics and the Basal Ganglia The Critic
DA
C
CortexC, striatumS, STNsubthalamic Nucleus,
DAdopamine system, rreward.
So called striosomal modules fulfill the
functions of the adaptive Critic. The
prediction-error (d) characteristics of the
DA-neurons of the Critic are generated by 1)
Equating the reward r with excitatory input from
the lateral hypothalamus. 2) Equating the term
v(t) with indirect excitation at the DA-neurons
which is initiated from striatal striosomes and
channelled through the subthalamic nucleus onto
the DA neurons. 3) Equating the term v(t-1) with
direct, long-lasting inhibition from striatal
striosomes onto the DA-neurons. There are many
problems with this simplistic view though
timing, mismatch to anatomy, etc.
112
The End

Write a Comment

User Comments (0)