ONLINE Q-LEARNER USING MOVING PROTOTYPES presentation

About This Presentation

Transcript and Presenter's Notes

Title: ONLINE Q-LEARNER USING MOVING PROTOTYPES

1
ONLINE Q-LEARNER USING MOVING PROTOTYPES

by
Miguel Ángel Soto Santibáñez

2
Reinforcement Learning

What does it do?
Tackles the problem of learning control
strategies for autonomous agents.
What is the goal?
The goal of the agent is to learn an action
policy that maximizes the total reward it will
receive from any starting state.

3
Reinforcement Learning

What does it need?
This method assumes that training information
is available in the form of a real-valued reward
signal given for each state-action transition.
i.e. (s, a, r)
What problems?
Very often, reinforcement learning fits a
problem setting known as a Markov decision
process (MDP).

4
Reinforcement Learning vs. Dynamic programming

reward function
r(s, a) ? r
state transition function
d(s, a) ? s

5
Q-learning

An off-policy control algorithm.
Advantage
Converges to an optimal policy in both
deterministic and nondeterministic MDPs.
Disadvantage
Only practical on a small number of problems.

6
Q-learning Algorithm

Initialize Q(s, a) arbitrarily
Repeat (for each episode)
Initialize s
Repeat (for each step of the episode)
Choose a from s using an exploratory policy
Take action a, observe r, s
Q(s, a) ? Q(s, a) ar ?
max Q(s, a) Q(s, a)
a
s ? s

7
Introduction to Q-learning Algorithm

An episode (s1, a1, r1), (s2, a2, r2),
(sn, an, rn),
s d(s, a) ? s
Q(s, a)
?, a

8
A Sample Problem

r 0
r - 8
9
States and actions
states
actions
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
10
The Q(s, a) function
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N
S
W
E
a c t i o n s
11
Q-learning Algorithm

Initialize Q(s, a) arbitrarily
Repeat (for each episode)
Initialize s
Repeat (for each step of the episode)
Choose a from s using an exploratory policy
Take action a, observe r, s
Q(s, a) ? Q(s, a) ar ?
max Q(s, a) Q(s, a)
a
s ? s

12
Initializing the Q(s, a) function
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a c t i o n s
13
Q-learning Algorithm

Initialize Q(s, a) arbitrarily
Repeat (for each episode)
Initialize s
Repeat (for each step of the episode)
Choose a from s using an exploratory policy
Take action a, observe r, s
Q(s, a) ? Q(s, a) ar ?
max Q(s, a) Q(s, a)
a
s ? s

14
An episode
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
15
Q-learning Algorithm

Initialize Q(s, a) arbitrarily
Repeat (for each episode)
Initialize s
Repeat (for each step of the episode)
Choose a from s using an exploratory policy
Take action a, observe r, s
Q(s, a) ? Q(s, a) ar ?
max Q(s, a) Q(s, a)
a
s ? s

16
Calculating new Q(s, a) values

17
The Q(s, a) function after the first episode
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a c t i o n s
18
A second episode
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
19
Calculating new Q(s, a) values

20
The Q(s, a) function after the second episode
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0
a c t i o n s
21
The Q(s, a) function after a few episodes
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
a c t i o n s
22
One of the optimal policies
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
a c t i o n s
23
An optimal policy graphically
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
24
Another of the optimal policies
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
a c t i o n s
25
Another optimal policy graphically
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
26
The problem with tabular Q-learning

What is the problem?
Only practical in a small number of problems
because
a) Q-learning can require many thousands of
training iterations to converge in even
modest-sized problems.
b) Very often, the memory resources required by
this method become too large.

27
Solution

What can we do about it?
Use generalization.
What are some examples?
Tile coding, Radial Basis Functions, Fuzzy
function approximation, Hashing, Artificial
Neural Networks, LSPI, Regression Trees, Kanerva
coding, etc.

28
Shortcomings

Tile coding Curse of Dimensionality.
Kanerva coding Static prototypes.
LSPI Require a priori knowledge of the
Q-function.
ANN Require a large number of learning
experiences.
Batch Regression trees Slow and requires lots
of memory.

29
Needed properties

1) Memory requirements should not explode
exponentially with the dimensionality of the
problem.
2) It should tackle the pitfalls caused by the
usage of static prototypes.
3) It should try to reduce the number of learning
experiences required to generate an acceptable
policy.
NOTE All this without
requiring a priori knowledge of the
Q-function.

30
Overview of the proposed method

1) The proposed method limits the number of
prototypes available to describe the Q-function
(as Kanerva coding).
2) The Q-function is modeled using a regression
tree (as the batch method proposed by Sridharan
and Tesauro).
3) But prototypes are not static, as in Kanerva
coding, but dynamic.
4) The proposed method has the capacity to update
the Q-function once for every available learning
experience (it can be an online learner).

31
Changes on the normal regression tree
32
Basic operations in the regression tree

Rupture
Merging

33
Impossible Merging
34
Rules for a sound tree
35
Impossible Merging
36
Sample Merging
37
Sample Merging
38
Sample Merging
The node to be inserted
39
Sample Merging
40
Sample Merging
41
Sample Merging
42
Sample Merging
43
The agent
Agent
44
Applications
45
Results first application
Tabular Q-learning Moving Prototypes Batch Method
Policy Quality Best Best Worst
Computational Complexity O(n) O(n log(n)) O(n2) O(n3)
Memory Usage Bad Best Worst
46
Results first application (details)
Tabular Q-learning Moving Prototypes Batch Method
Policy Quality 2,423,355 2,423,355 2,297,100
Memory Usage 10,202 prototypes 413 prototypes 11,975 prototypes
47
Results second application
Moving Prototypes LSPI (least-squares policy iteration)
Policy Quality Best Worst
Computational Complexity O(n log(n)) O(n2) O(n)
Memory Usage Worst Best
48
Results second application (details)
Moving Prototypes LSPI (least-squares policy iteration)
Policy Quality forever (succeeded) forever (succeeded) forever (succeeded) 26 time steps (failed) 170 time steps (failed) forever (succeeded)
Required Learning Experiences 216 324 216 1,902,621 183,618 648
Memory Usage about 170 prototypes about 170 prototypes about 170 prototypes 2 weight parameters 2 weight parameters 2 weight parameters
49
Results third application

Reason for this experiment
Evaluate the performance of the proposed method
in a scenario that we consider ideal for this
method, namely one, for which there is no
application specific knowledge available.
What took to learn a good policy
Less than 2 minutes of CPU time.
Less that 25,000 learning experiences.
Less than 900 state-action-value tuples.

50
Swimmer first movie
51
Swimmer second movie
52
Swimmer third movie
53
Future Work

Different types of splits.
Continue characterization of the method Moving
Prototypes.
Moving prototypes LSPI.
Moving prototypes Eligibility traces.

Write a Comment

User Comments (0)

About PowerShow.com

ONLINE Q-LEARNER USING MOVING PROTOTYPES PowerPoint PPT Presentation