Title: ONLINE Q-LEARNER USING MOVING PROTOTYPES
1ONLINE Q-LEARNER USING MOVING PROTOTYPES
- by
- Miguel Ángel Soto Santibáñez
2Reinforcement Learning
- What does it do?
- Tackles the problem of learning control
strategies for autonomous agents. - What is the goal?
- The goal of the agent is to learn an action
policy that maximizes the total reward it will
receive from any starting state.
3Reinforcement Learning
- What does it need?
- This method assumes that training information
is available in the form of a real-valued reward
signal given for each state-action transition. - i.e. (s, a, r)
- What problems?
- Very often, reinforcement learning fits a
problem setting known as a Markov decision
process (MDP).
4Reinforcement Learning vs. Dynamic programming
- reward function
- r(s, a) ? r
- state transition function
- d(s, a) ? s
5Q-learning
- An off-policy control algorithm.
- Advantage
- Converges to an optimal policy in both
deterministic and nondeterministic MDPs. - Disadvantage
- Only practical on a small number of problems.
6Q-learning Algorithm
- Initialize Q(s, a) arbitrarily
- Repeat (for each episode)
- Initialize s
- Repeat (for each step of the episode)
- Choose a from s using an exploratory policy
- Take action a, observe r, s
- Q(s, a) ? Q(s, a) ar ?
max Q(s, a) Q(s, a)
-
a - s ? s
7Introduction to Q-learning Algorithm
- An episode (s1, a1, r1), (s2, a2, r2),
(sn, an, rn), - s d(s, a) ? s
- Q(s, a)
- ?, a
8A Sample Problem
r 0
r - 8
9States and actions
states
actions
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
10The Q(s, a) function
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N
S
W
E
a c t i o n s
11Q-learning Algorithm
- Initialize Q(s, a) arbitrarily
- Repeat (for each episode)
- Initialize s
- Repeat (for each step of the episode)
- Choose a from s using an exploratory policy
- Take action a, observe r, s
- Q(s, a) ? Q(s, a) ar ?
max Q(s, a) Q(s, a)
-
a - s ? s
12Initializing the Q(s, a) function
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a c t i o n s
13Q-learning Algorithm
- Initialize Q(s, a) arbitrarily
- Repeat (for each episode)
- Initialize s
- Repeat (for each step of the episode)
- Choose a from s using an exploratory policy
- Take action a, observe r, s
- Q(s, a) ? Q(s, a) ar ?
max Q(s, a) Q(s, a)
-
a - s ? s
14An episode
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
15Q-learning Algorithm
- Initialize Q(s, a) arbitrarily
- Repeat (for each episode)
- Initialize s
- Repeat (for each step of the episode)
- Choose a from s using an exploratory policy
- Take action a, observe r, s
- Q(s, a) ? Q(s, a) ar ?
max Q(s, a) Q(s, a)
-
a - s ? s
16Calculating new Q(s, a) values
17The Q(s, a) function after the first episode
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a c t i o n s
18A second episode
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
19Calculating new Q(s, a) values
20The Q(s, a) function after the second episode
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0
a c t i o n s
21The Q(s, a) function after a few episodes
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
a c t i o n s
22One of the optimal policies
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
a c t i o n s
23An optimal policy graphically
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
24Another of the optimal policies
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
a c t i o n s
25Another optimal policy graphically
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
26The problem with tabular Q-learning
- What is the problem?
- Only practical in a small number of problems
because - a) Q-learning can require many thousands of
training iterations to converge in even
modest-sized problems. - b) Very often, the memory resources required by
this method become too large.
27Solution
- What can we do about it?
- Use generalization.
- What are some examples?
- Tile coding, Radial Basis Functions, Fuzzy
function approximation, Hashing, Artificial
Neural Networks, LSPI, Regression Trees, Kanerva
coding, etc.
28Shortcomings
- Tile coding Curse of Dimensionality.
- Kanerva coding Static prototypes.
- LSPI Require a priori knowledge of the
Q-function. - ANN Require a large number of learning
experiences. - Batch Regression trees Slow and requires lots
of memory.
29Needed properties
- 1) Memory requirements should not explode
exponentially with the dimensionality of the
problem. - 2) It should tackle the pitfalls caused by the
usage of static prototypes. - 3) It should try to reduce the number of learning
experiences required to generate an acceptable
policy. - NOTE All this without
requiring a priori knowledge of the
Q-function.
30Overview of the proposed method
- 1) The proposed method limits the number of
prototypes available to describe the Q-function
(as Kanerva coding). - 2) The Q-function is modeled using a regression
tree (as the batch method proposed by Sridharan
and Tesauro). - 3) But prototypes are not static, as in Kanerva
coding, but dynamic. - 4) The proposed method has the capacity to update
the Q-function once for every available learning
experience (it can be an online learner).
31Changes on the normal regression tree
32Basic operations in the regression tree
33Impossible Merging
34Rules for a sound tree
35Impossible Merging
36Sample Merging
37Sample Merging
38Sample Merging
The node to be inserted
39Sample Merging
40Sample Merging
41Sample Merging
42Sample Merging
43The agent
Agent
44Applications
45Results first application
Tabular Q-learning Moving Prototypes Batch Method
Policy Quality Best Best Worst
Computational Complexity O(n) O(n log(n)) O(n2) O(n3)
Memory Usage Bad Best Worst
46Results first application (details)
Tabular Q-learning Moving Prototypes Batch Method
Policy Quality 2,423,355 2,423,355 2,297,100
Memory Usage 10,202 prototypes 413 prototypes 11,975 prototypes
47Results second application
Moving Prototypes LSPI (least-squares policy iteration)
Policy Quality Best Worst
Computational Complexity O(n log(n)) O(n2) O(n)
Memory Usage Worst Best
48Results second application (details)
Moving Prototypes LSPI (least-squares policy iteration)
Policy Quality forever (succeeded) forever (succeeded) forever (succeeded) 26 time steps (failed) 170 time steps (failed) forever (succeeded)
Required Learning Experiences 216 324 216 1,902,621 183,618 648
Memory Usage about 170 prototypes about 170 prototypes about 170 prototypes 2 weight parameters 2 weight parameters 2 weight parameters
49Results third application
- Reason for this experiment
- Evaluate the performance of the proposed method
in a scenario that we consider ideal for this
method, namely one, for which there is no
application specific knowledge available. - What took to learn a good policy
- Less than 2 minutes of CPU time.
- Less that 25,000 learning experiences.
- Less than 900 state-action-value tuples.
50Swimmer first movie
51Swimmer second movie
52Swimmer third movie
53Future Work
- Different types of splits.
- Continue characterization of the method Moving
Prototypes. - Moving prototypes LSPI.
- Moving prototypes Eligibility traces.