ONLINE Q-LEARNER USING MOVING PROTOTYPES PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: ONLINE Q-LEARNER USING MOVING PROTOTYPES


1
ONLINE Q-LEARNER USING MOVING PROTOTYPES
  • by
  • Miguel Ángel Soto Santibáñez

2
Reinforcement Learning
  • What does it do?
  • Tackles the problem of learning control
    strategies for autonomous agents.
  • What is the goal?
  • The goal of the agent is to learn an action
    policy that maximizes the total reward it will
    receive from any starting state.

3
Reinforcement Learning
  • What does it need?
  • This method assumes that training information
    is available in the form of a real-valued reward
    signal given for each state-action transition.
  • i.e. (s, a, r)
  • What problems?
  • Very often, reinforcement learning fits a
    problem setting known as a Markov decision
    process (MDP).

4
Reinforcement Learning vs. Dynamic programming
  • reward function
  • r(s, a) ? r
  • state transition function
  • d(s, a) ? s

5
Q-learning
  • An off-policy control algorithm.
  • Advantage
  • Converges to an optimal policy in both
    deterministic and nondeterministic MDPs.
  • Disadvantage
  • Only practical on a small number of problems.

6
Q-learning Algorithm
  • Initialize Q(s, a) arbitrarily
  • Repeat (for each episode)
  • Initialize s
  • Repeat (for each step of the episode)
  • Choose a from s using an exploratory policy
  • Take action a, observe r, s
  • Q(s, a) ? Q(s, a) ar ?
    max Q(s, a) Q(s, a)

  • a
  • s ? s

7
Introduction to Q-learning Algorithm
  • An episode (s1, a1, r1), (s2, a2, r2),
    (sn, an, rn),
  • s d(s, a) ? s
  • Q(s, a)
  • ?, a

8
A Sample Problem




r 0
r - 8
9
States and actions
states
actions
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
10
The Q(s, a) function
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N
S
W
E
a c t i o n s
11
Q-learning Algorithm
  • Initialize Q(s, a) arbitrarily
  • Repeat (for each episode)
  • Initialize s
  • Repeat (for each step of the episode)
  • Choose a from s using an exploratory policy
  • Take action a, observe r, s
  • Q(s, a) ? Q(s, a) ar ?
    max Q(s, a) Q(s, a)

  • a
  • s ? s

12
Initializing the Q(s, a) function
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a c t i o n s
13
Q-learning Algorithm
  • Initialize Q(s, a) arbitrarily
  • Repeat (for each episode)
  • Initialize s
  • Repeat (for each step of the episode)
  • Choose a from s using an exploratory policy
  • Take action a, observe r, s
  • Q(s, a) ? Q(s, a) ar ?
    max Q(s, a) Q(s, a)

  • a
  • s ? s

14
An episode
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
15
Q-learning Algorithm
  • Initialize Q(s, a) arbitrarily
  • Repeat (for each episode)
  • Initialize s
  • Repeat (for each step of the episode)
  • Choose a from s using an exploratory policy
  • Take action a, observe r, s
  • Q(s, a) ? Q(s, a) ar ?
    max Q(s, a) Q(s, a)

  • a
  • s ? s

16
Calculating new Q(s, a) values



17
The Q(s, a) function after the first episode
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a c t i o n s
18
A second episode
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
19
Calculating new Q(s, a) values



20
The Q(s, a) function after the second episode
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0
a c t i o n s
21
The Q(s, a) function after a few episodes
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
a c t i o n s
22
One of the optimal policies
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
a c t i o n s
23
An optimal policy graphically
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
24
Another of the optimal policies
states
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
a c t i o n s
25
Another optimal policy graphically
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
26
The problem with tabular Q-learning
  • What is the problem?
  • Only practical in a small number of problems
    because
  • a) Q-learning can require many thousands of
    training iterations to converge in even
    modest-sized problems.
  • b) Very often, the memory resources required by
    this method become too large.

27
Solution
  • What can we do about it?
  • Use generalization.
  • What are some examples?
  • Tile coding, Radial Basis Functions, Fuzzy
    function approximation, Hashing, Artificial
    Neural Networks, LSPI, Regression Trees, Kanerva
    coding, etc.

28
Shortcomings
  • Tile coding Curse of Dimensionality.
  • Kanerva coding Static prototypes.
  • LSPI Require a priori knowledge of the
    Q-function.
  • ANN Require a large number of learning
    experiences.
  • Batch Regression trees Slow and requires lots
    of memory.

29
Needed properties
  • 1) Memory requirements should not explode
    exponentially with the dimensionality of the
    problem.
  • 2) It should tackle the pitfalls caused by the
    usage of static prototypes.
  • 3) It should try to reduce the number of learning
    experiences required to generate an acceptable
    policy.
  • NOTE All this without
    requiring a priori knowledge of the
    Q-function.

30
Overview of the proposed method
  • 1) The proposed method limits the number of
    prototypes available to describe the Q-function
    (as Kanerva coding).
  • 2) The Q-function is modeled using a regression
    tree (as the batch method proposed by Sridharan
    and Tesauro).
  • 3) But prototypes are not static, as in Kanerva
    coding, but dynamic.
  • 4) The proposed method has the capacity to update
    the Q-function once for every available learning
    experience (it can be an online learner).

31
Changes on the normal regression tree
32
Basic operations in the regression tree

  • Rupture
  • Merging

33
Impossible Merging
34
Rules for a sound tree
35
Impossible Merging
36
Sample Merging
37
Sample Merging
38
Sample Merging
The node to be inserted
39
Sample Merging
40
Sample Merging
41
Sample Merging
42
Sample Merging
43
The agent
Agent
44
Applications
45
Results first application
Tabular Q-learning Moving Prototypes Batch Method
Policy Quality Best Best Worst
Computational Complexity O(n) O(n log(n)) O(n2) O(n3)
Memory Usage Bad Best Worst
46
Results first application (details)
Tabular Q-learning Moving Prototypes Batch Method
Policy Quality 2,423,355 2,423,355 2,297,100
Memory Usage 10,202 prototypes 413 prototypes 11,975 prototypes
47
Results second application
Moving Prototypes LSPI (least-squares policy iteration)
Policy Quality Best Worst
Computational Complexity O(n log(n)) O(n2) O(n)
Memory Usage Worst Best
48
Results second application (details)
Moving Prototypes LSPI (least-squares policy iteration)
Policy Quality forever (succeeded) forever (succeeded) forever (succeeded) 26 time steps (failed) 170 time steps (failed) forever (succeeded)
Required Learning Experiences 216 324 216 1,902,621 183,618 648
Memory Usage about 170 prototypes about 170 prototypes about 170 prototypes 2 weight parameters 2 weight parameters 2 weight parameters
49
Results third application
  • Reason for this experiment
  • Evaluate the performance of the proposed method
    in a scenario that we consider ideal for this
    method, namely one, for which there is no
    application specific knowledge available.
  • What took to learn a good policy
  • Less than 2 minutes of CPU time.
  • Less that 25,000 learning experiences.
  • Less than 900 state-action-value tuples.

50
Swimmer first movie
51
Swimmer second movie
52
Swimmer third movie
53
Future Work
  • Different types of splits.
  • Continue characterization of the method Moving
    Prototypes.
  • Moving prototypes LSPI.
  • Moving prototypes Eligibility traces.
Write a Comment
User Comments (0)
About PowerShow.com