Reinforcement Learning (II.)

About This Presentation

Title:

Reinforcement Learning (II.)

Description:

We recall the formalism for the deterministic case. We reformulate the formalism for non-deterministic case. We learn about: Bellman equations ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 20

Provided by: axk

Category:

more less

Transcript and Presenter's Notes

Title: Reinforcement Learning (II.)

1
Reinforcement Learning (II.)

Ata Kaban
A.Kaban_at_cs.bham.ac.uk
School of Computer Science
University of Birmingham

2
Recall

Policy what to do
Reward what is good
Value what is good because it predicts reward
Model what follows what
Reinforcement Learning learning what to do from
interactions with the environment

3
Recall

Markov Decision Process
rt and st1 depend only on the current (st) state
and action (at)
Goal get as much eventual reward as possible no
matter from which state you start off.

4
Todays lecture

We recall the formalism for the deterministic case

We reformulate the formalism for
non-deterministic case
We learn about
Bellman equations
Optimal policy
Policy iteration
Q-learning in nondeterministic environment

5
What to learn

Learn an action policy
so that it maximises the expected eventual reward
from each state
Learn this from interaction examples, i.e. data
of the form ((s,a),r)

Learn an action policy
so that it maximises the eventual reward from
each state
Learn this from interaction examples, i.e. data
of the form ((s,a),r)

Notations used
We assume we are at a time t, so
Recall summarize other notations as well
Policy ?.
Remember in the deterministic case, ?(s) is an
action
In the non-deterministic case, ?(s) is a random
variable, ie we can only talk about ?(as), which
is the probability of doing action a in state s
State values in a policy V?(s)
Values of state-action pairs (i.e. Q-values)
Q(s,a)
State transitions the next state depends on the
current state and current action.
the state that deterministically follows s if
action a is taken ?(s,a)
Now the state transitions may also be
probabilistic then, the probability that s
follows s if action a is taken is p(ss,a)

7
State value function

How much reward can I expect to accumulate from
state s if I follow policy p
This is called Bellman equation
It is a linear system which has a unique solution!

How much reward can I accumulate from state s if
I follow policy p

8
Example

Compute V for p

Compute Va for random pol pa

V a(s6) 0.5(1000.90)
0.5(00.90) 50 V a(s5) 0.33(0 0.950)
0.66(0 0.90) 30 V a(s4)
0.5(0 0.930) 0.5(00.90)
13.5 Etc If computed for all states, then start
again and keep iterating till the values converge

V(s6) 100 0.90 100 V(s5) 0 0.9100
90 V(s4) 0 0.990 81
Btw, has p(ss,a) disappeared?
9
What is good about it?

For any policy, Bellman eq has unique solution
?it has unique solution for an optimal policy as
well. Denote this by V.
The optimal policy is what we want to learn.
Denote it by ?
If we could learn V, then with one look-ahead we
could compute ?
How to learn V? It depends on ?, which is
unknown as well
Iterate and improve on both in each iteration
until converging to V and an associated ?. This
is called (generalised) policy iteration.

10
Generalised Policy Iteration
Geometric illustration
11
Before going on what does it mean optimal
policy more exactly?

p is said to be better then another policy p,
In an MDP there always exists at least one policy
which is better than all others. This is called
optimal policy.
Any policy which is greedy with respect to V is
an optimal policy.

12
What is bad about it?

We need to know the model of the environment for
doing policy iteration.
i.e. we need to know the state transitions (what
follows what)
We need to be able to look one step ahead
i.e. to try out all actions in order to choose
the best one
In some applications this is not feasible
Is some others it is
Can you think of any examples?
a fierce battle?
an Internet crawler?

13
Looking ahead
Backup diagram
14
The other routeAction value function

How much eventual reward can I expect to get if
making action a from state s.
This is also a Bellman equation
It has a unique solution!

How much eventual reward can I get if making
action a from state s.

15
What is good about it?

If we know Q, look-ahead is not needed for
following an optimal policy!
i.e. if we know all action values then just do
the best action from each state.
Different implementations exist in the literature
to improve efficiency. We will stick with turning
the Bellman equation for action-value functions
into an iterative update.

16
Simple example of updating Q
s3
Grid world, the rewards are given on the arrows
s1
s2
Simple, i.e. observe this is a deterministic
world.
s6
s4
s5
Here, the Q values from a previous iteration are
given on the arrows
Recall
17
(No Transcript)
18