Title: Reinforcement Learning (II.)
1Reinforcement Learning (II.)
- Ata Kaban
- A.Kaban_at_cs.bham.ac.uk
- School of Computer Science
- University of Birmingham
2Recall
- Policy what to do
- Reward what is good
- Value what is good because it predicts reward
- Model what follows what
- Reinforcement Learning learning what to do from
interactions with the environment
3Recall
- Markov Decision Process
- rt and st1 depend only on the current (st) state
and action (at) - Goal get as much eventual reward as possible no
matter from which state you start off. -
4Todays lecture
- We recall the formalism for the deterministic case
- We reformulate the formalism for
non-deterministic case - We learn about
- Bellman equations
- Optimal policy
- Policy iteration
- Q-learning in nondeterministic environment
5What to learn
- Learn an action policy
- so that it maximises the expected eventual reward
from each state - Learn this from interaction examples, i.e. data
of the form ((s,a),r)
- Learn an action policy
- so that it maximises the eventual reward from
each state - Learn this from interaction examples, i.e. data
of the form ((s,a),r)
6- Notations used
- We assume we are at a time t, so
- Recall summarize other notations as well
- Policy ?.
- Remember in the deterministic case, ?(s) is an
action - In the non-deterministic case, ?(s) is a random
variable, ie we can only talk about ?(as), which
is the probability of doing action a in state s - State values in a policy V?(s)
- Values of state-action pairs (i.e. Q-values)
Q(s,a) - State transitions the next state depends on the
current state and current action. - the state that deterministically follows s if
action a is taken ?(s,a) - Now the state transitions may also be
probabilistic then, the probability that s
follows s if action a is taken is p(ss,a)
7State value function
- How much reward can I expect to accumulate from
state s if I follow policy p - This is called Bellman equation
- It is a linear system which has a unique solution!
- How much reward can I accumulate from state s if
I follow policy p
8Example
- Compute Va for random pol pa
V a(s6) 0.5(1000.90)
0.5(00.90) 50 V a(s5) 0.33(0 0.950)
0.66(0 0.90) 30 V a(s4)
0.5(0 0.930) 0.5(00.90)
13.5 Etc If computed for all states, then start
again and keep iterating till the values converge
V(s6) 100 0.90 100 V(s5) 0 0.9100
90 V(s4) 0 0.990 81
Btw, has p(ss,a) disappeared?
9What is good about it?
- For any policy, Bellman eq has unique solution
?it has unique solution for an optimal policy as
well. Denote this by V. - The optimal policy is what we want to learn.
Denote it by ? - If we could learn V, then with one look-ahead we
could compute ? - How to learn V? It depends on ?, which is
unknown as well - Iterate and improve on both in each iteration
until converging to V and an associated ?. This
is called (generalised) policy iteration.
10Generalised Policy Iteration
Geometric illustration
11Before going on what does it mean optimal
policy more exactly?
- p is said to be better then another policy p,
- In an MDP there always exists at least one policy
which is better than all others. This is called
optimal policy. - Any policy which is greedy with respect to V is
an optimal policy.
12What is bad about it?
- We need to know the model of the environment for
doing policy iteration. - i.e. we need to know the state transitions (what
follows what) - We need to be able to look one step ahead
- i.e. to try out all actions in order to choose
the best one - In some applications this is not feasible
- Is some others it is
- Can you think of any examples?
- a fierce battle?
- an Internet crawler?
13Looking ahead
Backup diagram
14The other routeAction value function
- How much eventual reward can I expect to get if
making action a from state s. - This is also a Bellman equation
- It has a unique solution!
- How much eventual reward can I get if making
action a from state s.
15What is good about it?
- If we know Q, look-ahead is not needed for
following an optimal policy! - i.e. if we know all action values then just do
the best action from each state. - Different implementations exist in the literature
to improve efficiency. We will stick with turning
the Bellman equation for action-value functions
into an iterative update.
16Simple example of updating Q
s3
Grid world, the rewards are given on the arrows
s1
s2
Simple, i.e. observe this is a deterministic
world.
s6
s4
s5
Here, the Q values from a previous iteration are
given on the arrows
Recall
17(No Transcript)
18- First iteration
- Q(L1,A)00.9(0.7500.30)
- Q(L1,S)00.9(0.5(-50)0.50)
- Q(L1,M)
-
- What is your optimal action plan?
19Key points
- Learning by reinforcement
- Markov Decision Processes
- Value functions
- Optimal policy
- Bellman equations
- Methods and implementations for computing value
functions - Policy iteration
- Q-learning