Learning to Maximize Reward: Reinforcement Learning

About This Presentation

Title:

Learning to Maximize Reward: Reinforcement Learning

Description:

Optional Reading: : Planning and Acting in Partially Observable Stochastic ... With no model, agent can't choose action from V ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 48

Provided by: johnc106

Learn more at: http://www.ai.mit.edu

more less

Transcript and Presenter's Notes

Title: Learning to Maximize Reward: Reinforcement Learning

1
Learning to Maximize Reward Reinforcement
Learning
Brian C. Williams 16.412J/6.834J October 28th,
2002
Slides adapted from Manuela Veloso, Reid
Simmons, Tom Mitchell, CMU
2/7/2014
2
Reading

Today Reinforcement Learning
Read 2nd ed AIMA Chapter 19, or1st ed AIMA
Chapter 20
Read Reinforcement Learning A Survey by L.
Kaebling, M. Littman and A. Moore, Journal of
Artificial Intelligence Research 4 (1996)
237-285.
For Markov Decision Processes
Read 1st/2nd ed AIMA Chapter 17 sections 1 4.
Optional Reading Planning and Acting in
Partially Observable Stochastic Domains, by L.
Kaebling, M. Littman and A. Cassandra, Elsevier
(1998) 237-285.

3
Markov Decision Processes and Reinforcement
Learning

Motivation
Learning policies through reinforcement
Q values
Q learning
Multi-step backups
Nondeterministic MDPs
Function Approximators
Model-based Learning
Summary

4
Example TD-Gammon Tesauro, 1995

Learns to play Backgammon
Situations
Board configurations (1020)
Actions
Moves
Rewards
100 if win
- 100 if lose
0 for all other states
Trained by playing 1.5 million games against
self.
Currently, roughly equal to best human player.

5
Reinforcement Learning Problem

Given Repeatedly
Executed action
Observed state
Observed reward
Learn action policy p S ? A
Maximizes life rewardr0 g r1 g 2 r2 . .
.from any start state.
Discount 0 lt g lt 1
Note
Unsupervised learning
Delayed reward

Goal Learn to choose actions that
maximize life reward r0 g r1 g 2 r2 . . .
6
How About Learning the Policy Directly?

p S ? A
fill out table entries for p by collecting
statistics on training pairs lts,agt.
Where does acome from?

7
How About Learning the Value Function?

Have agent learn value function Vp, denoted V.
Given learned V, agent selects optimal action by
one step lookahead
p(s) argmaxa r(s,a) gV(d(s, a)

Problem
Works well if agent knows the environment model.
d S x A ? S
r S x A ? ?
With no model, agent cant choose action from V.
With a model, could compute V via value
iteration, why learn it?

8
How About Learning the Model as Well?

Have agent learn d and r by statistics on
training instances ltst,rt1,st1gt
Compute V by value iteration.Vt1(s) ? maxa
r(s,a) gV t(d(s, a))
Agent selects optimal action by one step
lookahead
p(s) argmaxa r(s,a) gV(d(s, a)

Problem A viable strategy for many problems, but
When do you stop learning the model and compute
V?
May take a long time to converge on model.
Would like to continuously interleave learning
and acting, but repeatedly computing V is
costly.
How can we avoid learning the model and V
explicitly?

9
Eliminating the Model with Q Functions

p(s) argmaxa r(s,a) gV(d(s, a)
Key idea
Define function that encapsulates V, d and r
Q(s,a) r(s,a) gV(d(s, a))
From learned Q, can choose an optimal action
without knowing d or r.
p(s) argmaxa Q(s,a)
V Cumulative reward of being in s.
Q Cumulative reward of being in s and taking
action a.

10
Markov Decision Processes and Reinforcement
Learning

Motivation
Learning policies through reinforcement
Q values
Q learning
Multi-step backups
Nondeterministic MDPs
Function Approximators
Model-based Learning
Summary

11
How Do We Learn Q?

Q(st,at) r(st,at) gV(d(st, at))
Idea
Create update rule similar to Bellman equation.
Perform updates on training examples ltst , at ,
rt1 , st1 gt
Q(st,at) ? rt1 gV(st1 )
How do we eliminate V?
Q and V are closely related
V(s) maxa Q(s,a)
Substituting Q for V
Q(st,at) ? rt1 g maxa Q(st1,a)

Called a backup
12
Q-Learning for Deterministic Worlds

Let Q denote the current approximation to Q.
Initially
For each s, a initialize table entry Q(s, a) ? 0
Observe initial state s0
Do for all time t
Select an action at and execute it
Receive immediate reward rt1
Observe the new state st1
Update the table entry for Q (st, at) as follows
Q(st, at) ? rt1 g maxa Q(st1,a)
st ? st1

13
Example Q Learning Update
72
100

g 0.9

63
81
0 reward received
14
Example Q Learning Update
90
aright
s1
s2
72
100

g 0.9

63
81
0 reward received

Q(s1,aright) ? r(s1,aright) g maxa Q(s2,a)
? 0 0.9 max 63, 81, 100
? 90

Note if rewards are non-negative
For all s, a, n, Qn(s, a) ? Qn1(s, a)
For all s, a, n, 0 ? Qn(s, a) ? Q(s, a)

15
Q-Learning Iterations Episodic

Start at upper left move clockwise table
initially 0 g 0.8
Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0

16
Q-Learning Iterations Episodic

Start at upper left move clockwise table
initially 0 g 0.8
Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0

17
Q-Learning Iterations

Start at upper left move clockwise g 0.8
Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10

18
Q-Learning Iterations

Start at upper left move clockwise g 0.8
Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
0 0 r g maxa Q(s4,W), Q(s4,N) 0 0.8 x max10,0) 8

19
Q-Learning Iterations

Start at upper left move clockwise g 0.8
Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
0 0 r g maxa Q(s4,W), Q(s4,N) 0 0.8 x max10,0) 8 10
0 r g maxa Q(s3,W), Q(s3,S) 0 0.8 x max0,8) 6.4
20
Q-Learning Iterations

Start at upper left move clockwise g 0.8
Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
0 0 r g maxa Q(s4,W), Q(s4,N) 0 0.8 x max10,0) 8 10
0 r g maxa Q(s3,W), Q(s3,S) 0 0.8 x max0,8) 6.4 8 10
21
Q-Learning Iterations

Start at upper left move clockwise g 0.8
Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
0 0 r g maxa Q(s4,W), Q(s4,N) 0 0.8 x max10,0) 8 10
0 r g maxa Q(s3,W), Q(s3,S) 0 0.8 x max0,8) 6.4 8 10
22
Example Summary Value Iteration and Q-Learning
R(s, a) values
23
Exploration vs Exploitation

How do you pick actions as you learn?
Greedy Action Selection
Always select the action that looks best
p(s) arg maxa Q(s,a)
Probabilistic Action Selection
Likelihood of a is proportional to current Q
value.
P(ais) kQ(s, ai) / S j kQ(s, aj)

24
Markov Decision Processes and Reinforcement
Learning

Motivation
Learning policies through reinforcement
Q values
Q learning
Multi-step backups
Nondeterministic MDPs
Function Approximators
Model-based Learning
Summary

25
TD(l) Temporal Difference Learningfrom lecture
slides Machine Learning, T. Mitchell, McGraw
Hill, 1997.

Q learning reduce discrepancy between successive
Q estimates
One step time difference
Q(1)(st,at) rt g maxa Q(st1,at)
Why not two steps?
Q(2)(st,at) rt g rt1 g2 maxa Q(st2,at1)
Or n ?
Q(n)(st,at) rt g rt1 g(n-1) rtn-1 gn
maxa Q(stn,atn-1)
Blend all of these
Ql(st,at) (1-l) Q(1)(st,at) l Q(2)(st,at)
l2Q(3)(st,at)

26
Eligibility Traces

Idea Perform backups on N previous data points,
as well as most recent data point.
Select data to backup based on frequency of
visitation.
Bias towards frequent data by geometric decay
gi-j.

27
Markov Decision Processes and Reinforcement
Learning

Motivation
Learning policies through reinforcement
Nondeterministic MDPs
Value Iteration
Q Learning
Function Approximators
Model-based Learning
Summary

28
Nondeterministic MDPs

state transitions become probabilistic d(s,a,s)

29
NonDeterministic Case

How do we redefine cumulative reward to handle
non-determinism?
Define V and Q based on expected values
Vp(st) Ert g rt1 g 2 rt2 . . .
Vp(st) E ? g i rtI
Q(st,at) Er(st,at) gV(d(st, at))

30
Value Iteration for Non-deterministic MDPs

V1(s) 0 for all s
t 1
loop
t t 1
loop for all s in S
loop for all a in A
Qt (s ,a) r(s,a) g S s in S
d(s,a,s) V t(s)
end loop
Vt(s) maxa Qt (s,a)
end loop
until Vt1(s) - V t (s) lt e for all s in S

31
Q Learning for Nondeterministic MDPs

Q (s) r(s,a) g S s in S d(s,a,s) maxa Q
(s,a)
Alter training rule for non-deterministic
QnQn(st, at) ? (1- an) Qn-1 (st,at) an
rt1 g maxa Qn-1(st1,a)
where an 1/(1visitsn(s,a))
Can still prove convergence of Q Watkins and
Dayan, 92

32
Markov Decision Processes and Reinforcement
Learning

Motivation
Learning policies through reinforcement
Nondeterministic MDPs
Function Approximators
Model-based Learning
Summary

33
Function Approximation
Function Approximator
Q(s,a)
s
a
targets or error

Function Approximators
Backprop Neural Network
Radial Basis Function Network
CMAC Network
Nearest Neighbor, Memory-based
Decision Tree

gradient- descent methods
34
Function Approximation ExampleAdjusting Network
Weights
Function Approximator
Q(s,a)
s
a
targets or error

Function Approximator
Q(s,a) f(s,a,w)
Update Gradient-descent Sarsa
w ? w art1 gQ(st1,at1)-Q(st,at) ?w
f(st,at,w)

35
Example TD-Gammon Tesauro, 1995

Learns to play Backgammon
Situations
Board configurations (1020)
Actions
Moves
Rewards
100 if win
- 100 if lose
0 for all other states
Trained by playing 1.5 million games against
self.
Currently, roughly equal to best human player.

36
Example TD-Gammon Tesauro, 1995
V(s) predicted probability of winning
On win Outcome 1 On Loss Outcome 0
TD error
V(st1) V(st)
Hidden Units 0 - 160
Random Initial Weights
Raw Board Position ( of pieces at each position)
37
Markov Decision Processes and Reinforcement
Learning

Motivation
Learning policies through reinforcement
Nondeterministic MDPs
Function Approximators
Model-based Learning
Summary

38
Model-based Learning Certainty-Equivalence
Method

For every step
Use new experience to update model parameters.
Transitions
Rewards
Solve the model for V and p.
Value iteration.
Policy iteration.
Use the policy to choose the next action.

39
Learning the Model

For each state-action pair lts,agt visited
accumulate
Mean Transition
T(s, a, s) number-times-seen(s, a ? s)
number-times-tried(s,a)
Mean Reward R(s, a)

40
Comparison of Model-based and Model-free methods

Temporal Differencing / Q Learning Only does
computation for the states the system is actually
in.
Good real-time performance
Inefficient use of data
Model-based methods Computes the best estimates
for every state on every time step.
Efficient use of data
Terrible real-time performance
What is a middle ground?

41
Dyna A Middle GroundSutton, Intro to RL, 97