Title: Learning to Maximize Reward: Reinforcement Learning
1Learning to Maximize Reward Reinforcement
Learning
Brian C. Williams 16.412J/6.834J October 28th,
2002
Slides adapted from Manuela Veloso, Reid
Simmons, Tom Mitchell, CMU
2/7/2014
2Reading
- Today Reinforcement Learning
- Read 2nd ed AIMA Chapter 19, or1st ed AIMA
Chapter 20 - Read Reinforcement Learning A Survey by L.
Kaebling, M. Littman and A. Moore, Journal of
Artificial Intelligence Research 4 (1996)
237-285. - For Markov Decision Processes
- Read 1st/2nd ed AIMA Chapter 17 sections 1 4.
- Optional Reading Planning and Acting in
Partially Observable Stochastic Domains, by L.
Kaebling, M. Littman and A. Cassandra, Elsevier
(1998) 237-285.
3Markov Decision Processes and Reinforcement
Learning
- Motivation
- Learning policies through reinforcement
- Q values
- Q learning
- Multi-step backups
- Nondeterministic MDPs
- Function Approximators
- Model-based Learning
- Summary
4Example TD-Gammon Tesauro, 1995
- Learns to play Backgammon
- Situations
- Board configurations (1020)
- Actions
- Moves
- Rewards
- 100 if win
- - 100 if lose
- 0 for all other states
- Trained by playing 1.5 million games against
self. - Currently, roughly equal to best human player.
5Reinforcement Learning Problem
- Given Repeatedly
- Executed action
- Observed state
- Observed reward
- Learn action policy p S ? A
- Maximizes life rewardr0 g r1 g 2 r2 . .
.from any start state. - Discount 0 lt g lt 1
- Note
- Unsupervised learning
- Delayed reward
Goal Learn to choose actions that
maximize life reward r0 g r1 g 2 r2 . . .
6How About Learning the Policy Directly?
- p S ? A
- fill out table entries for p by collecting
statistics on training pairs lts,agt. - Where does acome from?
7How About Learning the Value Function?
- Have agent learn value function Vp, denoted V.
- Given learned V, agent selects optimal action by
one step lookahead - p(s) argmaxa r(s,a) gV(d(s, a)
- Problem
- Works well if agent knows the environment model.
- d S x A ? S
- r S x A ? ?
- With no model, agent cant choose action from V.
- With a model, could compute V via value
iteration, why learn it?
8How About Learning the Model as Well?
- Have agent learn d and r by statistics on
training instances ltst,rt1,st1gt - Compute V by value iteration.Vt1(s) ? maxa
r(s,a) gV t(d(s, a)) - Agent selects optimal action by one step
lookahead - p(s) argmaxa r(s,a) gV(d(s, a)
- Problem A viable strategy for many problems, but
- When do you stop learning the model and compute
V? - May take a long time to converge on model.
- Would like to continuously interleave learning
and acting, but repeatedly computing V is
costly. - How can we avoid learning the model and V
explicitly?
9Eliminating the Model with Q Functions
- p(s) argmaxa r(s,a) gV(d(s, a)
- Key idea
- Define function that encapsulates V, d and r
- Q(s,a) r(s,a) gV(d(s, a))
- From learned Q, can choose an optimal action
without knowing d or r. - p(s) argmaxa Q(s,a)
- V Cumulative reward of being in s.
- Q Cumulative reward of being in s and taking
action a.
10Markov Decision Processes and Reinforcement
Learning
- Motivation
- Learning policies through reinforcement
- Q values
- Q learning
- Multi-step backups
- Nondeterministic MDPs
- Function Approximators
- Model-based Learning
- Summary
11How Do We Learn Q?
- Q(st,at) r(st,at) gV(d(st, at))
- Idea
- Create update rule similar to Bellman equation.
- Perform updates on training examples ltst , at ,
rt1 , st1 gt - Q(st,at) ? rt1 gV(st1 )
- How do we eliminate V?
- Q and V are closely related
- V(s) maxa Q(s,a)
- Substituting Q for V
- Q(st,at) ? rt1 g maxa Q(st1,a)
Called a backup
12Q-Learning for Deterministic Worlds
- Let Q denote the current approximation to Q.
- Initially
- For each s, a initialize table entry Q(s, a) ? 0
- Observe initial state s0
- Do for all time t
- Select an action at and execute it
- Receive immediate reward rt1
- Observe the new state st1
- Update the table entry for Q (st, at) as follows
- Q(st, at) ? rt1 g maxa Q(st1,a)
- st ? st1
13Example Q Learning Update
72
100
63
81
0 reward received
14Example Q Learning Update
90
aright
s1
s2
72
100
63
81
0 reward received
- Q(s1,aright) ? r(s1,aright) g maxa Q(s2,a)
- ? 0 0.9 max 63, 81, 100
- ? 90
- Note if rewards are non-negative
- For all s, a, n, Qn(s, a) ? Qn1(s, a)
- For all s, a, n, 0 ? Qn(s, a) ? Q(s, a)
15Q-Learning Iterations Episodic
- Start at upper left move clockwise table
initially 0 g 0.8 - Q(s, a) ? r g maxa Q(s,a)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0
16Q-Learning Iterations Episodic
- Start at upper left move clockwise table
initially 0 g 0.8 - Q(s, a) ? r g maxa Q(s,a)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0
17Q-Learning Iterations
- Start at upper left move clockwise g 0.8
- Q(s, a) ? r g maxa Q(s,a)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
18Q-Learning Iterations
- Start at upper left move clockwise g 0.8
- Q(s, a) ? r g maxa Q(s,a)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
0 0 r g maxa Q(s4,W), Q(s4,N) 0 0.8 x max10,0) 8
19Q-Learning Iterations
- Start at upper left move clockwise g 0.8
- Q(s, a) ? r g maxa Q(s,a)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
0 0 r g maxa Q(s4,W), Q(s4,N) 0 0.8 x max10,0) 8 10
0 r g maxa Q(s3,W), Q(s3,S) 0 0.8 x max0,8) 6.4
20Q-Learning Iterations
- Start at upper left move clockwise g 0.8
- Q(s, a) ? r g maxa Q(s,a)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
0 0 r g maxa Q(s4,W), Q(s4,N) 0 0.8 x max10,0) 8 10
0 r g maxa Q(s3,W), Q(s3,S) 0 0.8 x max0,8) 6.4 8 10
21Q-Learning Iterations
- Start at upper left move clockwise g 0.8
- Q(s, a) ? r g maxa Q(s,a)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
0 0 r g maxa Q(s4,W), Q(s4,N) 0 0.8 x max10,0) 8 10
0 r g maxa Q(s3,W), Q(s3,S) 0 0.8 x max0,8) 6.4 8 10
22Example Summary Value Iteration and Q-Learning
R(s, a) values
23Exploration vs Exploitation
- How do you pick actions as you learn?
- Greedy Action Selection
- Always select the action that looks best
- p(s) arg maxa Q(s,a)
- Probabilistic Action Selection
- Likelihood of a is proportional to current Q
value. - P(ais) kQ(s, ai) / S j kQ(s, aj)
24Markov Decision Processes and Reinforcement
Learning
- Motivation
- Learning policies through reinforcement
- Q values
- Q learning
- Multi-step backups
- Nondeterministic MDPs
- Function Approximators
- Model-based Learning
- Summary
25TD(l) Temporal Difference Learningfrom lecture
slides Machine Learning, T. Mitchell, McGraw
Hill, 1997.
- Q learning reduce discrepancy between successive
Q estimates - One step time difference
- Q(1)(st,at) rt g maxa Q(st1,at)
- Why not two steps?
- Q(2)(st,at) rt g rt1 g2 maxa Q(st2,at1)
- Or n ?
- Q(n)(st,at) rt g rt1 g(n-1) rtn-1 gn
maxa Q(stn,atn-1) - Blend all of these
- Ql(st,at) (1-l) Q(1)(st,at) l Q(2)(st,at)
l2Q(3)(st,at)
26Eligibility Traces
- Idea Perform backups on N previous data points,
as well as most recent data point. - Select data to backup based on frequency of
visitation. - Bias towards frequent data by geometric decay
gi-j.
27Markov Decision Processes and Reinforcement
Learning
- Motivation
- Learning policies through reinforcement
- Nondeterministic MDPs
- Value Iteration
- Q Learning
- Function Approximators
- Model-based Learning
- Summary
28Nondeterministic MDPs
- state transitions become probabilistic d(s,a,s)
29NonDeterministic Case
- How do we redefine cumulative reward to handle
non-determinism? - Define V and Q based on expected values
- Vp(st) Ert g rt1 g 2 rt2 . . .
- Vp(st) E ? g i rtI
- Q(st,at) Er(st,at) gV(d(st, at))
30Value Iteration for Non-deterministic MDPs
- V1(s) 0 for all s
- t 1
- loop
- t t 1
- loop for all s in S
- loop for all a in A
- Qt (s ,a) r(s,a) g S s in S
d(s,a,s) V t(s) - end loop
- Vt(s) maxa Qt (s,a)
- end loop
- until Vt1(s) - V t (s) lt e for all s in S
31Q Learning for Nondeterministic MDPs
- Q (s) r(s,a) g S s in S d(s,a,s) maxa Q
(s,a) - Alter training rule for non-deterministic
QnQn(st, at) ? (1- an) Qn-1 (st,at) an
rt1 g maxa Qn-1(st1,a) - where an 1/(1visitsn(s,a))
- Can still prove convergence of Q Watkins and
Dayan, 92
32Markov Decision Processes and Reinforcement
Learning
- Motivation
- Learning policies through reinforcement
- Nondeterministic MDPs
- Function Approximators
- Model-based Learning
- Summary
33Function Approximation
Function Approximator
Q(s,a)
s
a
targets or error
- Function Approximators
- Backprop Neural Network
- Radial Basis Function Network
- CMAC Network
- Nearest Neighbor, Memory-based
- Decision Tree
gradient- descent methods
34Function Approximation ExampleAdjusting Network
Weights
Function Approximator
Q(s,a)
s
a
targets or error
- Function Approximator
- Q(s,a) f(s,a,w)
- Update Gradient-descent Sarsa
- w ? w art1 gQ(st1,at1)-Q(st,at) ?w
f(st,at,w)
35Example TD-Gammon Tesauro, 1995
- Learns to play Backgammon
- Situations
- Board configurations (1020)
- Actions
- Moves
- Rewards
- 100 if win
- - 100 if lose
- 0 for all other states
- Trained by playing 1.5 million games against
self. - Currently, roughly equal to best human player.
36Example TD-Gammon Tesauro, 1995
V(s) predicted probability of winning
On win Outcome 1 On Loss Outcome 0
TD error
V(st1) V(st)
Hidden Units 0 - 160
Random Initial Weights
Raw Board Position ( of pieces at each position)
37Markov Decision Processes and Reinforcement
Learning
- Motivation
- Learning policies through reinforcement
- Nondeterministic MDPs
- Function Approximators
- Model-based Learning
- Summary
38Model-based Learning Certainty-Equivalence
Method
- For every step
- Use new experience to update model parameters.
- Transitions
- Rewards
- Solve the model for V and p.
- Value iteration.
- Policy iteration.
- Use the policy to choose the next action.
39Learning the Model
- For each state-action pair lts,agt visited
accumulate - Mean Transition
- T(s, a, s) number-times-seen(s, a ? s)
-
number-times-tried(s,a) - Mean Reward R(s, a)
40Comparison of Model-based and Model-free methods
- Temporal Differencing / Q Learning Only does
computation for the states the system is actually
in. - Good real-time performance
- Inefficient use of data
- Model-based methods Computes the best estimates
for every state on every time step. - Efficient use of data
- Terrible real-time performance
- What is a middle ground?
41Dyna A Middle GroundSutton, Intro to RL, 97
- At each step, incrementally
- Update model based on new data
- Update policy based on new data
- Update policy based on updated model
- Performance, until optimal, on Grid World
- Q-Learning
- 531,000 Steps
- 531,000 Backups
- Dyna
- 61,908 Steps
- 3,055,000 Backups
42Dyna Algorithm
- Given state s
- Choose action a using estimated policy.
- Observe new state s and reward r.
- Update T and R of model.
- Update V at lts, agt
- V(s) ? maxa r(s,a) g ?s T(s,a,s)V(s))
- Perform k additional updates
- Pick k random states sj in s1, s2, . . . sk
- Update each V(sj)
- V(sj) ? maxa r(sj,a) g ?s T(sj,a,s)V(s))
43Markov Decision Processes and Reinforcement
Learning
- Motivation
- Learning policies through reinforcement
- Nondeterministic MDPs
- Function Approximators
- Model-based Learning
- Summary
44Ongoing Research
- Handling cases where state is only partially
observable - Design of optimal exploration strategies
- Extend to continuous action, state
- Learn and use d S x A ? S
- Scaling up in the size of the state space
- Function approximators (neural net instead of
table) - Generalization
- Macros
- Exploiting substructure
- Multiple learners Multi-agent reinforcement
learning
45Markov Decision Processes (MDPs)
- Model
- Finite set of states, S
- Finite set of actions, A
- Probabilistic state transitions, d(s,a)
- Reward for each state and action, R(s,a)
- Process
- Observe state st in S
- Choose action at in A
- Receive immediate reward rt
- State changes to st1
s0
s1
r0
s1
a1
46Crib Sheet MDPs by Value Iteration
- Insight Can calculate optimal values iteratively
using Dynamic Programming. - Algorithm
- Iteratively calculate value using Bellmans
Equation - Vt1(s) ? maxa r(s,a) gV t(d(s, a))
- Terminate when values are close enough
- Vt1(s) - V t (s) lt e
- Agent selects optimal action by one step
lookahead on V - p(s) argmaxa r(s,a) gV(d(s, a)
47Crib Sheet Q-Learning for Deterministic Worlds
- Let Q denote the current approximation to Q.
- Initially
- For each s, a initialize table entry Q(s, a) ? 0
- Observe current state s
- Do forever
- Select an action a and execute it
- Receive immediate reward r
- Observe the new state s
- Update the table entry for Q (s, a) as follows
- Q(s, a) ? r g maxa Q(s,a)
- s ? s