Title: Planning to Maximize Reward: Markov Decision Processes
1Planning to Maximize Reward Markov Decision
Processes
Brian C. Williams 16.412J/6.835J September 10th,
2001
Slides adapted from Manuela Veloso, Reid
Simmons, Tom Mitchell, CMU
3/29/2014
2Reading for Those with Gaps to Fill
- Search and Constraint Satisfaction
- Read AIMA Chapters 3 4
- Needed for planning (next week)and diagnosis
- Probabilities
- Read AIMA Chapter 14
- Needed for diagnosis, HMMs and Bayes nets
- AIMA Artificial Intelligence A Modern
Approach by Stuart Russell and
Peter Norvig
3Markov Decision Processes
- Motivation
- What are Markov Decision Processes?
- Computing Action Policies From a Model
- Summary
4Reading and Assignments
- Markov Decision Processes
- Read AIMA Chapters 17 sections 1 4.
- Reinforcement Learning
- Read AIMA Chapter 20
- Homework
- Out Wednesday, involves coding MDPs and RL
- Lecture based on development in Machine
Learning by Tom Mitchell Chapter 13
Reinforcement Learning
5How Might a Mouse Search a Maze for Cheese?
Cheese
- State Space Search?
- As a Constraint Satisfaction Problem?
- Goal-directed Planning?
- As a Rule or Production Systems?
- What is missing?
6courtesy NASA Ames
courtesy NASA Lewis
7Ideas in this lecture
- Objective is to accumulate rewards, rather than
goal states. - Objectives are achieved along the way, rather
than at the end. - Task is to generate policies for how to act in
all situations, rather than a plan for a single
starting situation. - Policies fall out of value functions, which
describe the greatest lifetime reward achievable
at every state. - Value functions are iteratively approximated.
8Markov Decision Processes
- Motivation
- What are Markov Decision Processes (MDPs)?
- Models
- Lifetime Reward
- Policies
- Computing Policies From a Model
- Summary
9MDP Problem
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for acting that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
10MDP Problem Model
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
11Markov Decision Processes (MDPs)
Process
- Model
- Finite set of states, S
- Finite set of actions, A
- Probabilistic state transitions, d(s,a)
- Reward for each state and action, R(s,a)
- Observe state st in S
- Choose action at in A
- Receive immediate reward rt
- State changes to st1
s0
s1
r0
- Legal transitions shown
- Reward on unlabeled transitions is 0.
G
12MDP Environment Assumptions
- Markov Assumption Next state and reward is a
function only of the current state and action - st1 d(st, at)
- rt r(st, at)
- Uncertain and Unknown Environment
- d and r may be nondeterministic and unknown
Today Deterministic Case Only
13MDP Nondeterministic Example
We only considerthe deterministic case
R Register for academic route D Depart to
Industry
14MDP Problem Model
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
15MDP Problem Lifetime Reward
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
16Lifetime Reward
- Finite horizon
- Rewards accumulate for a fixed period.
- 100K 100K 100K 300K
- Infinite horizon
- Assume reward accumulates for ever
- 100K 100K . . . infinity
- Discounting
- Future rewards not worth as much(a bird in hand
) - Introduce discount factor g100K g 100K g 2
100K. . . converges - Will make the math work
17MDP Problem
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
18MDP Problem Policy
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
19Assume deterministic world
- Policy p S ?A
- Selects an action for each state.
- Optimal policy p S ?A
- Selects action for each state that maximizes
lifetime reward.
20- There are many policies, not all are necessarily
optimal. - There may be several optimal policies.
21Markov Decision Processes
- Motivation
- Markov Decision Processes
- Computing Policies From a Model
- Value Functions
- Mapping Value Functions to Policies
- Computing Value Functions
- Summary
22Value Function Vp for a Given Policy p
- Vp(st) is the accumulated lifetime reward
resulting from starting in state st and
repeatedly executing policy p - Vp(st) rt g rt1 g 2 rt2 . . .
- Vp(st) ?i g i rtIwhere rt, rt1 ,
rt2 . . . are generated by following p starting
at st .
Vp
9
9
10
Assume g .9
10
10
0
23An Optimal Policy p Given Value Function V
- Idea Given state s
- Examine possible actions ai in state s.
- Select action ai with greatest lifetime reward.
- Lifetime reward Q(s, ai) is
- the immediate reward of taking action r(s,a)
- plus life time reward starting in target state
V(d(s, a)) - discounted by g.
- p(s) argmaxa r(s,a) gV(d(s, a)
- Requires
- Value function
- Environment model.
- d S x A ? S
- r S x A ? ?
24Example Mapping Value Function to Policy
- Agent selects optimal action from V
- p(s) argmaxa r(s,a) gV(d(s, a)
Model V
100
100
25Example Mapping Value Function to Policy
- Agent selects optimal action from V
- p(s) argmaxa r(s,a) gV(d(s, a)
Model V
a
90
100
0
100
G
- a 0 0.9 x 100 90
- b 0 0.9 x 81 72.9
- select a
b
100
81
90
100
26Example Mapping Value Function to Policy
- Agent selects optimal action from V
- p(s) argmaxa r(s,a) gV(d(s, a)
Model V
90
100
0
100
G
a
- a 100 0.9 x 0 100
- b 0 0.9 x 90 81
- select a
b
100
81
90
100
p
G
27Example Mapping Value Function to Policy
- Agent selects optimal action from V
- p(s) argmaxa r(s,a) gV(d(s, a)
Model V
90
100
0
100
G
a
b
100
81
90
100
c
p
G
28Markov Decision Processes
- Motivation
- Markov Decision Processes
- Computing Policies From a Model
- Value Functions
- Mapping Value Functions to Policies
- Computing Value Functions
- Summary
29Value Function V for an optimal policy p
Example
- Optimal value function for a one step horizon
V1(s) maxai r(s,ai)
30Value Function V for an optimal policy p
Example
- Optimal value function for a one step horizon
- V1(s) maxai r(s,ai)
- Optimal value function for a two step horizon
V2(s) maxai r(s,ai) gV 1(d(s, ai))
g
V1(SA)
RA
A
SA
SA
- Instance of the Dynamic Programming Principle
- Reuse shared sub-results
- Exponential saving
B
. . .
SB
SB
V1(SB)
31Value Function V for an optimal policy p
Example
- Optimal value function for a one step horizon
- V1(s) maxai r(s,ai)
- Optimal value function for a two step horizon
- V2(s) maxai r(s,ai) gV 1(d(s, ai))
- Optimal value function for an n step horizon
Vn(s) maxai r(s,ai) gV n-1(d(s, ai))
32Value Function V for an optimal policy p
Example
- Optimal value function for a one step horizon
- V1(s) maxai r(s,ai)
- Optimal value function for a two step horizon
- V2(s) maxai r(s,ai) gV 1(d(s, ai))
- Optimal value function for an n step horizon
- Vn(s) maxai r(s,ai) gV n-1(d(s, ai))
- Optimal value function for an infinite horizon
V(s) maxai r(s,ai) gV(d(s, ai))
33Solving MDPs by Value Iteration
- Insight Can calculate optimal values iteratively
using Dynamic Programming. - Algorithm
- Iteratively calculate value using Bellmans
Equation - Vt1(s) ? maxa r(s,a) gV t(d(s, a))
- Terminate when values are close enough
- Vt1(s) - V t (s) lt e
- Agent selects optimal action by one step
lookahead on V - p(s) argmaxa r(s,a) gV(d(s, a)
34Convergence of Value Iteration
- Algorithm
- Iteratively calculate value using Bellmans
Equation - Vt1(s) ? maxa r(s,a) gV t(d(s, a))
- Terminate when values are close enough
- Vt1(s) - V t (s) lt e
- Then
- Maxs in S Vt1(s) - V (s) lt 2eg/(1 - g)
- Note Convergence guaranteed even if updates are
performed in any order, but infinitely often.
35Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
V t
V t1
0
0
0
100
100
a
G
G
0
b
0
0
0
100
100
- a 0 0.9 x 0 0
- b 0 0.9 x 0 0
- Max 0
36Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
V t
V t1
0
0
0
100
100
G
G
0
100
a
c
b
0
0
0
100
100
- a 100 0.9 x 0 100
- b 0 0.9 x 0 0
- c 0 0.9 x 0 0
- Max 100
37Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
V t
V t1
0
0
0
100
100
G
G
0
100
0
a
0
0
0
100
100
38Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
39Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
0
40Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
0
100
41Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
V t
V t1
100
100
G
90
100
0
100
100
0
90
100
42Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
V t
V t1
100
100
G
90
100
0
100
100
81
90
100
43Example of Value Iteration
- Vt1(s) ? maxa r(s,a) gV t(d(s, a))
V t
V t1
100
100
G
90
100
0
100
100
81
90
100
44Markov Decision Processes
- Motivation
- Markov Decision Processes
- Computing policies from a model
- Summary
45Markov Decision Processes (MDPs)
- Model
- Finite set of states, S
- Finite set of actions, A
- Probabilistic state transitions, d(s,a)
- Reward for each state and action, R(s,a)
Deterministic Example
46Crib Sheet MDPs by Value Iteration
- Insight Can calculate optimal values iteratively
using Dynamic Programming. - Algorithm
- Iteratively calculate value using Bellmans
Equation - Vt1(s) ? maxa r(s,a) gV t(d(s, a))
- Terminate when values are close enough
- Vt1(s) - V t (s) lt e
- Agent selects optimal action by one step
lookahead on V - p(s) argmaxa r(s,a) gV(d(s, a)
47How Might a Mouse Search a Maze for Cheese?
Cheese
- By Value Iteration?
- What is missing?