Planning to Maximize Reward: Markov Decision Processes

About This Presentation

Title:

Planning to Maximize Reward: Markov Decision Processes

Description:

... model as an MDP create a policy for acting that maximizes lifetime reward ... plus life time reward starting in target state V(d(s, a)) ... discounted by g. ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 47

Provided by: johnc106

Learn more at: http://www.ai.mit.edu

more less

Transcript and Presenter's Notes

Title: Planning to Maximize Reward: Markov Decision Processes

1
Planning to Maximize Reward Markov Decision
Processes
Brian C. Williams 16.412J/6.835J September 10th,
2001
Slides adapted from Manuela Veloso, Reid
Simmons, Tom Mitchell, CMU
3/29/2014
2
Reading for Those with Gaps to Fill

Search and Constraint Satisfaction
Read AIMA Chapters 3 4
Needed for planning (next week)and diagnosis
Probabilities
Read AIMA Chapter 14
Needed for diagnosis, HMMs and Bayes nets
AIMA Artificial Intelligence A Modern
Approach by Stuart Russell and
Peter Norvig

3
Markov Decision Processes

Motivation
What are Markov Decision Processes?
Computing Action Policies From a Model
Summary

4
Reading and Assignments

Markov Decision Processes
Read AIMA Chapters 17 sections 1 4.
Reinforcement Learning
Read AIMA Chapter 20
Homework
Out Wednesday, involves coding MDPs and RL
Lecture based on development in Machine
Learning by Tom Mitchell Chapter 13
Reinforcement Learning

5
How Might a Mouse Search a Maze for Cheese?
Cheese

State Space Search?
As a Constraint Satisfaction Problem?
Goal-directed Planning?
As a Rule or Production Systems?
What is missing?

6
courtesy NASA Ames
courtesy NASA Lewis
7
Ideas in this lecture

Objective is to accumulate rewards, rather than
goal states.
Objectives are achieved along the way, rather
than at the end.
Task is to generate policies for how to act in
all situations, rather than a plan for a single
starting situation.
Policies fall out of value functions, which
describe the greatest lifetime reward achievable
at every state.
Value functions are iteratively approximated.

8
Markov Decision Processes

Motivation
What are Markov Decision Processes (MDPs)?
Models
Lifetime Reward
Policies
Computing Policies From a Model
Summary

9
MDP Problem
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for acting that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
10
MDP Problem Model
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
11
Markov Decision Processes (MDPs)
Process

Model
Finite set of states, S
Finite set of actions, A
Probabilistic state transitions, d(s,a)
Reward for each state and action, R(s,a)

Observe state st in S
Choose action at in A
Receive immediate reward rt
State changes to st1

s0
s1
r0

Legal transitions shown
Reward on unlabeled transitions is 0.

G
12
MDP Environment Assumptions

Markov Assumption Next state and reward is a
function only of the current state and action
st1 d(st, at)
rt r(st, at)
Uncertain and Unknown Environment
d and r may be nondeterministic and unknown

Today Deterministic Case Only
13
MDP Nondeterministic Example
We only considerthe deterministic case
R Register for academic route D Depart to
Industry
14
MDP Problem Model
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
15
MDP Problem Lifetime Reward
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
16
Lifetime Reward

Finite horizon
Rewards accumulate for a fixed period.
100K 100K 100K 300K
Infinite horizon
Assume reward accumulates for ever
100K 100K . . . infinity
Discounting
Future rewards not worth as much(a bird in hand
)
Introduce discount factor g100K g 100K g 2
100K. . . converges
Will make the math work

17
MDP Problem
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
18
MDP Problem Policy
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
19
Assume deterministic world

Policy p S ?A
Selects an action for each state.

Optimal policy p S ?A
Selects action for each state that maximizes
lifetime reward.

There are many policies, not all are necessarily
optimal.
There may be several optimal policies.

21
Markov Decision Processes

Motivation
Markov Decision Processes
Computing Policies From a Model
Value Functions
Mapping Value Functions to Policies
Computing Value Functions
Summary

22
Value Function Vp for a Given Policy p

Vp(st) is the accumulated lifetime reward
resulting from starting in state st and
repeatedly executing policy p
Vp(st) rt g rt1 g 2 rt2 . . .
Vp(st) ?i g i rtIwhere rt, rt1 ,
rt2 . . . are generated by following p starting
at st .

Vp
9
9
10
Assume g .9
10
10
0
23
An Optimal Policy p Given Value Function V

Idea Given state s
Examine possible actions ai in state s.
Select action ai with greatest lifetime reward.

Lifetime reward Q(s, ai) is
the immediate reward of taking action r(s,a)
plus life time reward starting in target state
V(d(s, a))
discounted by g.
p(s) argmaxa r(s,a) gV(d(s, a)
Requires
Value function
Environment model.
d S x A ? S
r S x A ? ?

24
Example Mapping Value Function to Policy

Agent selects optimal action from V
p(s) argmaxa r(s,a) gV(d(s, a)

Model V

g 0.9

100
100
25
Example Mapping Value Function to Policy

Agent selects optimal action from V
p(s) argmaxa r(s,a) gV(d(s, a)

Model V

g 0.9

a
90
100
0
100
G

a 0 0.9 x 100 90
b 0 0.9 x 81 72.9
select a

b
100
81
90
100
26
Example Mapping Value Function to Policy

Agent selects optimal action from V
p(s) argmaxa r(s,a) gV(d(s, a)

Model V

g 0.9

90
100
0
100
G
a

a 100 0.9 x 0 100
b 0 0.9 x 90 81
select a

b
100
81
90
100
p
G
27
Example Mapping Value Function to Policy

Agent selects optimal action from V
p(s) argmaxa r(s,a) gV(d(s, a)

Model V

g 0.9

90
100
0
100
G

a ?
b ?
c ?
select ?

a
b
100
81
90
100
c
p
G
28
Markov Decision Processes

Motivation
Markov Decision Processes
Computing Policies From a Model
Value Functions
Mapping Value Functions to Policies
Computing Value Functions
Summary

29
Value Function V for an optimal policy p
Example

Optimal value function for a one step horizon

V1(s) maxai r(s,ai)
30
Value Function V for an optimal policy p
Example

Optimal value function for a one step horizon
V1(s) maxai r(s,ai)
Optimal value function for a two step horizon

V2(s) maxai r(s,ai) gV 1(d(s, ai))
g
V1(SA)
RA
A
SA
SA

Instance of the Dynamic Programming Principle
Reuse shared sub-results
Exponential saving

B
. . .
SB
SB
V1(SB)
31
Value Function V for an optimal policy p
Example

Optimal value function for a one step horizon
V1(s) maxai r(s,ai)
Optimal value function for a two step horizon
V2(s) maxai r(s,ai) gV 1(d(s, ai))

Optimal value function for an n step horizon

Vn(s) maxai r(s,ai) gV n-1(d(s, ai))
32
Value Function V for an optimal policy p
Example

Optimal value function for a one step horizon
V1(s) maxai r(s,ai)
Optimal value function for a two step horizon
V2(s) maxai r(s,ai) gV 1(d(s, ai))
Optimal value function for an n step horizon
Vn(s) maxai r(s,ai) gV n-1(d(s, ai))
Optimal value function for an infinite horizon

V(s) maxai r(s,ai) gV(d(s, ai))
33
Solving MDPs by Value Iteration

Insight Can calculate optimal values iteratively
using Dynamic Programming.
Algorithm
Iteratively calculate value using Bellmans
Equation
Vt1(s) ? maxa r(s,a) gV t(d(s, a))
Terminate when values are close enough
Vt1(s) - V t (s) lt e
Agent selects optimal action by one step
lookahead on V
p(s) argmaxa r(s,a) gV(d(s, a)

34
Convergence of Value Iteration

Algorithm
Iteratively calculate value using Bellmans
Equation
Vt1(s) ? maxa r(s,a) gV t(d(s, a))
Terminate when values are close enough
Vt1(s) - V t (s) lt e
Then
Maxs in S Vt1(s) - V (s) lt 2eg/(1 - g)
Note Convergence guaranteed even if updates are
performed in any order, but infinitely often.

35
Example of Value Iteration