Planning to Maximize Reward: Markov Decision Processes

About This Presentation
Title:

Planning to Maximize Reward: Markov Decision Processes

Description:

... model as an MDP create a policy for acting that maximizes lifetime reward ... plus life time reward starting in target state V(d(s, a)) ... discounted by g. ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 47
Provided by: johnc106
Learn more at: http://www.ai.mit.edu

less

Transcript and Presenter's Notes

Title: Planning to Maximize Reward: Markov Decision Processes


1
Planning to Maximize Reward Markov Decision
Processes
Brian C. Williams 16.412J/6.835J September 10th,
2001
Slides adapted from Manuela Veloso, Reid
Simmons, Tom Mitchell, CMU
3/29/2014
2
Reading for Those with Gaps to Fill
  • Search and Constraint Satisfaction
  • Read AIMA Chapters 3 4
  • Needed for planning (next week)and diagnosis
  • Probabilities
  • Read AIMA Chapter 14
  • Needed for diagnosis, HMMs and Bayes nets
  • AIMA Artificial Intelligence A Modern
    Approach by Stuart Russell and
    Peter Norvig

3
Markov Decision Processes
  • Motivation
  • What are Markov Decision Processes?
  • Computing Action Policies From a Model
  • Summary

4
Reading and Assignments
  • Markov Decision Processes
  • Read AIMA Chapters 17 sections 1 4.
  • Reinforcement Learning
  • Read AIMA Chapter 20
  • Homework
  • Out Wednesday, involves coding MDPs and RL
  • Lecture based on development in Machine
    Learning by Tom Mitchell Chapter 13
    Reinforcement Learning

5
How Might a Mouse Search a Maze for Cheese?
Cheese
  • State Space Search?
  • As a Constraint Satisfaction Problem?
  • Goal-directed Planning?
  • As a Rule or Production Systems?
  • What is missing?

6
courtesy NASA Ames
courtesy NASA Lewis
7
Ideas in this lecture
  • Objective is to accumulate rewards, rather than
    goal states.
  • Objectives are achieved along the way, rather
    than at the end.
  • Task is to generate policies for how to act in
    all situations, rather than a plan for a single
    starting situation.
  • Policies fall out of value functions, which
    describe the greatest lifetime reward achievable
    at every state.
  • Value functions are iteratively approximated.

8
Markov Decision Processes
  • Motivation
  • What are Markov Decision Processes (MDPs)?
  • Models
  • Lifetime Reward
  • Policies
  • Computing Policies From a Model
  • Summary

9
MDP Problem
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for acting that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
10
MDP Problem Model
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
11
Markov Decision Processes (MDPs)
Process
  • Model
  • Finite set of states, S
  • Finite set of actions, A
  • Probabilistic state transitions, d(s,a)
  • Reward for each state and action, R(s,a)
  • Observe state st in S
  • Choose action at in A
  • Receive immediate reward rt
  • State changes to st1

s0
s1
r0
  • Legal transitions shown
  • Reward on unlabeled transitions is 0.

G
12
MDP Environment Assumptions
  • Markov Assumption Next state and reward is a
    function only of the current state and action
  • st1 d(st, at)
  • rt r(st, at)
  • Uncertain and Unknown Environment
  • d and r may be nondeterministic and unknown

Today Deterministic Case Only
13
MDP Nondeterministic Example
We only considerthe deterministic case
R Register for academic route D Depart to
Industry
14
MDP Problem Model
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
15
MDP Problem Lifetime Reward
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
16
Lifetime Reward
  • Finite horizon
  • Rewards accumulate for a fixed period.
  • 100K 100K 100K 300K
  • Infinite horizon
  • Assume reward accumulates for ever
  • 100K 100K . . . infinity
  • Discounting
  • Future rewards not worth as much(a bird in hand
    )
  • Introduce discount factor g100K g 100K g 2
    100K. . . converges
  • Will make the math work

17
MDP Problem
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
18
MDP Problem Policy
Agent
State
Action
Reward
Environment
s0
Given an environment model as an MDP create a
policy for action that maximizes lifetime
reward V r0 g r1 g 2 r2 . . .
19
Assume deterministic world
  • Policy p S ?A
  • Selects an action for each state.
  • Optimal policy p S ?A
  • Selects action for each state that maximizes
    lifetime reward.

20
  • There are many policies, not all are necessarily
    optimal.
  • There may be several optimal policies.

21
Markov Decision Processes
  • Motivation
  • Markov Decision Processes
  • Computing Policies From a Model
  • Value Functions
  • Mapping Value Functions to Policies
  • Computing Value Functions
  • Summary

22
Value Function Vp for a Given Policy p
  • Vp(st) is the accumulated lifetime reward
    resulting from starting in state st and
    repeatedly executing policy p
  • Vp(st) rt g rt1 g 2 rt2 . . .
  • Vp(st) ?i g i rtIwhere rt, rt1 ,
    rt2 . . . are generated by following p starting
    at st .

Vp
9
9
10
Assume g .9
10
10
0
23
An Optimal Policy p Given Value Function V
  • Idea Given state s
  • Examine possible actions ai in state s.
  • Select action ai with greatest lifetime reward.
  • Lifetime reward Q(s, ai) is
  • the immediate reward of taking action r(s,a)
  • plus life time reward starting in target state
    V(d(s, a))
  • discounted by g.
  • p(s) argmaxa r(s,a) gV(d(s, a)
  • Requires
  • Value function
  • Environment model.
  • d S x A ? S
  • r S x A ? ?

24
Example Mapping Value Function to Policy
  • Agent selects optimal action from V
  • p(s) argmaxa r(s,a) gV(d(s, a)

Model V
  • g 0.9

100
100
25
Example Mapping Value Function to Policy
  • Agent selects optimal action from V
  • p(s) argmaxa r(s,a) gV(d(s, a)

Model V
  • g 0.9

a
90
100
0
100
G
  • a 0 0.9 x 100 90
  • b 0 0.9 x 81 72.9
  • select a

b
100
81
90
100
26
Example Mapping Value Function to Policy
  • Agent selects optimal action from V
  • p(s) argmaxa r(s,a) gV(d(s, a)

Model V
  • g 0.9

90
100
0
100
G
a
  • a 100 0.9 x 0 100
  • b 0 0.9 x 90 81
  • select a

b
100
81
90
100
p
G
27
Example Mapping Value Function to Policy
  • Agent selects optimal action from V
  • p(s) argmaxa r(s,a) gV(d(s, a)

Model V
  • g 0.9

90
100
0
100
G
  • a ?
  • b ?
  • c ?
  • select ?

a
b
100
81
90
100
c
p
G
28
Markov Decision Processes
  • Motivation
  • Markov Decision Processes
  • Computing Policies From a Model
  • Value Functions
  • Mapping Value Functions to Policies
  • Computing Value Functions
  • Summary

29
Value Function V for an optimal policy p
Example
  • Optimal value function for a one step horizon

V1(s) maxai r(s,ai)
30
Value Function V for an optimal policy p
Example
  • Optimal value function for a one step horizon
  • V1(s) maxai r(s,ai)
  • Optimal value function for a two step horizon

V2(s) maxai r(s,ai) gV 1(d(s, ai))
g
V1(SA)
RA
A
SA
SA
  • Instance of the Dynamic Programming Principle
  • Reuse shared sub-results
  • Exponential saving

B
. . .
SB
SB
V1(SB)
31
Value Function V for an optimal policy p
Example
  • Optimal value function for a one step horizon
  • V1(s) maxai r(s,ai)
  • Optimal value function for a two step horizon
  • V2(s) maxai r(s,ai) gV 1(d(s, ai))
  • Optimal value function for an n step horizon

Vn(s) maxai r(s,ai) gV n-1(d(s, ai))
32
Value Function V for an optimal policy p
Example
  • Optimal value function for a one step horizon
  • V1(s) maxai r(s,ai)
  • Optimal value function for a two step horizon
  • V2(s) maxai r(s,ai) gV 1(d(s, ai))
  • Optimal value function for an n step horizon
  • Vn(s) maxai r(s,ai) gV n-1(d(s, ai))
  • Optimal value function for an infinite horizon

V(s) maxai r(s,ai) gV(d(s, ai))
33
Solving MDPs by Value Iteration
  • Insight Can calculate optimal values iteratively
    using Dynamic Programming.
  • Algorithm
  • Iteratively calculate value using Bellmans
    Equation
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • Terminate when values are close enough
  • Vt1(s) - V t (s) lt e
  • Agent selects optimal action by one step
    lookahead on V
  • p(s) argmaxa r(s,a) gV(d(s, a)

34
Convergence of Value Iteration
  • Algorithm
  • Iteratively calculate value using Bellmans
    Equation
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • Terminate when values are close enough
  • Vt1(s) - V t (s) lt e
  • Then
  • Maxs in S Vt1(s) - V (s) lt 2eg/(1 - g)
  • Note Convergence guaranteed even if updates are
    performed in any order, but infinitely often.

35
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • g 0.9

V t
V t1
0
0
0
100
100
a
G
G
0
b
0
0
0
100
100
  • a 0 0.9 x 0 0
  • b 0 0.9 x 0 0
  • Max 0

36
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • g 0.9

V t
V t1
0
0
0
100
100
G
G
0
100
a
c
b
0
0
0
100
100
  • a 100 0.9 x 0 100
  • b 0 0.9 x 0 0
  • c 0 0.9 x 0 0
  • Max 100

37
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • g 0.9

V t
V t1
0
0
0
100
100
G
G
0
100
0
a
0
0
0
100
100
  • a 0 0.9 x 0 0
  • Max 0

38
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • g 0.9

V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
39
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • g 0.9

V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
0
40
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • g 0.9

V t
V t1
0
0
0
100
100
G
G
0
100
0
0
0
0
100
100
0
0
100
41
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • g 0.9

V t
V t1
100
100
G
90
100
0
100
100
0
90
100
42
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • g 0.9

V t
V t1
100
100
G
90
100
0
100
100
81
90
100
43
Example of Value Iteration
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • g 0.9

V t
V t1
100
100
G
90
100
0
100
100
81
90
100
44
Markov Decision Processes
  • Motivation
  • Markov Decision Processes
  • Computing policies from a model
  • Summary

45
Markov Decision Processes (MDPs)
  • Model
  • Finite set of states, S
  • Finite set of actions, A
  • Probabilistic state transitions, d(s,a)
  • Reward for each state and action, R(s,a)

Deterministic Example
46
Crib Sheet MDPs by Value Iteration
  • Insight Can calculate optimal values iteratively
    using Dynamic Programming.
  • Algorithm
  • Iteratively calculate value using Bellmans
    Equation
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • Terminate when values are close enough
  • Vt1(s) - V t (s) lt e
  • Agent selects optimal action by one step
    lookahead on V
  • p(s) argmaxa r(s,a) gV(d(s, a)

47
How Might a Mouse Search a Maze for Cheese?
Cheese
  • By Value Iteration?
  • What is missing?
Write a Comment
User Comments (0)