Finish MDPs - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Finish MDPs

Description:

Key Ideas behind computation. Some Examples. Course Conclusions. CPSC 322, Lecture 36 ... Sketch of ideas to find the optimal policy for a MDP (Value Iteration) ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 21
Provided by: con81
Category:
Tags: finish | mdps

less

Transcript and Presenter's Notes

Title: Finish MDPs


1
Finish MDPs Last Class Computer Science cpsc322,
Lecture 36 (Textbook Chpt 12.5) April, 11, 2008
2
Lecture Overview
  • Recap MDPs and More on MDP Example
  • Optimal Policies
  • Key Ideas behind computation
  • Some Examples
  • Course Conclusions

3
Markov Decision Processes
Often an agent needs to go beyond a fixed set of
decisions.. Would like to manage an ongoing
(indefinite infinite) decision process (e.g.,
Help older adults living with cognitive
disabilities) Properties
4
Example MDP Scenario and Actions
  • Agent moves in the above grid via actions Up,
    Down, Left, Right
  • Each action has
  • 0.8 probability to reach its intended effect
  • 0.1 probability to move at right angles of the
    intended direction
  • If the agents bumps into a wall, it says there
  • Eleven states ( (3,4) and (2,4) are terminal
    states)

5
Example MDP Underlying info structures
Four actions Up, Down, Left, Right Eleven States
(1,1), (1,2) (3,4)
6
Lecture Overview
  • Recap MDPs and More on MDP Example
  • Optimal Policies
  • Key Ideas behind computation
  • Some Examples
  • Course Conclusions

7
MDPs Policy
  • In MDPs, a policy p(s) specifies what the agent
    should do for each state s
  • The optimal policy p(s) is the one that
    maximizes the expected total reward for each
    state

8
Sketch of ideas to find the optimal policy for a
MDP (Value Iteration)
  • We first need a couple of definitions
  • V ?(s) the expected value of following policy p
    in state s
  • Q ?(s, a), where a is an action expected value
    of performing a in s, and then following policy
    p.
  • We have, by definition

reward of getting to s from s via a
states reachable from s by doing a
Probability of getting to s from s via a
expected value of following policy ? in s
9
Value of a policy and Optimal policy
We can then compute V ?(s) in terms of Q ?(s, a)
Expected value of performing the action indicated
by in s and follow ? after that
Expected value of following ? in s
Optimal policy p is one that gives the action
that maximizes Q p for each state
10
Lecture Overview
  • Recap MDPs and More on MDP Example
  • Optimal Policies
  • Key Ideas behind computation
  • Some Examples
  • Course Conclusions

11
Optimal Policy in our Example
  • Lets suppose that, in our example, the total
    reward of an environment history is the sum of
    the individual rewards
  • For instance, with a penalty of -0.04 in not
    terminal states, reaching (3,4) in 10 steps gives
    a total reward of .
  • Penalty designed to make the agent go for short
    solution paths
  • Below is the optimal policy when penalty in
    non-terminal states is - 0.04

12
Rewards and Optimal Policy
Optimal Policy when penalty in non-terminal
states is -0.04
  • Note that here the cost of taking steps is small
    compared to the cost of ending into (2,4)
  • Thus, the optimal policy for state (1,3) is to
    take the long way around the obstacle rather then
    risking to fall into (2,4) by taking the shorter
    way that passes next to it
  • But the optimal policy may change if the reward
    in the non-terminal states (lets call it r)
    changes

13
Rewards and Optimal Policy
Optimal Policy when r lt -1.6284
1 2 3 4
3 2 1
Why is the agent heading straight into (2,4)
from its surrounding states?
14
Rewards and Optimal Policy
Optimal Policy when -0.427 lt r lt -0.085
1 2 3 4
3 2 1
The cost of taking a step is high enough to make
the agent take the shortcut to (3,4) from (1,3)
15
Rewards and Optimal Policy
Optimal Policy when -0.0218 lt r lt 0
1 2 3 4
3 2 1
Why is the agent heading straight into the
obstacle from (2,3)?
16
Rewards and Optimal Policy
Optimal Policy when -0.0218 lt r lt 0
1 2 3 4
3 2 1
  • Stay longer in the grid is not penalized as much
    as before. The agent is willing to take longer
    routes to avoid (2,4)
  • This is true even when it means banging against
    the obstacle a few times when moving from (2,3)

17
Rewards and Optimal Policy
Optimal Policy when r gt 0
1 2 3 4
3 2 1
state where every action belong to an optimal
policy
What happens when the agent is rewarded for every
step it takes?
18
Lecture Overview
  • Recap MDPs and More on MDP Example
  • Optimal Policies
  • Key Ideas behind computation
  • Some Examples
  • Course Conclusions

19
322 Conclusions
  • Artificial Intelligence has become a huge field.
  • After taking this course you should have achieved
    a reasonable understanding of the basic AI
    principles and techniques
  • But there is much more
  • 422 Advanced AI
  • 340 Machine Learning
  • 425 Machine Vision
  • Grad courses Natural Language Processing,
    Intelligent User Interfaces, Multi-Agents
    Systems, Machine Learning, Vision

20
Final Exam
  • When Friday  Apr 18, 12-3 pm
  • Where DMP 110 (not the regular room)

To revise material you can check on WebCT Final
Review" from the Course Menu It contains some
sample questions from previous course offerings
(so some questions may be inappropriate for what
we covered this time). It is anyway a good source
for revising the material and I may even use a
few of those questions in the final -)
  • Come to remaining Office hours
  • Ashiqur KhudaBukhsh  Mon 5-6, X150 (Learning
    Center)
  • Jacek Kisynski Mon 2-3, and Wed 5-6, X150
    (Learning Center)
  • Giuseppe Carenini Tue 2-3

21
Final Exam (cont)
  • Assignments 20
  • Midterm 30
  • Final 50
  • If your final grade is gt 20 higher than your
    midterm grade
  • Assignments 20
  • Midterm 15
  • Final 65

Assign4 will be marked by Tue. Ill put them
outside my office 129 West Wing
Write a Comment
User Comments (0)
About PowerShow.com