Title: Decision Making Under Uncertainty
1Decision Making Under Uncertainty
- Russell and Norvig ch 16, 17
- CMSC 671 Fall 2005
material from Lise Getoor, Jean-Claude Latombe,
and Daphne Koller
2Decision Making Under Uncertainty
- Many environments have multiple possible outcomes
- Some of these outcomes may be good others may be
bad - Some may be very likely others unlikely
- Whats a poor agent to do??
3Non-Deterministic vs. Probabilistic Uncertainty
- a,b,c
- decision that is best for worst case
Non-deterministic model
Probabilistic model
Adversarial search
4Expected Utility
- Random variable X with n values x1,,xn and
distribution (p1,,pn)E.g. X is the state
reached after doing an action A under uncertainty - Function U of XE.g., U is the utility of a state
- The expected utility of A is EUA
Si1,,n p(xiA)U(xi)
5One State/One Action Example
U(S0) 100 x 0.2 50 x 0.7 70 x 0.1
20 35 7 62
6One State/Two Actions Example
- U1(S0) 62
- U2(S0) 74
- U(S0) maxU1(S0),U2(S0)
- 74
80
7Introducing Action Costs
- U1(S0) 62 5 57
- U2(S0) 74 25 49
- U(S0) maxU1(S0),U2(S0)
- 57
-5
-25
80
8MEU Principle
- A rational agent should choose the action that
maximizes agents expected utility - This is the basis of the field of decision theory
- The MEU principle provides a normative criterion
for rational choice of action
AI is Solved!!!
9Not quite
- Must have complete model of
- Actions
- Utilities
- States
- Even if you have a complete model, will be
computationally intractable - In fact, a truly rational agent takes into
account the utility of reasoning as
well---bounded rationality - Nevertheless, great progress has been made in
this area recently, and we are able to solve much
more complex decision-theoretic problems than
ever before
10Well look at
- Decision-Theoretic Planning
- Simple decision making (ch. 16)
- Sequential decision making (ch. 17)
11Axioms of Utility Theory
- Orderability
- (AgtB) ? (AltB) ? (AB)
- Transitivity
- (AgtB) ? (BgtC) ? (AgtC)
- Continuity
- AgtBgtC ? ?p p,A 1-p,C B
- Substitutability
- AB ? p,A 1-p,Cp,B 1-p,C
- Monotonicity
- AgtB ? (pq ? p,A 1-p,B gt q,A 1-q,B)
- Decomposability
- p,A 1-p, q,B 1-q, C p,A (1-p)q, B
(1-p)(1-q), C
12Money Versus Utility
- Money ltgt Utility
- More money is better, but not always in a linear
relationship to the amount of money - Expected Monetary Value
- Risk-averse U(L) lt U(SEMV(L))
- Risk-seeking U(L) gt U(SEMV(L))
- Risk-neutral U(L) U(SEMV(L))
13Value Function
- Provides a ranking of alternatives, but not a
meaningful metric scale - Also known as an ordinal utility function
- Remember the expectiminimax example
- Sometimes, only relative judgments (value
functions) are necessary - At other times, absolute judgments (utility
functions) are required
14Multiattribute Utility Theory
- A given state may have multiple utilities
- ...because of multiple evaluation criteria
- ...because of multiple agents (interested
parties) with different utility functions - We will talk about this more later in the
semester, when we discuss multi-agent systems and
game theory
15Decision Networks
- Extend BNs to handle actions and utilities
- Also called influence diagrams
- Use BN inference methods to solve
- Perform Value of Information calculations
16Decision Networks cont.
- Chance nodes random variables, as in BNs
- Decision nodes actions that decision maker can
take - Utility/value nodes the utility of the outcome
state.
17RN example
18Umbrella Network
take/dont take
P(rain) 0.4
umbrella
weather
have umbrella
forecast
P(havetake) 1.0 P(havetake)1.0
happiness
f w p(fw) sunny rain
0.3 rainy rain 0.7 sunny no rain
0.8 rainy no rain 0.2
U(have,rain) -25 U(have,rain) 0 U(have,
rain) -100 U(have, rain) 100
19Evaluating Decision Networks
- Set the evidence variables for current state
- For each possible value of the decision node
- Set decision node to that value
- Calculate the posterior probability of the parent
nodes of the utility node, using BN inference - Calculate the resulting utility for action
- Return the action with the highest utility
20Decision MakingUmbrella Network
Should I take my umbrella??
take/dont take
P(rain) 0.4
umbrella
weather
have umbrella
forecast
P(havetake) 1.0 P(havetake)1.0
happiness
f w p(fw) sunny rain
0.3 rainy rain 0.7 sunny no rain
0.8 rainy no rain 0.2
U(have,rain) -25 U(have,rain) 0 U(have,
rain) -100 U(have, rain) 100
21Value of Information (VOI)
- Suppose an agents current knowledge is E. The
value of the current best action ? is
22Value of InformationUmbrella Network
What is the value of knowing the weather forecast?
take/dont take
P(rain) 0.4
umbrella
weather
have umbrella
forecast
P(havetake) 1.0 P(havetake)1.0
happiness
f w p(fw) sunny rain
0.3 rainy rain 0.7 sunny no rain
0.8 rainy no rain 0.2
U(have,rain) -25 U(have,rain) 0 U(have,
rain) -100 U(have, rain) 100
23Sequential Decision Making
- Finite Horizon
- Infinite Horizon
24Simple Robot Navigation Problem
- In each state, the possible actions are U, D, R,
and L
25Probabilistic Transition Model
- In each state, the possible actions are U, D, R,
and L - The effect of U is as follows (transition
model) - With probability 0.8 the robot moves up one
square (if the robot is already in the top
row, then it does not move)
26Probabilistic Transition Model
- In each state, the possible actions are U, D, R,
and L - The effect of U is as follows (transition
model) - With probability 0.8 the robot moves up one
square (if the robot is already in the top
row, then it does not move) - With probability 0.1 the robot moves right one
square (if the robot is already in the
rightmost row, then it does not move)
27Probabilistic Transition Model
- In each state, the possible actions are U, D, R,
and L - The effect of U is as follows (transition
model) - With probability 0.8 the robot moves up one
square (if the robot is already in the top
row, then it does not move) - With probability 0.1 the robot moves right one
square (if the robot is already in the
rightmost row, then it does not move) - With probability 0.1 the robot moves left one
square (if the robot is already in the
leftmost row, then it does not move)
28Markov Property
The transition properties depend only on the
current state, not on previous history (how that
state was reached)
29Sequence of Actions
3,2
3
2
1
4
3
2
1
- Planned sequence of actions (U, R)
30Sequence of Actions
3
2
1
4
3
2
1
- Planned sequence of actions (U, R)
- U is executed
31Histories
3
2
1
4
3
2
1
- Planned sequence of actions (U, R)
- U has been executed
- R is executed
- There are 9 possible sequences of states
called histories and 6 possible final states
for the robot!
32Probability of Reaching the Goal
3
Note importance of Markov property in this
derivation
2
1
4
3
2
1
- P(4,3 (U,R).3,2)
- P(4,3 R.3,3) x
P(3,3 U.3,2)
P(4,3 R.4,2) x P(4,2 U.3,2)
- P(3,3 U.3,2) 0.8
- P(4,2 U.3,2) 0.1
- P(4,3 R.3,3) 0.8
- P(4,3 R.4,2) 0.1
33Utility Function
- 4,3 provides power supply
- 4,2 is a sand area from which the robot cannot
escape
34Utility Function
- 4,3 provides power supply
- 4,2 is a sand area from which the robot cannot
escape - The robot needs to recharge its batteries
35Utility Function
- 4,3 provides power supply
- 4,2 is a sand area from which the robot cannot
escape - The robot needs to recharge its batteries
- 4,3 or 4,2 are terminal states
36Utility of a History
- 4,3 provides power supply
- 4,2 is a sand area from which the robot cannot
escape - The robot needs to recharge its batteries
- 4,3 or 4,2 are terminal states
- The utility of a history is defined by the
utility of the last state (1 or 1) minus
n/25, where n is the number of moves
37Utility of an Action Sequence
1
3
-1
2
1
4
3
2
1
- Consider the action sequence (U,R) from 3,2
38Utility of an Action Sequence
1
3
-1
2
1
4
3
2
1
- Consider the action sequence (U,R) from 3,2
- A run produces one among 7 possible histories,
each with some probability
39Utility of an Action Sequence
1
3
-1
2
1
4
3
2
1
- Consider the action sequence (U,R) from 3,2
- A run produces one among 7 possible histories,
each with some probability - The utility of the sequence is the expected
utility of the histories
U ShUh P(h)
40Optimal Action Sequence
1
3
-1
2
1
4
3
2
1
- Consider the action sequence (U,R) from 3,2
- A run produces one among 7 possible histories,
each with some probability - The utility of the sequence is the expected
utility of the histories - The optimal sequence is the one with maximal
utility
41Optimal Action Sequence
1
3
-1
2
1
4
3
2
1
- Consider the action sequence (U,R) from 3,2
- A run produces one among 7 possible histories,
each with some probability - The utility of the sequence is the expected
utility of the histories - The optimal sequence is the one with maximal
utility - But is the optimal action sequence what we want
to compute?
42Reactive Agent Algorithm
- Repeat
- s ? sensed state
- If s is terminal then exit
- a ? choose action (given s)
- Perform a
43Policy (Reactive/Closed-Loop Strategy)
- A policy P is a complete mapping from states to
actions
44Reactive Agent Algorithm
- Repeat
- s ? sensed state
- If s is terminal then exit
- a ? P(s)
- Perform a
45Optimal Policy
1
3
-1
2
1
4
3
2
1
- A policy P is a complete mapping from states to
actions - The optimal policy P is the one that always
yields a history (ending at a terminal state)
with maximal - expected utility
46Optimal Policy
1
3
-1
2
1
4
3
2
1
- A policy P is a complete mapping from states to
actions - The optimal policy P is the one that always
yields a history with maximal expected utility
47Additive Utility
- History H (s0,s1,,sn)
- The utility of H is additive iff
U(s0,s1,,sn) R(0) U(s1,,sn) S R(i)
Reward
48Additive Utility
- History H (s0,s1,,sn)
- The utility of H is additive iff
U(s0,s1,,sn) R(0) U(s1,,sn) S R(i) - Robot navigation example
- R(n) 1 if sn 4,3
- R(n) -1 if sn 4,2
- R(i) -1/25 if i 0, , n-1
49Principle of Max Expected Utility
- History H (s0,s1,,sn)
- Utility of H U(s0,s1,,sn) S R(i)
-
- First-step analysis ?
- U(i) R(i) maxa SkP(k a.i) U(k)
- P(i) arg maxa SkP(k a.i) U(k)
50Value Iteration
- Initialize the utility of each non-terminal
state si to U0(i) 0 - For t 0, 1, 2, , do Ut1(i) ? R(i)
maxa SkP(k a.i) Ut(k)
51Value Iteration
Note the importance of terminal states
and connectivity of the state-transition graph
- Initialize the utility of each non-terminal
state si to U0(i) 0 - For t 0, 1, 2, , do Ut1(i) ? R(i)
maxa SkP(k a.i) Ut(k)
0.812
0.868
0.918
1
3
0.762
0.660
-1
2
0.705
0.655
0.388
0.611
1
4
3
2
1
52Policy Iteration
- Pick a policy P at random
53Policy Iteration
- Pick a policy P at random
- Repeat
- Compute the utility of each state for P Ut1(i)
? R(i) SkP(k P(i).i) Ut(k)
54Policy Iteration
- Pick a policy P at random
- Repeat
- Compute the utility of each state for P Ut1(i)
? R(i) SkP(k P(i).i) Ut(k) - Compute the policy P given these utilities
P(i) arg maxa SkP(k a.i) U(k)
55Policy Iteration
- Pick a policy P at random
- Repeat
- Compute the utility of each state for P Ut1(i)
? R(i) SkP(k P(i).i) Ut(k) - Compute the policy P given these utilities
P(i) arg maxa SkP(k a.i) U(k) - If P P then return P
56n-Step decision process
- Assume that
- Each state reached after n steps is terminal,
hence has known utility - There is a single initial state
- Any two states reached after i and j steps are
different
57n-Step Decision Process
- P(i) arg maxa SkP(k a.i) U(k)
- U(i) R(i) maxa SkP(k a.i) U(k)
- For j n-1, n-2, , 0 do
- For every state si attained after step j
- Compute the utility of si
- Label that state with the corresponding action
58What is the Difference?
- P(i) arg maxa SkP(k a.i) U(k)
- U(i) R(i) maxa SkP(k a.i) U(k)
59Infinite Horizon
In many problems, e.g., the robot navigation
example, histories are potentially unbounded and
the same state can be reached many times
What if the robot lives forever?
One trick Use discounting to make
infinite Horizon problem mathematically tractable
1
3
-1
2
1
4
3
2
1
60Example Tracking a Target
- An optimal policy cannot be computed ahead
- of time
- - The environment might be unknown
- The environment may only be partially observable
- The target may not wait
- ? A policy must be computed on-the-fly
- The robot must keep the target in view
- The targets trajectory is not known in
advance - The environment may
- or may not be known
61POMDP (Partially Observable Markov Decision
Problem)
- A sensing operation returns multiple states,
with a probability distribution - Choosing the action that maximizes the
- expected utility of this state distribution
assuming state utilities computed as - above is not good enough, and actually
- does not make sense (is not rational)
62Example Target Tracking
63Summary
- Decision making under uncertainty
- Utility function
- Optimal policy
- Maximal expected utility
- Value iteration
- Policy iteration