Title: Decisions under uncertainty
1Decisions under uncertainty
- Reading Ch. 16, AIMA 2nd Ed.
2Outline
- Decisions, preferences, utility functions
- Influence diagrams
- Value of information
3Decision making
- Decisions an irrevocable allocation of domain
resources - Decisions should be made so as to maximize
expected utility - Questions
- Why make decisions based on average or expected
utility? - Why can one assume that utility functions exist?
- Can an agent act rationally by expressing
preferences between states without giving them
numeric values? - Can every preference structure be captured by
assigning a single number to every state?
4Simple decision problem
- Party decision problem inside or outside?
Dry
Regret
state
IN
Wet
Relief
Action
Dry
Perfect !
state
OUT
Disaster
Wet
5Value function
- Numerical score over all possible states of the
world
Action Weather Value
OUT Dry 100
IN Wet 60
IN Dry 50
OUT Wet 0
6Preferences
- Agent chooses among prizes (A,B,) and lotteries
(situations with uncertain prizes)
L1 ( .2, 40000 .8, 0 )
L2 ( .25, 30000 .75, 0 )
?
?
A ? B A is preferred to BA ? B B is preferred to
AA B indifference between A B
7Desired properties for preferences over lotteries
- Prefer 100 over 0 AND p lt q, then
L1 ( p, 100 1-p, 0 )
L2 ( .q, 100 1-q, 0 )
?
8Properties of (rational) preference
- Lead to rational agent behavior
- Orderability( A ? B ) V ( A ? B ) V ( A B )
- Transitivity( A ? B ) ( B ? C ) ? ( A ? C )
- ContinuityA ? B ? C ? ?p, ( p, A (1-p) C ) B
- SubstitutabilityAB ? ?p, ( p,A (1-p), C ) (
p,B (1-p), C ) - MonotonicityA ? B ? ( p gt q ? ( p,A (1-p)B ) ?
( q,A (1-q),B ) )
9Preference expected utility
- Properties of preference lead to existence
(Ramsey 1931, von Neumann Morgenstern 1944) of
utility function U such that
L1 ( p, 100 1-p, 0 )
L2 ( .q, 100 1-q, 0 )
?
IFF
EXPECTED UTILITY of L2, EU(L2)
EXPECTED UTILITY of L1, EU(L1)
10Properties of utility
- Utility is a function that maps states to real
numbers - Standard approach to assessing utilities of
states - Compare state A to a standard lottery L (
p,Ubest, 1-p, Uworst)Ubest best possible
eventUworst worst possible event - Adjust p until A L
Continue as before
0.999999
30
0.000001
Instant death
11Utility scales
- Normalized utilities Ubest 1.0, Uworst 0.0
- Micromorts one-millionth chance of death
- useful for Russian roulette, paying to reduce
product risks, etc. - QALYs quality-adjusted life years
- useful for medical decisions involving
substantial risk - Note behavior is invariant w.r.t. positive
linear transformation U(s) A U(s) B, A gt
0
12Utility vs Money
- Utility is NOT monetary payoff
?
gt
EMV(L2) 30,000
EMV(L1) 32,000
13Attitudes toward risk
U( reward )
L
U( 500 )
U( L )
Insurance risk premium
500
1000
400
reward
certain monetary equivalent
U convex risk averse U linear risk
neutral U concave risk seeking
14Human judgment under uncertainty
- Is decision theory compatible with human judgment
under uncertainty? - Are people experts in reasoning under
uncertainty? How well do they perform? What kind
of heuristics do they use?
.2 U(40k) gt .25 U(30k)
.8 U(40k) gt U(30k)
.8 U(40k) lt U(30k)
15Student group utility
- For each amount, adjust p until half the class
votes for lottery (10000)
16Technology forecasting
- I think there is a world market for about five
computers. - Thomas J. Watson,
Sr. Chairman of the Board of IBM, 1943 - There doesn't seem to be any real limit to the
growth of the computer industry. - Thomas J.
Watson, Sr. Chairman of the Board of IBM, 1968
17Maximizing expected utility
EU(IN) 0.7 0.632 0.3 0.699
0.6521 EU(OUT) 0.7 0.865 0.3 0 0.6055
Action arg MEU(IN,OUT) arg max EU(IN),
EU(OUT) IN
18Multi-attribute utilities
- Many aspects of an outcome combine to determine
our preferences - vacation planning cost, flying time, beach
quality, food quality, etc. - Medical decision making risk of death
(micromort), quality of life (QALY), cost of
treatment, etc. - For rational decision making, must combine all
relevant factors into single utility
function. U(a,b,c,) f f1(a), f2(b),
where f is a simple function such as addition - f, In case of mutual preference independence
which occurs when it is always preferable to
increase the value of an attribute given all
other attributes are fixed
19Decision graphs / Influence diagrams
earthquake
burglary
alarm
call
newscast
goods recovered
Action node
Utility node
gohome?
Utility
missmeeting
20Optimal policy
earthquake
burglary
Choose action given evidence MEU( go home
call )
alarm
call
newscast
goods recovered
gohome?
Utility
missmeeting
Call? EU( Go home ) EU( Stay )
Yes ? ?
No ? ?
21Optimal policy
earthquake
burglary
Choose action given evidence MEU( go home
call )
alarm
call
newscast
goods recovered
gohome?
Utility
missmeeting
22Optimal policy
earthquake
burglary
Choose action given evidence MEU( go home
call )
alarm
call
newscast
goods recovered
gohome?
Utility
missmeeting
Call? EU( Go home ) EU( Stay )
Yes 37 13
No 53 83
Call?c EU( Go home ) EU( Stay ) MEU(Call )
Yes 37 13 37
No 53 83 83
Call? EU( Go home ) EU( Stay )
Yes 37 13
No 53 83
23Value of information
- What is it worth to get another piece of
information? - What is the increase in (maximized) expected
utility if I make a decision with an additional
piece of information? - Additional information (if free) cannot make you
worse off. - There is no value-of-information if you will not
change your decision.
24Optimal policy with additional evidence
earthquake
burglary
alarm
call
newscast
goods recovered
How much better can we doif we have evidence
aboutnewscast? ( Should we ask for
evidenceabout newscast? )
gohome?
Utility
missmeeting
25Optimal policy with additional evidence
earthquake
burglary
alarm
call
newscast
goods recovered
gohome?
Utility
Call Newscast Go home
Yes Quake 44 / 45
Yes No 35 / 6
No Quake 51 / 80
No No 52 / 84
Call Newscast Go home
Yes Quake NO
Yes No YES
No Quake NO
No No NO
missmeeting
26Value of perfect information
- The general case We assume that exact evidence
can be obtained about the value of some random
variable Ej. - The agent's current knowledge is E.
- The value of the current best action a is defined
by
- With the new evidence Ej the value of new best
action aEj will be
27VPI (contd)
- However, we do not have this new evidence in
hand. Hence, we can only say what we expect the
expected utility of Ej to be
- The value of perfect information Ej is then
28Properties of VPI
- Positive?E,E1 VPIE(E1) ? 0
- Non-additive ( in general )VPIE(E1,E2) ?
VPIE(E1) VPIE(E2) - Order-invariantVPIE(E1,E2) VPIE( E1)
VPIE,E1(E2) VPIE( E2) VPIE,E2(E1)
29Example
- What is the value of information Newscast?
30Example (contd)
Call? MEU(Call )
Yes 36.74
No 83.23
Call? Newscast MEU(Call, Newscast)
Yes Quake 45.20
Yes NoQuake 35.16
No Quake 80.89
No NoQuake 83.39
Call? P(NewscastQuake Call ) P(NewscastNoQuake Call)
Yes .1794 .8206
No .0453 .9547
31Sequential Decisions
- So far, decisions in static situations. But most
situations are dynamic! - If I dont attend CS440 today, will I be kicked
out of the class? - If I dont attend CS440 today, will I be better
off in the future?
Action
A Attend / Do not attend
S Professor hates me / Professor does not care
P(S A)
E Professor looks upset / not upset
State
U Probability of being expelled from class
U( S )
Eviden.
Utility
32Sequential decisions
- Extend static structure over time just like an
HMM, with decisions and utilities. - One small caveat a different representation
slightly better
33Partially-observed Markov decision processes
(POMDP)
- Actions at time t should impact state at t1
- Use Rewards (R) instead of utilities (U)
- Actions directly determine rewards
A0
A1
A2
P(St At-1)
S0
S1
S2
P(St St-1)
P(Et St)
R(St)
E0
E1
E2
R0
R1
R2
34POMDP Problems
- Objective Find a sequence of actions that
takes one from an initial state to a final state
while maximizing some notion of total/future
reward.E.g. Find a sequence of actions that
takes a car from point A to point B while
minimizing time and consumed fuel.
35Example (POMDPs)
- Optimal dialog modeling e.g., automated airline
reservation system - Actions
- System prompts How may I help you?, Please
specify your favorite airline, Where are you
leaving from?, Do you mind leaving at a
different time?, - States
- (Origin, Destination, Airline, Flight,
Departure, Arrival,)
A Stochastic Model of Human-Machine Interaction
for Learning Dialog Strategies Levin,
Pieraccini, and Eckert, IEEE TSAP, 2000
36Example 2 (POMDP)
- Optimal control
- States/actions are continuous
- Objective design optimal control laws for
guiding objects from start position to goal
position (Lunar lander) - Actions
- Engine thrust, robot arm torques,
- States
- Positions, velocities of objects/robotic arm
joints, - Reward
- Usually specified in terms of cost (reward-1)
cost of fuel, battery charge loss, energy loss,
37Markov decision processes (MDPs)
- Lets make life a bit simpler assume we exactly
know the state of the world
A0
A1
A2
P(St At-1)
S0
S1
S2
P(St St-1)
R(St)
R0
R1
R2
38Examples
- Blackjack game
- Objective Have your card sum be greater than the
dealers without exceeding 21. - States(200 of them)
- Current sum (12-21)
- Dealers showing card (ace -10)
- Do I have a useable ace?
- Reward 1 for winning, 0 for a draw, -1 for
losing - Actions stick (stop receiving cards), hit
(receive another card)
39MDP Fundamentals
- We mentioned (in POMDPs) that the goal is to
select actions that maximize reward - What reward?
- Immediate?at arg max E R(st1)
- Cumulative?at arg max E R(st1) R(st2)
R(st3) - Discounted?at arg max E R(st1) ?R(st2)
?2 R(st3)
40Utility utility maximization in MDPs
- Assume we are in state st and want to find the
best sequence of future actions that will
maximize discounted reward from st on.
A
at
at1
at2
S
st
st1
st2
U
Rt
Rt1
Rt2
41Utility utility maximization in MDPs
- Assume we are in state st and want to find the
best sequence of future actions that will
maximize discounted reward from st on. - Convert it into a simplified model by
compounding states, actions, rewards
A
Maximum expected utility
st
S
U
42Bellman equations value iteration algorithm
- Do we need to search over aT, T??, actions A?
- No!
Best immediate action
Bellman update
43Proof of Bellman update equation
44Example
- If I dont attend CS440 today, will I be better
off in the future? - Actions Attend / Dont attend
- States Learned topic / Did not learn topic
- Reward 1 Learned, -1 Did not learn
- Discount factor ?0.9
- Transition probabilities
Attended (A) Attended (A) Do not attend (NA) Do not attend (NA)
Learned (L) Did not learn (NL) Learned (L) Did not learn (NL)
Learned (L) 0.9 0.5 0.6 0.2
Did not learn (L) 0.1 0.5 0.4 0.8
U(L) 1 0.9 max 0.9 U(L) 0.1 U(NL), 0.6
U(L) 0.4 U(NL) U(NL) -1 0.9 max 0.5 U(L)
0.5 U(NL), 0.2 U(L) 0.8 U(NL)
45Computing MDP state utilitiesValue iteration
- How can one solve for U(L) and U(NL) in the
previous example? - Answer Value-iteration algorithmStart with some
initial utility U(L), U(NL), then iterate
8
7
6
5
4
3
U(L)
2
U(NL)
1
0
-1
5
10
15
20
25
30
35
40
45
50
46Optimal policy
- Given utilities from VPI, find optimal policy
A(L) arg max 0.9 U(L) 0.1 U(NL), 0.6 U(L)
0.4 U(NL) arg max 0.97.1574
0.14.0324, 0.67.1574 0.44.0324
arg max 6.8449, 5.9074
Attend A(NL) arg max 0.5 U(L) 0.5 U(NL),
0.2 U(L) 0.8 U(NL) arg max
5.5949, 4.6574 Attend
47Policy iteration
- Instead of iterating in the space of utility
values, iterate over policies - Assume optimal policy, e.g., A(L) A(NL)
- Compute utility values, e.g., U(L) U(NL) for A
- Compute new optimal policy from utilities, e.g.,
U(L) U(NL)