Decisions under uncertainty

About This Presentation
Title:

Decisions under uncertainty

Description:

Rutgers CS440, Fall 2003. Decision making ... Rutgers CS440, Fall 2003. Value function. Numerical score over all possible states of the world ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Decisions under uncertainty


1
Decisions under uncertainty
  • Reading Ch. 16, AIMA 2nd Ed.

2
Outline
  • Decisions, preferences, utility functions
  • Influence diagrams
  • Value of information

3
Decision making
  • Decisions an irrevocable allocation of domain
    resources
  • Decisions should be made so as to maximize
    expected utility
  • Questions
  • Why make decisions based on average or expected
    utility?
  • Why can one assume that utility functions exist?
  • Can an agent act rationally by expressing
    preferences between states without giving them
    numeric values?
  • Can every preference structure be captured by
    assigning a single number to every state?

4
Simple decision problem
  • Party decision problem inside or outside?

Dry
Regret
state
IN
Wet
Relief
Action
Dry
Perfect !
state
OUT
Disaster
Wet
5
Value function
  • Numerical score over all possible states of the
    world

Action Weather Value
OUT Dry 100
IN Wet 60
IN Dry 50
OUT Wet 0
6
Preferences
  • Agent chooses among prizes (A,B,) and lotteries
    (situations with uncertain prizes)

L1 ( .2, 40000 .8, 0 )
L2 ( .25, 30000 .75, 0 )
?
?

A ? B A is preferred to BA ? B B is preferred to
AA B indifference between A B
7
Desired properties for preferences over lotteries
  • Prefer 100 over 0 AND p lt q, then

L1 ( p, 100 1-p, 0 )
L2 ( .q, 100 1-q, 0 )
?
8
Properties of (rational) preference
  • Lead to rational agent behavior
  • Orderability( A ? B ) V ( A ? B ) V ( A B )
  • Transitivity( A ? B ) ( B ? C ) ? ( A ? C )
  • ContinuityA ? B ? C ? ?p, ( p, A (1-p) C ) B
  • SubstitutabilityAB ? ?p, ( p,A (1-p), C ) (
    p,B (1-p), C )
  • MonotonicityA ? B ? ( p gt q ? ( p,A (1-p)B ) ?
    ( q,A (1-q),B ) )

9
Preference expected utility
  • Properties of preference lead to existence
    (Ramsey 1931, von Neumann Morgenstern 1944) of
    utility function U such that

L1 ( p, 100 1-p, 0 )
L2 ( .q, 100 1-q, 0 )
?
IFF
EXPECTED UTILITY of L2, EU(L2)
EXPECTED UTILITY of L1, EU(L1)
10
Properties of utility
  • Utility is a function that maps states to real
    numbers
  • Standard approach to assessing utilities of
    states
  • Compare state A to a standard lottery L (
    p,Ubest, 1-p, Uworst)Ubest best possible
    eventUworst worst possible event
  • Adjust p until A L

Continue as before
0.999999
30
0.000001
Instant death
11
Utility scales
  • Normalized utilities Ubest 1.0, Uworst 0.0
  • Micromorts one-millionth chance of death
  • useful for Russian roulette, paying to reduce
    product risks, etc.
  • QALYs quality-adjusted life years
  • useful for medical decisions involving
    substantial risk
  • Note behavior is invariant w.r.t. positive
    linear transformation U(s) A U(s) B, A gt
    0

12
Utility vs Money
  • Utility is NOT monetary payoff

?
gt
EMV(L2) 30,000
EMV(L1) 32,000
13
Attitudes toward risk
U( reward )
L
U( 500 )
U( L )
Insurance risk premium
500
1000
400
reward
certain monetary equivalent
U convex risk averse U linear risk
neutral U concave risk seeking
14
Human judgment under uncertainty
  • Is decision theory compatible with human judgment
    under uncertainty?
  • Are people experts in reasoning under
    uncertainty? How well do they perform? What kind
    of heuristics do they use?

.2 U(40k) gt .25 U(30k)
.8 U(40k) gt U(30k)
.8 U(40k) lt U(30k)
15
Student group utility
  • For each amount, adjust p until half the class
    votes for lottery (10000)

16
Technology forecasting
  • I think there is a world market for about five
    computers. - Thomas J. Watson,
    Sr. Chairman of the Board of IBM, 1943
  • There doesn't seem to be any real limit to the
    growth of the computer industry. - Thomas J.
    Watson, Sr. Chairman of the Board of IBM, 1968

17
Maximizing expected utility
EU(IN) 0.7 0.632 0.3 0.699
0.6521 EU(OUT) 0.7 0.865 0.3 0 0.6055
Action arg MEU(IN,OUT) arg max EU(IN),
EU(OUT) IN
18
Multi-attribute utilities
  • Many aspects of an outcome combine to determine
    our preferences
  • vacation planning cost, flying time, beach
    quality, food quality, etc.
  • Medical decision making risk of death
    (micromort), quality of life (QALY), cost of
    treatment, etc.
  • For rational decision making, must combine all
    relevant factors into single utility
    function. U(a,b,c,) f f1(a), f2(b),
    where f is a simple function such as addition
  • f, In case of mutual preference independence
    which occurs when it is always preferable to
    increase the value of an attribute given all
    other attributes are fixed

19
Decision graphs / Influence diagrams
earthquake
burglary
alarm
call
newscast
goods recovered
Action node
Utility node
gohome?
Utility
missmeeting
20
Optimal policy
earthquake
burglary
Choose action given evidence MEU( go home
call )
alarm
call
newscast
goods recovered
gohome?
Utility
missmeeting
Call? EU( Go home ) EU( Stay )
Yes ? ?
No ? ?
21
Optimal policy
earthquake
burglary
Choose action given evidence MEU( go home
call )
alarm
call
newscast
goods recovered
gohome?
Utility
missmeeting
22
Optimal policy
earthquake
burglary
Choose action given evidence MEU( go home
call )
alarm
call
newscast
goods recovered
gohome?
Utility
missmeeting
Call? EU( Go home ) EU( Stay )
Yes 37 13
No 53 83
Call?c EU( Go home ) EU( Stay ) MEU(Call )
Yes 37 13 37
No 53 83 83
Call? EU( Go home ) EU( Stay )
Yes 37 13
No 53 83
23
Value of information
  • What is it worth to get another piece of
    information?
  • What is the increase in (maximized) expected
    utility if I make a decision with an additional
    piece of information?
  • Additional information (if free) cannot make you
    worse off.
  • There is no value-of-information if you will not
    change your decision.

24
Optimal policy with additional evidence
earthquake
burglary
alarm
call
newscast
goods recovered
How much better can we doif we have evidence
aboutnewscast? ( Should we ask for
evidenceabout newscast? )
gohome?
Utility
missmeeting
25
Optimal policy with additional evidence
earthquake
burglary
alarm
call
newscast
goods recovered
gohome?
Utility
Call Newscast Go home
Yes Quake 44 / 45
Yes No 35 / 6
No Quake 51 / 80
No No 52 / 84
Call Newscast Go home
Yes Quake NO
Yes No YES
No Quake NO
No No NO
missmeeting
26
Value of perfect information
  • The general case We assume that exact evidence
    can be obtained about the value of some random
    variable Ej.
  • The agent's current knowledge is E.
  • The value of the current best action a is defined
    by
  • With the new evidence Ej the value of new best
    action aEj will be

27
VPI (contd)
  • However, we do not have this new evidence in
    hand. Hence, we can only say what we expect the
    expected utility of Ej to be
  • The value of perfect information Ej is then

28
Properties of VPI
  1. Positive?E,E1 VPIE(E1) ? 0
  2. Non-additive ( in general )VPIE(E1,E2) ?
    VPIE(E1) VPIE(E2)
  3. Order-invariantVPIE(E1,E2) VPIE( E1)
    VPIE,E1(E2) VPIE( E2) VPIE,E2(E1)

29
Example
  • What is the value of information Newscast?

30
Example (contd)
Call? MEU(Call )
Yes 36.74
No 83.23
Call? Newscast MEU(Call, Newscast)
Yes Quake 45.20
Yes NoQuake 35.16
No Quake 80.89
No NoQuake 83.39
Call? P(NewscastQuake Call ) P(NewscastNoQuake Call)
Yes .1794 .8206
No .0453 .9547
31
Sequential Decisions
  • So far, decisions in static situations. But most
    situations are dynamic!
  • If I dont attend CS440 today, will I be kicked
    out of the class?
  • If I dont attend CS440 today, will I be better
    off in the future?

Action
A Attend / Do not attend
S Professor hates me / Professor does not care
P(S A)
E Professor looks upset / not upset
State
U Probability of being expelled from class
U( S )
Eviden.
Utility
32
Sequential decisions
  • Extend static structure over time just like an
    HMM, with decisions and utilities.
  • One small caveat a different representation
    slightly better

33
Partially-observed Markov decision processes
(POMDP)
  • Actions at time t should impact state at t1
  • Use Rewards (R) instead of utilities (U)
  • Actions directly determine rewards

A0
A1
A2
P(St At-1)
S0
S1
S2

P(St St-1)
P(Et St)
R(St)
E0
E1
E2
R0
R1
R2
34
POMDP Problems
  • Objective Find a sequence of actions that
    takes one from an initial state to a final state
    while maximizing some notion of total/future
    reward.E.g. Find a sequence of actions that
    takes a car from point A to point B while
    minimizing time and consumed fuel.

35
Example (POMDPs)
  • Optimal dialog modeling e.g., automated airline
    reservation system
  • Actions
  • System prompts How may I help you?, Please
    specify your favorite airline, Where are you
    leaving from?, Do you mind leaving at a
    different time?,
  • States
  • (Origin, Destination, Airline, Flight,
    Departure, Arrival,)

A Stochastic Model of Human-Machine Interaction
for Learning Dialog Strategies Levin,
Pieraccini, and Eckert, IEEE TSAP, 2000
36
Example 2 (POMDP)
  • Optimal control
  • States/actions are continuous
  • Objective design optimal control laws for
    guiding objects from start position to goal
    position (Lunar lander)
  • Actions
  • Engine thrust, robot arm torques,
  • States
  • Positions, velocities of objects/robotic arm
    joints,
  • Reward
  • Usually specified in terms of cost (reward-1)
    cost of fuel, battery charge loss, energy loss,

37
Markov decision processes (MDPs)
  • Lets make life a bit simpler assume we exactly
    know the state of the world

A0
A1
A2
P(St At-1)
S0
S1
S2

P(St St-1)
R(St)
R0
R1
R2
38
Examples
  • Blackjack game
  • Objective Have your card sum be greater than the
    dealers without exceeding 21.
  • States(200 of them)
  • Current sum (12-21)
  • Dealers showing card (ace -10)
  • Do I have a useable ace?
  • Reward 1 for winning, 0 for a draw, -1 for
    losing
  • Actions stick (stop receiving cards), hit
    (receive another card)

39
MDP Fundamentals
  • We mentioned (in POMDPs) that the goal is to
    select actions that maximize reward
  • What reward?
  • Immediate?at arg max E R(st1)
  • Cumulative?at arg max E R(st1) R(st2)
    R(st3)
  • Discounted?at arg max E R(st1) ?R(st2)
    ?2 R(st3)

40
Utility utility maximization in MDPs
  • Assume we are in state st and want to find the
    best sequence of future actions that will
    maximize discounted reward from st on.

A
at
at1
at2
S
st
st1
st2

U
Rt
Rt1
Rt2
41
Utility utility maximization in MDPs
  • Assume we are in state st and want to find the
    best sequence of future actions that will
    maximize discounted reward from st on.
  • Convert it into a simplified model by
    compounding states, actions, rewards

A
Maximum expected utility
st
S
U
42
Bellman equations value iteration algorithm
  • Do we need to search over aT, T??, actions A?
  • No!

Best immediate action
Bellman update
43
Proof of Bellman update equation
44
Example
  • If I dont attend CS440 today, will I be better
    off in the future?
  • Actions Attend / Dont attend
  • States Learned topic / Did not learn topic
  • Reward 1 Learned, -1 Did not learn
  • Discount factor ?0.9
  • Transition probabilities

Attended (A) Attended (A) Do not attend (NA) Do not attend (NA)
Learned (L) Did not learn (NL) Learned (L) Did not learn (NL)
Learned (L) 0.9 0.5 0.6 0.2
Did not learn (L) 0.1 0.5 0.4 0.8
U(L) 1 0.9 max 0.9 U(L) 0.1 U(NL), 0.6
U(L) 0.4 U(NL) U(NL) -1 0.9 max 0.5 U(L)
0.5 U(NL), 0.2 U(L) 0.8 U(NL)
45
Computing MDP state utilitiesValue iteration
  • How can one solve for U(L) and U(NL) in the
    previous example?
  • Answer Value-iteration algorithmStart with some
    initial utility U(L), U(NL), then iterate

8
7
6
5
4
3
U(L)
2
U(NL)
1
0
-1
5
10
15
20
25
30
35
40
45
50
46
Optimal policy
  • Given utilities from VPI, find optimal policy

A(L) arg max 0.9 U(L) 0.1 U(NL), 0.6 U(L)
0.4 U(NL) arg max 0.97.1574
0.14.0324, 0.67.1574 0.44.0324
arg max 6.8449, 5.9074
Attend A(NL) arg max 0.5 U(L) 0.5 U(NL),
0.2 U(L) 0.8 U(NL) arg max
5.5949, 4.6574 Attend
47
Policy iteration
  • Instead of iterating in the space of utility
    values, iterate over policies
  • Assume optimal policy, e.g., A(L) A(NL)
  • Compute utility values, e.g., U(L) U(NL) for A
  • Compute new optimal policy from utilities, e.g.,
    U(L) U(NL)
Write a Comment
User Comments (0)