Decisions under uncertainty

About This Presentation

Title:

Decisions under uncertainty

Description:

Rutgers CS440, Fall 2003. Decision making ... Rutgers CS440, Fall 2003. Value function. Numerical score over all possible states of the world ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 48

Provided by: vladimir5

Learn more at: https://people.cs.rutgers.edu

more less

Transcript and Presenter's Notes

Title: Decisions under uncertainty

1
Decisions under uncertainty

Reading Ch. 16, AIMA 2nd Ed.

2
Outline

Decisions, preferences, utility functions
Influence diagrams
Value of information

3
Decision making

Decisions an irrevocable allocation of domain
resources
Decisions should be made so as to maximize
expected utility
Questions
Why make decisions based on average or expected
utility?
Why can one assume that utility functions exist?
Can an agent act rationally by expressing
preferences between states without giving them
numeric values?
Can every preference structure be captured by
assigning a single number to every state?

4
Simple decision problem

Party decision problem inside or outside?

Dry
Regret
state
IN
Wet
Relief
Action
Dry
Perfect !
state
OUT
Disaster
Wet
5
Value function

Numerical score over all possible states of the
world

Action Weather Value
OUT Dry 100
IN Wet 60
IN Dry 50
OUT Wet 0
6
Preferences

Agent chooses among prizes (A,B,) and lotteries
(situations with uncertain prizes)

L1 ( .2, 40000 .8, 0 )
L2 ( .25, 30000 .75, 0 )
?
?

A ? B A is preferred to BA ? B B is preferred to
AA B indifference between A B
7
Desired properties for preferences over lotteries

Prefer 100 over 0 AND p lt q, then

L1 ( p, 100 1-p, 0 )
L2 ( .q, 100 1-q, 0 )
?
8
Properties of (rational) preference

Lead to rational agent behavior
Orderability( A ? B ) V ( A ? B ) V ( A B )
Transitivity( A ? B ) ( B ? C ) ? ( A ? C )
ContinuityA ? B ? C ? ?p, ( p, A (1-p) C ) B
SubstitutabilityAB ? ?p, ( p,A (1-p), C ) (
p,B (1-p), C )
MonotonicityA ? B ? ( p gt q ? ( p,A (1-p)B ) ?
( q,A (1-q),B ) )

9
Preference expected utility

Properties of preference lead to existence
(Ramsey 1931, von Neumann Morgenstern 1944) of
utility function U such that

L1 ( p, 100 1-p, 0 )
L2 ( .q, 100 1-q, 0 )
?
IFF
EXPECTED UTILITY of L2, EU(L2)
EXPECTED UTILITY of L1, EU(L1)
10
Properties of utility

Utility is a function that maps states to real
numbers
Standard approach to assessing utilities of
states
Compare state A to a standard lottery L (
p,Ubest, 1-p, Uworst)Ubest best possible
eventUworst worst possible event
Adjust p until A L

Continue as before
0.999999
30
0.000001
Instant death
11
Utility scales

Normalized utilities Ubest 1.0, Uworst 0.0
Micromorts one-millionth chance of death
useful for Russian roulette, paying to reduce
product risks, etc.
QALYs quality-adjusted life years
useful for medical decisions involving
substantial risk
Note behavior is invariant w.r.t. positive
linear transformation U(s) A U(s) B, A gt
0

12
Utility vs Money

Utility is NOT monetary payoff

?
gt
EMV(L2) 30,000
EMV(L1) 32,000
13
Attitudes toward risk
U( reward )
L
U( 500 )
U( L )
Insurance risk premium
500
1000
400
reward
certain monetary equivalent
U convex risk averse U linear risk
neutral U concave risk seeking
14
Human judgment under uncertainty

Is decision theory compatible with human judgment
under uncertainty?
Are people experts in reasoning under
uncertainty? How well do they perform? What kind
of heuristics do they use?

.2 U(40k) gt .25 U(30k)
.8 U(40k) gt U(30k)
.8 U(40k) lt U(30k)
15
Student group utility

For each amount, adjust p until half the class
votes for lottery (10000)

16
Technology forecasting

I think there is a world market for about five
computers. - Thomas J. Watson,
Sr. Chairman of the Board of IBM, 1943
There doesn't seem to be any real limit to the
growth of the computer industry. - Thomas J.
Watson, Sr. Chairman of the Board of IBM, 1968

17
Maximizing expected utility
EU(IN) 0.7 0.632 0.3 0.699
0.6521 EU(OUT) 0.7 0.865 0.3 0 0.6055
Action arg MEU(IN,OUT) arg max EU(IN),
EU(OUT) IN
18
Multi-attribute utilities

Many aspects of an outcome combine to determine
our preferences
vacation planning cost, flying time, beach
quality, food quality, etc.
Medical decision making risk of death
(micromort), quality of life (QALY), cost of
treatment, etc.
For rational decision making, must combine all
relevant factors into single utility
function. U(a,b,c,) f f1(a), f2(b),
where f is a simple function such as addition
f, In case of mutual preference independence
which occurs when it is always preferable to
increase the value of an attribute given all
other attributes are fixed

19
Decision graphs / Influence diagrams
earthquake
burglary
alarm
call
newscast
goods recovered
Action node
Utility node
gohome?
Utility
missmeeting
20
Optimal policy
earthquake
burglary
Choose action given evidence MEU( go home
call )
alarm
call
newscast
goods recovered
gohome?
Utility
missmeeting
Call? EU( Go home ) EU( Stay )
Yes ? ?
No ? ?
21
Optimal policy
earthquake
burglary
Choose action given evidence MEU( go home
call )
alarm
call
newscast
goods recovered
gohome?
Utility
missmeeting
22
Optimal policy
earthquake
burglary
Choose action given evidence MEU( go home
call )
alarm
call
newscast
goods recovered
gohome?
Utility
missmeeting
Call? EU( Go home ) EU( Stay )
Yes 37 13
No 53 83
Call?c EU( Go home ) EU( Stay ) MEU(Call )
Yes 37 13 37
No 53 83 83
Call? EU( Go home ) EU( Stay )
Yes 37 13
No 53 83
23
Value of information

What is it worth to get another piece of
information?
What is the increase in (maximized) expected
utility if I make a decision with an additional
piece of information?
Additional information (if free) cannot make you
worse off.
There is no value-of-information if you will not
change your decision.

24
Optimal policy with additional evidence
earthquake
burglary
alarm
call
newscast
goods recovered
How much better can we doif we have evidence
aboutnewscast? ( Should we ask for
evidenceabout newscast? )
gohome?
Utility
missmeeting
25
Optimal policy with additional evidence
earthquake
burglary
alarm
call
newscast
goods recovered
gohome?
Utility
Call Newscast Go home
Yes Quake 44 / 45
Yes No 35 / 6
No Quake 51 / 80
No No 52 / 84
Call Newscast Go home
Yes Quake NO
Yes No YES
No Quake NO
No No NO
missmeeting
26
Value of perfect information

The general case We assume that exact evidence
can be obtained about the value of some random
variable Ej.
The agent's current knowledge is E.
The value of the current best action a is defined
by

With the new evidence Ej the value of new best
action aEj will be

27
VPI (contd)

However, we do not have this new evidence in
hand. Hence, we can only say what we expect the
expected utility of Ej to be

The value of perfect information Ej is then

28
Properties of VPI

Positive?E,E1 VPIE(E1) ? 0
Non-additive ( in general )VPIE(E1,E2) ?
VPIE(E1) VPIE(E2)
Order-invariantVPIE(E1,E2) VPIE( E1)
VPIE,E1(E2) VPIE( E2) VPIE,E2(E1)

29
Example

What is the value of information Newscast?

30
Example (contd)
Call? MEU(Call )
Yes 36.74
No 83.23
Call? Newscast MEU(Call, Newscast)
Yes Quake 45.20
Yes NoQuake 35.16
No Quake 80.89
No NoQuake 83.39
Call? P(NewscastQuake Call ) P(NewscastNoQuake Call)
Yes .1794 .8206
No .0453 .9547
31
Sequential Decisions

So far, decisions in static situations. But most
situations are dynamic!
If I dont attend CS440 today, will I be kicked
out of the class?
If I dont attend CS440 today, will I be better
off in the future?

Action
A Attend / Do not attend
S Professor hates me / Professor does not care
P(S A)
E Professor looks upset / not upset
State
U Probability of being expelled from class
U( S )
Eviden.
Utility
32
Sequential decisions

Extend static structure over time just like an
HMM, with decisions and utilities.
One small caveat a different representation
slightly better

33
Partially-observed Markov decision processes
(POMDP)

Actions at time t should impact state at t1
Use Rewards (R) instead of utilities (U)
Actions directly determine rewards

A0
A1
A2
P(St At-1)
S0
S1
S2

P(St St-1)
P(Et St)
R(St)
E0
E1
E2
R0
R1
R2
34
POMDP Problems

Objective Find a sequence of actions that
takes one from an initial state to a final state
while maximizing some notion of total/future
reward.E.g. Find a sequence of actions that
takes a car from point A to point B while
minimizing time and consumed fuel.

35
Example (POMDPs)

Optimal dialog modeling e.g., automated airline
reservation system
Actions
System prompts How may I help you?, Please
specify your favorite airline, Where are you
leaving from?, Do you mind leaving at a
different time?,
States
(Origin, Destination, Airline, Flight,
Departure, Arrival,)

A Stochastic Model of Human-Machine Interaction
for Learning Dialog Strategies Levin,
Pieraccini, and Eckert, IEEE TSAP, 2000
36
Example 2 (POMDP)

Optimal control
States/actions are continuous
Objective design optimal control laws for
guiding objects from start position to goal
position (Lunar lander)
Actions
Engine thrust, robot arm torques,
States
Positions, velocities of objects/robotic arm
joints,
Reward
Usually specified in terms of cost (reward-1)
cost of fuel, battery charge loss, energy loss,

37
Markov decision processes (MDPs)

Lets make life a bit simpler assume we exactly
know the state of the world

A0
A1
A2
P(St At-1)
S0
S1
S2

P(St St-1)
R(St)
R0
R1
R2
38
Examples

Blackjack game
Objective Have your card sum be greater than the
dealers without exceeding 21.
States(200 of them)
Current sum (12-21)
Dealers showing card (ace -10)
Do I have a useable ace?
Reward 1 for winning, 0 for a draw, -1 for
losing
Actions stick (stop receiving cards), hit
(receive another card)

39
MDP Fundamentals

We mentioned (in POMDPs) that the goal is to
select actions that maximize reward
What reward?
Immediate?at arg max E R(st1)
Cumulative?at arg max E R(st1) R(st2)
R(st3)
Discounted?at arg max E R(st1) ?R(st2)
?2 R(st3)

40
Utility utility maximization in MDPs

Assume we are in state st and want to find the
best sequence of future actions that will
maximize discounted reward from st on.

A
at
at1
at2
S
st
st1
st2

U
Rt
Rt1
Rt2
41
Utility utility maximization in MDPs

Assume we are in state st and want to find the
best sequence of future actions that will
maximize discounted reward from st on.
Convert it into a simplified model by
compounding states, actions, rewards

A
Maximum expected utility
st
S
U
42
Bellman equations value iteration algorithm

Do we need to search over aT, T??, actions A?
No!

Best immediate action
Bellman update
43
Proof of Bellman update equation
44
Example

If I dont attend CS440 today, will I be better
off in the future?
Actions Attend / Dont attend
States Learned topic / Did not learn topic
Reward 1 Learned, -1 Did not learn
Discount factor ?0.9
Transition probabilities

Attended (A) Attended (A) Do not attend (NA) Do not attend (NA)
Learned (L) Did not learn (NL) Learned (L) Did not learn (NL)
Learned (L) 0.9 0.5 0.6 0.2
Did not learn (L) 0.1 0.5 0.4 0.8
U(L) 1 0.9 max 0.9 U(L) 0.1 U(NL), 0.6
U(L) 0.4 U(NL) U(NL) -1 0.9 max 0.5 U(L)
0.5 U(NL), 0.2 U(L) 0.8 U(NL)
45
Computing MDP state utilitiesValue iteration

How can one solve for U(L) and U(NL) in the
previous example?
Answer Value-iteration algorithmStart with some
initial utility U(L), U(NL), then iterate

8
7
6
5
4
3
U(L)
2
U(NL)
1
0
-1
5
10
15
20
25
30
35
40
45
50
46
Optimal policy

Given utilities from VPI, find optimal policy

A(L) arg max 0.9 U(L) 0.1 U(NL), 0.6 U(L)
0.4 U(NL) arg max 0.97.1574
0.14.0324, 0.67.1574 0.44.0324
arg max 6.8449, 5.9074
Attend A(NL) arg max 0.5 U(L) 0.5 U(NL),
0.2 U(L) 0.8 U(NL) arg max
5.5949, 4.6574 Attend
47
Policy iteration