Title: RiskSensitive Markov Decision Processes
1 Risk-SensitiveMarkov Decision Processes
Matthew J. Sobel Dept. of Operations Case
Weatherhead School of Management
Chinese Academy of Sciences Frontiers of Research
in Supply Chain Management June 15, 2007
2Outline
- Mean-variance tradeoffs in dynamic stochastic
models - Approaches and algorithms
- Pareto-optimal stationary policies
- Time preference, risk preference, and discounting
- Discounting without risk neutrality
- Application to coordination of operational and
financial decisions - Application to supply chain coordination
- Summary
3References with 100 citations
- "Mean-Variance Tradeoffs in an Undiscounted MDP,"
M. J. Sobel, Operations Research, Vol. 42,
(1994), pp. 175-183. - Discounting and Risk Neutrality, M. J. Sobel,
2006, - http//weatherhead.case.edu/orom/research/technic
alReports/Technical20Memorandum20Number20724F.p
df - Risk Neutrality and Ordered Vector Spaces, J.
C. Alexander M. J. Sobel, 2006. - http//ssrn.com/abstract896201
4MULTI-STAGE SUPPLY SYSTEM
Unit Proc. N
WIP N
Unit Proc. N-1
WIP N-1
WIP 2
Unit Proc. 1
FGI
Demand
5A Multi-Stage Supply System
- Know the amount of material in each buffer
inventory - Each period at each stage decide how much to
process at that stage - End-item demand is random (independent and
identically distributed random variables gt 0) - This is a Markov decision process (MDP)
- The STATE is the vector of amounts of material in
the buffer inventories - The ACTION is the vector of amounts processed at
the various stages - Costs processing, storage, demand in excess of
supply
6Optimization Criteria
- Minimize the expected value of the costs during a
number of periods - Finite number Clark and Scarf more than 40 years
ago - Infinitely many periods
- Expected value of the long-run average cost in a
period - Expected value of the sum of discounted costs
- These are risk-neutral criteria
- They depend only on the expected value of
appropriate random variables - Risk-sensitive criteria depend on other moments
too
7Why use risk-sensitive criteria?
- Most managers are risk averse
- They are eager to trade some expected value to
reduce the downside risk! - There are large risks in some operations
phenomena, including supply chain activities - Invest in capacity now. Will the markets be
strong when the capacity is available? - Invest in new technology now? Will it be
outdated technology by the time that it is
available? - Large rivers and lakes as supply chains
8Markov decision process (MDP)
9Stationary policies and distributions
- A stationary policy induces a Markov chain (MC)
with stationary transition probabilities - Each ergodic class in this MC has a stationary
distribution. Use it to calculate the mean and
variance of the steady-state reward - The mean is the gain rate. It is the usual
criterion for infinite-horizon MDPs with average
reward criterion
10Pareto optimal policies
- Consider all (mean, variance) pairs generated by
stationary policies on sub-chains. - A pair (mean, variance) is said to be Pareto
optimal if it is not possible to increase the
gain-rate or lower the variance without damaging
the other criterion - How can you calculate policies that generate
Pareto optimal pairs? How can you explore the
mean-variance tradeoffs?
11Mean-variance tradeoffs 3 approaches
- Several papers explore 1 for fixed ?
- Nobody has explored 2
- Today 3 to generate all the unrandomized
stationary policies that are Pareto-optimal
12Why use approach 3?
- A parametric solution to 1 can miss some Pareto
optima - A parametric solution to 3 generates all
solutions to 1 - Basic idea to do 3
- Add one constraint to the linear program that
optimizes the gain rate of an MDP that satisfies
the unichain assumption (that assumption is not
made here) - Solve the linear program parametrically with
respect to the extra constraint
13Basic idea of 3
- Add one constraint to the linear program for
optimizing the gain rate of an MDP that satisfies
the unichain assumption (that assumption is not
made here) - Size of linear program number of states x number
of state-action pairs - Solving the linear program parametrically with
respect to the extra constraint generates a
series of extreme points - Each extreme point corresponds to a deterministic
stationary policy that is Pareto optimal, and it
identifies the corresponding sub-chain - So this procedure solves the problem efficiently
if you can choose the policy and the sub-chain
14Research questions
- Many properties are known about the multi-stage
supply system that was described early in this
talk. The unichain linear program would be very
large. How can you use the known properties to
reduce the computation in 3? - The same idea can be applied to many other
operations models (including supply chain models) - Nobody has explored approach 2
15Strategic Operational Decisions
- Examples
- Location
- Capacity
- Technology
- Product design
- Process design
- Supply chain design
- Consequences
- Uncertain time streams of revenues, costs, etc.
- So we face tradeoffs over time and under
uncertainty
16Risky Business Risk Neutral Analyses!
- Strategic operational choices tradeoffs over
time and under uncertainty - But most of our models and methods assume risk
neutrality - Expected net profit in the newsvendor model
- EPV (expected present value) in MDPs
applications - Canonical form for risk-sensitive preferences in
static situation expected utility of the
monetary payoff - What is the canonical form in dynamic situations?
17Logic of Time - Risk Preferences
- Preferences among alternative risky time
streams - Stochastic processes could be vector-valued
- Sequences of consumption, environmental
attributes, or indicators of timing of
resolution of risk, or .
18Standard Approach
-
- Advantages
- Markov decision process (MDP) with X rewards ?
MDP with f(X) rewards - Investigate risk sensitivity via quadratic f()
with normally distributed randomness
19Justification for Standard Approach
- If X and Y are deterministic, Koopmans axioms
imply - In a stochastic world,
- Would have to estimate discount factors, U( ),
and ?( )
20Preference Theory
- Risk preference
- Implications of properties of
- Von Neumann Morgenstern
- Many others since then
- Time preference
- T. C. Koopmans Williams Nassar 1960s
- Empirical research - past 20 years
- Reference markets alternative approach
- Why not use the same formalism for risk
preference and time preference?
21Time Risk Preferences
22Is it Logical to Discount without Risk Neutrality
?
- 2) If preferences satisfy the four axioms
and there is a - utility function for random variables),
then that - function is linear!
- So preferences are risk neutral.
- 3) Koopmans assumptions include the four
axioms
23Four Axioms
- First three are common in axiomatic theories
- Decomposition seriously restrictive in a
stochastic setting!
24Discounting Theorem
- Consider stochastic processes with T periods
- Theorem The four axioms imply that there are
unique positive ß1,ß2,,ßT such that, for all X
and Y, - Corollary Adding a fifth axiom implies ßt ßt
- Adding a sixth axiom implies
ß lt 1
25Proof of Discounting Theorem
- The four axioms induce an algebra of preference
- For example, if (X1,X2,) (0,0,) then
- c(X1,X2,) (0,0,) for all numbers c
- The algebra of preference implies the existence
of discount factors
26Risk Neutrality
- A felicity function assigns a number to each
random variable, is linear, and is
order-reserving (reflects preferences among
random variables) - The four axioms are rationality, continuity,
non-triviality, and decomposition. - Risk neutrality
27Risk Neutrality Theorem
- If preferences among stochastic processes satisfy
the four axioms, then - preferences are consistent with discounting (the
- discounting theorem), and
- (B) the following properties are equivalent
- Risk neutrality
- Existence of a felicity function
- Preferences satisfy decomposition (converse of
decomposition)
28Risk Neutrality Theorem cond.
- If preferences among stochastic processes satisfy
the four axioms, then the following properties
are equivalent - Risk neutrality
- Existence of a felicity function
- Preferences satisfy decomposition
- Koopmans assumptions 40 years ago included the
four axioms. So there is no basis for the
standard approach
29Discounting and Risk Sensitivity
- At present, this seems to be the only formalism
for time-risk tradeoffs that has a logical
foundation - This formalism invites an exponential
inter-period utility function - Consequences in structured models
- Markov decision processes with Kun-Jen Chung
- Sequential games with Madhvi Shinde Bhatt
- Inventory model with Mokrane Bouakiz
- Insurance with Danko Turcic
- Supply chain contracts with Danko Turcic
30Risk-neutral SC coordination
- A contract coordinates the SC if the actions that
optimize the entire chain (viewed as a single
entity) are a Nash equilibrium of the strategic
game induced by the contract (among the members
of the SC). - "Optimize the entire chain" means expected value
of of the PV (present value) of total profit - Payoffs in the game are the parties' expected
values of the PVs of their profits.
31Risk-sensitive SC coordination
- A contract coordinates the SC if the actions that
optimize the entire chain (viewed as a single
entity) are a Nash equilibrium of the strategic
game induced by the contract (among the members
of the SC). - "Optimize the entire chain" means expected value
of inter-period utility of the PV of total profit - Whose inter-period utility? Linear in rest of
this talk. - A payoff in the game is the party's expected
value of its inter-period utility of the PV of
its profits. - Are some parties more sensitive to risk than
others?
32Coordinating the newsvendor
- Risk neutrality
- Various types of contracts coordinate the SC
- Buy-back contracts and revenue-sharing contracts
- These types of contracts are equivalent for the
retailer - Any division of the "pie" is achievable
- Risk sensitivity
- There may not be any buy-back contract or
revenue-sharing contract that coordinates the SC - The retailer is not indifferent between a
buy-back contract and a revenue-sharing contract
33Time line buy-back contract
- M (manufacturer) announces wholesale price w and
buy-back price b - R (retailer) orders Q, and M incurs -cQ
- M ships Q units to R who incurs cost kQ and pays
wQ to M - R receives
- M pays to R M gets revenue
-
34Buy-back contract risk-sensitive retailer
35Buy-back risk-sensitive retailer - more
- There is an example of parameters and strictly
concave ? for which there is no coordinating
buy-back contract. That is, there is a Qo for
which no Q is a solution.
36Misspecification bias
- Mistaken use of intra-period utility function
instead of inter-period utility function -
- It is more difficult for the SC to overcome
double marginalization if the retailer's
consultant neglects to use an inter-period
utility function
37Buy-back vs. revenue-sharing
- Take any pair of buy-back and revenue-sharing
contracts that are equivalent under risk
neutrality - The buy-back contract has a higher value of
- So the risk-sensitive retailer prefers the
- buy-back contract
38Summary 1
- The axioms that have long been the justification
for discounting with a non-linear intra-period
utility function imply that the preferences are
risk neutral - Capital asset pricing theory
- Other areas of economics and finance
39Summary 2
- Weakening the axioms yields discounting without
risk neutrality if and only if the composition
axiom is not satisfied. Then the logically
correct formalism uses an inter-period utility
function
40Summary 3
- There are many unanswered questions such as
- Can preferences be consistent with discounting
under weaker assumptions than rationality,
continuity, non-triviality, and decomposition? - What are the effects of inter-period utility
functions in prescriptive sciences? - Is there a reasonable resolution of dynamic
inconsistency?
41Summary 4
- Most supply chain coordination research assumes
risk neutrality - There are alternative risk-sensitive definitions
of coordination - It is possible to analyze a simple two-member
supply chain with a risk-sensitive newsvendor
retailer and risk-neutral supply chain
optimization
42Summary 5
- There are risk-sensitive models that cannot be
coordinated with any buy-back contract - Misspecification with intra-period utility
function yields an order quantity that is too
small - If buy-back and revenue-sharing contracts are
equivalent under risk neutrality, then a
risk-sensitive retailer prefers buy-back
43Example
- The following example satisfies the first three
axioms, but neither decomposition nor composition - Two element sample space ? a,b Pa
3/4 Pb 1/4 - Preference is determined by variance - mean
- X(a) Y(b) 0 X(b) Y(a) -1
44Stochastic Order is not Rational
If the distribution functions of X and and Y
cross, then neither is stochastically larger than
the other. So the ordering is not complete
45Mean Variance Tradeoffs
- If X and Y are independent,
- The ordering satisfies decomposition but not
composition - Generally
- It is easy to find examples that satisfy
decomposition but not composition - It is difficult to find examples that satisfy
composition but not decomposition
46Decomposition vs. Composition
47Where Does this Leave Us?
- DA Denardo and Rothblum van Mieghem Chen, Sim,
Simchi-Levi and Sun and I have used the
following ordering - Robert Rosenthal (deceased) challenged my
justification which was the obvious
orthogonality of axioms for time preference and
risk preference - He was correct - the two sets of axioms are NOT
orthogonal - Nevertheless, there is a strong justification for
this ordering
48Role of the Composition Axiom
- Let V be an abstract real vector space
(application stochastic processes with the zero
process as the 0 in V ) - A real-valued function on V is weakly continuous
if it is continuous on each finite-dimensional
subspace of V, and it is linear if it is linear
as a map of vector spaces. - A real-valued function u on V is a pseudo-utility
function if - A pseudo-utility function is a utility function
if it satisfies
49Recent Result with James Alexander
- If a binary relation on a real vector space
satisfies the four axioms, then there is a
utility function of the form f ?u in which
uV?R is a linear pseudo-utility function. Also,
- fR?R is weakly monotonic and is linear if
and only if the binary relation satisfies the
composition axiom - So if V is the set of stochastic processes on a
probability space and if preferences satisfy the
four axioms but not composition, then there is a
nonlinear inter-period utility function ? such
that
50Mathematical Novelty
- Hausner and Wendel (1952) showed that a binary
relation on a real vector space has a linear
pseudo-utility function if the binary relations
properties include - Rationality
- Anti-symmetry
- Cone property
- Composition decomposition
- Our theorem
- Does not require composition, anti-symmetry, or
the cone property for existence of a linear
pseudo-utility function, but it requires
continuity and non-triviality - Exactly specifies the consequence of augmenting
decomposition with composition
51In Operations
- Standard approach
-
- Apply to dynamic newsvendor as in much supply
chain research - There is a literature on this problem
52Whats the Difference?
- In interpret
coordinates as - Time indices ? time preference
- Sample space outcomes ? risk preference
- Preference theory is largely abstract so it
applies to both time and risk preference - Issues unique to each kind discounting risk
neutrality - Why not invoke von Neumann-Morgenstern axioms for
risk preference and Williams-Nassar axioms for
time preference? - Are the two sets of axioms orthogonal?
- Robert Rosenthal
53Risk-Averse Dynamic News Vendor
54Risk-Neutral Optimization of a Firms Value
- Market value of a firm is the present value of
time stream of dividends - Paper with Lode Li and Martin Shubik
- Firm makes periodic operational and financial
decisions - Operational decisions as in dynamic news vendor
- Financial decisions
- Dividend (net of capital subscription -
entrepreneurial firm) - Short-term loan (model includes a default
penalty) - Augment constraints in dynamic news vendor model
- Liquidity
- Cash flow balance
- Additional state variable retained earnings
55Risk-Averse Optimization of a Firms Value
- Market value of a firm is the present value of
time stream of dividends - Again use an exponential inter-period utility
function - Risk-neutral and risk-averse analyses share
conclusions - There are optimal base-stock inventory and
retained earnings levels - Dont borrow unless you have to for liquidity,
and then as little as possible (pecking order
principle)
56Risk-Aversion Effects
- Market value of a firm is the present value of
time stream of dividends - Again use an exponential inter-period utility
function - Some effects of risk aversion
- Inventory base-stock level rises as time elapses
- Retained earnings base-stock level drops as time
elapses - So dividends rise as time passes
- Effects of initial capitalization
- Work with Qiaohai (Joice) Hu in the risk-neutral
case - Further results in the risk-sensitive case