RiskSensitive Markov Decision Processes - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

RiskSensitive Markov Decision Processes

Description:

Time preference, risk preference, and discounting. Discounting without risk neutrality. Application to ... Work with Qiaohai (Joice) Hu in the risk-neutral case ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 57
Provided by: orsc
Category:

less

Transcript and Presenter's Notes

Title: RiskSensitive Markov Decision Processes


1
Risk-SensitiveMarkov Decision Processes
Matthew J. Sobel Dept. of Operations Case
Weatherhead School of Management
Chinese Academy of Sciences Frontiers of Research
in Supply Chain Management June 15, 2007
2
Outline
  • Mean-variance tradeoffs in dynamic stochastic
    models
  • Approaches and algorithms
  • Pareto-optimal stationary policies
  • Time preference, risk preference, and discounting
  • Discounting without risk neutrality
  • Application to coordination of operational and
    financial decisions
  • Application to supply chain coordination
  • Summary

3
References with 100 citations
  • "Mean-Variance Tradeoffs in an Undiscounted MDP,"
    M. J. Sobel, Operations Research, Vol. 42,
    (1994), pp. 175-183.
  • Discounting and Risk Neutrality, M. J. Sobel,
    2006,
  • http//weatherhead.case.edu/orom/research/technic
    alReports/Technical20Memorandum20Number20724F.p
    df
  • Risk Neutrality and Ordered Vector Spaces, J.
    C. Alexander M. J. Sobel, 2006.
  • http//ssrn.com/abstract896201

4
MULTI-STAGE SUPPLY SYSTEM
Unit Proc. N
WIP N
Unit Proc. N-1
WIP N-1
WIP 2
Unit Proc. 1
FGI
Demand
5
A Multi-Stage Supply System
  • Know the amount of material in each buffer
    inventory
  • Each period at each stage decide how much to
    process at that stage
  • End-item demand is random (independent and
    identically distributed random variables gt 0)
  • This is a Markov decision process (MDP)
  • The STATE is the vector of amounts of material in
    the buffer inventories
  • The ACTION is the vector of amounts processed at
    the various stages
  • Costs processing, storage, demand in excess of
    supply

6
Optimization Criteria
  • Minimize the expected value of the costs during a
    number of periods
  • Finite number Clark and Scarf more than 40 years
    ago
  • Infinitely many periods
  • Expected value of the long-run average cost in a
    period
  • Expected value of the sum of discounted costs
  • These are risk-neutral criteria
  • They depend only on the expected value of
    appropriate random variables
  • Risk-sensitive criteria depend on other moments
    too

7
Why use risk-sensitive criteria?
  • Most managers are risk averse
  • They are eager to trade some expected value to
    reduce the downside risk!
  • There are large risks in some operations
    phenomena, including supply chain activities
  • Invest in capacity now. Will the markets be
    strong when the capacity is available?
  • Invest in new technology now? Will it be
    outdated technology by the time that it is
    available?
  • Large rivers and lakes as supply chains

8
Markov decision process (MDP)
9
Stationary policies and distributions
  • A stationary policy induces a Markov chain (MC)
    with stationary transition probabilities
  • Each ergodic class in this MC has a stationary
    distribution. Use it to calculate the mean and
    variance of the steady-state reward
  • The mean is the gain rate. It is the usual
    criterion for infinite-horizon MDPs with average
    reward criterion

10
Pareto optimal policies
  • Consider all (mean, variance) pairs generated by
    stationary policies on sub-chains.
  • A pair (mean, variance) is said to be Pareto
    optimal if it is not possible to increase the
    gain-rate or lower the variance without damaging
    the other criterion
  • How can you calculate policies that generate
    Pareto optimal pairs? How can you explore the
    mean-variance tradeoffs?

11
Mean-variance tradeoffs 3 approaches
  • Several papers explore 1 for fixed ?
  • Nobody has explored 2
  • Today 3 to generate all the unrandomized
    stationary policies that are Pareto-optimal

12
Why use approach 3?
  • A parametric solution to 1 can miss some Pareto
    optima
  • A parametric solution to 3 generates all
    solutions to 1
  • Basic idea to do 3
  • Add one constraint to the linear program that
    optimizes the gain rate of an MDP that satisfies
    the unichain assumption (that assumption is not
    made here)
  • Solve the linear program parametrically with
    respect to the extra constraint

13
Basic idea of 3
  • Add one constraint to the linear program for
    optimizing the gain rate of an MDP that satisfies
    the unichain assumption (that assumption is not
    made here)
  • Size of linear program number of states x number
    of state-action pairs
  • Solving the linear program parametrically with
    respect to the extra constraint generates a
    series of extreme points
  • Each extreme point corresponds to a deterministic
    stationary policy that is Pareto optimal, and it
    identifies the corresponding sub-chain
  • So this procedure solves the problem efficiently
    if you can choose the policy and the sub-chain

14
Research questions
  • Many properties are known about the multi-stage
    supply system that was described early in this
    talk. The unichain linear program would be very
    large. How can you use the known properties to
    reduce the computation in 3?
  • The same idea can be applied to many other
    operations models (including supply chain models)
  • Nobody has explored approach 2

15
Strategic Operational Decisions
  • Examples
  • Location
  • Capacity
  • Technology
  • Product design
  • Process design
  • Supply chain design
  • Consequences
  • Uncertain time streams of revenues, costs, etc.
  • So we face tradeoffs over time and under
    uncertainty

16
Risky Business Risk Neutral Analyses!
  • Strategic operational choices tradeoffs over
    time and under uncertainty
  • But most of our models and methods assume risk
    neutrality
  • Expected net profit in the newsvendor model
  • EPV (expected present value) in MDPs
    applications
  • Canonical form for risk-sensitive preferences in
    static situation expected utility of the
    monetary payoff
  • What is the canonical form in dynamic situations?

17
Logic of Time - Risk Preferences
  • Preferences among alternative risky time
    streams
  • Stochastic processes could be vector-valued
  • Sequences of consumption, environmental
    attributes, or indicators of timing of
    resolution of risk, or .

18
Standard Approach
  • Advantages
  • Markov decision process (MDP) with X rewards ?
    MDP with f(X) rewards
  • Investigate risk sensitivity via quadratic f()
    with normally distributed randomness

19
Justification for Standard Approach
  • If X and Y are deterministic, Koopmans axioms
    imply
  • In a stochastic world,
  • Would have to estimate discount factors, U( ),
    and ?( )

20
Preference Theory
  • Risk preference
  • Implications of properties of
  • Von Neumann Morgenstern
  • Many others since then
  • Time preference
  • T. C. Koopmans Williams Nassar 1960s
  • Empirical research - past 20 years
  • Reference markets alternative approach
  • Why not use the same formalism for risk
    preference and time preference?

21
Time Risk Preferences
22
Is it Logical to Discount without Risk Neutrality
?
  • 2) If preferences satisfy the four axioms
    and there is a
  • utility function for random variables),
    then that
  • function is linear!
  • So preferences are risk neutral.
  • 3) Koopmans assumptions include the four
    axioms

23
Four Axioms
  • First three are common in axiomatic theories
  • Decomposition seriously restrictive in a
    stochastic setting!

24
Discounting Theorem
  • Consider stochastic processes with T periods
  • Theorem The four axioms imply that there are
    unique positive ß1,ß2,,ßT such that, for all X
    and Y,
  • Corollary Adding a fifth axiom implies ßt ßt
  • Adding a sixth axiom implies
    ß lt 1

25
Proof of Discounting Theorem
  • The four axioms induce an algebra of preference
  • For example, if (X1,X2,) (0,0,) then
  • c(X1,X2,) (0,0,) for all numbers c
  • The algebra of preference implies the existence
    of discount factors

26
Risk Neutrality
  • A felicity function assigns a number to each
    random variable, is linear, and is
    order-reserving (reflects preferences among
    random variables)
  • The four axioms are rationality, continuity,
    non-triviality, and decomposition.
  • Risk neutrality

27
Risk Neutrality Theorem
  • If preferences among stochastic processes satisfy
    the four axioms, then
  • preferences are consistent with discounting (the
  • discounting theorem), and
  • (B) the following properties are equivalent
  • Risk neutrality
  • Existence of a felicity function
  • Preferences satisfy decomposition (converse of
    decomposition)

28
Risk Neutrality Theorem cond.
  • If preferences among stochastic processes satisfy
    the four axioms, then the following properties
    are equivalent
  • Risk neutrality
  • Existence of a felicity function
  • Preferences satisfy decomposition
  • Koopmans assumptions 40 years ago included the
    four axioms. So there is no basis for the
    standard approach

29
Discounting and Risk Sensitivity
  • At present, this seems to be the only formalism
    for time-risk tradeoffs that has a logical
    foundation
  • This formalism invites an exponential
    inter-period utility function
  • Consequences in structured models
  • Markov decision processes with Kun-Jen Chung
  • Sequential games with Madhvi Shinde Bhatt
  • Inventory model with Mokrane Bouakiz
  • Insurance with Danko Turcic
  • Supply chain contracts with Danko Turcic

30
Risk-neutral SC coordination
  • A contract coordinates the SC if the actions that
    optimize the entire chain (viewed as a single
    entity) are a Nash equilibrium of the strategic
    game induced by the contract (among the members
    of the SC).
  • "Optimize the entire chain" means expected value
    of of the PV (present value) of total profit
  • Payoffs in the game are the parties' expected
    values of the PVs of their profits.

31
Risk-sensitive SC coordination
  • A contract coordinates the SC if the actions that
    optimize the entire chain (viewed as a single
    entity) are a Nash equilibrium of the strategic
    game induced by the contract (among the members
    of the SC).
  • "Optimize the entire chain" means expected value
    of inter-period utility of the PV of total profit
  • Whose inter-period utility? Linear in rest of
    this talk.
  • A payoff in the game is the party's expected
    value of its inter-period utility of the PV of
    its profits.
  • Are some parties more sensitive to risk than
    others?

32
Coordinating the newsvendor
  • Risk neutrality
  • Various types of contracts coordinate the SC
  • Buy-back contracts and revenue-sharing contracts
  • These types of contracts are equivalent for the
    retailer
  • Any division of the "pie" is achievable
  • Risk sensitivity
  • There may not be any buy-back contract or
    revenue-sharing contract that coordinates the SC
  • The retailer is not indifferent between a
    buy-back contract and a revenue-sharing contract

33
Time line buy-back contract
  • M (manufacturer) announces wholesale price w and
    buy-back price b
  • R (retailer) orders Q, and M incurs -cQ
  • M ships Q units to R who incurs cost kQ and pays
    wQ to M
  • R receives
  • M pays to R M gets revenue

34
Buy-back contract risk-sensitive retailer
35
Buy-back risk-sensitive retailer - more
  • There is an example of parameters and strictly
    concave ? for which there is no coordinating
    buy-back contract. That is, there is a Qo for
    which no Q is a solution.

36
Misspecification bias
  • Mistaken use of intra-period utility function
    instead of inter-period utility function
  • It is more difficult for the SC to overcome
    double marginalization if the retailer's
    consultant neglects to use an inter-period
    utility function

37
Buy-back vs. revenue-sharing
  • Take any pair of buy-back and revenue-sharing
    contracts that are equivalent under risk
    neutrality
  • The buy-back contract has a higher value of
  • So the risk-sensitive retailer prefers the
  • buy-back contract

38
Summary 1
  • The axioms that have long been the justification
    for discounting with a non-linear intra-period
    utility function imply that the preferences are
    risk neutral
  • Capital asset pricing theory
  • Other areas of economics and finance

39
Summary 2
  • Weakening the axioms yields discounting without
    risk neutrality if and only if the composition
    axiom is not satisfied. Then the logically
    correct formalism uses an inter-period utility
    function

40
Summary 3
  • There are many unanswered questions such as
  • Can preferences be consistent with discounting
    under weaker assumptions than rationality,
    continuity, non-triviality, and decomposition?
  • What are the effects of inter-period utility
    functions in prescriptive sciences?
  • Is there a reasonable resolution of dynamic
    inconsistency?

41
Summary 4
  • Most supply chain coordination research assumes
    risk neutrality
  • There are alternative risk-sensitive definitions
    of coordination
  • It is possible to analyze a simple two-member
    supply chain with a risk-sensitive newsvendor
    retailer and risk-neutral supply chain
    optimization

42
Summary 5
  • There are risk-sensitive models that cannot be
    coordinated with any buy-back contract
  • Misspecification with intra-period utility
    function yields an order quantity that is too
    small
  • If buy-back and revenue-sharing contracts are
    equivalent under risk neutrality, then a
    risk-sensitive retailer prefers buy-back

43
Example
  • The following example satisfies the first three
    axioms, but neither decomposition nor composition
  • Two element sample space ? a,b Pa
    3/4 Pb 1/4
  • Preference is determined by variance - mean
  • X(a) Y(b) 0 X(b) Y(a) -1

44
Stochastic Order is not Rational
If the distribution functions of X and and Y
cross, then neither is stochastically larger than
the other. So the ordering is not complete
45
Mean Variance Tradeoffs
  • If X and Y are independent,
  • The ordering satisfies decomposition but not
    composition
  • Generally
  • It is easy to find examples that satisfy
    decomposition but not composition
  • It is difficult to find examples that satisfy
    composition but not decomposition

46
Decomposition vs. Composition
47
Where Does this Leave Us?
  • DA Denardo and Rothblum van Mieghem Chen, Sim,
    Simchi-Levi and Sun and I have used the
    following ordering
  • Robert Rosenthal (deceased) challenged my
    justification which was the obvious
    orthogonality of axioms for time preference and
    risk preference
  • He was correct - the two sets of axioms are NOT
    orthogonal
  • Nevertheless, there is a strong justification for
    this ordering

48
Role of the Composition Axiom
  • Let V be an abstract real vector space
    (application stochastic processes with the zero
    process as the 0 in V )
  • A real-valued function on V is weakly continuous
    if it is continuous on each finite-dimensional
    subspace of V, and it is linear if it is linear
    as a map of vector spaces.
  • A real-valued function u on V is a pseudo-utility
    function if
  • A pseudo-utility function is a utility function
    if it satisfies

49
Recent Result with James Alexander
  • If a binary relation on a real vector space
    satisfies the four axioms, then there is a
    utility function of the form f ?u in which
    uV?R is a linear pseudo-utility function. Also,
  • fR?R is weakly monotonic and is linear if
    and only if the binary relation satisfies the
    composition axiom
  • So if V is the set of stochastic processes on a
    probability space and if preferences satisfy the
    four axioms but not composition, then there is a
    nonlinear inter-period utility function ? such
    that

50
Mathematical Novelty
  • Hausner and Wendel (1952) showed that a binary
    relation on a real vector space has a linear
    pseudo-utility function if the binary relations
    properties include
  • Rationality
  • Anti-symmetry
  • Cone property
  • Composition decomposition
  • Our theorem
  • Does not require composition, anti-symmetry, or
    the cone property for existence of a linear
    pseudo-utility function, but it requires
    continuity and non-triviality
  • Exactly specifies the consequence of augmenting
    decomposition with composition

51
In Operations
  • Standard approach
  • Apply to dynamic newsvendor as in much supply
    chain research
  • There is a literature on this problem

52
Whats the Difference?
  • In interpret
    coordinates as
  • Time indices ? time preference
  • Sample space outcomes ? risk preference
  • Preference theory is largely abstract so it
    applies to both time and risk preference
  • Issues unique to each kind discounting risk
    neutrality
  • Why not invoke von Neumann-Morgenstern axioms for
    risk preference and Williams-Nassar axioms for
    time preference?
  • Are the two sets of axioms orthogonal?
  • Robert Rosenthal

53
Risk-Averse Dynamic News Vendor
54
Risk-Neutral Optimization of a Firms Value
  • Market value of a firm is the present value of
    time stream of dividends
  • Paper with Lode Li and Martin Shubik
  • Firm makes periodic operational and financial
    decisions
  • Operational decisions as in dynamic news vendor
  • Financial decisions
  • Dividend (net of capital subscription -
    entrepreneurial firm)
  • Short-term loan (model includes a default
    penalty)
  • Augment constraints in dynamic news vendor model
  • Liquidity
  • Cash flow balance
  • Additional state variable retained earnings

55
Risk-Averse Optimization of a Firms Value
  • Market value of a firm is the present value of
    time stream of dividends
  • Again use an exponential inter-period utility
    function
  • Risk-neutral and risk-averse analyses share
    conclusions
  • There are optimal base-stock inventory and
    retained earnings levels
  • Dont borrow unless you have to for liquidity,
    and then as little as possible (pecking order
    principle)

56
Risk-Aversion Effects
  • Market value of a firm is the present value of
    time stream of dividends
  • Again use an exponential inter-period utility
    function
  • Some effects of risk aversion
  • Inventory base-stock level rises as time elapses
  • Retained earnings base-stock level drops as time
    elapses
  • So dividends rise as time passes
  • Effects of initial capitalization
  • Work with Qiaohai (Joice) Hu in the risk-neutral
    case
  • Further results in the risk-sensitive case
Write a Comment
User Comments (0)
About PowerShow.com