Markov Decision Processes: A Survey

About This Presentation

Title:

Markov Decision Processes: A Survey

Description:

Airline Meal Planning Models and Results. MDP Theory and Computation ... Harpaz, Lee and Winkler (1982) study output decisions of a competitive firm in a ... – PowerPoint PPT presentation

Number of Views:160

Avg rating:3.0/5.0

Slides: 64

Provided by: martinlp

more less

Transcript and Presenter's Notes

Title: Markov Decision Processes: A Survey

1
Markov Decision ProcessesA Survey

Martin L. Puterman

2
Outline

Example - Airline Meal Planning
MDP Overview and Applications
Airline Meal Planning Models and Results
MDP Theory and Computation
Bayesian MDPs and Censored Models
Reinforcement Learning
Concluding Remarks

3
Airline Meal Planning

Goal Get the right number of meals on each
flight
Why is this hard?
Meal preparation lead times
Load uncertainty
Last minute uploading capacity constraints
Why is this important to an airline?
500 flights per day ? 365 days ? 5/meal
912,500

4

5
How Significant is the Problem?
6
The Meal Planning Decision Process

At several key decision points up to 3 hours
before departure, the meal planner observes
reservations and meals allocated and adjusts
allocated meal quantity.
Hourly in the last three hours, adjustments are
made but the cost of adjustment is significantly
higher and limited by delivery van capacity and
uploading logistics.

7
Meal Planning Timeline

8
Airline Meal Planning

Operational goal develop a meal planning
strategy that minimizes expected total overage,
underage and operational costs

A Meal Planning Strategy specifies at each
decision point the number of extra meals to
prepare or deliver for any observed meal
allocation and reservation quantity.
9
Why is Finding an Optimal Meal Planning Strategy
Challenging?

6 decision points
108 passengers
108 possible actions
One strategy requires 108?108?6 69984 order
quantities.
There are 7,558,272 strategies to consider.
Demand must be forecasted.

10
Airline Meal Planning Characteristics

A similar decision is made at several time points
There are costs associated with each decision
The decision has future consequences
The overall cost depends on several events
There is uncertainty about the future

11
What is a Markov decision process?

A mathematical representation of a sequential
decision making problem in which
A system evolves through time.
A decision maker controls it by taking actions at
pre-specified points of time.
Actions incur immediate costs or accrue immediate
rewards and affect the subsequent system state.

12
MDP Overview

13
Markov Decision Processes are also known as

MDPs
Dynamic Programs
Stochastic Dynamic Programs
Sequential Decision Processes
Stochastic Control Problems

14
Early Historical Perspective

Massé - Reservoir Control (1940s )
Wald - Sequential Analysis (1940s )
Bellman - Dynamic Programing (1950s)
Arrow, Dvorestsky, Wolfowitz, Kiefer, Karlin -
Inventory (1950s)
Howard (1960) - Finite State and Action Models
Blackwell (1962) - Theoretical Foundation
Derman, Ross, Denardo, Veinott (1960s) - Theory
- USA
Dynkin, Krylov, Shirayev, Yushkevitch (1960s) -
Theory - USSR

15
Basic Model Ingredients

Decision epochs 0, 1, 2, ., N or 0,N or
0,1,2, or 0,?)
State Space S (generic state s)
Action Sets As (generic action a)
Rewards rt(s,a)
Transition probabilities pt(js,a)
A model is called stationary if rewards and
transition probabilities are independent of t

16
System Evolution
at1
at
st
st1
rt(st,at)
rt1(st1,at1)
Decision Epoch t 1
Decision Epoch t
17
Another Perspective
s1
s1
a1
s2
s2
s3
s3
a2
s4
s4
18
Yet Another Perspective An Event Timeline
...
May 15
June 1
June 10
June 15
...
Place June order
May order arrives ship product to DCs
May sales data arrives prepare July forecast
Place July order
19
Some Variants on the Basic Model

There may be a continuum of states and/or actions
Decisions may be made in continuous time
Rewards and transition rates may change over time
System state may be not observable
Some model parameters may not be known

20
Derived Quantities

Decision Rules dt(s)
Policies, Strategies or Plans ? ( d1, d2, )
or ? ( d1, d2, , dN)
Stochastic Processes ( Xt, Yt ), Es? ?
Value Functions vt ?(s), v??(s), g ?,
Value functions differ from immediate rewards,
they represent the value starting in a state of
all future events

21
Objective

Identify a policy that maximizes either the
expected total reward (finite or infinite
horizon)
v(s) Es ? rt (Xt,Yt )
expected discounted reward
expected long run average reward
expected utility
possibly subject to constraints on system
performance

?
t0
22
The Bellman Equation

MDP computation and theory focuses on solving the
optimality (Bellman) equation which for infinite
horizon discounted models

This can also be expressed as
v Tv or Bv 0
v(s) is the value function of the MDP
23
Some Theoretical Issues

When does an optimal policy with nice structure
exist?
Markov or Stationary Policy
(s,S) or Control Limit Policy
When do computational algorithms converge? and
how fast?
What properties do solutions of the optimality
equation have?

24
Computing Optimal Policies

Why?
Implementation
Gold Standard for Heuristics
Basic Principle - Transform multi-period problem
into a sequence of one-period problems.
Why is computation difficult in practice?
Curse of Dimensionality

25
Computational Methods

Finite Horizon Models
Backward Induction (Dynamic Programming)
Infinite Horizon Models
Value Iteration
Policy Iteration
Modified Policy Iteration
Linear Programming
Neuro-Dynamic Programming/Reinforcement learning

26
Infinite Horizon Computation

Iterative algorithms work as follows
Approximate the value function by v(s)
Select a new decision rule by
Re-approximate the value function
Approximation methods
exact - policy iteration
iterative - value iteration and modified policy
iteration
simulation based - reinforcement learning

27
Applications (A to N)

Airline Meal Planning
Behaviourial Ecology
Capacity Expansion
Decision Analysis
Equipment Replacement
Fisheries Management
Gambling Systems

Highway Pavement Repair
Inventory Control
Job Seeking Strategies
Knapsack Problems
Learning
Medical Treatment
Network Control

28
Applications (O to Z)

Option Pricing
Project Selection
Queueing System Control
Robotic Motion
Scheduling
Tetris

User Modeling
Vision (Computer)
Water Resources
X-Ray Dosage
Yield Management
Zebra Hunting

29
Coffee, Tea or ? A Markov Decision Process
Model for Airline Meal Provisioning J. Goto,
M.E. Lewis and MLP

Decision Epochs T 1, ,5
0 - Departure time
1-3 1,2 and 3 Hours Pre-Departure
4 6 Hours Pre - Departure
5 36 Hours Pre-Departure
States (l,q) 0 ? l ? Booking Limit, 0 ? q ?
Capacity
Actions (Meal quantity after delivery)
At,(l,q) 0, 1, , Plane Capacity t 3,4,5
At,(l,q) q-van capacity, , q van capacity)
t 1,2

30
Markov Decision Process Formulation

Costs (depending on t)
rt((l,q),a) Meal Cost Return penalty late
delivery charge shortage cost

Transition Probabilities
?pt(qq)
aq pt((l,q)(l,q),a) ?
? 0 a?q
31
An Optimal Decision Rule
Meal Quantity
Decision Epoch 1
Passenger Load
Departure
Adjust with Van
32
Empirical Performance
33
Overage versus Shortage

Evaluate the model over a range of terminal costs
Observe the relationship of average overage and
proportion of flights short-catered
55 flight number / aircraft capacity combinations
(evaluated separately)

34
Overage versus Shortage

Performance of optimal policies

35
Information Acquisition

36
Information Acquisition and Optimization

Objective Investigate the tradeoff between
acquiring information and optimal policy choice
Examples
Harpaz, Lee and Winkler (1982) study output
decisions of a competitive firm in a market with
random demand in which the demand distribution is
unknown.
Braden and Oren (1994) study dynamic pricing
decisions of a firm in a market with unknown
consumer demand curves.
Lariviere and Porteus (1999) and Ding, Puterman
and Bisi (2002) study order decisions of a
censored newsvendor with unobservable lost sales
and unknown demand distributions.
Key result - it is optimal to experiment

37
Bayesian Newsvendor Model

Newsvendor cost structure (c - cost, h - salvage
value, p - penalty cost)
Demand assumptions
positive continuous
i.i.d. sample from f(x?) with unknown ?
prior on ? is ?1(?)
Assume first that demand is unobservable
Demand Sales observed lost sales

Time Line of Events

obs. x1
set y2
obs. x2
set y1
1
2
3
39
Demand Updating
xn

n n1
40
Bayesian Newsvendor Model

Bayesian MDP Formulation
At decision epoch n, (n1,2, ..., N)
States
all probability distributions on the

unknown parameter
Actions
Costs
Transition Prob

41
Bayesian Newsvendor Model

The Optimality Equations
for n1,,N with the boundary condition
Key Observation
The transition probabilities are independent
of the actions. So the problem can be reduced to
a sequence of single-period problems.

The Bayesian Newsvendor Policy
The BMDP reduces to a sequence of single-period,
two-step problems.
Demand distribution parameter updating
Cost minimization
where Mn is the CDF of mn

43
Bayesian Newsvendor with Unobservable Lost Sales

Model Set-up
Same as fully observable case but unmet demand is
lost and unobservable
Question
Is the Bayesian Newsvendor policy optimal?

44
Bayesian Newsvendor with Unobservable Lost Sales

Demand is censored by the order quantity.
demand exactly observed
demand censored
Demand updating is different in this case

45
Bayesian Newsvendor with Unobservable Lost Sales

Set yn-1

obs. xn-10 with mn-1(0)
obs. xn-11 with mn-1(1)
obs. xn-1 yn-1 with 1-Mn-1(yn-1)
46
Bayesian Newsvendor with Unobservable Lost Sales

Model Formulation
States, Actions, Costs As above
Transition Probabilities

The Optimality Equations
47
The Key Result
Bayesian Newsvendor with Unobservable Lost Sales

if f(x? ) is likelihood order increasing in ?.
In this model, decisions in separate periods are
interrelated through the optimality equation.
This means that it is optimal to tradeoff
learning for short term optimality.
Question What is an upper bound on yn?

48
Bayesian Newsvendor with Unobservable Lost Sales
Solving the optimality equations gives For
N For n1,..., N-1,
where p(yn) is a policy dependent penalty cost
. Proof of key result is based on showing p(yn)
gt p for n lt N.
49
Bayesian Newsvendor with Unobservable Lost Sales

Some comments
The extra penalty can be interpreted as the
marginal expected value of information at
decision epoch n.
Numerical results show small improvements when
using the optimal policy as opposed to the
Bayesian Newsvendor policy.
We have extended this to a two level supply chain

50
Partially Observed MDPs

In POMDPs, system state is not observable.
Decision maker receives a signal y which is
related to the system state by q(ys,a).
Analysis is based on using Bayes Theorem to
estimate distribution of the system state given
the signal
Similar to Bayesian MDPs described above
the posterior state distribution is a sufficient
statistic for decision making
State space is a continuum
Early work by Smallwood and Sondik (1972)
Applications
Medical diagnosis and treatment
Equipment repair

51
Reinforcement Learning and Neuro-Dynamic
Programming

52
Neuro-Dynamic Programming or Reinforcement
Learning

A different way to think about dynamic
programming
Basis in artificial intelligence research
Mimics learning behavior of animals
Developed to
Solve problems with high dimensional state spaces
and/or
Solve problems in which the system is described
by a simulator (as opposed to a mathematical
model)
NDL/Rl refers to a collection of methods
combining Monte Carlo methods with MDP concepts

53
Reinforcement Learning

Mimics learning by experimenting in real life
learn by interacting with the environment
goal is long term
uncertainty may be present (task must be repeated
many time to obtain its value)
Trades off between exploration and exploitation
Key focus is estimating value function ( v(s) or
Q(s,a) )
Start with guess of value function
Carry out task and observe immediate outcome
(reward and transition)
Update value function

54
Reinforcement Learning - Example

Playing Tic-Tac-Toe (Sutton and Barto, 2000)
You know the rules of the game but not opponents
strategy (assumed fixed over time but with random
component)
Approach
List possible system states
Start with initial guess of probability of
winning in each state
Observe current state (s) and choose action that
will move you to state with highest probability
of winning
Observe state (s) after opponent plays
Revise value in state s based on value in s.
Player might not always choose best action but
try other actions to learn about different
states.

55
Reinforcement Learning - Example

Observations about Tic-Tac-Toe Example
Dimension of state space is 39
Writing down a mathematical model for the game is
challenging, simulating it is easy.
Goal is to maximize the probability of winning,
there is no immediate reward
Possible updating mechanism using observations
vnew(s) vold(s) ? ( vold(s) - vold(s) )
? is a step-size parameter
vold(s) - vold(s) is a temporal-difference
The subsequent state s depends on players action.

56
Reinforcement Learning

Problems can be classified in two ways

57
Temporal Difference Updating

No model example, discounted case - based on
Q(s,a)
Algorithm (policy specified)
Start system in s, choose action a and observe s
and a
Update Q(s,a) ? Q(s,a) ? ( r ?Q(s,a) -
Q(s,a))
Repeat replacing (s,a) by (s,a)
Algorithm (no-policy specified) (Q-Learning)
Start system in s, choose action a and observe s
and a
Update Q(s,a) ? Q(s,a) ? ( r ? max a? A
Q(s,a) - Q(s,a))
Repeat replacing s by s
Issues include choosing ? and stopping criteria

58
RL Function Approximation

High-dimensionality addressed by
replacing v(s) or Q(s,a) by representation
and then applying Q-learning algorithm updating
weights wi at each iteration, or
approximating v(s) or Q(s,a) by a neural network
Issue choose basis functions ?i(s,a) to
reflect problem structure

59
RL Applications

Backgammon
Checker Player
Robot Control
Elevator Dispatching
Dynamic Telecommunications Channel Allocation
Job Shop Scheduling
Supply Chain Management

60
Neuro-Dynamic Programming Reinforcement Learning
It is unclear which algorithms and parameter
settings will work on a particular problem, and
when a method does work, it is still unclear
which ingredients are actually necessary for
success. As a result, applications often require
trial and error in a long process of a parameter
tweaking and experimentation. van Roy - 2002
61
Conclusions

62
Concluding Comments

MDPs provide an elegant formal framework for
sequential decision making
They are widely applicable
They can be used to compute optimal policies
They can be used as a baseline to evaluate
heuristics
They can be used to determine structural results
about optimal policies
Recent research is addressing The Curse of
Dimensionality

63
Some References

Bertsekas, D.P. and Tsitsiklis, J.N.,
Neuro-Dynamic Programming, Athena, 1996.
Feinberg, E.A. and Shwartz, A. Handbook of Markov
Decision Processes Methods and Applications,
Kluwer 2002.
Puterman, M.L. Markov Decision Processes, Wiley,
1994.
Sutton, R.S. and Barto, A.G. Reinforcement
Learning, MIT, 2000.

Write a Comment

User Comments (0)