Title: Neuro-Dynamic Programming
1Neuro-Dynamic Programming
- José A. Ramírez,Yan Liao
- Advanced Decision Processes
- ECECS 841, Spring 2003University of Cincinnati
2Outline
- 1. Neuro-Dynamic Programming (NDP) motivation.
- 2. Introduction to Infinite Horizon Problems
- Minimization of Total Cost,
- Discounted Problems,
- Finite-State Systems,
- Value Iteration and Error Bounds,
- Policy Iteration,
- The Role of Contraction Mappings.
- 3. Stochastic Control Overview
- State Equation (system model),
- Value Function,
- Stationary policies and value function,
- Introductory example Tetris (game).
-
3Outline
- 4. Control of Complex Systems
- Motivation about use of NDP in complex systems,
- Examples of complex systems where NDP could be
applied. - 5. Value Function Approximation
- Linear parameterization parameter vector and
basis functions, - Continuation Tetris example.
4Outline
- 6. Temporal-Difference Learning (TD(?))
- Introduction Autonomous systems, general TD(?)
algorithm, - Controlled Systems, TD(?) for more general
systems - Approximate policy iteration,
- Controlled TD,
- Q-functions, and approximating the Q-function
(Q-learning), - Comments about relationship with Approximate
Value Iteration. - 7. Actors and Critics
- Averaged Rewards,
- Independent actors,
- Using critic Feedback.
51. Neuro-Dynamic Programming (NDP) motivation
rational and irrational behavior
-How decisions are made (psychologists,
economists).
Study ofDecision-Making
-How decisions ought to be made rational
decision-making (engineers and management
scientists).
clear objectives, strategic behavior.
Rational decision problems -Development of
mathematical theory understanding of dynamics
models,uncertainties, objectives, and
characterization of optimal decision
strategies. -If optimal strategies do exist,
then computational methods are used as complement
(e.g., Implementation).
61. Neuro-Dynamic Programming (NDP) motivation
-In contrast to rational decision-making, there
is no clear-cut mathematical theory about
decisions made by participants of natural systems
(speculative theories, refining ideas by
experimentation). -One approach hypothesis that
behavior is in some sense rational, then ideas
from study of rational decision-making are used
to characterize such behavior, e.g., utility and
equilibrium theory in financial
economics. -Also, study of animal behavior is
subject of interest evolutionary theory and its
popular precept survival of the fittest
support the possibility that behavior to some
extent concurs with that rational
agent. -Contributions from study of natural
systems to science of rational decision-making
-Computational complexity of decision problems
and lack of systematic approaches for dealing
with it.
71. Neuro-Dynamic Programming (NDP) motivation
-For example practical problems addressed by the
theory of dynamic programming (DP) can rarely
solved using DP algorithms because the
computational time required for the generation
of optimal strategies typically grows
exponentially in the number of variables involved
? Curse of dimensionality. -This call for an
understanding of suboptimal solutions
/decision-making under computational constraints.
Problem ? no satisfactory theory has been
developed to this end. -Interesting the fact
that biological mechanisms facilitate the
efficient synthesis of adequate strategies
motivates the possibility that understanding such
mechanisms can inspire new and computationally
feasible methodologies for strategic
decision-making.
81. Neuro-Dynamic Programming (NDP) motivation
-Reinforcement Learning (RL) over two decades,
RL algorithms originally conceived as
descriptive models for phenomena observed in
animal behavior- have grown out in the field of
artificial intelligence and been applied to
solving complex sequential decision
problems. -Success of RL in solving large-scale
problems has generated special interest among
operations researchers and control theorists ?
research devoted to understand those methods and
their potential. -Developments from the
operations research and control theorists
focused in normative view, acknowledge of
relative disconnect from descriptive models of
animal behavior, ? some operations researchers
and control theorists have come to refer this
area of research as Neuro-Dynamic Programming
(NDP) instead of RL.
91. Neuro-Dynamic Programming (NDP) motivation
-During these lectures we will present a sample
of the recent developments and open issues of
research in NDP. -Specifically, we will be
focused in two algorithmic ideas of greatest use
in NDP, and for which there has been significant
theoretical progress in recent years -Temporal
-Difference learning -Actor-Critic
Methods. -First, we begin providing some
background and perspective on the methodology and
problems may address. Comments about references
102. Introduction to Infinite Horizon Problems
Material taken from Dynamic Programming and
Optimal Control, vol. I, II and Neuro-Dynamic
Programming by Dimitri P. Berstsektas and John
Tsitsiklis. The Dynamic Programming Problems
with infinite horizon are characterized by the
following aspectsa) The number of stages is
infinite.b) The system is stationary ? the
system equation, the cost per stage, and the
random disturbance statistics do not change from
one stage to the next. Why Infinite Horizon
Problems? -They are interesting because their
analysis is elegant and insightful. -Implementati
on of optimal policies is often simple.? Optimal
policies are typically stationary, e.g., optimal
rule used to choose controls does not change
from stage to stage. -NDP! ? complex systems.
This assumption is never satisfied in practice,
but is a reasonable approximation for problems
with a finite but very large number of stages.
112. Introduction to Infinite Horizon Problems
- -They require more sophisticated analysis than
the finite horizon problems. - ? It is needed to analyze limiting behavior as
the horizon tends to infinity. - -We consider four principal classes of infinite
horizon problems. The first two classes try to
minimize J? (x0), the total cost over an infinite
number of stages - Stochastic shortest path problems in this case ?
1 and assume that there is an additional state
0, which is a cost-free termination state once
the system reach the termination state it remains
there at not additional cost.? objective reach
the termination state with the minimal cost. - Discounted problems with bounded cost per stage
here ? lt 1, and the absolute one-stage cost
g(x,u,w) is bounded from above by some constant
M. Thus, J?(x0) is well defined because it is
the infinite sum of a sequence of numbers that
are bounded in absolute value by the decreasing
geometric progression M?k. -
122. Introduction to Infinite Horizon Problems
- iii) Discounted problems with unbounded cost per
stage here the discount factor ? may or may not
be less than 1, and the cost per stage is
unbounded.? this problem is difficult to analyze
because of the possibility of infinite cost for
some policies (more details in chap.3, Dynamic
Programming,vol. II, by Bertsekas). - iv) Average cost problems in some problems we
have J?(x0) ?, for every policy ? and initial
state i, then in many problems the average cost
per stage, given by - where J?N(i) is the N-stage cost-to-go of policy
? starting at state x0, is well defined - as a limit and is finite.
132. Introduction to Infinite Horizon Problems
- A Preview of Infinite Horizon Results
- Let J the optimal cost-to-go function of the
infinite horizon problem, and consider - the case ? 1, with JN(x) as the optimal cost of
the problem involving N stages, - initial state x, cost per stage g(x,u,w), and
zero terminal cost. Thus, the N-stage cost - can be computed after N iterations of the DP
algorithm - Thus, we can speculate the following
- The optimal infinite horizon cost is the limit of
the corresponding N-stage optimal costs as N? 8
Note that the time indexing has been reversed
from the original DP algorithm, thus the optimal
finite horizon cost functions can be computed
with a singleDP recursion (more details in
chap.1, Dynamic Programming, vol. II, by D.P.
Bertsekas.
142. Introduction to Infinite Horizon Problems
- ii) The following limiting form of the DP
algorithm should hold for all statesthis is
not an algorithm, but a system of equations (one
equation per state), which has as solution the
costs-to-go for all states. It is also viewed as
a functional equation for the cost-to-go function
J , and it is called Bellmans equation. - iii) If ?(x) attains the minimum in the
right-hand side of the Bellmans equation for
each x, then the policy ??, ? , should be
optimal. This is true for most infinite horizon
problems of interest.? stationary policies. - Most of the analysis of infinite horizon problems
are focused around the above three - issues and efficient methods to compute J and
optimal policies.
152. Introduction to Infinite Horizon Problems
Stationary Policy A stationary policy is an
admissible policy ??, ?, with a
corresponding cost function J? (x).? ? is
optimal if J?(x)J(x) for all states x. Some
Shorthand Notation The use of single recursions
in the DP algorithm to compute optimal costs over
a finite horizon, motivates the introduction of
two mappings that play an important theoretical
role and give us a convenient shorthand notation
for expressions that are Complicated to
write. For any function JS??, where S is the
states space, we consider the function obtained
by applying the DP mapping J as follows T can
be viewed as a mapping that transforms J on S
into the function TJ on S. TJ represent the
optimal cost function for the one-stage problem
that has stage cost g and terminal cost ?J.
162. Introduction to Infinite Horizon Problems
Similarly, for any control function ? S ? C,
where C is the space of controls, we
have Also, we denote the composition Tk of
the mapping T with itself k times Then, for
k0 we have
172. Introduction to Infinite Horizon Problems
Some Basic Properties Monotonicity Lemma For
any functions JS?? and JS??, such that and
for any function ?S?C with ?(x) ? U(x), for all
x ? S, we have
182. Introduction to Infinite Horizon Problems
The Role of Contraction Mappings (Dynamic
Programming, vol. II, Bertsekas) Definition
1.4.1 A mapping H B(S) ? B(S) is said to be a
contraction mapping if there exists a scalar ?
lt1 such that Where is the norm It
is said to be an m-stage contraction mapping if
there exists a positive integer m and some ? lt 1
such that where Hm denotes the composition
of H with itself m times. Note B(S) is the set
of all bounded real-valued functions on S. Every
function JS ??.
192. Introduction to Infinite Horizon Problems
The Role of Contraction Mappings (Dynamic
Programming, vol. II, Bertsekas) Proposition
1.4.1 (Contraction Mapping Fixed-Point
Theorem) If H B(S) ? B(S) is a contraction
mapping or an m-stage contraction mapping, then
there exists a unique fixed point of H i.e.,
there exists a unique function J ? B(S) such
that Furthermore, if J is any function in B(S)
and Hk is the composition of H with itself k
times, then
203. Stochastic Control Overview
State Equation Lets consider a discrete-time
dynamic system, that at each time t, takes on a
state xt and evolves according to
where wt is a disturbance (iid) and at is a
control decision. We restrict attention to finite
state, disturbances, and control spaces, denoted
by ?, W, and ?, respectively.
Value Function Let r ? x ?? ? associates a
reward r( xt , at ) with a decision at, made at
state xt. Let ? a stationary policy with ? ?
? ?. For each policy ? we define a value
function v( , ? ) ?? ?
213. Stochastic Control Overview
Optimal Value Function we define the optimal
value function V as follows
From dynamic programming, we have that any
stationary policy ? given by
is optimal in the sense that
223. Stochastic Control Overview
Example 1 Tetris The video arcade game of Tetris
can be viewed as an instance of stochastic
control. In particular, we can view the state xt
as an encoding of the current wall of bricks
and the shape of the current falling piece. The
decision at identifies an orientation and
horizontal position for placement of the falling
piece onto the wall. Though the arcade game
employs a more complicated scoring system,
consider for simplicity a reward r(xt, at) equal
to the number of rows eliminated by placing the
piece in the position described by at. Then, a
stationary policy that maximizes the value
essentially optimizes a combination of present
and future row elimination, with decreasing
emphasis placed on rows to be eliminated at times
farther into the future.
233. Stochastic Control Overview
Example 1 Tetris, cont. Tetris was first
programmed by Alexey Pajitnov, Dmitry Pavlovsky,
and Vadim Gerasimov, computer engineers at the
Computer Center of the Russian Academy of
Sciences in 1985-86.
Standard shapes
Number of states
243. Stochastic Control Overview
Dynamic programming algorithms compute the
optimal value function V. The result is stored in
a look-up table with one entry V(x) per state x
? X. When is required, the value function is used
to generate optimal decisions. For example,
given a current state xt ? X , a decision at is
selected according to
254. Control of Complex Systems
The main objective is in the development of a
methodology for the control of complex
systems. Two common characteristics of these
type of systems are i-An intractable state
space intractable state spaces preclude the
use of classical DP which compute and store one
numerical value per state. ii- Severe
nonlinearities methods of traditional linear
control, which are applicable even in large state
spaces, are ruled out by severe
nonlinearities. Lets review some examples of
complex systems, where NDP could be and has been
applied.
264. Control of Complex Systems
a) Call Admission and Routing With rising demand
in telecommunication network resources, effective
management is as important as ever. Admission
(deciding which calls to accept/reject) and
routing (allocating links in the network to
particular calls) are examples of decisions that
must be made at any point in time. The objective
is to make the best use of limited network
resources. In principle, such sequential decision
problems can be addressed by dynamic programming.
Unfortunately, the enormous state spaces involved
render dynamic programming algorithms
inapplicable, and heuristic control strategies
are used in lieu. b) Strategic Asset
Allocation Strategic asset allocation is the
problem of distributing an investors wealth
among assets in the market in order to take on a
combination of risk and expected return that best
suits the investors preferences. In general, the
optimal strategy involves dynamic rebalancing of
wealth among assets over time. If each asset
offers a fixed rate of risk and return, and some
additional simplifying assumptions are made, the
only state variable is wealth, and the problem
can be solved efficiently by dynamic programming
algorithms. There are even closed form solutions
in cases involving certain types of investor
preferences. However, in the more realistic
setting involving risks and returns that
fluctuate with economic conditions, economic
indicators must be taken into account as state
variables, and this quickly leads to an
intractable state space. The design of effective
strategies in such situations constitutes an
important challenge in the growing field of
financial engineering.
274. Control of Complex Systems
c) SupplyChain Management With todays tight
vertical integration, increased production
complexity, and diversification, the inventory
flow within and among corporations can be viewed
as a complex network called a supply chain
consisting of storage, production, and
distribution sites. In a supply chain, raw
materials and parts from external vendors are
processed through several stages to produce
finished goods. Finished goods are then
transported to distributors, then to wholesalers,
and finally retailers, before reaching customers.
The goal in supplychain management is to achieve
a particular level of product availability while
minimizing costs. The solution is a policy that
decides how much to order or produce at various
sites given the present state of the company and
the operating environment. d) Emissions
Reductions The threat of global warming that may
result from accumulation of carbon dioxide and
other greenhouse gasses poses a serious
dilemma. In particular, cuts in emission levels
bear a detrimental shortterm impact on economic
growth. At the same time, a depleting environment
can severely hurt the economy especially the
agricultural sector in the longer term. To
complicate the matter further, scientific
evidence on the relationship between emission
levels and global warming is inconclusive,
leading to uncertainty about the benefits of
various cuts. One systematic approach to
considering these conflicting goals involves the
formulation of a dynamic system model that
describes our understanding of economic growth
and environmental science. Given such a model,
the design of environmental policy amounts to
dynamic programming. Unfortunately, classical
algorithms are inapplicable due to the size of
the state space.
284. Control of Complex Systems
e) Semiconductor Wafer Fabrication The
manufacturing floor at a semiconductor wafer
fabrication facility is organized into service
stations, each equipped with specialized
machinery. There is a single stream of jobs
arriving on a production floor. Each job follows
a deterministic route that revisits the same
station multiple times. This leads to a
scheduling problem where, at any time, each
station must select a job to service such that
(long term) production capacity is maximized.
Such a system can be viewed as a special class of
queueing networks, which are models suitable for
a variety of applications in manufacturing,
telecommunications, and computer systems. Optimal
control of queueing networks is notoriously
difficult, and this reputation is strengthened by
formal characterizations of computational
complexity. Other systems parking lots,
football, games strategy, combinatorial
optimization maintenance and repair, dynamic
channel allocation, backgammon. Some papers in
applications -Tsitsiklis, J. and Van Roy, B.
Neuro-Dynamic Programming Overview and a Case
Study in Optimal Stopping. IEEE Proceedings of
the 36th Conference on Decision Control, San
Diego, California, pp. 1181-1186, December,
1997. -Van Roy, B., Bertesekas, D.P., Lee, Y.,
and Tsitsiklis, J. A Neuro-Dynamic Programming
Approach to Retailer Inventory Management IEEE
Proceedings of the 36th Conference on Decision
Control, San Diego, California, pp. 4052-4057,
December, 1997.-Marbach, P. and Tsitsiklis, J.
A Neuro-Dynamic Programming Approach to
Admission Control in ATM Networks The Single
Link Case Technical Report LIDS-P-2402,
Laboratory for Information and Decision Systems,
M.I.T., November 1997. - Marbach, P, Mihatsch,
O, and Tsitsiklis, J. Call Admission Control and
Routing in Integrated Services Networks Using
Reinforcement Learning IEEE Proceedings of the
37th Conference on Decision Control, Tampa,
Florida, pp. 563-568, December,
1998.-Bertsekas, D.P., Homer, M.L., Missile
Defense and Interceptor Allocation by
Neuro-Dynamic Programming IEEE Transactions on
Systems Man and Cybernetics, vol. 30,
pp.101-110, 2000.
294. Control of Complex Systems
-For the examples presented, state spaces are
intractable as consequence of the curse of
dimensionality, that is, state spaces grow
exponentially in the number of state variables. ?
difficult (if not impossible) to compute (store)
one value per state as is required by classical
DP. -Additional shortcoming with classical DP
computations require use of transition
probabilities. ? For many complex systems, such
probabilities are not readily accessible. On the
other hand, is often easier develop simulation
models for the system and generate sample
trajectories. -Objective of NDP overcoming
curse of dimensionality through use of
parameterized (value) function approximators and
through use of output generated by simulators,
rather than explicit transition probabilities.
305. Value Function Approximation
-Intractability of state spaces ? value function
approximation. -Two important pre-conditions for
the development of effective approximation i-C
hoose a parameterization
that yields a good approximation ii-Algor
ithms for computing appropriate parameter
values. Note the choice of a suitable
parameterization requires some practical
experience or theoretical analysis that provides
rough information about the shape of the
function to be approximated.
315. Value Function Approximation
Linear parameterization -General classes of
parameterizations have been found used in NDP, to
keep the exposition simple, let us focus on
linear parameterizations, which take the form
where ?1, , ?K are basis functions mapping X
to ? and u (u(1), , u(K)) is a vector of
scalar weights.Similar to statistical
regression, the basis functions ?1, , ?K are
selected by a human based on intuition or
analysis to the problem at hand. Hint one
interpretation that is useful for the
construction of basis functions involves viewing
each function ?k as a feature that is, a
numerical value capturing a salient
characteristic of the state that may be pertinent
to effective decision making.
325. Value Function Approximation
Example 2 Tetris, continuation. In our
stochastic control formulation of Tetris, the
state is an encoding of the current wall
configuration and the current falling piece.
There are clearly too many states for exact
dynamic programming algorithms to be applicable.
However, we may believe that most information
relevant to gameplaying decisions can be
captured by a few intuitive features. In
particular, one feature, say ?1, may map states
to the height of the wall. Another, say ?2,
could map states to a measure of jaggedness of
the wall. A third might provide a scalar encoding
of the type of the current falling piece (there
are seven different shapes in the arcade game).
Given a collection of such features, the next
task is to select weights u(1), . . . , u(K) such
that for all states x. This approximation
could then be used to generate a gameplaying
strategy.
335. Value Function Approximation
- Example 2 Tetris, continuation.
- Similar approach is presented in the book
Neuro-Dynamic Programming (chapter 8, cases - of study) by D.P. Bertesekas and J. Tsitsiklis,
with the following parameterization, after some - experimentation
- Let the height hk of the kth column of the wall.
There are w such features, where w is the walls
width. - The absolute difference ?hk - hk1? between the
heights of the kth and the (k1)st column, k1,,
w-1. - The maximum wall height maxk hk.
- The number of holes L in the wall, that is, the
number of empty positions of the wall that are
surrounded by full positions.
345. Value Function Approximation
Example 2 Tetris, continuation. Thus, there are
2w1 features, which together with a constant
offset, require 2w2 weights in a linear
architecture of the form Using this
parameterization, with w10 (22 features), an
strategy is generated by NDP that eliminates an
average of 3554 rows per game, reflecting
performance comparable of an expert player.
offset
356. Temporal-Difference Learning
Introduction to Temporal-Difference
Learning Material from Richard Sutton,
Learning to Predict by Methods of Temporal
Differences, Machine Learning, 3 9-44, 1988. ?
this paper provide the first formal results in
the theory of temporal- difference (TD)
methods. - Learning to predict -Use of
past experience (historical information) with a
incompletely know system to predict its future
behavior. -Learning to predict is one of the
most basic and prevalent kinds of learning. -In
prediction learning training examples can be
taken directly from the temporal -sequence of
ordinary sensory input no special supervisor or
teacher is required. -Conventional
prediction-learning methods (Widrow-Hoff, LMS,
Delta Rule, Backpropagation) -Driven by error
between predicted and actual outcomes.
366. Temporal-Difference Learning
- -TD methods
- Driven by error or difference between temporally
successive predictions.? learning occurs
whenever there is a change in prediction over
time. - -Advantages of TD methods over conventional
methods - They are more incremental, and therefore more
easier to compute. - They tend to make more efficient use of their
experience they converge faster and produce
better predictions. - -TD Approach
- Predictions are based on numerical features
combined using adjustable parameters or
weights.? similar to connectionist models
(Neural Networks)
376. Temporal-Difference Learning
- -TD and supervised-learning approaches to
prediction - Historically, the most important learning
paradigm has been supervised learning learner is
asked to associate pairs of items (input,output). - Supervised learning has been used in patter
classification, concept acquisition, learning
from examples, system identification, and
associative memory.
Input
Real Output
A
LearningAlgorithm
Error
-
Input
Estimated Output
Adjust estimator parameters
386. Temporal-Difference Learning
- -Single-step and multi-step prediction problems
- Single-step all information about the
correctness of each prediction is revealed at
once.? supervised learning methods. - Multi-step correctness is not revealed until
more than one step after the prediction is made,
but partial information relevant to its
correctness is revealed at each step. ? TD
learning methods. - -Computational issues
- Sutton introduce a particular TD procedure by
formally relating it to a classical
supervised-learning procedure, the Widrow-Hoff
rule (also known as delta rule, the ADALINE
Adaptive Linear Element, and the Least Mean
Squares LMS- filter). - We consider multi-step prediction problems in
which experience comes in observation-outcome
sequences of the form x1, x2, x3, , xm, z, where
each xt is a vector of observations available at
time t in the sequence, and z is the outcome of
the sequence. Also, xt ? ?n and z ? ?.
396. Temporal-Difference Learning
- -Computational issues (cont.)
- For each observation-outcome, the learner
produces a corresponding sequence of predictions
P1, P2, P3, , Pm, each of which is an estimate
of z. - Predictions Pt are based on a vector of
modifiable parameters w. ? Pt(xt ,w) - All learning procedures are expressed as rules
for updating w. For each observation, an
increment to w, denoted ?wt , is determined.
After a complete sequence has been processed, w
is changed by (the sum of) all the sequences
increments - The supervised-learning approach treats each
sequence of observations and its outcome as a
sequence of observation-outcome pairs (x1 , z) ,
(x2 , z), , (xm , z). In this case increments
due to time t depends on the error between Pt
and z, and on how to change w will affect Pt .
406. Temporal-Difference Learning
- -Computational issues (cont.)
- Then, a prototypical supervised-learning update
procedure is - where ? is a positive parameter affecting the
rate of learning, and the gradient ?wPt , is - the vector of partial derivatives of Pt with
respect to each component of w. - Special case consider Pt a linear function of
xt and w, that is Pt wTxt ?i w(i) xt(i),
where w(i) and xt(i) are ith component of w and
xt.? ?wPt xt. Thus, - which correspond to the Widrow-Hoff rule.
- This equation depend critically on z, and thus
cannot be determined until the end of the
sequence when z becomes known.? All observations
and predictions made during a sequence must be
remembered until its end ?wt cannot be computed
incrementally. -
416. Temporal-Difference Learning
-TD Procedure There is a Temporal-Difference
procedure that produces the same (exactly)
result, and can be Computed incrementally. The
key is to represent the error z-Pt as a sum of
the changes in predictions as follows Using
this equation and the prototypical
supervised-learning equation, we have
This equation can be computed incrementally,becau
se it depends only on a pair of
successivepredictions and on the sum of all past
values of thegradient.We refer to this
procedure as TD(1).
426. Temporal-Difference Learning
- The TD(?) family of learning procedures
- The hallmark of temporal-difference methods is
their sensitivity to changes in successive
predictions rather than overall error between
predictions and the final outcome. - In response to an increase (decrease) in
prediction from Pt to Pt1 , an increment ?wt is
determined that increases (decreases) the
predictions for some or all of the preceding
observations vectors x1, ,xt. - TD(1) is a special case where all the predictions
are altered to an equal extent. - Now, consider the case where greater alterations
are made to more recent predictions. We consider
an exponential weighting with recency, in which
alterations to the predictions of observation
vectors occurring k steps in the past are
weighted according to ?k for 0 ? ? ? 1
?t-k
? 1, TD(1)
1
? 0TD(0)
k increases
0 lt ? lt 1
t-k
0
436. Temporal-Difference Learning
- The TD(?) family of learning procedures
- For ? 0 we have the TD(0) procedure
- For ? 1 we have the TD(1) procedure, that is
equivalent to the Widrow-Hoff rule, except that
TD(1) is incremental -
- Alterations of past predictions can be weighted
in ways other than the exponential form given
previously, let
Also referred in literature as eligibility
vectors.
446. Temporal-Difference Learning
- ?(Material taken from Neuro-Dynamic Programming,
Chapter 5) - Monte Carlo Simulation brief overview
- Suppose v is a random variable with unknown mean
m that we want to estimate. - Using Monte Carlo simulation to estimate m
generate a number of samples v1, v2, , vN, and
then estimate m by forming the sample
meanAlso, we can compute the sample mean
recursivelywith M1 v1 .
456. Temporal-Difference Learning
- Monte Carlo simulation case of iid samples
- Suppose N samples v1, v2, , vN , independent and
identically distributed, with mean m, and - variance ?2. Then we have
- where MN is said to be an unbiased estimator of
m. Also, its variance is given by - As N? ? the variance of MN converge to zero ?
MN converges to m. - Also, the strong law of large numbers provides an
additional property the sequence MN - converges to m with probability one. ? The
estimator is consistent.
466. Temporal-Difference Learning
- Policy Evaluation by Monte Carlo simulation
- -Consider the stochastic shortest path problem,
with state space 0, 1, 2, , n with 0 as an - absorbing state and cost-free. Let the
cost-to-go from i to j g( i , j ) (given the
control action - µ(i) , pij(µ(i))). Suppose that we have a fixed
stationary policy µ (proper) and we want to - calculate, using simulation, the corresponding
cost-to-go vector
J µ ( J µ (1) J µ (2) . . . J µ (n) ). - Approach generate, starting from each i, many
samples states trajectories and average the - corresponding costs to obtain an approximation to
J µ(i).? Instead of do this for each state i, - lets use each trajectory to obtain cost samples
for all states visited by the trajectory, and - consider the cost of the trajectory portion that
starts at each intermediate state.
476. Temporal-Difference Learning
- Policy Evaluation by Monte Carlo simulation
- Suppose that a number of simulation runs are
performed, each ending at the termination state
0. - Consider the m-th time a given state i0 is
encountered, and let (i0 , i1 ,, iN) be the
remainder of - the corresponding trajectory, where iN 0.
- Then, let c( i0 , m ) the cumulative cost up to
reaching state 0, then - Some assumptions different simulated
trajectories are statistically independent, and
each - trajectory is generated according to the Markov
process determined by the policy µ. Then we - have,
486. Temporal-Difference Learning
Policy Evaluation by Monte Carlo simulation The
estimation of Jµ(i) is obtained by forming the
sample mean subsequent to the Kth encounter
with state i. The sample mean can be expressed
in iterative form where starting with
J(i)0.
496. Temporal-Difference Learning
Policy Evaluation by Monte Carlo
simulation Consider the trajectory (i0, i1, ,
iN), and let k an integer, such as 1 ? k ? N.
The trajectory contains the subtrajectory (ik,
ik1,, iN).? sample trajectory with initial
state ik that can be used to update J(ik) using
the iterative equation previously
presented. Algorithm Run a simulation and
generate the state trajectory (i0, i1, ,iN),
update the estimates J(ik) for each k0, 1, ,
N-1, the formula The step size ?(ik) can
change from one iteration to iteration. Ad
ditional details ? Neuro-Dynamic Programming by
Bertsekas and Tsitsiklis, chapter 5.
506. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences Here we consider the implementation
of the Monte Carlo policy evaluation algorithm
that incrementally updates the cost-to-go
estimates J(i). First, assume that for any
trajectory i0, i1, , iN , with iN 0, and ik 0
for k gt N, g( ik, ik1) 0 for k ? N, and J(0)0.
Also, the policy under consideration is
proper. Lets rewrite the previous formula in the
following form Note that we use the
property J(iN)0.
516. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences Equivalently, we can rewrite the
previous equation as follows where are
called temporal differences (TD). The TD
represents the difference between the
estimate of the cost-to-go based on the
simulated outcome in the current stage, and the
current estimate J(ik).
526. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences Advantage The estimations can be
computed incrementally, e.g., for the l-th
temporal difference dl (once that it becomes
available) we have as soon as dl is
available. The temporal difference dl appears in
the update formula for J(ik) for every k ? l,
then, as soon as transition il1 has been
simulated.
536. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences TD(?) Here we introduce the TD(?)
algorithm as a stochastic approximation method
for solving a suitably reformulated Bellman
equation. The Monte Carlo evaluation algorithm
can be viewed as a Robbins-Monro stochastic
approximation (more details chapter 4.
Neuro-Dynamic Programming) method for solving the
equations for unknowns J?(ik), as ik ranges
over the states in the state space. Other
algorithms can be generated in similar form,
e.g., starting from other systems equation
involving J? and then replacing expectations by
single estimates. For example, from Bellmans
equation
546. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences TD(?) the stochastic approximation
method takes the form which is updated each
time that state ik is visited. Lets take now a
fixed value l, nonnegative and integer, and
taking into consideration the cost for the first
l1 transitions, then the stochastic algorithm
could be based on the (l1)-step Bellman
equation ?Without any special knowledge to
select one value of l over another, we consider
forming a weighted average of all possible
multistep Bellman equations. Specifically, we
fix some ? lt 1, and multiply by (1- ?) ?l and
sum over all nonnegative l.
556. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences TD(?) Then, we have Interchangin
g the order of the two summations, and using the
fact that we have
566. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences TD(?) The previous equation can be
expressed in terms of the temporal differences as
follows where are the temporal differences,
and Edm0 for all m (Bellmans equation).
The Robbins-Monro stochastic approximation
method, equivalent to the previous equation,
is where ? is a stepsize parameter (can change
from iteration to iteration). The above equation
provide us with a family of algorithms, one for
each choice of ?, and it is known as TD(?). Note
that for ?1 we have the Monte Carlo policy
evaluation method, also called TD(1).
576. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences TD(?) Also, for ?0, we have
another limiting case, and using the convention
001, then the TD(0) method is presented as
follows This equation coincides with the
one-step Bellmans equation previously
presented. Off-line and On-line variants When
all of the updates are carried out
simultaneously, after the entire trajectory has
been simulated, then we have the off-line version
of TD(?). In alternative form, when the updates
are evaluated one term at a time, we have the
on-line version of TD(?).
58Temporal-Difference Learning (TD(?))
- Discounted Problem
- the (l1)-step Bellman equation
- Specifically, we fix some ? lt 1, and multiply by
(1- ?) ?l and sum over all nonnegative l
59Temporal-Difference Learning (TD(?))
- Interchanging the order of the two summations,
and using the fact that - we have
60Temporal-Difference Learning (TD(?))
- In terms of the temporal differences defined by
- we have
- Again we have Edm0 for all m.
61Temporal-Difference Learning (TD(?))
- From here on, the development is entirely
similar to the development for the undiscounted
case. - The only difference is that enters in the
definition of the temporal difference and that
is replaced by . - In particular, we have
62Temporal-Difference Learning (TD(?))
- Approximation (linear)
- To tune basis function weights
- Value function
- Autonomous systems
- Controlled systems
- Approximated policy iteration
- Controlled TD
- Q-function
- Relationship with Approximate Value Iteration
- Historical View
63Value function Autonomous systems
- Problem formulation
- Autonomous process
-
- Value function
- where is a scalar reward
- is a discount factor
- Linear approximation
-
- where is a collection of basis function
64Value function Autonomous systems
- Suppose that we observe a sequence of states
- at time t the weight vector has been set to
some value - Temporal difference corresponding to the
transition from to -
a prediction of given our current approximation
to the value function
an improved prediction that incorporates
knowledge of the reward and the
next stage
65Value function Autonomous systems
- Given an arbitrary initial weight vector ,
finally we find the correct weight vector - So the updating law of the weight vector is
-
- where is a scalar step size, is called
eligibility vector
66Value function Autonomous systems
- Eligibility vector is defined as
-
- where
providing a direction for the adjustment of
such that moves towards the improved
prediction
67Value function Autonomous systems
- Note that the eligibility vectors can be
recursively updated according to -
68Value function Autonomous systems
- Consequently, when the updating law of the
weight vector can be rewritten as -
- That means, only the last state has effect on
the update - In the more general case of ,
-
trigger
step size
direction
69Convergence Linear Approximators
- Under appropriate technical conditions
- i) For any ? ?0,1, there exists a vector u(?)
such that the sequence ut generated by the
algorithm converges (with probability one) to
u(?) . - ii) The limit of convergence u(?) satisfies
-
- 59 J. N. Tsitsiklis and B. Van Roy, An Analysis
of TemporalDeference Learning with Function
Approximation. IEEE Transactions on Automatic - Control, 42(5)674690, 1997.
- 10 Bertsekas and Tsitsiklis, chapter 6.
Neuro-Dynamic Programming. -
70Value function Controlled systems
- Unlike an autonomous system, a controlled system
cannot be passively simulated and observed.
Control decisions are required and influence the
systems dynamics. - The objective here is to approximate the optimal
value function of a controlled system.
71Value function Controlled systems
- Approximate Policy Iteration
- A classical dynamic programming algorithm -
policy iteration - Given a value function corresponding to a
stationary policy - , an improved policy can be
defined by - In particular, for all .
- Furthermore, a sequence of policies
initialized with some arbitrary and
updated according to - converges to an optimal policy .
72Value function Controlled systems
- Approximate Policy Iteration
- For each value function , let
-
- generating a sequence of weight vectors
- Select such that
- With an arbitrary initial stationary policy
73Value function Controlled systems
- Approximate Policy Iteration
- Two loops
- External find converged weight vector ? update
the present stationary policy - Internal applying temporal-difference learning
to generate each iterate weight vector ? value
function approximation - Initialization
74Value function Controlled systems
- Approximate Policy Iteration
- A result from section 6.2 10 D. P. Bertsekas
and J. N. Tsitsiklis, Neuro-Dynamic Programming.
Athena Scientific, Bellmont, MA, 1996. - if there exists some such that for all m
- then
- The external sequence does not always converge.
75Value function Controlled systems
- Controlled TD
- arbitrarily initialize and
-
- generate a decision according to
-
- where
76Value function Controlled systems
- Big problem Convergence
- A modification that has been found to be useful
in practical applications involves adding
exploration noise to the controls. - One approach to this end involves randomizing
decisions by choosing at each time t a decision
,for , with probability - where is a small scalar.
-
77Value function Controlled systems
- Note 1) Probability gt0
- 2) , the probability of ,such that
- simple proof
78Q-Function
- Given V,
- Define a Q-function
-
- then
- Qlearning is a variant of temporaldifference
learning that approximates Q functions rather
than value functions.
79Q-Function
- Q-learning
- arbitrarily initialize and
-
-
- generate a decision according to
-
- where
80Q-Function
- Like in the case of controlled TD, it is often
desirable to incorporate exploration. - Randomize decisions by choosing at each time t a
decision ,for , with probability - where is a small scalar.
- Note 1) Probability gt0
- 2) , the probability of ,such that
- The analysis of Qlearning bears many
similarities with that of controlled TD, and
results that apply to one can often be
generalized in a straightforward way to
accommodate the other.
81Relationship with Approximate Value Iteration
- The classical value iteration algorithm can be
described compactly in terms of the dynamic
programming operator T, - Approximate Value iteration
- Disadvantage
- the approximate value iteration need not possess
fixed points, and therefore should not be
expected to converge. - In fact, even in cases where a fixed point
exists, and even when the system is autonomous,
the algorithm can generate a diverging sequence
of weight vectors.
82Relationship with Approximate Value Iteration
83Relationship with Approximate Value Iteration
- Controlled TD can be thought of as a stochastic
approximation algorithm designed to converge on
fixed points of approximate value iteration. - Advantage
- Controlled TD uses simulation to effectively
bypass the need to explicitly compute projections
required for approximate value iteration. - Autonomous systems
- Controlled systems the possible introduction of
exploration -
84Historical View
- A long history and big names
- Sutton Temporaldifference based on earlier work
by Barto and Sutton on models for classical
conditioning phenomena observed in animal
behavior and by Barto, Sutton, and Anderson on
actorcritic methods - Witten lookup table algorithm bears
similarities with one proposed a decade earlier. - Watkins Qlearning was propose in his thesis
and the study of temporaldierence learning was
integrated with classical ideas from dynamic
programming and stochastic approximation theory. - The work of Werbos and Barto, Bradtke, and Singh
also contributed to the above integration.
85Historical View
- Application
- Tesauro a worldclass Backgammon playing
program. The practical potential was first
demonstrated since then - channel allocation in cellular communication
networks - elevator dispatching
- inventory management
- jobshop scheduling
86Actors and Critics
- Averaged Rewards
- Independent actors
- An actor is a parameterized class of policies.
87Actors and Critics
- Independent actors (cont.)
- one stochastic gradient method, which was
proposed by Marbach and Tsitsiklis - where
- Using critic Feedback
88(No Transcript)
89Bibliography
- 1 D. P. Bertsekas, Dynamic Programming and
Optimal Control. Athena - Scientific, Bellmont, MA, 1995.
- 2 D. P. Bertsekas and J. N. Tsitsiklis,
Neuro-Dynamic Programming. - Athena Scientific, Bellmont, MA, 1996.
- 3 R. S. Sutton, Temporal Credic Assignment in
Reinforcement Learning. - PhD thesis, University of Massachusetts,
Amherst, Amherst, MA, 1984. - 4 R. S. Sutton, Learning to Predict by the
Methods of Temporal Differences. Machine
Learning, 3944, 1988. - 5 R. S. Sutton and A. G. Barto, Reinforcement
Learning An Introduction. - MIT Press, Cambridge, MA, 1998.
- 6 J. N. Tsitsiklis and B. Van Roy, An Analysis
of TemporalDierence - Learning with Function Approximation. IEEE
Transactions on Automatic - Control, 42(5)674690, 1997.