Neuro-Dynamic Programming

1 / 89
About This Presentation
Title:

Neuro-Dynamic Programming

Description:

Neuro-Dynamic Programming Jos A. Ram rez Yan Liao Advanced Decision Processes ECECS 841, Spring 2003 University of Cincinnati Outline 1. Neuro-Dynamic ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 90
Provided by: smitlabU

less

Transcript and Presenter's Notes

Title: Neuro-Dynamic Programming


1
Neuro-Dynamic Programming
  • José A. Ramírez,Yan Liao
  • Advanced Decision Processes
  • ECECS 841, Spring 2003University of Cincinnati

2
Outline
  • 1. Neuro-Dynamic Programming (NDP) motivation.
  • 2. Introduction to Infinite Horizon Problems
  • Minimization of Total Cost,
  • Discounted Problems,
  • Finite-State Systems,
  • Value Iteration and Error Bounds,
  • Policy Iteration,
  • The Role of Contraction Mappings.
  • 3. Stochastic Control Overview
  • State Equation (system model),
  • Value Function,
  • Stationary policies and value function,
  • Introductory example Tetris (game).

3
Outline
  • 4. Control of Complex Systems
  • Motivation about use of NDP in complex systems,
  • Examples of complex systems where NDP could be
    applied.
  • 5. Value Function Approximation
  • Linear parameterization parameter vector and
    basis functions,
  • Continuation Tetris example.

4
Outline
  • 6. Temporal-Difference Learning (TD(?))
  • Introduction Autonomous systems, general TD(?)
    algorithm,
  • Controlled Systems, TD(?) for more general
    systems
  • Approximate policy iteration,
  • Controlled TD,
  • Q-functions, and approximating the Q-function
    (Q-learning),
  • Comments about relationship with Approximate
    Value Iteration.
  • 7. Actors and Critics
  • Averaged Rewards,
  • Independent actors,
  • Using critic Feedback.

5
1. Neuro-Dynamic Programming (NDP) motivation
rational and irrational behavior
-How decisions are made (psychologists,
economists).
Study ofDecision-Making
-How decisions ought to be made rational
decision-making (engineers and management
scientists).
clear objectives, strategic behavior.
Rational decision problems -Development of
mathematical theory understanding of dynamics
models,uncertainties, objectives, and
characterization of optimal decision
strategies. -If optimal strategies do exist,
then computational methods are used as complement
(e.g., Implementation).
6
1. Neuro-Dynamic Programming (NDP) motivation
-In contrast to rational decision-making, there
is no clear-cut mathematical theory about
decisions made by participants of natural systems
(speculative theories, refining ideas by
experimentation). -One approach hypothesis that
behavior is in some sense rational, then ideas
from study of rational decision-making are used
to characterize such behavior, e.g., utility and
equilibrium theory in financial
economics. -Also, study of animal behavior is
subject of interest evolutionary theory and its
popular precept survival of the fittest
support the possibility that behavior to some
extent concurs with that rational
agent. -Contributions from study of natural
systems to science of rational decision-making
-Computational complexity of decision problems
and lack of systematic approaches for dealing
with it.
7
1. Neuro-Dynamic Programming (NDP) motivation
-For example practical problems addressed by the
theory of dynamic programming (DP) can rarely
solved using DP algorithms because the
computational time required for the generation
of optimal strategies typically grows
exponentially in the number of variables involved
? Curse of dimensionality. -This call for an
understanding of suboptimal solutions
/decision-making under computational constraints.
Problem ? no satisfactory theory has been
developed to this end. -Interesting the fact
that biological mechanisms facilitate the
efficient synthesis of adequate strategies
motivates the possibility that understanding such
mechanisms can inspire new and computationally
feasible methodologies for strategic
decision-making.
8
1. Neuro-Dynamic Programming (NDP) motivation
-Reinforcement Learning (RL) over two decades,
RL algorithms originally conceived as
descriptive models for phenomena observed in
animal behavior- have grown out in the field of
artificial intelligence and been applied to
solving complex sequential decision
problems. -Success of RL in solving large-scale
problems has generated special interest among
operations researchers and control theorists ?
research devoted to understand those methods and
their potential. -Developments from the
operations research and control theorists
focused in normative view, acknowledge of
relative disconnect from descriptive models of
animal behavior, ? some operations researchers
and control theorists have come to refer this
area of research as Neuro-Dynamic Programming
(NDP) instead of RL.
9
1. Neuro-Dynamic Programming (NDP) motivation
-During these lectures we will present a sample
of the recent developments and open issues of
research in NDP. -Specifically, we will be
focused in two algorithmic ideas of greatest use
in NDP, and for which there has been significant
theoretical progress in recent years -Temporal
-Difference learning -Actor-Critic
Methods. -First, we begin providing some
background and perspective on the methodology and
problems may address. Comments about references
10
2. Introduction to Infinite Horizon Problems
Material taken from Dynamic Programming and
Optimal Control, vol. I, II and Neuro-Dynamic
Programming by Dimitri P. Berstsektas and John
Tsitsiklis. The Dynamic Programming Problems
with infinite horizon are characterized by the
following aspectsa) The number of stages is
infinite.b) The system is stationary ? the
system equation, the cost per stage, and the
random disturbance statistics do not change from
one stage to the next. Why Infinite Horizon
Problems? -They are interesting because their
analysis is elegant and insightful. -Implementati
on of optimal policies is often simple.? Optimal
policies are typically stationary, e.g., optimal
rule used to choose controls does not change
from stage to stage. -NDP! ? complex systems.
This assumption is never satisfied in practice,
but is a reasonable approximation for problems
with a finite but very large number of stages.
11
2. Introduction to Infinite Horizon Problems
  • -They require more sophisticated analysis than
    the finite horizon problems.
  • ? It is needed to analyze limiting behavior as
    the horizon tends to infinity.
  • -We consider four principal classes of infinite
    horizon problems. The first two classes try to
    minimize J? (x0), the total cost over an infinite
    number of stages
  • Stochastic shortest path problems in this case ?
    1 and assume that there is an additional state
    0, which is a cost-free termination state once
    the system reach the termination state it remains
    there at not additional cost.? objective reach
    the termination state with the minimal cost.
  • Discounted problems with bounded cost per stage
    here ? lt 1, and the absolute one-stage cost
    g(x,u,w) is bounded from above by some constant
    M. Thus, J?(x0) is well defined because it is
    the infinite sum of a sequence of numbers that
    are bounded in absolute value by the decreasing
    geometric progression M?k.

12
2. Introduction to Infinite Horizon Problems
  • iii) Discounted problems with unbounded cost per
    stage here the discount factor ? may or may not
    be less than 1, and the cost per stage is
    unbounded.? this problem is difficult to analyze
    because of the possibility of infinite cost for
    some policies (more details in chap.3, Dynamic
    Programming,vol. II, by Bertsekas).
  • iv) Average cost problems in some problems we
    have J?(x0) ?, for every policy ? and initial
    state i, then in many problems the average cost
    per stage, given by
  • where J?N(i) is the N-stage cost-to-go of policy
    ? starting at state x0, is well defined
  • as a limit and is finite.

13
2. Introduction to Infinite Horizon Problems
  • A Preview of Infinite Horizon Results
  • Let J the optimal cost-to-go function of the
    infinite horizon problem, and consider
  • the case ? 1, with JN(x) as the optimal cost of
    the problem involving N stages,
  • initial state x, cost per stage g(x,u,w), and
    zero terminal cost. Thus, the N-stage cost
  • can be computed after N iterations of the DP
    algorithm
  • Thus, we can speculate the following
  • The optimal infinite horizon cost is the limit of
    the corresponding N-stage optimal costs as N? 8

Note that the time indexing has been reversed
from the original DP algorithm, thus the optimal
finite horizon cost functions can be computed
with a singleDP recursion (more details in
chap.1, Dynamic Programming, vol. II, by D.P.
Bertsekas.
14
2. Introduction to Infinite Horizon Problems
  • ii) The following limiting form of the DP
    algorithm should hold for all statesthis is
    not an algorithm, but a system of equations (one
    equation per state), which has as solution the
    costs-to-go for all states. It is also viewed as
    a functional equation for the cost-to-go function
    J , and it is called Bellmans equation.
  • iii) If ?(x) attains the minimum in the
    right-hand side of the Bellmans equation for
    each x, then the policy ??, ? , should be
    optimal. This is true for most infinite horizon
    problems of interest.? stationary policies.
  • Most of the analysis of infinite horizon problems
    are focused around the above three
  • issues and efficient methods to compute J and
    optimal policies.

15
2. Introduction to Infinite Horizon Problems
Stationary Policy A stationary policy is an
admissible policy ??, ?, with a
corresponding cost function J? (x).? ? is
optimal if J?(x)J(x) for all states x. Some
Shorthand Notation The use of single recursions
in the DP algorithm to compute optimal costs over
a finite horizon, motivates the introduction of
two mappings that play an important theoretical
role and give us a convenient shorthand notation
for expressions that are Complicated to
write. For any function JS??, where S is the
states space, we consider the function obtained
by applying the DP mapping J as follows T can
be viewed as a mapping that transforms J on S
into the function TJ on S. TJ represent the
optimal cost function for the one-stage problem
that has stage cost g and terminal cost ?J.
16
2. Introduction to Infinite Horizon Problems
Similarly, for any control function ? S ? C,
where C is the space of controls, we
have Also, we denote the composition Tk of
the mapping T with itself k times Then, for
k0 we have
17
2. Introduction to Infinite Horizon Problems
Some Basic Properties Monotonicity Lemma For
any functions JS?? and JS??, such that and
for any function ?S?C with ?(x) ? U(x), for all
x ? S, we have
18
2. Introduction to Infinite Horizon Problems
The Role of Contraction Mappings (Dynamic
Programming, vol. II, Bertsekas) Definition
1.4.1 A mapping H B(S) ? B(S) is said to be a
contraction mapping if there exists a scalar ?
lt1 such that Where is the norm It
is said to be an m-stage contraction mapping if
there exists a positive integer m and some ? lt 1
such that where Hm denotes the composition
of H with itself m times. Note B(S) is the set
of all bounded real-valued functions on S. Every
function JS ??.
19
2. Introduction to Infinite Horizon Problems
The Role of Contraction Mappings (Dynamic
Programming, vol. II, Bertsekas) Proposition
1.4.1 (Contraction Mapping Fixed-Point
Theorem) If H B(S) ? B(S) is a contraction
mapping or an m-stage contraction mapping, then
there exists a unique fixed point of H i.e.,
there exists a unique function J ? B(S) such
that Furthermore, if J is any function in B(S)
and Hk is the composition of H with itself k
times, then
20
3. Stochastic Control Overview
State Equation Lets consider a discrete-time
dynamic system, that at each time t, takes on a
state xt and evolves according to
where wt is a disturbance (iid) and at is a
control decision. We restrict attention to finite
state, disturbances, and control spaces, denoted
by ?, W, and ?, respectively.
Value Function Let r ? x ?? ? associates a
reward r( xt , at ) with a decision at, made at
state xt. Let ? a stationary policy with ? ?
? ?. For each policy ? we define a value
function v( , ? ) ?? ?
21
3. Stochastic Control Overview
Optimal Value Function we define the optimal
value function V as follows
From dynamic programming, we have that any
stationary policy ? given by
is optimal in the sense that
22
3. Stochastic Control Overview
Example 1 Tetris The video arcade game of Tetris
can be viewed as an instance of stochastic
control. In particular, we can view the state xt
as an encoding of the current wall of bricks
and the shape of the current falling piece. The
decision at identifies an orientation and
horizontal position for placement of the falling
piece onto the wall. Though the arcade game
employs a more complicated scoring system,
consider for simplicity a reward r(xt, at) equal
to the number of rows eliminated by placing the
piece in the position described by at. Then, a
stationary policy that maximizes the value
essentially optimizes a combination of present
and future row elimination, with decreasing
emphasis placed on rows to be eliminated at times
farther into the future.
23
3. Stochastic Control Overview
Example 1 Tetris, cont. Tetris was first
programmed by Alexey Pajitnov, Dmitry Pavlovsky,
and Vadim Gerasimov, computer engineers at the
Computer Center of the Russian Academy of
Sciences in 1985-86.
Standard shapes
Number of states
24
3. Stochastic Control Overview
Dynamic programming algorithms compute the
optimal value function V. The result is stored in
a look-up table with one entry V(x) per state x
? X. When is required, the value function is used
to generate optimal decisions. For example,
given a current state xt ? X , a decision at is
selected according to
25
4. Control of Complex Systems
The main objective is in the development of a
methodology for the control of complex
systems. Two common characteristics of these
type of systems are i-An intractable state
space intractable state spaces preclude the
use of classical DP which compute and store one
numerical value per state. ii- Severe
nonlinearities methods of traditional linear
control, which are applicable even in large state
spaces, are ruled out by severe
nonlinearities. Lets review some examples of
complex systems, where NDP could be and has been
applied.
26
4. Control of Complex Systems
a) Call Admission and Routing With rising demand
in telecommunication network resources, effective
management is as important as ever. Admission
(deciding which calls to accept/reject) and
routing (allocating links in the network to
particular calls) are examples of decisions that
must be made at any point in time. The objective
is to make the best use of limited network
resources. In principle, such sequential decision
problems can be addressed by dynamic programming.
Unfortunately, the enormous state spaces involved
render dynamic programming algorithms
inapplicable, and heuristic control strategies
are used in lieu. b) Strategic Asset
Allocation Strategic asset allocation is the
problem of distributing an investors wealth
among assets in the market in order to take on a
combination of risk and expected return that best
suits the investors preferences. In general, the
optimal strategy involves dynamic rebalancing of
wealth among assets over time. If each asset
offers a fixed rate of risk and return, and some
additional simplifying assumptions are made, the
only state variable is wealth, and the problem
can be solved efficiently by dynamic programming
algorithms. There are even closed form solutions
in cases involving certain types of investor
preferences. However, in the more realistic
setting involving risks and returns that
fluctuate with economic conditions, economic
indicators must be taken into account as state
variables, and this quickly leads to an
intractable state space. The design of effective
strategies in such situations constitutes an
important challenge in the growing field of
financial engineering.
27
4. Control of Complex Systems
c) SupplyChain Management With todays tight
vertical integration, increased production
complexity, and diversification, the inventory
flow within and among corporations can be viewed
as a complex network called a supply chain
consisting of storage, production, and
distribution sites. In a supply chain, raw
materials and parts from external vendors are
processed through several stages to produce
finished goods. Finished goods are then
transported to distributors, then to wholesalers,
and finally retailers, before reaching customers.
The goal in supplychain management is to achieve
a particular level of product availability while
minimizing costs. The solution is a policy that
decides how much to order or produce at various
sites given the present state of the company and
the operating environment. d) Emissions
Reductions The threat of global warming that may
result from accumulation of carbon dioxide and
other greenhouse gasses poses a serious
dilemma. In particular, cuts in emission levels
bear a detrimental shortterm impact on economic
growth. At the same time, a depleting environment
can severely hurt the economy especially the
agricultural sector in the longer term. To
complicate the matter further, scientific
evidence on the relationship between emission
levels and global warming is inconclusive,
leading to uncertainty about the benefits of
various cuts. One systematic approach to
considering these conflicting goals involves the
formulation of a dynamic system model that
describes our understanding of economic growth
and environmental science. Given such a model,
the design of environmental policy amounts to
dynamic programming. Unfortunately, classical
algorithms are inapplicable due to the size of
the state space.
28
4. Control of Complex Systems
e) Semiconductor Wafer Fabrication The
manufacturing floor at a semiconductor wafer
fabrication facility is organized into service
stations, each equipped with specialized
machinery. There is a single stream of jobs
arriving on a production floor. Each job follows
a deterministic route that revisits the same
station multiple times. This leads to a
scheduling problem where, at any time, each
station must select a job to service such that
(long term) production capacity is maximized.
Such a system can be viewed as a special class of
queueing networks, which are models suitable for
a variety of applications in manufacturing,
telecommunications, and computer systems. Optimal
control of queueing networks is notoriously
difficult, and this reputation is strengthened by
formal characterizations of computational
complexity. Other systems parking lots,
football, games strategy, combinatorial
optimization maintenance and repair, dynamic
channel allocation, backgammon. Some papers in
applications -Tsitsiklis, J. and Van Roy, B.
Neuro-Dynamic Programming Overview and a Case
Study in Optimal Stopping. IEEE Proceedings of
the 36th Conference on Decision Control, San
Diego, California, pp. 1181-1186, December,
1997. -Van Roy, B., Bertesekas, D.P., Lee, Y.,
and Tsitsiklis, J. A Neuro-Dynamic Programming
Approach to Retailer Inventory Management IEEE
Proceedings of the 36th Conference on Decision
Control, San Diego, California, pp. 4052-4057,
December, 1997.-Marbach, P. and Tsitsiklis, J.
A Neuro-Dynamic Programming Approach to
Admission Control in ATM Networks The Single
Link Case Technical Report LIDS-P-2402,
Laboratory for Information and Decision Systems,
M.I.T., November 1997. - Marbach, P, Mihatsch,
O, and Tsitsiklis, J. Call Admission Control and
Routing in Integrated Services Networks Using
Reinforcement Learning IEEE Proceedings of the
37th Conference on Decision Control, Tampa,
Florida, pp. 563-568, December,
1998.-Bertsekas, D.P., Homer, M.L., Missile
Defense and Interceptor Allocation by
Neuro-Dynamic Programming IEEE Transactions on
Systems Man and Cybernetics, vol. 30,
pp.101-110, 2000.
29
4. Control of Complex Systems
-For the examples presented, state spaces are
intractable as consequence of the curse of
dimensionality, that is, state spaces grow
exponentially in the number of state variables. ?
difficult (if not impossible) to compute (store)
one value per state as is required by classical
DP. -Additional shortcoming with classical DP
computations require use of transition
probabilities. ? For many complex systems, such
probabilities are not readily accessible. On the
other hand, is often easier develop simulation
models for the system and generate sample
trajectories. -Objective of NDP overcoming
curse of dimensionality through use of
parameterized (value) function approximators and
through use of output generated by simulators,
rather than explicit transition probabilities.
30
5. Value Function Approximation
-Intractability of state spaces ? value function
approximation. -Two important pre-conditions for
the development of effective approximation i-C
hoose a parameterization
that yields a good approximation ii-Algor
ithms for computing appropriate parameter
values. Note the choice of a suitable
parameterization requires some practical
experience or theoretical analysis that provides
rough information about the shape of the
function to be approximated.
31
5. Value Function Approximation
Linear parameterization -General classes of
parameterizations have been found used in NDP, to
keep the exposition simple, let us focus on
linear parameterizations, which take the form
where ?1, , ?K are basis functions mapping X
to ? and u (u(1), , u(K)) is a vector of
scalar weights.Similar to statistical
regression, the basis functions ?1, , ?K are
selected by a human based on intuition or
analysis to the problem at hand. Hint one
interpretation that is useful for the
construction of basis functions involves viewing
each function ?k as a feature that is, a
numerical value capturing a salient
characteristic of the state that may be pertinent
to effective decision making.
32
5. Value Function Approximation
Example 2 Tetris, continuation. In our
stochastic control formulation of Tetris, the
state is an encoding of the current wall
configuration and the current falling piece.
There are clearly too many states for exact
dynamic programming algorithms to be applicable.
However, we may believe that most information
relevant to gameplaying decisions can be
captured by a few intuitive features. In
particular, one feature, say ?1, may map states
to the height of the wall. Another, say ?2,
could map states to a measure of jaggedness of
the wall. A third might provide a scalar encoding
of the type of the current falling piece (there
are seven different shapes in the arcade game).
Given a collection of such features, the next
task is to select weights u(1), . . . , u(K) such
that for all states x. This approximation
could then be used to generate a gameplaying
strategy.
33
5. Value Function Approximation
  • Example 2 Tetris, continuation.
  • Similar approach is presented in the book
    Neuro-Dynamic Programming (chapter 8, cases
  • of study) by D.P. Bertesekas and J. Tsitsiklis,
    with the following parameterization, after some
  • experimentation
  • Let the height hk of the kth column of the wall.
    There are w such features, where w is the walls
    width.
  • The absolute difference ?hk - hk1? between the
    heights of the kth and the (k1)st column, k1,,
    w-1.
  • The maximum wall height maxk hk.
  • The number of holes L in the wall, that is, the
    number of empty positions of the wall that are
    surrounded by full positions.

34
5. Value Function Approximation
Example 2 Tetris, continuation. Thus, there are
2w1 features, which together with a constant
offset, require 2w2 weights in a linear
architecture of the form Using this
parameterization, with w10 (22 features), an
strategy is generated by NDP that eliminates an
average of 3554 rows per game, reflecting
performance comparable of an expert player.
offset
35
6. Temporal-Difference Learning
Introduction to Temporal-Difference
Learning Material from Richard Sutton,
Learning to Predict by Methods of Temporal
Differences, Machine Learning, 3 9-44, 1988. ?
this paper provide the first formal results in
the theory of temporal- difference (TD)
methods. - Learning to predict -Use of
past experience (historical information) with a
incompletely know system to predict its future
behavior. -Learning to predict is one of the
most basic and prevalent kinds of learning. -In
prediction learning training examples can be
taken directly from the temporal -sequence of
ordinary sensory input no special supervisor or
teacher is required. -Conventional
prediction-learning methods (Widrow-Hoff, LMS,
Delta Rule, Backpropagation) -Driven by error
between predicted and actual outcomes.
36
6. Temporal-Difference Learning
  • -TD methods
  • Driven by error or difference between temporally
    successive predictions.? learning occurs
    whenever there is a change in prediction over
    time.
  • -Advantages of TD methods over conventional
    methods
  • They are more incremental, and therefore more
    easier to compute.
  • They tend to make more efficient use of their
    experience they converge faster and produce
    better predictions.
  • -TD Approach
  • Predictions are based on numerical features
    combined using adjustable parameters or
    weights.? similar to connectionist models
    (Neural Networks)

37
6. Temporal-Difference Learning
  • -TD and supervised-learning approaches to
    prediction
  • Historically, the most important learning
    paradigm has been supervised learning learner is
    asked to associate pairs of items (input,output).
  • Supervised learning has been used in patter
    classification, concept acquisition, learning
    from examples, system identification, and
    associative memory.

Input
Real Output
A

LearningAlgorithm
Error
-
Input
Estimated Output
Adjust estimator parameters
38
6. Temporal-Difference Learning
  • -Single-step and multi-step prediction problems
  • Single-step all information about the
    correctness of each prediction is revealed at
    once.? supervised learning methods.
  • Multi-step correctness is not revealed until
    more than one step after the prediction is made,
    but partial information relevant to its
    correctness is revealed at each step. ? TD
    learning methods.
  • -Computational issues
  • Sutton introduce a particular TD procedure by
    formally relating it to a classical
    supervised-learning procedure, the Widrow-Hoff
    rule (also known as delta rule, the ADALINE
    Adaptive Linear Element, and the Least Mean
    Squares LMS- filter).
  • We consider multi-step prediction problems in
    which experience comes in observation-outcome
    sequences of the form x1, x2, x3, , xm, z, where
    each xt is a vector of observations available at
    time t in the sequence, and z is the outcome of
    the sequence. Also, xt ? ?n and z ? ?.

39
6. Temporal-Difference Learning
  • -Computational issues (cont.)
  • For each observation-outcome, the learner
    produces a corresponding sequence of predictions
    P1, P2, P3, , Pm, each of which is an estimate
    of z.
  • Predictions Pt are based on a vector of
    modifiable parameters w. ? Pt(xt ,w)
  • All learning procedures are expressed as rules
    for updating w. For each observation, an
    increment to w, denoted ?wt , is determined.
    After a complete sequence has been processed, w
    is changed by (the sum of) all the sequences
    increments
  • The supervised-learning approach treats each
    sequence of observations and its outcome as a
    sequence of observation-outcome pairs (x1 , z) ,
    (x2 , z), , (xm , z). In this case increments
    due to time t depends on the error between Pt
    and z, and on how to change w will affect Pt .

40
6. Temporal-Difference Learning
  • -Computational issues (cont.)
  • Then, a prototypical supervised-learning update
    procedure is
  • where ? is a positive parameter affecting the
    rate of learning, and the gradient ?wPt , is
  • the vector of partial derivatives of Pt with
    respect to each component of w.
  • Special case consider Pt a linear function of
    xt and w, that is Pt wTxt ?i w(i) xt(i),
    where w(i) and xt(i) are ith component of w and
    xt.? ?wPt xt. Thus,
  • which correspond to the Widrow-Hoff rule.
  • This equation depend critically on z, and thus
    cannot be determined until the end of the
    sequence when z becomes known.? All observations
    and predictions made during a sequence must be
    remembered until its end ?wt cannot be computed
    incrementally.

41
6. Temporal-Difference Learning
-TD Procedure There is a Temporal-Difference
procedure that produces the same (exactly)
result, and can be Computed incrementally. The
key is to represent the error z-Pt as a sum of
the changes in predictions as follows Using
this equation and the prototypical
supervised-learning equation, we have
This equation can be computed incrementally,becau
se it depends only on a pair of
successivepredictions and on the sum of all past
values of thegradient.We refer to this
procedure as TD(1).
42
6. Temporal-Difference Learning
  • The TD(?) family of learning procedures
  • The hallmark of temporal-difference methods is
    their sensitivity to changes in successive
    predictions rather than overall error between
    predictions and the final outcome.
  • In response to an increase (decrease) in
    prediction from Pt to Pt1 , an increment ?wt is
    determined that increases (decreases) the
    predictions for some or all of the preceding
    observations vectors x1, ,xt.
  • TD(1) is a special case where all the predictions
    are altered to an equal extent.
  • Now, consider the case where greater alterations
    are made to more recent predictions. We consider
    an exponential weighting with recency, in which
    alterations to the predictions of observation
    vectors occurring k steps in the past are
    weighted according to ?k for 0 ? ? ? 1

?t-k
? 1, TD(1)
1
? 0TD(0)
k increases
0 lt ? lt 1
t-k
0
43
6. Temporal-Difference Learning
  • The TD(?) family of learning procedures
  • For ? 0 we have the TD(0) procedure
  • For ? 1 we have the TD(1) procedure, that is
    equivalent to the Widrow-Hoff rule, except that
    TD(1) is incremental
  • Alterations of past predictions can be weighted
    in ways other than the exponential form given
    previously, let

Also referred in literature as eligibility
vectors.
44
6. Temporal-Difference Learning
  • ?(Material taken from Neuro-Dynamic Programming,
    Chapter 5)
  • Monte Carlo Simulation brief overview
  • Suppose v is a random variable with unknown mean
    m that we want to estimate.
  • Using Monte Carlo simulation to estimate m
    generate a number of samples v1, v2, , vN, and
    then estimate m by forming the sample
    meanAlso, we can compute the sample mean
    recursivelywith M1 v1 .

45
6. Temporal-Difference Learning
  • Monte Carlo simulation case of iid samples
  • Suppose N samples v1, v2, , vN , independent and
    identically distributed, with mean m, and
  • variance ?2. Then we have
  • where MN is said to be an unbiased estimator of
    m. Also, its variance is given by
  • As N? ? the variance of MN converge to zero ?
    MN converges to m.
  • Also, the strong law of large numbers provides an
    additional property the sequence MN
  • converges to m with probability one. ? The
    estimator is consistent.

46
6. Temporal-Difference Learning
  • Policy Evaluation by Monte Carlo simulation
  • -Consider the stochastic shortest path problem,
    with state space 0, 1, 2, , n with 0 as an
  • absorbing state and cost-free. Let the
    cost-to-go from i to j g( i , j ) (given the
    control action
  • µ(i) , pij(µ(i))). Suppose that we have a fixed
    stationary policy µ (proper) and we want to
  • calculate, using simulation, the corresponding
    cost-to-go vector
    J µ ( J µ (1) J µ (2) . . . J µ (n) ).
  • Approach generate, starting from each i, many
    samples states trajectories and average the
  • corresponding costs to obtain an approximation to
    J µ(i).? Instead of do this for each state i,
  • lets use each trajectory to obtain cost samples
    for all states visited by the trajectory, and
  • consider the cost of the trajectory portion that
    starts at each intermediate state.

47
6. Temporal-Difference Learning
  • Policy Evaluation by Monte Carlo simulation
  • Suppose that a number of simulation runs are
    performed, each ending at the termination state
    0.
  • Consider the m-th time a given state i0 is
    encountered, and let (i0 , i1 ,, iN) be the
    remainder of
  • the corresponding trajectory, where iN 0.
  • Then, let c( i0 , m ) the cumulative cost up to
    reaching state 0, then
  • Some assumptions different simulated
    trajectories are statistically independent, and
    each
  • trajectory is generated according to the Markov
    process determined by the policy µ. Then we
  • have,

48
6. Temporal-Difference Learning
Policy Evaluation by Monte Carlo simulation The
estimation of Jµ(i) is obtained by forming the
sample mean subsequent to the Kth encounter
with state i. The sample mean can be expressed
in iterative form where starting with
J(i)0.
49
6. Temporal-Difference Learning
Policy Evaluation by Monte Carlo
simulation Consider the trajectory (i0, i1, ,
iN), and let k an integer, such as 1 ? k ? N.
The trajectory contains the subtrajectory (ik,
ik1,, iN).? sample trajectory with initial
state ik that can be used to update J(ik) using
the iterative equation previously
presented. Algorithm Run a simulation and
generate the state trajectory (i0, i1, ,iN),
update the estimates J(ik) for each k0, 1, ,
N-1, the formula The step size ?(ik) can
change from one iteration to iteration. Ad
ditional details ? Neuro-Dynamic Programming by
Bertsekas and Tsitsiklis, chapter 5.
50
6. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences Here we consider the implementation
of the Monte Carlo policy evaluation algorithm
that incrementally updates the cost-to-go
estimates J(i). First, assume that for any
trajectory i0, i1, , iN , with iN 0, and ik 0
for k gt N, g( ik, ik1) 0 for k ? N, and J(0)0.
Also, the policy under consideration is
proper. Lets rewrite the previous formula in the
following form Note that we use the
property J(iN)0.
51
6. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences Equivalently, we can rewrite the
previous equation as follows where are
called temporal differences (TD). The TD
represents the difference between the
estimate of the cost-to-go based on the
simulated outcome in the current stage, and the
current estimate J(ik).
52
6. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences Advantage The estimations can be
computed incrementally, e.g., for the l-th
temporal difference dl (once that it becomes
available) we have as soon as dl is
available. The temporal difference dl appears in
the update formula for J(ik) for every k ? l,
then, as soon as transition il1 has been
simulated.
53
6. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences TD(?) Here we introduce the TD(?)
algorithm as a stochastic approximation method
for solving a suitably reformulated Bellman
equation. The Monte Carlo evaluation algorithm
can be viewed as a Robbins-Monro stochastic
approximation (more details chapter 4.
Neuro-Dynamic Programming) method for solving the
equations for unknowns J?(ik), as ik ranges
over the states in the state space. Other
algorithms can be generated in similar form,
e.g., starting from other systems equation
involving J? and then replacing expectations by
single estimates. For example, from Bellmans
equation
54
6. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences TD(?) the stochastic approximation
method takes the form which is updated each
time that state ik is visited. Lets take now a
fixed value l, nonnegative and integer, and
taking into consideration the cost for the first
l1 transitions, then the stochastic algorithm
could be based on the (l1)-step Bellman
equation ?Without any special knowledge to
select one value of l over another, we consider
forming a weighted average of all possible
multistep Bellman equations. Specifically, we
fix some ? lt 1, and multiply by (1- ?) ?l and
sum over all nonnegative l.
55
6. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences TD(?) Then, we have Interchangin
g the order of the two summations, and using the
fact that we have
56
6. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences TD(?) The previous equation can be
expressed in terms of the temporal differences as
follows where are the temporal differences,
and Edm0 for all m (Bellmans equation).
The Robbins-Monro stochastic approximation
method, equivalent to the previous equation,
is where ? is a stepsize parameter (can change
from iteration to iteration). The above equation
provide us with a family of algorithms, one for
each choice of ?, and it is known as TD(?). Note
that for ?1 we have the Monte Carlo policy
evaluation method, also called TD(1).
57
6. Temporal-Difference Learning
Monte Carlo simulation using Temporal
Differences TD(?) Also, for ?0, we have
another limiting case, and using the convention
001, then the TD(0) method is presented as
follows This equation coincides with the
one-step Bellmans equation previously
presented. Off-line and On-line variants When
all of the updates are carried out
simultaneously, after the entire trajectory has
been simulated, then we have the off-line version
of TD(?). In alternative form, when the updates
are evaluated one term at a time, we have the
on-line version of TD(?).
58
Temporal-Difference Learning (TD(?))
  • Discounted Problem
  • the (l1)-step Bellman equation
  • Specifically, we fix some ? lt 1, and multiply by
    (1- ?) ?l and sum over all nonnegative l

59
Temporal-Difference Learning (TD(?))
  • Interchanging the order of the two summations,
    and using the fact that
  • we have

60
Temporal-Difference Learning (TD(?))
  • In terms of the temporal differences defined by
  • we have
  • Again we have Edm0 for all m.

61
Temporal-Difference Learning (TD(?))
  • From here on, the development is entirely
    similar to the development for the undiscounted
    case.
  • The only difference is that enters in the
    definition of the temporal difference and that
    is replaced by .
  • In particular, we have

62
Temporal-Difference Learning (TD(?))
  • Approximation (linear)
  • To tune basis function weights
  • Value function
  • Autonomous systems
  • Controlled systems
  • Approximated policy iteration
  • Controlled TD
  • Q-function
  • Relationship with Approximate Value Iteration
  • Historical View

63
Value function Autonomous systems
  • Problem formulation
  • Autonomous process
  • Value function
  • where is a scalar reward
  • is a discount factor
  • Linear approximation
  • where is a collection of basis function

64
Value function Autonomous systems
  • Suppose that we observe a sequence of states
  • at time t the weight vector has been set to
    some value
  • Temporal difference corresponding to the
    transition from to

a prediction of given our current approximation
to the value function
an improved prediction that incorporates
knowledge of the reward and the
next stage
65
Value function Autonomous systems
  • Given an arbitrary initial weight vector ,
    finally we find the correct weight vector
  • So the updating law of the weight vector is
  • where is a scalar step size, is called
    eligibility vector

66
Value function Autonomous systems
  • Eligibility vector is defined as
  • where

providing a direction for the adjustment of
such that moves towards the improved
prediction
67
Value function Autonomous systems
  • Note that the eligibility vectors can be
    recursively updated according to

68
Value function Autonomous systems
  • Consequently, when the updating law of the
    weight vector can be rewritten as
  • That means, only the last state has effect on
    the update
  • In the more general case of ,

trigger
step size
direction
69
Convergence Linear Approximators
  • Under appropriate technical conditions
  • i) For any ? ?0,1, there exists a vector u(?)
    such that the sequence ut generated by the
    algorithm converges (with probability one) to
    u(?) .
  • ii) The limit of convergence u(?) satisfies
  • 59 J. N. Tsitsiklis and B. Van Roy, An Analysis
    of TemporalDeference Learning with Function
    Approximation. IEEE Transactions on Automatic
  • Control, 42(5)674690, 1997.
  • 10 Bertsekas and Tsitsiklis, chapter 6.
    Neuro-Dynamic Programming.

70
Value function Controlled systems
  • Unlike an autonomous system, a controlled system
    cannot be passively simulated and observed.
    Control decisions are required and influence the
    systems dynamics.
  • The objective here is to approximate the optimal
    value function of a controlled system.

71
Value function Controlled systems
  • Approximate Policy Iteration
  • A classical dynamic programming algorithm -
    policy iteration
  • Given a value function corresponding to a
    stationary policy
  • , an improved policy can be
    defined by
  • In particular, for all .
  • Furthermore, a sequence of policies
    initialized with some arbitrary and
    updated according to
  • converges to an optimal policy .

72
Value function Controlled systems
  • Approximate Policy Iteration
  • For each value function , let
  • generating a sequence of weight vectors
  • Select such that
  • With an arbitrary initial stationary policy

73
Value function Controlled systems
  • Approximate Policy Iteration
  • Two loops
  • External find converged weight vector ? update
    the present stationary policy
  • Internal applying temporal-difference learning
    to generate each iterate weight vector ? value
    function approximation
  • Initialization

74
Value function Controlled systems
  • Approximate Policy Iteration
  • A result from section 6.2 10 D. P. Bertsekas
    and J. N. Tsitsiklis, Neuro-Dynamic Programming.
    Athena Scientific, Bellmont, MA, 1996.
  • if there exists some such that for all m
  • then
  • The external sequence does not always converge.

75
Value function Controlled systems
  • Controlled TD
  • arbitrarily initialize and
  • generate a decision according to
  • where

76
Value function Controlled systems
  • Big problem Convergence
  • A modification that has been found to be useful
    in practical applications involves adding
    exploration noise to the controls.
  • One approach to this end involves randomizing
    decisions by choosing at each time t a decision
    ,for , with probability
  • where is a small scalar.

77
Value function Controlled systems
  • Note 1) Probability gt0
  • 2) , the probability of ,such that
  • simple proof

78
Q-Function
  • Given V,
  • Define a Q-function
  • then
  • Qlearning is a variant of temporaldifference
    learning that approximates Q functions rather
    than value functions.

79
Q-Function
  • Q-learning
  • arbitrarily initialize and
  • generate a decision according to
  • where

80
Q-Function
  • Like in the case of controlled TD, it is often
    desirable to incorporate exploration.
  • Randomize decisions by choosing at each time t a
    decision ,for , with probability
  • where is a small scalar.
  • Note 1) Probability gt0
  • 2) , the probability of ,such that
  • The analysis of Qlearning bears many
    similarities with that of controlled TD, and
    results that apply to one can often be
    generalized in a straightforward way to
    accommodate the other.

81
Relationship with Approximate Value Iteration
  • The classical value iteration algorithm can be
    described compactly in terms of the dynamic
    programming operator T,
  • Approximate Value iteration
  • Disadvantage
  • the approximate value iteration need not possess
    fixed points, and therefore should not be
    expected to converge.
  • In fact, even in cases where a fixed point
    exists, and even when the system is autonomous,
    the algorithm can generate a diverging sequence
    of weight vectors.

82
Relationship with Approximate Value Iteration
83
Relationship with Approximate Value Iteration
  • Controlled TD can be thought of as a stochastic
    approximation algorithm designed to converge on
    fixed points of approximate value iteration.
  • Advantage
  • Controlled TD uses simulation to effectively
    bypass the need to explicitly compute projections
    required for approximate value iteration.
  • Autonomous systems
  • Controlled systems the possible introduction of
    exploration

84
Historical View
  • A long history and big names
  • Sutton Temporaldifference based on earlier work
    by Barto and Sutton on models for classical
    conditioning phenomena observed in animal
    behavior and by Barto, Sutton, and Anderson on
    actorcritic methods
  • Witten lookup table algorithm bears
    similarities with one proposed a decade earlier.
  • Watkins Qlearning was propose in his thesis
    and the study of temporaldierence learning was
    integrated with classical ideas from dynamic
    programming and stochastic approximation theory.
  • The work of Werbos and Barto, Bradtke, and Singh
    also contributed to the above integration.

85
Historical View
  • Application
  • Tesauro a worldclass Backgammon playing
    program. The practical potential was first
    demonstrated since then
  • channel allocation in cellular communication
    networks
  • elevator dispatching
  • inventory management
  • jobshop scheduling

86
Actors and Critics
  • Averaged Rewards
  • Independent actors
  • An actor is a parameterized class of policies.

87
Actors and Critics
  • Independent actors (cont.)
  • one stochastic gradient method, which was
    proposed by Marbach and Tsitsiklis
  • where
  • Using critic Feedback

88
(No Transcript)
89
Bibliography
  • 1 D. P. Bertsekas, Dynamic Programming and
    Optimal Control. Athena
  • Scientific, Bellmont, MA, 1995.
  • 2 D. P. Bertsekas and J. N. Tsitsiklis,
    Neuro-Dynamic Programming.
  • Athena Scientific, Bellmont, MA, 1996.
  • 3 R. S. Sutton, Temporal Credic Assignment in
    Reinforcement Learning.
  • PhD thesis, University of Massachusetts,
    Amherst, Amherst, MA, 1984.
  • 4 R. S. Sutton, Learning to Predict by the
    Methods of Temporal Differences. Machine
    Learning, 3944, 1988.
  • 5 R. S. Sutton and A. G. Barto, Reinforcement
    Learning An Introduction.
  • MIT Press, Cambridge, MA, 1998.
  • 6 J. N. Tsitsiklis and B. Van Roy, An Analysis
    of TemporalDierence
  • Learning with Function Approximation. IEEE
    Transactions on Automatic
  • Control, 42(5)674690, 1997.
Write a Comment
User Comments (0)