Reinforcement Learning: Learning Algorithms - PowerPoint PPT Presentation

1 / 72

About This Presentation

Title:

Reinforcement Learning: Learning Algorithms

Description:

Claim: Both converge to V (.) From now on St = S(t) 1. 2. 3. 4. 5 ... Once values converged. or .. Always at the states visited. 24. Monte-Carlo: Evaluation ... – PowerPoint PPT presentation

Number of Views:323

Avg rating:3.0/5.0

Slides: 73

Provided by: CsabaSze

Category:

more less

Transcript and Presenter's Notes

Title: Reinforcement Learning: Learning Algorithms

1
Reinforcement LearningLearning Algorithms

Csaba Szepesvári
University of Alberta
Kioloa, MLSS08
Slides http//www.cs.ualberta.ca/szepesva/MLSS08
/

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAAAAA
2
Contents

Defining the problem(s)
Learning optimally
Learning a good policy
Monte-Carlo
Temporal Difference (bootstrapping)
Batch fitted value iteration and relatives

3
The Learning Problem

The MDP is unknown but the agent can interact
with the system
Goals
Learn an optimal policy
Where do the samples come from?
Samples are generated externally
The agent interacts with the system to get the
samples (active learning)
Performance measure What is the performance of
the policy obtained?
Learn optimally Minimize regret while
interacting with the system
Performance measure loss in rewards due to not
using the optimal policy from the beginning
Exploration vs. exploitation

4
Learning from Feedback

A protocol for prediction problems
xt situation (observed by the agent)
yt 2 Y value to be predicted
pt 2 Y predicted value (can depend on all past
values ) learning!)
rt(xt,yt,y) value of predicting y loss of
learner ?t rt(xt,yt,y)-rt(xt, yt,pt)
Supervised learningagent is told yt,
rt(xt,yt,.)
Regression rt(xt,yt,y)-(y-yt)2 ? ?t(yt-pt)2
Full information prediction problem8 y2 Y,
rt(xt,y) is communicated to the agent, but not yt
Bandit (partial information) problemrt(xt,pt)
is communicated to the agent only

5
Learning Optimally

Explore or exploit?
Bandit problems
Simple schemes
Optimism in the face of uncertainty (OFU) ? UCB
Learning optimally in MDPs with the OFU
principle

6
Learning Optimally Exploration vs.
Exploitation

Two treatments
Unknown success probabilities
Goal
find the best treatment while loosing few
patients
Explore or exploit?

7
Exploration vs. Exploitation Some Applications

Simple processes
Clinical trials
Job shop scheduling (random jobs)
What ad to put on a web-page
More complex processes (memory)
Optimizing production
Controlling an inventory
Optimal investment
Poker
..

8
Bernoulli Bandits

Payoff is 0 or 1
Arm 1
R1(1), R2(1), R3(1), R4(1),
Arm 2
R1(2), R2(2), R3(2), R4(2),

0
1
0
0
1
1
0
1
9
Some definitions
Now t9 T1(t-1) 4 T2(t-1) 4 A1 1, A2 2,

Payoff is 0 or 1
Arm 1
R1(1), R2(1), R3(1), R4(1),
Arm 2
R1(2), R2(2), R3(2), R4(2),

0
1
0
0
1
1
0
1
10
The Exploration/Exploitation Dilemma

Action values Q(a) ERt(a)
Suppose you form estimates
The greedy action at t is
Exploitation When the agent chooses to follow
At
Exploration When the agent chooses to do
something else
You cant exploit all the time you cant explore
all the time
You can never stop exploring but you should
always reduce exploring. Maybe.

11
Action-Value Methods

Methods that adapt action-value estimates and
nothing else
How to estimate action-values?
Sample average
Claim if
nt(a)!1
Why??

12
e-Greedy Action Selection

Greedy action selection
e-Greedy

. . . the simplest way to balance exploration
and exploitation
13
10-Armed Testbed

n 10 possible actions
Repeat 2000 times
Q(a) N(0,1)
Play 1000 rounds
Rt(a) N(Q(a),1)

14
e-Greedy Methods on the 10-Armed Testbed
15
Softmax Action Selection

Problem with ²-greedy Neglects action values
Softmax idea grade action probs. by estimated
values.
Gibbs, or Boltzmann action selection, or
exponential weights
? ?t is the computational temperature

16
Incremental Implementation

Sample average
Incremental computation
Common update rule form
NewEstimate OldEstimate
StepSizeTarget OldEstimate

17
UCB Upper Confidence Bounds
Auer et al. 02

Principle Optimism in the face of uncertainty
Works when the environment is not adversary
Assume rewards are in 0,1. Let
(pgt2)
For a stationary environment, with iid rewards
this algorithm is hard to beat!
Formally regret in T steps is O(log T)
Improvement Estimate variance, use it in place
of p AuSzeMu 07
This principle can be used for achieving small
regret in the full RL problem!

18
UCRL2 UCB Applied to RL

Auer, Jaksch Ortner 07
Algorithm UCRL2()
Phase initialization
Estimate mean model p0 using maximum likelihood
(counts)
C p p(.x,a)-p0(.x,a) c X
log(AT/delta) / N(x,a)
p argmaxp ½(p), ¼ ¼(p)
N0(x,a) N(x,a), 8 (x,a)2 X A
Execution
Execute ¼ until some (x,a) have been visited at
least N0(x,a) times in this phase

19
UCRL2 Results

Def Diameter of an MDP MD(M) maxx,y min¼ E
T(x?y ¼)
Regret bounds
Lower bound ELn ?( ( D X A T )1/2)
Upper bounds
w.p. 1-/T, LT O( D X ( A T log(
AT/)1/2 )
w.p. 1-,LT O( D2 X2 A log( AT/)/ )
performance gap between best and second best
policy

20
Learning a Good Policy

Monte-Carlo methods
Temporal Difference methods
Tabular case
Function approximation
Batch learning

21
Learning a good policy

Model-based learning
Learn p,r
Solve the resulting MDP
Model-free learning
Learn the optimal action-value function and
(then) act greedily
Actor-critic learning
Policy gradient methods
Hybrid
Learn a model and mix planning and a model-free
method e.g. Dyna

22
Monte-Carlo Methods

Episodic MDPs!
Goal Learn V¼(.)
V¼(x) E¼ ?tt RtX0x
(Xt,At,Rt) -- trajectory of ¼
Visits to a state
f(x) min tXt x
First visit
E(x) t Xt x
Every visit
Return
S(t) 0Rt 1 Rt1

K independent trajectories ? S(k), E(k), f(k),
k1..K
First-visit MC
Average over
S(k)( f(k)(x) ) k1..K
Every-visit MC
Average over
S(k)( t ) k1..K , t2 E(k)(x)
Claim Both converge to V¼(.)
From now on St S(t)

Singh Sutton 96
23
Learning to Control with MC

Goal Learn to behave optimally
Method
Learn Q¼(x,a)
..to be used in an approximate policy iteration
(PI) algorithm
Idea/algorithm
Add randomness
Goal all actions are sampled eventually
infinitely often
e.g., ²-greedy or exploring starts
Use the first-visit or the every-visit method to
estimate Q¼(x,a)
Update policy
Once values converged.. or ..
Always at the states visited

24
Monte-Carlo Evaluation

Convergence rate Var(S(0)Xx)/N
Advantages over DP
Learn from interaction with environment
No need for full models
No need to learn about ALL states
Less harm by Markovian violations (no
bootstrapping)
Issue maintaining sufficient exploration
exploring starts, soft policies

25
Temporal Difference Methods
Samuel, 59, Holland 75, Sutton 88

Every-visit Monte-Carlo
V(Xt) ? V(Xt) t(Xt) (St V(Xt))
Bootstrapping
St Rt St1
St Rt V(Xt1)
TD(0)
V(Xt) ? V(Xt) t(Xt) ( St V(Xt) )
Value iteration
V(Xt) ? E St Xt
Theorem Let Vt be the sequence of functions
generated by TD(0). Assume 8 x, w.p.1 ?t
t(x)1, ?t t2(x)lt1. Then Vt ? V¼ w.p.1
Proof Stochastic approximationsVt1Tt(Vt,Vt),
Ut1Tt(Ut,V¼) ? TV¼.Jaakkola et al., 94,
Tsitsiklis 94, SzeLi99

26
TD or MC?

TD advantages
can be fully incremental, i.e., learn before
knowing the final outcome
Less memory
Less peak computation
learn without the final outcome
From incomplete sequences
MC advantage
Less harm by Markovian violations
Convergence rate?
Var(S(0)Xx) decides!

27
Learning to Control with TD

Q-learning Watkins 90Q(Xt,At) ? Q(Xt,At)
t(Xt,At) RtmaxaQ (Xt1,a)Q(Xt,At)
Theorem Converges to Q JJS94, Tsi94,SzeLi99
SARSA Rummery Niranjan 94
At Greedy²(Q,Xt)
Q(Xt,At) ? Q(Xt,At) t(Xt,At) RtQ
(Xt1,At1)Q(Xt,At)
Off-policy (Q-learning) vs. on-policy (SARSA)
Expecti-SARSA
Actor-Critic Witten 77, Barto, Sutton
Anderson 83, Sutton 84

28
Cliffwalking
e-greedy, e 0.1
29
N-step TD Prediction

Idea Look farther into the future when you do TD
backup (1, 2, 3, , n steps)

30
N-step TD Prediction

Monte Carlo
St Rt Rt1 .. T-t RT
TD St(1) Rt V(Xt1)
Use V to estimate remaining return
n-step TD
2 step return
St(2) Rt Rt1 2 V(Xt2)
n-step return
St(n) Rt Rt1 n V(Xtn)

31
Learning with n-step Backups

Learning with n-step backups
V(Xt) ? V(Xt) t( St(n) - V(Xt))
n controls how much to bootstrap

32
Random Walk Examples

How does 2-step TD work here?
How about 3-step TD?

33
A Larger Example

Task 19 state random walk
Do you think there is an optimal n? for
everything?

34
Averaging N-step Returns

Idea backup an average of several returns
e.g. backup half of 2-step and half of 4-step
complex backup

One backup
35
Forward View of TD(l)
Sutton 88

Idea Average over multiple backups
l-return
St() (1-) ?n0..1 n St(n1)
TD()
V(Xt) t( St() -V(Xt))
Relation to TD(0) and MC
0 ? TD(0)
1 ? MC

36
l-return on the Random Walk

Same 19 state random walk as before
Why intermediate values of l are best?

37
Backward View of TD(l)
Sutton 88, Singh Sutton 96

t Rt V(Xt1) V(Xt)
V(x) ? V(x) t t e(x)
e(x) ? e(x) I(xXt)
Off-line updates ?Same as FW TD()
e(x) eligibility trace
Accumulating trace
Replacing traces speed up convergence
e(x) ? max( e(x), I(xXt) )

38
Function Approximation with TD
39
Gradient Descent Methods

Assume Vt is a differentiable function of ?
Vt(x) V(x?).
Assume, for now, training examples of the form
(Xt, V?(Xt))

40
Performance Measures

Many are applicable but
a common and simple one is the mean-squared error
(MSE) over a distribution P
Why P?
Why minimize MSE?
Let us assume that P is always the distribution
of states at which backups are done.
The on-policy distribution the distribution
created while following the policy being
evaluated. Stronger results are available for
this distribution.

41
Gradient Descent

Let L be any function of the parameters.Its
gradient at any point ? in this space is
Iteratively move down the gradient

42
Gradient Descent in RL

Function to descent on
Gradient
Gradient descent procedure
Bootstrapping with St
TD(?) (forward view)

43
Linear Methods
Sutton 84, 88, Tsitsiklis Van Roy 97

Linear FAPP V(xµ) µ T Á(x)
rµ V(xµ) Á(x)
Tabular representation Á(x)y I(xy)
Backward view
t Rt V(Xt1) V(Xt)
µ ? µ t t e
e ? e rµ V(Xtµ)
Theorem TsiVaR97 Vt converges to V s.t.
V-V¼D,2 V¼- V¼D,2/(1-).

44
Control with FA
Rummery Niranjan 94

Learning state-action values
Training examples
The general gradient-descent rule
Gradient-descent Sarsa(l)

45
Mountain-Car Task
Sutton 96, Singh Sutton 96
46
Mountain-Car Results
47
Bairds Counterexample Off-policy Updates Can
Diverge
Baird 95
48
Bairds Counterexample Cont.
49
Should We Bootstrap?
50
Batch Reinforcement Learning
51
Batch RL

Goal Given the trajectory of the behavior policy
¼b X1,A1,R1, , Xt, At, Rt, , XNcompute a
good policy!
Batch learning
Properties
Data collection is not influenced
Emphasis is on the quality of the solution
Computational complexity plays a secondary role
Performance measures
V(x) V¼(x)1 supx V(x) - V¼(x)
supx V(x) - V¼(x)
V(x) - V¼(x)2 s (V(x)-V¼(x))2 d¹(x)

52
Solution methods
Bradtke, Barto 96, Lagoudakis, Parr 03,
AnSzeMu 07

Build a model
Do not build a model, but find an approximation
to Q
using value iteration gt fitted Q-iteration
using policy iteration gt
Policy evaluated by approximate value iteration
Policy evaluated by Bellman-residual minimization
(BRM)
Policy evaluated by least-squares temporal
difference learning (LSTD) gt LSPI
Policy search

53
Evaluating a policy Fitted value iteration

Choose a function space F.
Solve for i1,2,,M the LS (regression) problems
Counterexamples?!?!? Baird 95, Tsitsiklis
and van Roy 96
When does this work??
Requirement If M is big enough and the number of
samples is big enough QM should be close to Q¼
We have to make some assumptions on F

54
Least-squares vs. gradient

Linear least squares (ordinary regression) yt
wT xt ²t (xt,yt) jointly distributed
r.v.s., iid, E²txt0.
Seeing (xt,yt), t1,,T, find out w.
Loss function L(w) E (y1 wT x1 )2 .
Least-squares approach
wT argminw ?t1T (yt wT xt)2
Stochastic gradient method
wt1 wt t (yt-wtT xt) xt
Tradeoffs
Sample complexity How good is the estimate
Computational complexity How expensive is the
computation?

55
Fitted value iteration Analysis
After AnSzeMu 07

Goal Bound QM - Q¼¹2 in terms of
maxm ²mº2 , ²mº2 s ²m2(x,a)
º(dx,da),where Qm1 T¼Qm ²m , ²-1 Q0-Q¼
Um Qm Q¼

56
Analysis/2
57
Summary

If the regression errors are all small and the
system is noisy (8 ¼,½, ½ P¼ C1 º) then the
final error will be small.
How to make the regression errors small?
Regression error decomposition

58
Controlling the approximation error
59
Controlling the approximation error
60
Controlling the approximation error
61
Controlling the approximation error

Assume smoothness!

62
Learning with (lots of) historical data

Data A long trajectory of some exploration
policy
Goal Efficient algorithm to learn a policy
Idea Use fitted action-values
Algorithms
Bellman residual minimization, FQI AnSzeMu 07
LSPI Lagoudakis, Parr 03
Bounds
Oracle inequalities (BRM, FQI and LSPI)
) consistency

63
BRM insight
AnSzeMu 07

TD error ?tRt Q(Xt1,¼(Xt1))-Q(Xt,At)
Bellman error EE ?t Xt,At 2
What we can compute/estimate EE ?t2 Xt,At
They are different!
However

64
Loss function
65
Algorithm (BRM)
66
Do we need to reweight or throw away data?

NO!
WHY?
Intuition from regression
m(x) EYXx can be learnt no matter what p(x)
is!
?(ax) the same should be possible!
BUT..
Performance might be poor! gt YES!
Like in supervised learning when training and
test distributions are different

67
Bound
68
The concentration coefficients

Lyapunov exponents
Our case
yt is infinite dimensional
Pt depends on the policy chosen
If top-Lyap exp. 0, we are good?

69
Open question

Abstraction
Let
True?

70
Relation to LSTD
AnSzeMu 07

LSTD
Linear function space
Bootstrap the normal equation

71
Open issues

Adaptive algorithms to take advantage of
regularity when present to address the curse of
dimensionality
Penalized least-squares/aggregation?
Feature relevance
Factorization
Manifold estimation
Abstraction build automatically
Active learning
Optimal on-line learning for infinite problems

72
References

Auer et al. 02 P. Auer, N. Cesa-Bianchi and P.
Fischer Finite time analysis of the multiarmed
bandit problem, Machine Learning, 47235256,
2002.
AuSzeMu 07 J.-Y. Audibert, R. Munos and Cs.
Szepesvári Tuning bandit algorithms in
stochastic environments, ALT, 2007.
Auer, Jaksch Ortner 07 P. Auer, T. Jaksch
and R. Ortner Near-optimal Regret Bounds for
Reinforcement Learning, (2007), available
athttp//www.unileoben.ac.at/infotech/publicatio
ns/ucrlrevised.pdf
Singh Sutton 96 S.P. Singh and R.S.
SuttonReinforcement learning with replacing
eligibility traces. Machine Learning, 22123158,
1996.
Sutton 88 R.S. Sutton Learning to predict by
the method of temporal differences. Machine
Learning, 3944, 1988.
Jaakkola et al. 94 T. Jaakkola, M.I. Jordan,
and S.P. Singh On the convergence of stochastic
iterative dynamic programming algorithms. Neural
Computation, 6 11851201, 1994.
Tsitsiklis, 94 J.N. Tsitsiklis Asynchronous
stochastic approximation and Q-learning. Machine
Learning, 16185202, 1994.
SzeLi99 Cs. Szepesvári and M.L. Littman A
Unified Analysis of Value-Function-Based
Reinforcement-Learning Algorithms, Neural
Computation, 11, 20172059, 1999.
Watkins 90 C.J.C.H. Watkins Learning from
Delayed Rewards, PhD Thesis, 1990.
Rummery and Niranjan 94 G.A. Rummery and M.
Niranjan On-line Q-learning using connectionist
systems. Technical Report CUED/F-INFENG/TR 166,
Cambridge University Engineering Department,
1994.
Sutton 84 R.S. Sutton Temporal Credit
Assignment in Reinforcement Learning. PhD
thesis, University of Massachusetts, Amherst, MA,
1984.
Tsitsiklis Van Roy 97 J.N. Tsitsiklis and B.
Van Roy An analysis of temporal-difference
learning with function approximation. IEEE
Transactions on Automatic Control, 42674690,
1997.
Sutton 96 R.S. Sutton Generalization in
reinforcement learning Successful examples using
sparse coarse coding. NIPS, 1996.
Baird 95 L.C. Baird Residual algorithms
Reinforcement learning with function
approximation, ICML, 1995.
Bradtke, Barto 96 S.J. Bradtke and A.G. Barto
Linear least-squares algorithms for temporal
difference learning. Machine Learning, 223357,
1996.
Lagoudakis, Parr 03 M. Lagoudakis and R. Parr
Least-squares policy iteration, Journal of
Machine Learning Research, 411071149, 2003.
AnSzeMu 07 A. Antos, Cs. Szepesvari and R.
Munos Learning near-optimal policies with
Bellman-residual minimization based fitted policy
iteration and a single sample path, Machine
Learning Journal, 2007.