Title: Reinforcement Learning: Learning Algorithms
1Reinforcement LearningLearning Algorithms
- Csaba Szepesvári
- University of Alberta
- Kioloa, MLSS08
- Slides http//www.cs.ualberta.ca/szepesva/MLSS08
/
TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAAAAA
2Contents
- Defining the problem(s)
- Learning optimally
- Learning a good policy
- Monte-Carlo
- Temporal Difference (bootstrapping)
- Batch fitted value iteration and relatives
3The Learning Problem
- The MDP is unknown but the agent can interact
with the system - Goals
- Learn an optimal policy
- Where do the samples come from?
- Samples are generated externally
- The agent interacts with the system to get the
samples (active learning) - Performance measure What is the performance of
the policy obtained? - Learn optimally Minimize regret while
interacting with the system - Performance measure loss in rewards due to not
using the optimal policy from the beginning - Exploration vs. exploitation
4Learning from Feedback
- A protocol for prediction problems
- xt situation (observed by the agent)
- yt 2 Y value to be predicted
- pt 2 Y predicted value (can depend on all past
values ) learning!) - rt(xt,yt,y) value of predicting y loss of
learner ?t rt(xt,yt,y)-rt(xt, yt,pt) - Supervised learningagent is told yt,
rt(xt,yt,.) - Regression rt(xt,yt,y)-(y-yt)2 ? ?t(yt-pt)2
- Full information prediction problem8 y2 Y,
rt(xt,y) is communicated to the agent, but not yt - Bandit (partial information) problemrt(xt,pt)
is communicated to the agent only
5Learning Optimally
- Explore or exploit?
- Bandit problems
- Simple schemes
- Optimism in the face of uncertainty (OFU) ? UCB
- Learning optimally in MDPs with the OFU
principle
6Learning Optimally Exploration vs.
Exploitation
- Two treatments
- Unknown success probabilities
- Goal
- find the best treatment while loosing few
patients - Explore or exploit?
7Exploration vs. Exploitation Some Applications
- Simple processes
- Clinical trials
- Job shop scheduling (random jobs)
- What ad to put on a web-page
- More complex processes (memory)
- Optimizing production
- Controlling an inventory
- Optimal investment
- Poker
- ..
8Bernoulli Bandits
- Payoff is 0 or 1
- Arm 1
- R1(1), R2(1), R3(1), R4(1),
- Arm 2
- R1(2), R2(2), R3(2), R4(2),
0
1
0
0
1
1
0
1
9Some definitions
Now t9 T1(t-1) 4 T2(t-1) 4 A1 1, A2 2,
- Payoff is 0 or 1
- Arm 1
- R1(1), R2(1), R3(1), R4(1),
- Arm 2
- R1(2), R2(2), R3(2), R4(2),
0
1
0
0
1
1
0
1
10The Exploration/Exploitation Dilemma
- Action values Q(a) ERt(a)
- Suppose you form estimates
- The greedy action at t is
- Exploitation When the agent chooses to follow
At - Exploration When the agent chooses to do
something else - You cant exploit all the time you cant explore
all the time - You can never stop exploring but you should
always reduce exploring. Maybe.
11Action-Value Methods
- Methods that adapt action-value estimates and
nothing else - How to estimate action-values?
- Sample average
- Claim if
nt(a)!1 - Why??
12e-Greedy Action Selection
- Greedy action selection
- e-Greedy
. . . the simplest way to balance exploration
and exploitation
1310-Armed Testbed
- n 10 possible actions
- Repeat 2000 times
- Q(a) N(0,1)
- Play 1000 rounds
- Rt(a) N(Q(a),1)
14e-Greedy Methods on the 10-Armed Testbed
15Softmax Action Selection
- Problem with ²-greedy Neglects action values
- Softmax idea grade action probs. by estimated
values. - Gibbs, or Boltzmann action selection, or
exponential weights - ? ?t is the computational temperature
16Incremental Implementation
- Sample average
- Incremental computation
- Common update rule form
- NewEstimate OldEstimate
StepSizeTarget OldEstimate
17UCB Upper Confidence Bounds
Auer et al. 02
- Principle Optimism in the face of uncertainty
- Works when the environment is not adversary
- Assume rewards are in 0,1. Let
- (pgt2)
- For a stationary environment, with iid rewards
this algorithm is hard to beat! - Formally regret in T steps is O(log T)
- Improvement Estimate variance, use it in place
of p AuSzeMu 07 - This principle can be used for achieving small
regret in the full RL problem!
18UCRL2 UCB Applied to RL
- Auer, Jaksch Ortner 07
- Algorithm UCRL2()
- Phase initialization
- Estimate mean model p0 using maximum likelihood
(counts) - C p p(.x,a)-p0(.x,a) c X
log(AT/delta) / N(x,a) - p argmaxp ½(p), ¼ ¼(p)
- N0(x,a) N(x,a), 8 (x,a)2 X A
- Execution
- Execute ¼ until some (x,a) have been visited at
least N0(x,a) times in this phase
19UCRL2 Results
- Def Diameter of an MDP MD(M) maxx,y min¼ E
T(x?y ¼) - Regret bounds
- Lower bound ELn ?( ( D X A T )1/2)
- Upper bounds
- w.p. 1-/T, LT O( D X ( A T log(
AT/)1/2 ) - w.p. 1-,LT O( D2 X2 A log( AT/)/ )
performance gap between best and second best
policy
20Learning a Good Policy
- Monte-Carlo methods
- Temporal Difference methods
- Tabular case
- Function approximation
- Batch learning
21Learning a good policy
- Model-based learning
- Learn p,r
- Solve the resulting MDP
- Model-free learning
- Learn the optimal action-value function and
(then) act greedily - Actor-critic learning
- Policy gradient methods
- Hybrid
- Learn a model and mix planning and a model-free
method e.g. Dyna
22Monte-Carlo Methods
- Episodic MDPs!
- Goal Learn V¼(.)
- V¼(x) E¼ ?tt RtX0x
- (Xt,At,Rt) -- trajectory of ¼
- Visits to a state
- f(x) min tXt x
- First visit
- E(x) t Xt x
- Every visit
- Return
- S(t) 0Rt 1 Rt1
- K independent trajectories ? S(k), E(k), f(k),
k1..K - First-visit MC
- Average over
- S(k)( f(k)(x) ) k1..K
- Every-visit MC
- Average over
- S(k)( t ) k1..K , t2 E(k)(x)
- Claim Both converge to V¼(.)
- From now on St S(t)
Singh Sutton 96
23Learning to Control with MC
- Goal Learn to behave optimally
- Method
- Learn Q¼(x,a)
- ..to be used in an approximate policy iteration
(PI) algorithm - Idea/algorithm
- Add randomness
- Goal all actions are sampled eventually
infinitely often - e.g., ²-greedy or exploring starts
- Use the first-visit or the every-visit method to
estimate Q¼(x,a) - Update policy
- Once values converged.. or ..
- Always at the states visited
24Monte-Carlo Evaluation
- Convergence rate Var(S(0)Xx)/N
- Advantages over DP
- Learn from interaction with environment
- No need for full models
- No need to learn about ALL states
- Less harm by Markovian violations (no
bootstrapping) - Issue maintaining sufficient exploration
- exploring starts, soft policies
25Temporal Difference Methods
Samuel, 59, Holland 75, Sutton 88
- Every-visit Monte-Carlo
- V(Xt) ? V(Xt) t(Xt) (St V(Xt))
- Bootstrapping
- St Rt St1
- St Rt V(Xt1)
- TD(0)
- V(Xt) ? V(Xt) t(Xt) ( St V(Xt) )
- Value iteration
- V(Xt) ? E St Xt
- Theorem Let Vt be the sequence of functions
generated by TD(0). Assume 8 x, w.p.1 ?t
t(x)1, ?t t2(x)lt1. Then Vt ? V¼ w.p.1 - Proof Stochastic approximationsVt1Tt(Vt,Vt),
Ut1Tt(Ut,V¼) ? TV¼.Jaakkola et al., 94,
Tsitsiklis 94, SzeLi99
26TD or MC?
- TD advantages
- can be fully incremental, i.e., learn before
knowing the final outcome - Less memory
- Less peak computation
- learn without the final outcome
- From incomplete sequences
- MC advantage
- Less harm by Markovian violations
- Convergence rate?
- Var(S(0)Xx) decides!
27Learning to Control with TD
- Q-learning Watkins 90Q(Xt,At) ? Q(Xt,At)
t(Xt,At) RtmaxaQ (Xt1,a)Q(Xt,At) - Theorem Converges to Q JJS94, Tsi94,SzeLi99
- SARSA Rummery Niranjan 94
- At Greedy²(Q,Xt)
- Q(Xt,At) ? Q(Xt,At) t(Xt,At) RtQ
(Xt1,At1)Q(Xt,At) - Off-policy (Q-learning) vs. on-policy (SARSA)
- Expecti-SARSA
- Actor-Critic Witten 77, Barto, Sutton
Anderson 83, Sutton 84
28Cliffwalking
e-greedy, e 0.1
29N-step TD Prediction
- Idea Look farther into the future when you do TD
backup (1, 2, 3, , n steps)
30N-step TD Prediction
- Monte Carlo
- St Rt Rt1 .. T-t RT
- TD St(1) Rt V(Xt1)
- Use V to estimate remaining return
- n-step TD
- 2 step return
- St(2) Rt Rt1 2 V(Xt2)
- n-step return
- St(n) Rt Rt1 n V(Xtn)
31Learning with n-step Backups
- Learning with n-step backups
- V(Xt) ? V(Xt) t( St(n) - V(Xt))
- n controls how much to bootstrap
32Random Walk Examples
- How does 2-step TD work here?
- How about 3-step TD?
33A Larger Example
- Task 19 state random walk
- Do you think there is an optimal n? for
everything?
34Averaging N-step Returns
- Idea backup an average of several returns
- e.g. backup half of 2-step and half of 4-step
- complex backup
One backup
35Forward View of TD(l)
Sutton 88
- Idea Average over multiple backups
- l-return
- St() (1-) ?n0..1 n St(n1)
- TD()
- V(Xt) t( St() -V(Xt))
- Relation to TD(0) and MC
- 0 ? TD(0)
- 1 ? MC
36l-return on the Random Walk
- Same 19 state random walk as before
- Why intermediate values of l are best?
37Backward View of TD(l)
Sutton 88, Singh Sutton 96
- t Rt V(Xt1) V(Xt)
- V(x) ? V(x) t t e(x)
- e(x) ? e(x) I(xXt)
- Off-line updates ?Same as FW TD()
- e(x) eligibility trace
- Accumulating trace
- Replacing traces speed up convergence
- e(x) ? max( e(x), I(xXt) )
38Function Approximation with TD
39Gradient Descent Methods
- Assume Vt is a differentiable function of ?
- Vt(x) V(x?).
- Assume, for now, training examples of the form
-
- (Xt, V?(Xt))
40Performance Measures
- Many are applicable but
- a common and simple one is the mean-squared error
(MSE) over a distribution P - Why P?
- Why minimize MSE?
- Let us assume that P is always the distribution
of states at which backups are done. - The on-policy distribution the distribution
created while following the policy being
evaluated. Stronger results are available for
this distribution.
41Gradient Descent
- Let L be any function of the parameters.Its
gradient at any point ? in this space is - Iteratively move down the gradient
42Gradient Descent in RL
- Function to descent on
- Gradient
- Gradient descent procedure
- Bootstrapping with St
- TD(?) (forward view)
43Linear Methods
Sutton 84, 88, Tsitsiklis Van Roy 97
- Linear FAPP V(xµ) µ T Á(x)
- rµ V(xµ) Á(x)
- Tabular representation Á(x)y I(xy)
- Backward view
- t Rt V(Xt1) V(Xt)
- µ ? µ t t e
- e ? e rµ V(Xtµ)
- Theorem TsiVaR97 Vt converges to V s.t.
V-V¼D,2 V¼- V¼D,2/(1-).
44Control with FA
Rummery Niranjan 94
- Learning state-action values
- Training examples
- The general gradient-descent rule
- Gradient-descent Sarsa(l)
45Mountain-Car Task
Sutton 96, Singh Sutton 96
46Mountain-Car Results
47Bairds Counterexample Off-policy Updates Can
Diverge
Baird 95
48Bairds Counterexample Cont.
49Should We Bootstrap?
50Batch Reinforcement Learning
51Batch RL
- Goal Given the trajectory of the behavior policy
¼b X1,A1,R1, , Xt, At, Rt, , XNcompute a
good policy! - Batch learning
- Properties
- Data collection is not influenced
- Emphasis is on the quality of the solution
- Computational complexity plays a secondary role
- Performance measures
- V(x) V¼(x)1 supx V(x) - V¼(x)
supx V(x) - V¼(x) - V(x) - V¼(x)2 s (V(x)-V¼(x))2 d¹(x)
52Solution methods
Bradtke, Barto 96, Lagoudakis, Parr 03,
AnSzeMu 07
- Build a model
- Do not build a model, but find an approximation
to Q - using value iteration gt fitted Q-iteration
- using policy iteration gt
- Policy evaluated by approximate value iteration
Policy evaluated by Bellman-residual minimization
(BRM) - Policy evaluated by least-squares temporal
difference learning (LSTD) gt LSPI - Policy search
53Evaluating a policy Fitted value iteration
- Choose a function space F.
- Solve for i1,2,,M the LS (regression) problems
- Counterexamples?!?!? Baird 95, Tsitsiklis
and van Roy 96 - When does this work??
- Requirement If M is big enough and the number of
samples is big enough QM should be close to Q¼ - We have to make some assumptions on F
54Least-squares vs. gradient
- Linear least squares (ordinary regression) yt
wT xt ²t (xt,yt) jointly distributed
r.v.s., iid, E²txt0. - Seeing (xt,yt), t1,,T, find out w.
- Loss function L(w) E (y1 wT x1 )2 .
- Least-squares approach
- wT argminw ?t1T (yt wT xt)2
- Stochastic gradient method
- wt1 wt t (yt-wtT xt) xt
- Tradeoffs
- Sample complexity How good is the estimate
- Computational complexity How expensive is the
computation?
55Fitted value iteration Analysis
After AnSzeMu 07
- Goal Bound QM - Q¼¹2 in terms of
maxm ²mº2 , ²mº2 s ²m2(x,a)
º(dx,da),where Qm1 T¼Qm ²m , ²-1 Q0-Q¼ - Um Qm Q¼
56Analysis/2
57Summary
- If the regression errors are all small and the
system is noisy (8 ¼,½, ½ P¼ C1 º) then the
final error will be small. - How to make the regression errors small?
- Regression error decomposition
58Controlling the approximation error
59Controlling the approximation error
60Controlling the approximation error
61Controlling the approximation error
62Learning with (lots of) historical data
- Data A long trajectory of some exploration
policy - Goal Efficient algorithm to learn a policy
- Idea Use fitted action-values
- Algorithms
- Bellman residual minimization, FQI AnSzeMu 07
- LSPI Lagoudakis, Parr 03
- Bounds
- Oracle inequalities (BRM, FQI and LSPI)
- ) consistency
63BRM insight
AnSzeMu 07
- TD error ?tRt Q(Xt1,¼(Xt1))-Q(Xt,At)
- Bellman error EE ?t Xt,At 2
- What we can compute/estimate EE ?t2 Xt,At
- They are different!
- However
64Loss function
65Algorithm (BRM)
66Do we need to reweight or throw away data?
- NO!
- WHY?
- Intuition from regression
- m(x) EYXx can be learnt no matter what p(x)
is! - ?(ax) the same should be possible!
- BUT..
- Performance might be poor! gt YES!
- Like in supervised learning when training and
test distributions are different
67Bound
68The concentration coefficients
- Lyapunov exponents
- Our case
- yt is infinite dimensional
- Pt depends on the policy chosen
- If top-Lyap exp. 0, we are good?
69Open question
70Relation to LSTD
AnSzeMu 07
- LSTD
- Linear function space
- Bootstrap the normal equation
71Open issues
- Adaptive algorithms to take advantage of
regularity when present to address the curse of
dimensionality - Penalized least-squares/aggregation?
- Feature relevance
- Factorization
- Manifold estimation
- Abstraction build automatically
- Active learning
- Optimal on-line learning for infinite problems
72References
- Auer et al. 02 P. Auer, N. Cesa-Bianchi and P.
Fischer Finite time analysis of the multiarmed
bandit problem, Machine Learning, 47235256,
2002. - AuSzeMu 07 J.-Y. Audibert, R. Munos and Cs.
Szepesvári Tuning bandit algorithms in
stochastic environments, ALT, 2007. - Auer, Jaksch Ortner 07 P. Auer, T. Jaksch
and R. Ortner Near-optimal Regret Bounds for
Reinforcement Learning, (2007), available
athttp//www.unileoben.ac.at/infotech/publicatio
ns/ucrlrevised.pdf - Singh Sutton 96 S.P. Singh and R.S.
SuttonReinforcement learning with replacing
eligibility traces. Machine Learning, 22123158,
1996. - Sutton 88 R.S. Sutton Learning to predict by
the method of temporal differences. Machine
Learning, 3944, 1988. - Jaakkola et al. 94 T. Jaakkola, M.I. Jordan,
and S.P. Singh On the convergence of stochastic
iterative dynamic programming algorithms. Neural
Computation, 6 11851201, 1994. - Tsitsiklis, 94 J.N. Tsitsiklis Asynchronous
stochastic approximation and Q-learning. Machine
Learning, 16185202, 1994. - SzeLi99 Cs. Szepesvári and M.L. Littman A
Unified Analysis of Value-Function-Based
Reinforcement-Learning Algorithms, Neural
Computation, 11, 20172059, 1999. - Watkins 90 C.J.C.H. Watkins Learning from
Delayed Rewards, PhD Thesis, 1990. - Rummery and Niranjan 94 G.A. Rummery and M.
Niranjan On-line Q-learning using connectionist
systems. Technical Report CUED/F-INFENG/TR 166,
Cambridge University Engineering Department,
1994. - Sutton 84 R.S. Sutton Temporal Credit
Assignment in Reinforcement Learning. PhD
thesis, University of Massachusetts, Amherst, MA,
1984. - Tsitsiklis Van Roy 97 J.N. Tsitsiklis and B.
Van Roy An analysis of temporal-difference
learning with function approximation. IEEE
Transactions on Automatic Control, 42674690,
1997. - Sutton 96 R.S. Sutton Generalization in
reinforcement learning Successful examples using
sparse coarse coding. NIPS, 1996. - Baird 95 L.C. Baird Residual algorithms
Reinforcement learning with function
approximation, ICML, 1995. - Bradtke, Barto 96 S.J. Bradtke and A.G. Barto
Linear least-squares algorithms for temporal
difference learning. Machine Learning, 223357,
1996. - Lagoudakis, Parr 03 M. Lagoudakis and R. Parr
Least-squares policy iteration, Journal of
Machine Learning Research, 411071149, 2003. - AnSzeMu 07 A. Antos, Cs. Szepesvari and R.
Munos Learning near-optimal policies with
Bellman-residual minimization based fitted policy
iteration and a single sample path, Machine
Learning Journal, 2007.