Title: COMP538 Reinforcement Learning Recent Development
1COMP538Reinforcement LearningRecent Development
- Group 7
- Chan Ka Ki (cski_at_ust.hk)
- Fung On Tik Andy (cpegandy_at_ust.hk)
- Li Yuk Hin (tonyli_at_ust.hk)
Instructor Nevin L. Zhang
2Outline
- Introduction
- 3 Solving Methods
- Main Consideration
- Exploration vs. Exploitation
- Directed / Undirected Exploration
- Function Approximation
- Planning and Learning
- Directed RL vs. Undirected RL
- Dyna-Q and Prioritized Sweeping
- Conclusion on recent development
3Introduction
- Agent interacts with environment
- Goal-directed learning from interaction
Environment
4Key Features
- Agent is NOT told which actions to take, but
learn by itself - By trial-and-error
- From experiences
- Explore and exploit
- Exploitation agent takes the best action based
on its current knowledge - Exploration try to take NOT the best action to
gain more knowledge
5Elements of RL
- Policy what to do
- Reward what is good
- Value what is good because it predicts reward
- Model what follows what
6Dynamic Programming
- Model-based
- compute optimal policies given a perfect model of
the environment as a Markov decision process
(MDP) - Bootstrap
- update estimates based in part on other learned
estimates, without waiting for a final outcome
7Dynamic Programming
8Monte Carlo
- Model-free
- NOT bootstrap
- Entire episode included
- Only one choice at each state (unlike DP)
- Time required to estimate one state does not
depend on the total number of states
9Monte Carlo
10Temporal Difference
- Model-free
- Bootstrap
- Partial episode included
11Temporal Difference
12Example Driving home
13Driving home
- Changes recommended by Monte Carlo methods
- Changes recommended
- by TD methods
14N-step TD Prediction
- MC and TD are extreme cases!
15Averaging N-step Returns
- n-step methods were introduced to help with TD(l)
understanding - Idea backup an average of several returns
- e.g. backup half of 2-step and half of 4-step
- Called a complex backup
- Draw each component
- Label with the weights for that component
16Forward View of TD(l)
- TD(l) is a method for averaging all n-step
backups - weight by ln-1 (time since visitation)
- l-return
- Backup using l-return
17Forward View of TD(l)
- Look forward from each state to determine update
from future states and rewards
18Backward View of TD(l)
- The forward view was for theory
- The backward view is for mechanism
- New variable called eligibility trace
- On each step, decay all traces by gl and
increment the trace for the current state by 1 - Accumulating trace
19Backward View
- Shout dt backwards over time
- The strength of your voice decreases with
temporal distance by gl
20Forward View Backward View
- The forward (theoretical) view of TD(l) is
equivalent to the backward (mechanistic) view for
off-line updating
21Adaptive Exploration in Reinforcement Learning
Relu Patrascu Department of Systems Design
Engineering University of Waterloo Waterloo,
Ontario, Canada relu_at_pami.uwaterloo.ca
Deborah Stacey Dept. of Computing and Information
Science University of Guelph Ontario,
Canada dastacey_at_uoguelph.ca
22Objectives
- Explains the trade-off between exploitation and
exploration - Introduces two categories of exploration methods
- Undirected Exploration
- ?-greedy exploration
- Directed Exploration
- Counter-based exploration
- Past-Success directed exploration
- Function approximation Backpropagation algorithm
and Fuzzy ARTMAP
23Introduction
- Main problem How to make the learning process
adapt to the non-stationary environment? - Sub-Problems
- How to balance exploitation and exploration when
the environment change? - How can the function approximators adapt the
environment?
24Exploitation and Exploration
- Exploit or Explore?
- To maximize reward, a learner must exploit the
knowledge it already has - Explore an action with small immediate reward,
but may yield more reward in the long run - An example Choosing the job
- Suppose you are working at a small company with
25,000 salary - You have another offer from an enterprise but
only start at 12,000 - Keep working on the small company may guarantee
you have stable income - Work on an enterprise may have more opportunities
for promotion, which increase the income in long
run
25Undirected Exploration
- Undirected Exploration
- No biased
- purely random
- Eg. ?-greedy exploration
- it explores it chooses equally among all actions
- likely to choose the worst appearing action as it
is to choose the next-to-best
26Directed Exploration
- Directed Exploration
- Memorize exploration-specific knowledge
- Biased by some features of the learning process
- Eg. Counter-based techniques
- Favor the choice of actions resulting in a
transition to a state that has not been
frequently visited - The main idea is encourage the learner to explore
- parts of the state space that have not been
sampled often - parts that have not been sampled recently
27Past-success Directed Exploration
- Based on ?-greedy exploration
- Bias to adapt the environment from the learning
process - Increase exploitation rate if receives reward at
an increasing rate - Increase exploration rate when stop receiving
reward - Average discounted reward
- Reflects amount and frequency of received
immediate rewards - The further back in time, the less effect on
average reward
28Past-Success Directed Exploration
- Average discounted reward defined as
- Apply it on ?-greedy algorithm
where ? ? (0,1 is the discount factor rt the
reward received at time t
29Gradient Descent Method
- Why use a gradient descent method?
- RL applications use table to store the value
functions - Large number of states causes practically
impossible - Solution use function approximator to predict
the value - Error backpropagation algorithm
- Catastrophic Interference
- cannot learn incrementally in non-stationary
environment - acquire new knowledge forget much of its previous
knowledge
30Gradient Descent Method
Initialize w arbitrarily and e 0 Repeat (for
each episode) Initialize s Pass s through each
network and obtain Qa a ? arg maxa Qa With
probability ? a ? a random action ? A(s) Repeat
(for each step of episode) e ? ??e ea ? ea
?wQa Take action a, observe reward, r and next
state, s ? ? r Qa Pass s through each network
and obtain Qa a ? arg maxa Qa With probability
? a ? a random action ? A(s) ? ? ? ?Qa w ? w
??e a ? a until s is terminal Â
where a ? arg maxa Qa means a is set to the
action for which the expression is maximal, in
this case the highest Q ? is a constant step
size parameter named the learning rate ?wQa is
the partial derivative of Qa with respect to the
weights w ? the discount factor e the vector of
eligibility traces ? ? (0, 1 is the eligibility
trace parameter
31Fuzzy ARTMAP
- ARTMAP - Adaptive Resonancy Theory mapping
between input vector and output pattern - a neural network specifically designed to deal
with the stability/plasticity dilemma - This dilemma means a neural network isn't able to
learn new information without damaging what was
learned previously, similar to catastrophic
interference
32Experiments
- Gridworld with non-stationary environment
- Learning agent can move up, down, left or right
- Two gates must pass through one of them from
start state to goal state - First 1000 episodes, gate 1 open and gate 2 close
- 1001-5000 episodes, gate 1 close and gate 2 open
- To test how well the algorithm adapt to the
changed environment
33Results
- Backpropagation algorithm
- After 1000th episode
- average discounted reward drops rapidly and
monotonically - Surges to maximum exploitation
- Fuzzy ARTMAP
- After 1000th episode
- Reward drops in a few episode and goes back to
high values - A temporary surge in exploration
34Planning and Learning
Objectives
- Use of environment models
- Integration of planning and learning methods
35Models
- Model anything the agent can use to predict how
the environment will respond to its actions - Distribution model description of all
possibilities and their probabilities - e.g.,
- Sample model produces sample experiences
- e.g., a simulation model, set of data
- Both types of models can be used to produce
simulated experience - Often sample models are much easier to obtain
36Planning
- Planning any computational process that uses a
model to create or improve a policy
- We take the following view
- all state-space planning methods involve
computing value functions, either explicitly or
implicitly - they all apply backups to simulated experience
Simulated Experience
37Learning, Planning, and Acting
- Two uses of real experience
- model learning to improve the model
- direct RL to directly improve the value function
and policy - Improving value function and/or policy via a
model is sometimes called indirect RL or
model-based RL. Here, we call it planning.
38Direct vs. Indirect RL
- Indirect methods
- make fuller use of experience get better policy
with fewer environment interactions
- Direct methods
- simpler
- not affected by bad models
But they are very closely related and can be
usefully combined planning, acting, model
learning, and direct RL can occur simultaneously
and in parallel
39The Dyna-Q Architecture(Sutton 1990)
40The Dyna-Q Architecture (Sutton 1990)
- Dyna use the experience to build the model (R,
T), uses experience to adjust the policy and user
the model to adjust the policy - For each interaction with environment,
experiencing lts, a, s, rgt - use experience to adjust the policy
- Q(s,a) R(s,a) ? r ? MaxaQ(s, a)
Q(s,a) - use experience to update a model (T, R) Model
(s,a) (s, r) - use model to simulate the experience to adjust
the policy a ? Rand(a), s ? Rand(s) (s, r) ?
Model(s, a) Q(s,a) R(s,a) ? r ?
MaxaQ(s, a) Q(s,a)
41The Dyna-Q Algorithm
42Dyna-Q Snapshots Midway in 2nd Episode
43Dyna-Q Properties
- Dyna algorithm requires about N times the
computation of Q learning per instance - But it is typically vastly less than that for
naïve model-based method - N can be determined by the relative speed of
computation and of the taking action - What if the environment is changed ?
- Change to harder or change to easier.
44Blocking Maze
The changed environment is harder
45Shortcut Maze
The changed environment is easier
46What is Dyna-Q ?
- Uses an exploration bonus
- Keeps track of time since each state-action pair
was tried for real - An extra reward is added for transitions caused
by state-action pairs related to how long ago
they were tried the longer unvisited, the more
reward for visiting - The agent actually plans how to visit long
unvisited states
47Prioritized Sweeping
- The updating of the model is no longer random
- Instead, store additional information in the
model in order to make the appropriate choice of
updating
- Store the change of each state value ?V(s), and
use it to modify the priority of the predecessors
of s, according their transition probability
T(s,a, s)
s1
s4
? 10
s2
s5
? 5
s3
Priority High Low
48Prioritized Sweeping
49Prioritized Sweeping vs. Dyna-Q
Both use N5 backups per environmental interaction
50Full and Sample(One-Step)Backups
51Summary
- Emphasized close relationship between planning
and learning - Important distinction between distribution models
and sample models - Looked at some ways to integrate planning and
learning - synergy among planning, acting, model learning
52RL Recent Development Problem Modeling
Model of environment
Known Unknown
Completely Observable
Partially Observable
53Research topics
- Exploration-Exploitation tradeoff
- Problem of delayed reward (credit assignment)
- Input generalization
- Function Approximator
- Multi-Agent Reinforcement Learning
- Global goal vs Local goal
- Achieve several goals in parallel
- Agent cooperation and communication
54RL Application
TD Gammon
- Tesauro 1992, 1994, 1995, ...
- 30 pieces, 24 locations implies enormous number
of configurations - Effective branching factor of 400
- TD(?) algorithm
- Multi-layer Neural Network
- Near the level of worlds strongest grandmasters
55RL Application
- Elevator Dispatching
- Crites and Barto 1996
56RL Application
- Elevator Dispatching
- 18 hall call buttons 218 combinations
- positions and directions of cars 184 (rounding
to nearest floor) - motion states of cars (accelerating, moving,
decelerating, stopped, loading, turning) 6 - 40 car buttons 240
- 18 discretized real numbers are available giving
elapsed time since hall buttons pushed - Set of passengers riding each car and their
destinations observable only through the car
buttons
Conservatively about 1022 states
57RL Application
- Dynamic Channel Allocation Singh and Bertsekas
1997 - Job-Shop Scheduling Zhang and Dietterich 1995,
1996
58Q A