COMP538 Reinforcement Learning Recent Development

1 / 58
About This Presentation
Title:

COMP538 Reinforcement Learning Recent Development

Description:

Average discounted reward. Reflects amount and frequency of received immediate rewards ... average discounted reward drops rapidly and monotonically. Surges to ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 59
Provided by: Aut41

less

Transcript and Presenter's Notes

Title: COMP538 Reinforcement Learning Recent Development


1
COMP538Reinforcement LearningRecent Development
  • Group 7
  • Chan Ka Ki (cski_at_ust.hk)
  • Fung On Tik Andy (cpegandy_at_ust.hk)
  • Li Yuk Hin (tonyli_at_ust.hk)

Instructor Nevin L. Zhang
2
Outline
  • Introduction
  • 3 Solving Methods
  • Main Consideration
  • Exploration vs. Exploitation
  • Directed / Undirected Exploration
  • Function Approximation
  • Planning and Learning
  • Directed RL vs. Undirected RL
  • Dyna-Q and Prioritized Sweeping
  • Conclusion on recent development

3
Introduction
  • Agent interacts with environment
  • Goal-directed learning from interaction

Environment
4
Key Features
  • Agent is NOT told which actions to take, but
    learn by itself
  • By trial-and-error
  • From experiences
  • Explore and exploit
  • Exploitation agent takes the best action based
    on its current knowledge
  • Exploration try to take NOT the best action to
    gain more knowledge

5
Elements of RL
  • Policy what to do
  • Reward what is good
  • Value what is good because it predicts reward
  • Model what follows what

6
Dynamic Programming
  • Model-based
  • compute optimal policies given a perfect model of
    the environment as a Markov decision process
    (MDP)
  • Bootstrap
  • update estimates based in part on other learned
    estimates, without waiting for a final outcome

7
Dynamic Programming
8
Monte Carlo
  • Model-free
  • NOT bootstrap
  • Entire episode included
  • Only one choice at each state (unlike DP)
  • Time required to estimate one state does not
    depend on the total number of states

9
Monte Carlo
10
Temporal Difference
  • Model-free
  • Bootstrap
  • Partial episode included

11
Temporal Difference
12
Example Driving home
13
Driving home
  • Changes recommended by Monte Carlo methods
  • Changes recommended
  • by TD methods

14
N-step TD Prediction
  • MC and TD are extreme cases!

15
Averaging N-step Returns
  • n-step methods were introduced to help with TD(l)
    understanding
  • Idea backup an average of several returns
  • e.g. backup half of 2-step and half of 4-step
  • Called a complex backup
  • Draw each component
  • Label with the weights for that component

16
Forward View of TD(l)
  • TD(l) is a method for averaging all n-step
    backups
  • weight by ln-1 (time since visitation)
  • l-return
  • Backup using l-return

17
Forward View of TD(l)
  • Look forward from each state to determine update
    from future states and rewards

18
Backward View of TD(l)
  • The forward view was for theory
  • The backward view is for mechanism
  • New variable called eligibility trace
  • On each step, decay all traces by gl and
    increment the trace for the current state by 1
  • Accumulating trace

19
Backward View
  • Shout dt backwards over time
  • The strength of your voice decreases with
    temporal distance by gl

20
Forward View Backward View
  • The forward (theoretical) view of TD(l) is
    equivalent to the backward (mechanistic) view for
    off-line updating

21
Adaptive Exploration in Reinforcement Learning
Relu Patrascu Department of Systems Design
Engineering University of Waterloo Waterloo,
Ontario, Canada relu_at_pami.uwaterloo.ca
Deborah Stacey Dept. of Computing and Information
Science University of Guelph Ontario,
Canada dastacey_at_uoguelph.ca
22
Objectives
  • Explains the trade-off between exploitation and
    exploration
  • Introduces two categories of exploration methods
  • Undirected Exploration
  • ?-greedy exploration
  • Directed Exploration
  • Counter-based exploration
  • Past-Success directed exploration
  • Function approximation Backpropagation algorithm
    and Fuzzy ARTMAP

23
Introduction
  • Main problem How to make the learning process
    adapt to the non-stationary environment?
  • Sub-Problems
  • How to balance exploitation and exploration when
    the environment change?
  • How can the function approximators adapt the
    environment?

24
Exploitation and Exploration
  • Exploit or Explore?
  • To maximize reward, a learner must exploit the
    knowledge it already has
  • Explore an action with small immediate reward,
    but may yield more reward in the long run
  • An example Choosing the job
  • Suppose you are working at a small company with
    25,000 salary
  • You have another offer from an enterprise but
    only start at 12,000
  • Keep working on the small company may guarantee
    you have stable income
  • Work on an enterprise may have more opportunities
    for promotion, which increase the income in long
    run

25
Undirected Exploration
  • Undirected Exploration
  • No biased
  • purely random
  • Eg. ?-greedy exploration
  • it explores it chooses equally among all actions
  • likely to choose the worst appearing action as it
    is to choose the next-to-best

26
Directed Exploration
  • Directed Exploration
  • Memorize exploration-specific knowledge
  • Biased by some features of the learning process
  • Eg. Counter-based techniques
  • Favor the choice of actions resulting in a
    transition to a state that has not been
    frequently visited
  • The main idea is encourage the learner to explore
  • parts of the state space that have not been
    sampled often
  • parts that have not been sampled recently

27
Past-success Directed Exploration
  • Based on ?-greedy exploration
  • Bias to adapt the environment from the learning
    process
  • Increase exploitation rate if receives reward at
    an increasing rate
  • Increase exploration rate when stop receiving
    reward
  • Average discounted reward
  • Reflects amount and frequency of received
    immediate rewards
  • The further back in time, the less effect on
    average reward

28
Past-Success Directed Exploration
  • Average discounted reward defined as
  • Apply it on ?-greedy algorithm

where ? ? (0,1 is the discount factor rt the
reward received at time t
29
Gradient Descent Method
  • Why use a gradient descent method?
  • RL applications use table to store the value
    functions
  • Large number of states causes practically
    impossible
  • Solution use function approximator to predict
    the value
  • Error backpropagation algorithm
  • Catastrophic Interference
  • cannot learn incrementally in non-stationary
    environment
  • acquire new knowledge forget much of its previous
    knowledge

30
Gradient Descent Method
Initialize w arbitrarily and e 0 Repeat (for
each episode) Initialize s Pass s through each
network and obtain Qa a ? arg maxa Qa With
probability ? a ? a random action ? A(s) Repeat
(for each step of episode) e ? ??e ea ? ea
?wQa Take action a, observe reward, r and next
state, s ? ? r Qa Pass s through each network
and obtain Qa a ? arg maxa Qa With probability
? a ? a random action ? A(s) ? ? ? ?Qa w ? w
??e a ? a until s is terminal  
where a ? arg maxa Qa means a is set to the
action for which the expression is maximal, in
this case the highest Q ? is a constant step
size parameter named the learning rate ?wQa is
the partial derivative of Qa with respect to the
weights w ? the discount factor e the vector of
eligibility traces ? ? (0, 1 is the eligibility
trace parameter
31
Fuzzy ARTMAP
  • ARTMAP - Adaptive Resonancy Theory mapping
    between input vector and output pattern
  • a neural network specifically designed to deal
    with the stability/plasticity dilemma
  • This dilemma means a neural network isn't able to
    learn new information without damaging what was
    learned previously, similar to catastrophic
    interference

32
Experiments
  • Gridworld with non-stationary environment
  • Learning agent can move up, down, left or right
  • Two gates must pass through one of them from
    start state to goal state
  • First 1000 episodes, gate 1 open and gate 2 close
  • 1001-5000 episodes, gate 1 close and gate 2 open
  • To test how well the algorithm adapt to the
    changed environment

33
Results
  • Backpropagation algorithm
  • After 1000th episode
  • average discounted reward drops rapidly and
    monotonically
  • Surges to maximum exploitation
  • Fuzzy ARTMAP
  • After 1000th episode
  • Reward drops in a few episode and goes back to
    high values
  • A temporary surge in exploration

34
Planning and Learning
Objectives
  • Use of environment models
  • Integration of planning and learning methods

35
Models
  • Model anything the agent can use to predict how
    the environment will respond to its actions
  • Distribution model description of all
    possibilities and their probabilities
  • e.g.,
  • Sample model produces sample experiences
  • e.g., a simulation model, set of data
  • Both types of models can be used to produce
    simulated experience
  • Often sample models are much easier to obtain

36
Planning
  • Planning any computational process that uses a
    model to create or improve a policy
  • We take the following view
  • all state-space planning methods involve
    computing value functions, either explicitly or
    implicitly
  • they all apply backups to simulated experience

Simulated Experience
37
Learning, Planning, and Acting
  • Two uses of real experience
  • model learning to improve the model
  • direct RL to directly improve the value function
    and policy
  • Improving value function and/or policy via a
    model is sometimes called indirect RL or
    model-based RL. Here, we call it planning.

38
Direct vs. Indirect RL
  • Indirect methods
  • make fuller use of experience get better policy
    with fewer environment interactions
  • Direct methods
  • simpler
  • not affected by bad models

But they are very closely related and can be
usefully combined planning, acting, model
learning, and direct RL can occur simultaneously
and in parallel
39
The Dyna-Q Architecture(Sutton 1990)
40
The Dyna-Q Architecture (Sutton 1990)
  • Dyna use the experience to build the model (R,
    T), uses experience to adjust the policy and user
    the model to adjust the policy
  • For each interaction with environment,
    experiencing lts, a, s, rgt
  • use experience to adjust the policy
  • Q(s,a) R(s,a) ? r ? MaxaQ(s, a)
    Q(s,a)
  • use experience to update a model (T, R) Model
    (s,a) (s, r)
  • use model to simulate the experience to adjust
    the policy a ? Rand(a), s ? Rand(s) (s, r) ?
    Model(s, a) Q(s,a) R(s,a) ? r ?
    MaxaQ(s, a) Q(s,a)

41
The Dyna-Q Algorithm
42
Dyna-Q Snapshots Midway in 2nd Episode
43
Dyna-Q Properties
  • Dyna algorithm requires about N times the
    computation of Q learning per instance
  • But it is typically vastly less than that for
    naïve model-based method
  • N can be determined by the relative speed of
    computation and of the taking action
  • What if the environment is changed ?
  • Change to harder or change to easier.

44
Blocking Maze
The changed environment is harder
45
Shortcut Maze
The changed environment is easier
46
What is Dyna-Q ?
  • Uses an exploration bonus
  • Keeps track of time since each state-action pair
    was tried for real
  • An extra reward is added for transitions caused
    by state-action pairs related to how long ago
    they were tried the longer unvisited, the more
    reward for visiting
  • The agent actually plans how to visit long
    unvisited states

47
Prioritized Sweeping
  • The updating of the model is no longer random
  • Instead, store additional information in the
    model in order to make the appropriate choice of
    updating
  • Store the change of each state value ?V(s), and
    use it to modify the priority of the predecessors
    of s, according their transition probability
    T(s,a, s)

s1
s4
? 10
s2
s5
? 5
s3
Priority High Low
48
Prioritized Sweeping
49
Prioritized Sweeping vs. Dyna-Q
Both use N5 backups per environmental interaction
50
Full and Sample(One-Step)Backups
51
Summary
  • Emphasized close relationship between planning
    and learning
  • Important distinction between distribution models
    and sample models
  • Looked at some ways to integrate planning and
    learning
  • synergy among planning, acting, model learning

52
RL Recent Development Problem Modeling
Model of environment
Known Unknown
Completely Observable
Partially Observable
53
Research topics
  • Exploration-Exploitation tradeoff
  • Problem of delayed reward (credit assignment)
  • Input generalization
  • Function Approximator
  • Multi-Agent Reinforcement Learning
  • Global goal vs Local goal
  • Achieve several goals in parallel
  • Agent cooperation and communication

54
RL Application
TD Gammon
  • Tesauro 1992, 1994, 1995, ...
  • 30 pieces, 24 locations implies enormous number
    of configurations
  • Effective branching factor of 400
  • TD(?) algorithm
  • Multi-layer Neural Network
  • Near the level of worlds strongest grandmasters

55
RL Application
  • Elevator Dispatching
  • Crites and Barto 1996

56
RL Application
  • Elevator Dispatching
  • 18 hall call buttons 218 combinations
  • positions and directions of cars 184 (rounding
    to nearest floor)
  • motion states of cars (accelerating, moving,
    decelerating, stopped, loading, turning) 6
  • 40 car buttons 240
  • 18 discretized real numbers are available giving
    elapsed time since hall buttons pushed
  • Set of passengers riding each car and their
    destinations observable only through the car
    buttons

Conservatively about 1022 states
57
RL Application
  • Dynamic Channel Allocation Singh and Bertsekas
    1997
  • Job-Shop Scheduling Zhang and Dietterich 1995,
    1996


58
Q A
Write a Comment
User Comments (0)