COMP538 Reinforcement Learning Recent Development

1 / 58

About This Presentation

Title:

COMP538 Reinforcement Learning Recent Development

Description:

Average discounted reward. Reflects amount and frequency of received immediate rewards ... average discounted reward drops rapidly and monotonically. Surges to ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 59

Provided by: Aut41

more less

Transcript and Presenter's Notes

Title: COMP538 Reinforcement Learning Recent Development

1
COMP538Reinforcement LearningRecent Development

Group 7
Chan Ka Ki (cski_at_ust.hk)
Fung On Tik Andy (cpegandy_at_ust.hk)
Li Yuk Hin (tonyli_at_ust.hk)

Instructor Nevin L. Zhang
2
Outline

Introduction
3 Solving Methods
Main Consideration
Exploration vs. Exploitation
Directed / Undirected Exploration
Function Approximation
Planning and Learning
Directed RL vs. Undirected RL
Dyna-Q and Prioritized Sweeping
Conclusion on recent development

3
Introduction

Agent interacts with environment
Goal-directed learning from interaction

Environment
4
Key Features

Agent is NOT told which actions to take, but
learn by itself
By trial-and-error
From experiences
Explore and exploit
Exploitation agent takes the best action based
on its current knowledge
Exploration try to take NOT the best action to
gain more knowledge

5
Elements of RL

Policy what to do
Reward what is good
Value what is good because it predicts reward
Model what follows what

6
Dynamic Programming

Model-based
compute optimal policies given a perfect model of
the environment as a Markov decision process
(MDP)
Bootstrap
update estimates based in part on other learned
estimates, without waiting for a final outcome

7
Dynamic Programming
8
Monte Carlo

Model-free
NOT bootstrap
Entire episode included
Only one choice at each state (unlike DP)
Time required to estimate one state does not
depend on the total number of states

9
Monte Carlo
10
Temporal Difference

Model-free
Bootstrap
Partial episode included

11
Temporal Difference
12
Example Driving home
13
Driving home

Changes recommended by Monte Carlo methods

Changes recommended
by TD methods

14
N-step TD Prediction

MC and TD are extreme cases!

15
Averaging N-step Returns

n-step methods were introduced to help with TD(l)
understanding
Idea backup an average of several returns
e.g. backup half of 2-step and half of 4-step
Called a complex backup
Draw each component
Label with the weights for that component

16
Forward View of TD(l)

TD(l) is a method for averaging all n-step
backups
weight by ln-1 (time since visitation)
l-return
Backup using l-return

17
Forward View of TD(l)

Look forward from each state to determine update
from future states and rewards

18
Backward View of TD(l)

The forward view was for theory
The backward view is for mechanism
New variable called eligibility trace
On each step, decay all traces by gl and
increment the trace for the current state by 1
Accumulating trace

19
Backward View

Shout dt backwards over time
The strength of your voice decreases with
temporal distance by gl

20
Forward View Backward View

The forward (theoretical) view of TD(l) is
equivalent to the backward (mechanistic) view for
off-line updating

21
Adaptive Exploration in Reinforcement Learning
Relu Patrascu Department of Systems Design
Engineering University of Waterloo Waterloo,
Ontario, Canada relu_at_pami.uwaterloo.ca
Deborah Stacey Dept. of Computing and Information
Science University of Guelph Ontario,
Canada dastacey_at_uoguelph.ca
22
Objectives

Explains the trade-off between exploitation and
exploration
Introduces two categories of exploration methods
Undirected Exploration
?-greedy exploration
Directed Exploration
Counter-based exploration
Past-Success directed exploration
Function approximation Backpropagation algorithm
and Fuzzy ARTMAP

23
Introduction

Main problem How to make the learning process
adapt to the non-stationary environment?
Sub-Problems
How to balance exploitation and exploration when
the environment change?
How can the function approximators adapt the
environment?

24
Exploitation and Exploration

Exploit or Explore?
To maximize reward, a learner must exploit the
knowledge it already has
Explore an action with small immediate reward,
but may yield more reward in the long run
An example Choosing the job
Suppose you are working at a small company with
25,000 salary
You have another offer from an enterprise but
only start at 12,000
Keep working on the small company may guarantee
you have stable income
Work on an enterprise may have more opportunities
for promotion, which increase the income in long
run

25
Undirected Exploration

Undirected Exploration
No biased
purely random
Eg. ?-greedy exploration
it explores it chooses equally among all actions
likely to choose the worst appearing action as it
is to choose the next-to-best

26
Directed Exploration

Directed Exploration
Memorize exploration-specific knowledge
Biased by some features of the learning process
Eg. Counter-based techniques
Favor the choice of actions resulting in a
transition to a state that has not been
frequently visited
The main idea is encourage the learner to explore
parts of the state space that have not been
sampled often
parts that have not been sampled recently

27
Past-success Directed Exploration

Based on ?-greedy exploration
Bias to adapt the environment from the learning
process
Increase exploitation rate if receives reward at
an increasing rate
Increase exploration rate when stop receiving
reward
Average discounted reward
Reflects amount and frequency of received
immediate rewards
The further back in time, the less effect on
average reward

28
Past-Success Directed Exploration

Average discounted reward defined as
Apply it on ?-greedy algorithm

where ? ? (0,1 is the discount factor rt the
reward received at time t
29
Gradient Descent Method

Why use a gradient descent method?
RL applications use table to store the value
functions
Large number of states causes practically
impossible
Solution use function approximator to predict
the value
Error backpropagation algorithm
Catastrophic Interference
cannot learn incrementally in non-stationary
environment
acquire new knowledge forget much of its previous
knowledge

30
Gradient Descent Method
Initialize w arbitrarily and e 0 Repeat (for
each episode) Initialize s Pass s through each
network and obtain Qa a ? arg maxa Qa With
probability ? a ? a random action ? A(s) Repeat
(for each step of episode) e ? ??e ea ? ea
?wQa Take action a, observe reward, r and next
state, s ? ? r Qa Pass s through each network
and obtain Qa a ? arg maxa Qa With probability
? a ? a random action ? A(s) ? ? ? ?Qa w ? w
??e a ? a until s is terminal
where a ? arg maxa Qa means a is set to the
action for which the expression is maximal, in
this case the highest Q ? is a constant step
size parameter named the learning rate ?wQa is
the partial derivative of Qa with respect to the
weights w ? the discount factor e the vector of
eligibility traces ? ? (0, 1 is the eligibility
trace parameter
31
Fuzzy ARTMAP

ARTMAP - Adaptive Resonancy Theory mapping
between input vector and output pattern
a neural network specifically designed to deal
with the stability/plasticity dilemma
This dilemma means a neural network isn't able to
learn new information without damaging what was
learned previously, similar to catastrophic
interference

32
Experiments

Gridworld with non-stationary environment
Learning agent can move up, down, left or right
Two gates must pass through one of them from
start state to goal state
First 1000 episodes, gate 1 open and gate 2 close
1001-5000 episodes, gate 1 close and gate 2 open
To test how well the algorithm adapt to the
changed environment

33
Results

Backpropagation algorithm
After 1000th episode
average discounted reward drops rapidly and
monotonically
Surges to maximum exploitation
Fuzzy ARTMAP
After 1000th episode
Reward drops in a few episode and goes back to
high values
A temporary surge in exploration

34
Planning and Learning
Objectives

Use of environment models
Integration of planning and learning methods

35
Models

Model anything the agent can use to predict how
the environment will respond to its actions
Distribution model description of all
possibilities and their probabilities
e.g.,
Sample model produces sample experiences
e.g., a simulation model, set of data
Both types of models can be used to produce
simulated experience
Often sample models are much easier to obtain

36
Planning

Planning any computational process that uses a
model to create or improve a policy

We take the following view
all state-space planning methods involve
computing value functions, either explicitly or
implicitly
they all apply backups to simulated experience

Simulated Experience
37
Learning, Planning, and Acting

Two uses of real experience
model learning to improve the model
direct RL to directly improve the value function
and policy
Improving value function and/or policy via a
model is sometimes called indirect RL or
model-based RL. Here, we call it planning.

38
Direct vs. Indirect RL

Indirect methods
make fuller use of experience get better policy
with fewer environment interactions

Direct methods
simpler
not affected by bad models

But they are very closely related and can be
usefully combined planning, acting, model
learning, and direct RL can occur simultaneously
and in parallel
39
The Dyna-Q Architecture(Sutton 1990)
40
The Dyna-Q Architecture (Sutton 1990)

Dyna use the experience to build the model (R,
T), uses experience to adjust the policy and user
the model to adjust the policy
For each interaction with environment,
experiencing lts, a, s, rgt
use experience to adjust the policy
Q(s,a) R(s,a) ? r ? MaxaQ(s, a)
Q(s,a)
use experience to update a model (T, R) Model
(s,a) (s, r)
use model to simulate the experience to adjust
the policy a ? Rand(a), s ? Rand(s) (s, r) ?
Model(s, a) Q(s,a) R(s,a) ? r ?
MaxaQ(s, a) Q(s,a)

41
The Dyna-Q Algorithm
42
Dyna-Q Snapshots Midway in 2nd Episode
43
Dyna-Q Properties

Dyna algorithm requires about N times the
computation of Q learning per instance
But it is typically vastly less than that for
naïve model-based method
N can be determined by the relative speed of
computation and of the taking action
What if the environment is changed ?
Change to harder or change to easier.

44
Blocking Maze
The changed environment is harder
45
Shortcut Maze
The changed environment is easier
46
What is Dyna-Q ?

Uses an exploration bonus
Keeps track of time since each state-action pair
was tried for real
An extra reward is added for transitions caused
by state-action pairs related to how long ago
they were tried the longer unvisited, the more
reward for visiting
The agent actually plans how to visit long
unvisited states

47
Prioritized Sweeping

The updating of the model is no longer random
Instead, store additional information in the
model in order to make the appropriate choice of
updating

Store the change of each state value ?V(s), and
use it to modify the priority of the predecessors
of s, according their transition probability
T(s,a, s)

s1
s4
? 10
s2
s5
? 5
s3
Priority High Low
48
Prioritized Sweeping
49
Prioritized Sweeping vs. Dyna-Q
Both use N5 backups per environmental interaction
50
Full and Sample(One-Step)Backups
51
Summary

Emphasized close relationship between planning
and learning
Important distinction between distribution models
and sample models
Looked at some ways to integrate planning and
learning
synergy among planning, acting, model learning

52
RL Recent Development Problem Modeling
Model of environment
Known Unknown
Completely Observable
Partially Observable
53
Research topics

Exploration-Exploitation tradeoff
Problem of delayed reward (credit assignment)
Input generalization
Function Approximator
Multi-Agent Reinforcement Learning
Global goal vs Local goal
Achieve several goals in parallel
Agent cooperation and communication

54
RL Application
TD Gammon

Tesauro 1992, 1994, 1995, ...
30 pieces, 24 locations implies enormous number
of configurations
Effective branching factor of 400
TD(?) algorithm
Multi-layer Neural Network
Near the level of worlds strongest grandmasters

55
RL Application

Elevator Dispatching
Crites and Barto 1996

56
RL Application

Elevator Dispatching
18 hall call buttons 218 combinations
positions and directions of cars 184 (rounding
to nearest floor)
motion states of cars (accelerating, moving,
decelerating, stopped, loading, turning) 6
40 car buttons 240
18 discretized real numbers are available giving
elapsed time since hall buttons pushed
Set of passengers riding each car and their
destinations observable only through the car
buttons

Conservatively about 1022 states
57
RL Application