Applying Online Search Techniques to Reinforcement Learning - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Applying Online Search Techniques to Reinforcement Learning

Description:

Using Search with Learned Models. Toy Example: Hill-Car. 72 simplex-interpolated value function ... Inaccuracies in models may cause global searches to fail. ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 30

Provided by: scottd153

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Applying Online Search Techniques to Reinforcement Learning

1
Applying Online Search Techniques to
Reinforcement Learning

Scott Davies, Andrew Ng, and Andrew Moore
Carnegie Mellon University

2
The Agony of Continuous State Spaces

Learning useful value functions for
continuous-state optimal control problems can be
difficult
Small inaccuracies/inconsistencies in
approximated value functions can cause simple
controllers to fail miserably
Accurate value functions can be very expensive to
compute even in relatively low-dimensional spaces
with perfectly accurate state transition models

3
Combining Value Functions With Online Search

Instead of modeling the value function accurately
everywhere, we can perform online searches for
good trajectories from the agents current
position to compensate for value function
inaccuracies
We examine two different types of search
Local searches in which the agent performs a
finite-depth look-ahead search
Global searches in which the agent searches for
trajectories all the way to goal states

4
Typical One-Step Search

Given a value function V(x) over the state space,
an agent typically uses a model to predict where
each possible one-step trajectory T takes it,
then chooses the trajectory that maximizes

RT ? V(xT)
This takes O(A) time, where A is the set of
possible actions.
Given a perfect V(x), this would lead to optimal
behavior.
5
Local Search

An obvious possible extension consider all
possible d-step trajectories T, selecting the one
that maximizes RT ?dV(xT).
Computational expense O(Ad).
To make deeper searches more computationally
tractable, we can limit agent to considering only
trajectories in which the action is switched at
most s times.
Computational expense
(considerably cheaper than full d-step
search if s ltlt d)

6
Local Search Example

Two-dimensional state space (position velocity)
Car must back up to take running start to make
it

Search over 20-step trajectories with at most one
switch in actions
7
Using Local Search Online

Repeat
From current state, consider all possible d-step
trajectories T in which the action is changed at
most s times
Perform the first action in the trajectory that
maximizes RT ?dV(xT).
Let B denote the parallel backup operator such
that

If s (d-1), Local Search is formally equivalent
to behaving greedily with respect to the new
value function Bd-1V. Since V is typically
arrived at through iterations of a much cruder
backup operator, this value function is often
much more accurate than V.
8
Uninformed Global Search

Suppose we have a minimum-cost-to-goal problem in
a continuous state space with nonnegative costs.
Why not forget about explicitly calculating V and
just extend the search from the current position
all the way to the goal?
Problem combinatorial explosion.
Possible solution
Break state space into partitions, e.g. a uniform
grid. (Can be represented sparsely.)
Use previously discussed local search procedure
to find trajectories between partitions
Prune all but least-cost trajectory entering any
given partition

9
Uninformed Global Search

Problems
Still computationally expensive
Even with fine partitioning of state space,
pruning the wrong trajectories can cause search
to fail

10
Informed Global Search

Use approximate value function V to guide the
selection of which points to search from next
Reasonably accurate V will cause search to stay
along optimal path to goal dramatic reduction in
search time
V can help choose effective points within each
partition from which to search, thereby improving
solution quality
Uniformed Global Search same as Informed Global
Search with V(x) 0

11
Informed Global Search Algorithm

Let x0 be current state, and g(x0) be the grid
element containing x0
Set g(x0)s representative state to x0, and add
g(x0) to priority queue P with priority V(x0)
Until goal state found or P empty
Remove grid element g from top of P. Let x
denote gs representative state.
SEARCH-FROM(g, x)
If goal found, execute trajectory otherwise
signal failure

12
Informed Global Search Algorithm, contd

SEARCH-FROM(g, x)
Starting from x, perform local search as
described earlier, but prune the search wherever
it reaches a different grid element g? ? g.
Each time another grid element g? reached at
state x?
If g? previously SEARCHED-FROM, do nothing.
If g? never previously reached, add g? to P with
priority RT(x0x?) ? TV(x?), where T is
trajectory from x0 to x?. Set g?s
representative state to x?. Record trajectory
from x to x?.
If g? previously reached but previous priority
is lower than RT(x0x?) ? TV(x?), update g? s
priority to RT(x0x?) ? TV(x?) and set
representative state to x?. Record trajectory
from x to x?.

13
Informed Global Search Examples
77 simplex-interpolated V
1313 simplex-interpolated V
Hill-car Search Trees
14
Informed Global Search as A

Informed Global Search is essentially an A
search using the value function V as a search
heuristic
Using A with an optimistic heuristic function
normally guarantees optimal path to the goal.
Uninformed global search effectively uses
trivially optimistic heuristic V(s) 0. Might
we expect better solution quality with uninformed
search than with non-optimistic crude approximate
value function V?
Not necessarily! A crude approximate
non-optimistic value function can improve
solution quality by helping the algorithm avoid
pruning wrong parts of search tree

15
Hill-car

Car on steep hill
State variables position and velocity (2-d)
Actions accelerate forward or backward
Goal park near top
Random start states
Cost total time to goal

16
Acrobot

Two-link planar robot acting in vertical plane
under gravity
Underactuated joint at elbow unactuated shoulder
Two angular positions their velocities (4-d)
Goal raise tip at least one links height above
shoulder
Two actions full torque clockwise /
counterclockwise
Random starting positions
Cost total time to goal

Goal
?1
?2
17
Move-Cart-Pole

Upright pole attached to cart by unactuated joint
State horizontal position of cart, angle of
pole, and associated velocities (4-d)
Actions accelerate left or right
Goal configuration cart moved, pole balanced
Start with random x ? 0
Per-step cost quadratic in distance from goal
configuration
Big penalty if pole falls over

?
Goal configuration
x
18
Planar Slider

Puck sliding on bumpy 2-d surface
Two spatial variables their velocities (4-d)
Actions accelerate NW, NE, SW, or SE
Goal in NW corner
Random start states
Cost total time to goal

19
Local Search Experiments
Move-Cart-Pole

CPU Time and Solution cost vs. search depth d
No limits imposed on number of action switches
(sd)
Value function 134 simplex-interpolation grid

20
Local Search Experiments
Hill-car

CPU Time and Solution cost vs. search depth d
Max. number of action switches fixed at 2 (s 2)
Value function 72 simplex-interpolated value
function

21
Comparative experiments Hill-Car

Local search d6, s2
Global searches
Local search between grid elements d20, s1
502 search grid resolution
72 simplex-interpolated value function

22
Hill-Car results contd

Uninformed Global Search prunes wrong
trajectories
Increase search grid to 1002 so this doesnt
happen
Uninformed does near-optimal
Informed doesnt crude value function not
optimistic

Failed search trajectory picture goes here
23
Comparative Results Four-d domains

All value functions 134 simplex interpolations
All local searches between global search
elements
depth 20, with at max. 1 action switch (d20,
s1)
Acrobot
Local Search depth 4 no action switch
restriction (d4,s4)
Global 504 search grid
Move-Cart-Pole same as Acrobot
Slider
Local Search depth 10 max. 1 action switch
(d10,s1)
Global 204 search grid

24
Acrobot
LS number of local searches performed to find
paths between elements of global search grid

Local search significantly improves solution
quality, but increases CPU time by order of
magnitude
Uninformed global search takes even more time
poor solution quality indicates suboptimal
trajectory pruning
Informed global search finds much better
solutions in relatively little time. Value
function drastically reduces search, and better
pruning leads to better solutions

25
Move-Cart-Pole

No search pole often falls, incurring large
penalties overall poor solution quality
Local search improves things a bit
Uninformed search finds better solutions than
informed
Few grid cells in which pruning is required
Value function not optimistic, so informed search
solutions suboptimal
Informed search reduces costs by order of
magnitude with no increase in required CPU time

26
Planar Slider

Local search almost useless, and incurs massive
CPU expense
Uninformed search decreases solution cost by 50,
but at even greater CPU expense
Informed search decreases solution cost by factor
of 4, at no increase in CPU time

27
Using Search with Learned Models

Toy Example Hill-Car
72 simplex-interpolated value function
One nearest-neighbor function approximator per
possible action used to learn dx/dt
States sufficiently far away from nearest
neighbor optimistically assumed to be absorbing
to encourage exploration
Average costs over first few hundred trials
No search 212
Local search 127
Informed global search 155

28
Using Search with Learned Models

Problems do arise when using learned models
Inaccuracies in models may cause global searches
to fail. Not clear then if failure should be
blamed on model inaccuracies or on insufficiently
fine state space partitioning
Trajectories found will be inaccurate
Need adaptive closed-loop controller
Fortunately, we will get new data with which to
increase the accuracy of our model
Model approximators must be fast and accurate

29
Avenues for Future Research