Title: Applying Online Search Techniques to Reinforcement Learning
1Applying Online Search Techniques to
Reinforcement Learning
- Scott Davies, Andrew Ng, and Andrew Moore
- Carnegie Mellon University
2The Agony of Continuous State Spaces
- Learning useful value functions for
continuous-state optimal control problems can be
difficult - Small inaccuracies/inconsistencies in
approximated value functions can cause simple
controllers to fail miserably - Accurate value functions can be very expensive to
compute even in relatively low-dimensional spaces
with perfectly accurate state transition models
3Combining Value Functions With Online Search
- Instead of modeling the value function accurately
everywhere, we can perform online searches for
good trajectories from the agents current
position to compensate for value function
inaccuracies - We examine two different types of search
- Local searches in which the agent performs a
finite-depth look-ahead search - Global searches in which the agent searches for
trajectories all the way to goal states
4Typical One-Step Search
- Given a value function V(x) over the state space,
an agent typically uses a model to predict where
each possible one-step trajectory T takes it,
then chooses the trajectory that maximizes
RT ? V(xT)
This takes O(A) time, where A is the set of
possible actions.
Given a perfect V(x), this would lead to optimal
behavior.
5Local Search
- An obvious possible extension consider all
possible d-step trajectories T, selecting the one
that maximizes RT ?dV(xT). - Computational expense O(Ad).
- To make deeper searches more computationally
tractable, we can limit agent to considering only
trajectories in which the action is switched at
most s times. - Computational expense
- (considerably cheaper than full d-step
search if s ltlt d)
6Local Search Example
- Two-dimensional state space (position velocity)
- Car must back up to take running start to make
it
Search over 20-step trajectories with at most one
switch in actions
7Using Local Search Online
- Repeat
- From current state, consider all possible d-step
trajectories T in which the action is changed at
most s times - Perform the first action in the trajectory that
maximizes RT ?dV(xT). - Let B denote the parallel backup operator such
that
If s (d-1), Local Search is formally equivalent
to behaving greedily with respect to the new
value function Bd-1V. Since V is typically
arrived at through iterations of a much cruder
backup operator, this value function is often
much more accurate than V.
8Uninformed Global Search
- Suppose we have a minimum-cost-to-goal problem in
a continuous state space with nonnegative costs.
Why not forget about explicitly calculating V and
just extend the search from the current position
all the way to the goal? - Problem combinatorial explosion.
- Possible solution
- Break state space into partitions, e.g. a uniform
grid. (Can be represented sparsely.) - Use previously discussed local search procedure
to find trajectories between partitions - Prune all but least-cost trajectory entering any
given partition
9Uninformed Global Search
- Problems
- Still computationally expensive
- Even with fine partitioning of state space,
pruning the wrong trajectories can cause search
to fail
10Informed Global Search
- Use approximate value function V to guide the
selection of which points to search from next - Reasonably accurate V will cause search to stay
along optimal path to goal dramatic reduction in
search time - V can help choose effective points within each
partition from which to search, thereby improving
solution quality - Uniformed Global Search same as Informed Global
Search with V(x) 0
11Informed Global Search Algorithm
- Let x0 be current state, and g(x0) be the grid
element containing x0 - Set g(x0)s representative state to x0, and add
g(x0) to priority queue P with priority V(x0) - Until goal state found or P empty
- Remove grid element g from top of P. Let x
denote gs representative state. - SEARCH-FROM(g, x)
- If goal found, execute trajectory otherwise
signal failure
12Informed Global Search Algorithm, contd
- SEARCH-FROM(g, x)
- Starting from x, perform local search as
described earlier, but prune the search wherever
it reaches a different grid element g? ? g. - Each time another grid element g? reached at
state x? - If g? previously SEARCHED-FROM, do nothing.
- If g? never previously reached, add g? to P with
priority RT(x0x?) ? TV(x?), where T is
trajectory from x0 to x?. Set g?s
representative state to x?. Record trajectory
from x to x?. - If g? previously reached but previous priority
is lower than RT(x0x?) ? TV(x?), update g? s
priority to RT(x0x?) ? TV(x?) and set
representative state to x?. Record trajectory
from x to x?.
13Informed Global Search Examples
77 simplex-interpolated V
1313 simplex-interpolated V
Hill-car Search Trees
14Informed Global Search as A
- Informed Global Search is essentially an A
search using the value function V as a search
heuristic - Using A with an optimistic heuristic function
normally guarantees optimal path to the goal. - Uninformed global search effectively uses
trivially optimistic heuristic V(s) 0. Might
we expect better solution quality with uninformed
search than with non-optimistic crude approximate
value function V? - Not necessarily! A crude approximate
non-optimistic value function can improve
solution quality by helping the algorithm avoid
pruning wrong parts of search tree
15Hill-car
- Car on steep hill
- State variables position and velocity (2-d)
- Actions accelerate forward or backward
- Goal park near top
- Random start states
- Cost total time to goal
16Acrobot
- Two-link planar robot acting in vertical plane
under gravity - Underactuated joint at elbow unactuated shoulder
- Two angular positions their velocities (4-d)
- Goal raise tip at least one links height above
shoulder - Two actions full torque clockwise /
counterclockwise - Random starting positions
- Cost total time to goal
Goal
?1
?2
17Move-Cart-Pole
- Upright pole attached to cart by unactuated joint
- State horizontal position of cart, angle of
pole, and associated velocities (4-d) - Actions accelerate left or right
- Goal configuration cart moved, pole balanced
- Start with random x ? 0
- Per-step cost quadratic in distance from goal
configuration - Big penalty if pole falls over
?
Goal configuration
x
18Planar Slider
- Puck sliding on bumpy 2-d surface
- Two spatial variables their velocities (4-d)
- Actions accelerate NW, NE, SW, or SE
- Goal in NW corner
- Random start states
- Cost total time to goal
19Local Search Experiments
Move-Cart-Pole
- CPU Time and Solution cost vs. search depth d
- No limits imposed on number of action switches
(sd) - Value function 134 simplex-interpolation grid
20Local Search Experiments
Hill-car
- CPU Time and Solution cost vs. search depth d
- Max. number of action switches fixed at 2 (s 2)
- Value function 72 simplex-interpolated value
function
21Comparative experiments Hill-Car
- Local search d6, s2
- Global searches
- Local search between grid elements d20, s1
- 502 search grid resolution
- 72 simplex-interpolated value function
22Hill-Car results contd
- Uninformed Global Search prunes wrong
trajectories - Increase search grid to 1002 so this doesnt
happen - Uninformed does near-optimal
- Informed doesnt crude value function not
optimistic
Failed search trajectory picture goes here
23Comparative Results Four-d domains
- All value functions 134 simplex interpolations
- All local searches between global search
elements - depth 20, with at max. 1 action switch (d20,
s1) - Acrobot
- Local Search depth 4 no action switch
restriction (d4,s4) - Global 504 search grid
- Move-Cart-Pole same as Acrobot
- Slider
- Local Search depth 10 max. 1 action switch
(d10,s1) - Global 204 search grid
24Acrobot
LS number of local searches performed to find
paths between elements of global search grid
- Local search significantly improves solution
quality, but increases CPU time by order of
magnitude - Uninformed global search takes even more time
poor solution quality indicates suboptimal
trajectory pruning - Informed global search finds much better
solutions in relatively little time. Value
function drastically reduces search, and better
pruning leads to better solutions
25Move-Cart-Pole
- No search pole often falls, incurring large
penalties overall poor solution quality - Local search improves things a bit
- Uninformed search finds better solutions than
informed - Few grid cells in which pruning is required
- Value function not optimistic, so informed search
solutions suboptimal - Informed search reduces costs by order of
magnitude with no increase in required CPU time
26Planar Slider
- Local search almost useless, and incurs massive
CPU expense - Uninformed search decreases solution cost by 50,
but at even greater CPU expense - Informed search decreases solution cost by factor
of 4, at no increase in CPU time
27Using Search with Learned Models
- Toy Example Hill-Car
- 72 simplex-interpolated value function
- One nearest-neighbor function approximator per
possible action used to learn dx/dt - States sufficiently far away from nearest
neighbor optimistically assumed to be absorbing
to encourage exploration - Average costs over first few hundred trials
- No search 212
- Local search 127
- Informed global search 155
28Using Search with Learned Models
- Problems do arise when using learned models
- Inaccuracies in models may cause global searches
to fail. Not clear then if failure should be
blamed on model inaccuracies or on insufficiently
fine state space partitioning - Trajectories found will be inaccurate
- Need adaptive closed-loop controller
- Fortunately, we will get new data with which to
increase the accuracy of our model - Model approximators must be fast and accurate
29Avenues for Future Research
- Extensions to nondeterministic systems?
- Higher-dimensional problems
- Better function approximators for model learning
- Variable-resolution search grids
- Optimistic value function generation?