Applying Online Search Techniques to Reinforcement Learning - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Applying Online Search Techniques to Reinforcement Learning

Description:

Using Search with Learned Models. Toy Example: Hill-Car. 72 simplex-interpolated value function ... Inaccuracies in models may cause global searches to fail. ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 30
Provided by: scottd153
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Applying Online Search Techniques to Reinforcement Learning


1
Applying Online Search Techniques to
Reinforcement Learning
  • Scott Davies, Andrew Ng, and Andrew Moore
  • Carnegie Mellon University

2
The Agony of Continuous State Spaces
  • Learning useful value functions for
    continuous-state optimal control problems can be
    difficult
  • Small inaccuracies/inconsistencies in
    approximated value functions can cause simple
    controllers to fail miserably
  • Accurate value functions can be very expensive to
    compute even in relatively low-dimensional spaces
    with perfectly accurate state transition models

3
Combining Value Functions With Online Search
  • Instead of modeling the value function accurately
    everywhere, we can perform online searches for
    good trajectories from the agents current
    position to compensate for value function
    inaccuracies
  • We examine two different types of search
  • Local searches in which the agent performs a
    finite-depth look-ahead search
  • Global searches in which the agent searches for
    trajectories all the way to goal states

4
Typical One-Step Search
  • Given a value function V(x) over the state space,
    an agent typically uses a model to predict where
    each possible one-step trajectory T takes it,
    then chooses the trajectory that maximizes

RT ? V(xT)
This takes O(A) time, where A is the set of
possible actions.
Given a perfect V(x), this would lead to optimal
behavior.
5
Local Search
  • An obvious possible extension consider all
    possible d-step trajectories T, selecting the one
    that maximizes RT ?dV(xT).
  • Computational expense O(Ad).
  • To make deeper searches more computationally
    tractable, we can limit agent to considering only
    trajectories in which the action is switched at
    most s times.
  • Computational expense
  • (considerably cheaper than full d-step
    search if s ltlt d)

6
Local Search Example
  • Two-dimensional state space (position velocity)
  • Car must back up to take running start to make
    it

Search over 20-step trajectories with at most one
switch in actions
7
Using Local Search Online
  • Repeat
  • From current state, consider all possible d-step
    trajectories T in which the action is changed at
    most s times
  • Perform the first action in the trajectory that
    maximizes RT ?dV(xT).
  • Let B denote the parallel backup operator such
    that

If s (d-1), Local Search is formally equivalent
to behaving greedily with respect to the new
value function Bd-1V. Since V is typically
arrived at through iterations of a much cruder
backup operator, this value function is often
much more accurate than V.
8
Uninformed Global Search
  • Suppose we have a minimum-cost-to-goal problem in
    a continuous state space with nonnegative costs.
    Why not forget about explicitly calculating V and
    just extend the search from the current position
    all the way to the goal?
  • Problem combinatorial explosion.
  • Possible solution
  • Break state space into partitions, e.g. a uniform
    grid. (Can be represented sparsely.)
  • Use previously discussed local search procedure
    to find trajectories between partitions
  • Prune all but least-cost trajectory entering any
    given partition

9
Uninformed Global Search
  • Problems
  • Still computationally expensive
  • Even with fine partitioning of state space,
    pruning the wrong trajectories can cause search
    to fail

10
Informed Global Search
  • Use approximate value function V to guide the
    selection of which points to search from next
  • Reasonably accurate V will cause search to stay
    along optimal path to goal dramatic reduction in
    search time
  • V can help choose effective points within each
    partition from which to search, thereby improving
    solution quality
  • Uniformed Global Search same as Informed Global
    Search with V(x) 0

11
Informed Global Search Algorithm
  • Let x0 be current state, and g(x0) be the grid
    element containing x0
  • Set g(x0)s representative state to x0, and add
    g(x0) to priority queue P with priority V(x0)
  • Until goal state found or P empty
  • Remove grid element g from top of P. Let x
    denote gs representative state.
  • SEARCH-FROM(g, x)
  • If goal found, execute trajectory otherwise
    signal failure

12
Informed Global Search Algorithm, contd
  • SEARCH-FROM(g, x)
  • Starting from x, perform local search as
    described earlier, but prune the search wherever
    it reaches a different grid element g? ? g.
  • Each time another grid element g? reached at
    state x?
  • If g? previously SEARCHED-FROM, do nothing.
  • If g? never previously reached, add g? to P with
    priority RT(x0x?) ? TV(x?), where T is
    trajectory from x0 to x?. Set g?s
    representative state to x?. Record trajectory
    from x to x?.
  • If g? previously reached but previous priority
    is lower than RT(x0x?) ? TV(x?), update g? s
    priority to RT(x0x?) ? TV(x?) and set
    representative state to x?. Record trajectory
    from x to x?.

13
Informed Global Search Examples
77 simplex-interpolated V
1313 simplex-interpolated V
Hill-car Search Trees
14
Informed Global Search as A
  • Informed Global Search is essentially an A
    search using the value function V as a search
    heuristic
  • Using A with an optimistic heuristic function
    normally guarantees optimal path to the goal.
  • Uninformed global search effectively uses
    trivially optimistic heuristic V(s) 0. Might
    we expect better solution quality with uninformed
    search than with non-optimistic crude approximate
    value function V?
  • Not necessarily! A crude approximate
    non-optimistic value function can improve
    solution quality by helping the algorithm avoid
    pruning wrong parts of search tree

15
Hill-car
  • Car on steep hill
  • State variables position and velocity (2-d)
  • Actions accelerate forward or backward
  • Goal park near top
  • Random start states
  • Cost total time to goal

16
Acrobot
  • Two-link planar robot acting in vertical plane
    under gravity
  • Underactuated joint at elbow unactuated shoulder
  • Two angular positions their velocities (4-d)
  • Goal raise tip at least one links height above
    shoulder
  • Two actions full torque clockwise /
    counterclockwise
  • Random starting positions
  • Cost total time to goal

Goal
?1
?2
17
Move-Cart-Pole
  • Upright pole attached to cart by unactuated joint
  • State horizontal position of cart, angle of
    pole, and associated velocities (4-d)
  • Actions accelerate left or right
  • Goal configuration cart moved, pole balanced
  • Start with random x ? 0
  • Per-step cost quadratic in distance from goal
    configuration
  • Big penalty if pole falls over

?
Goal configuration
x
18
Planar Slider
  • Puck sliding on bumpy 2-d surface
  • Two spatial variables their velocities (4-d)
  • Actions accelerate NW, NE, SW, or SE
  • Goal in NW corner
  • Random start states
  • Cost total time to goal

19
Local Search Experiments
Move-Cart-Pole
  • CPU Time and Solution cost vs. search depth d
  • No limits imposed on number of action switches
    (sd)
  • Value function 134 simplex-interpolation grid

20
Local Search Experiments
Hill-car
  • CPU Time and Solution cost vs. search depth d
  • Max. number of action switches fixed at 2 (s 2)
  • Value function 72 simplex-interpolated value
    function

21
Comparative experiments Hill-Car
  • Local search d6, s2
  • Global searches
  • Local search between grid elements d20, s1
  • 502 search grid resolution
  • 72 simplex-interpolated value function

22
Hill-Car results contd
  • Uninformed Global Search prunes wrong
    trajectories
  • Increase search grid to 1002 so this doesnt
    happen
  • Uninformed does near-optimal
  • Informed doesnt crude value function not
    optimistic

Failed search trajectory picture goes here
23
Comparative Results Four-d domains
  • All value functions 134 simplex interpolations
  • All local searches between global search
    elements
  • depth 20, with at max. 1 action switch (d20,
    s1)
  • Acrobot
  • Local Search depth 4 no action switch
    restriction (d4,s4)
  • Global 504 search grid
  • Move-Cart-Pole same as Acrobot
  • Slider
  • Local Search depth 10 max. 1 action switch
    (d10,s1)
  • Global 204 search grid

24
Acrobot
LS number of local searches performed to find
paths between elements of global search grid
  • Local search significantly improves solution
    quality, but increases CPU time by order of
    magnitude
  • Uninformed global search takes even more time
    poor solution quality indicates suboptimal
    trajectory pruning
  • Informed global search finds much better
    solutions in relatively little time. Value
    function drastically reduces search, and better
    pruning leads to better solutions

25
Move-Cart-Pole
  • No search pole often falls, incurring large
    penalties overall poor solution quality
  • Local search improves things a bit
  • Uninformed search finds better solutions than
    informed
  • Few grid cells in which pruning is required
  • Value function not optimistic, so informed search
    solutions suboptimal
  • Informed search reduces costs by order of
    magnitude with no increase in required CPU time

26
Planar Slider
  • Local search almost useless, and incurs massive
    CPU expense
  • Uninformed search decreases solution cost by 50,
    but at even greater CPU expense
  • Informed search decreases solution cost by factor
    of 4, at no increase in CPU time

27
Using Search with Learned Models
  • Toy Example Hill-Car
  • 72 simplex-interpolated value function
  • One nearest-neighbor function approximator per
    possible action used to learn dx/dt
  • States sufficiently far away from nearest
    neighbor optimistically assumed to be absorbing
    to encourage exploration
  • Average costs over first few hundred trials
  • No search 212
  • Local search 127
  • Informed global search 155

28
Using Search with Learned Models
  • Problems do arise when using learned models
  • Inaccuracies in models may cause global searches
    to fail. Not clear then if failure should be
    blamed on model inaccuracies or on insufficiently
    fine state space partitioning
  • Trajectories found will be inaccurate
  • Need adaptive closed-loop controller
  • Fortunately, we will get new data with which to
    increase the accuracy of our model
  • Model approximators must be fast and accurate

29
Avenues for Future Research
  • Extensions to nondeterministic systems?
  • Higher-dimensional problems
  • Better function approximators for model learning
  • Variable-resolution search grids
  • Optimistic value function generation?
Write a Comment
User Comments (0)
About PowerShow.com