Using Inaccurate Models in Reinforcement Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Using Inaccurate Models in Reinforcement Learning

Description:

Real RC Car. Control actions: throttle and steering. We used DDP. ... All 4 standard fixed-wing control actions: throttle, ailerons, elevators and rudder. ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 2
Provided by: pieter4
Category:

less

Transcript and Presenter's Notes

Title: Using Inaccurate Models in Reinforcement Learning


1
Using Inaccurate Models in Reinforcement Learning
S T A N F O R D
S T A N F O R D
  • Pieter Abbeel, Morgan Quigley, and Andrew Y. Ng
  • 1. Preliminaries
  • Markov Decision Process (MDP)
  • M (S, A, T , H, s0, R).
  • S ?n (continuous state space)
  • Time varying, deterministic dynamics
  • T ft S x A ! S, t 0,,H.
  • Goal find policy ?? S ! A, that maximizes
    U(??) E ? R(st) ?? .
  • Focus task of trajectory following.
  • 0. Overview
  • RL for high-dimensional continuous state-space
    tasks.
  • Model-based RL Difficult to build an accurate
    model.
  • Model-free RL Often requires large numbers of
    real-life trials.
  • We present a hybrid algorithm, which requires
    only
  • an approximate model,
  • a small number of real-life trials.
  • Resulting policy is an approximate local optimum.
  • 2. Motivating Example
  • Student-driver learning to make a 90 degree
    right turn Only a few trials needed, no
    accurate model.
  • Key aspects
  • Real-life trial shows whether turn is wide or
    short.
  • Crude model turning steering wheel more to the
    right results in sharper turn turning steering
    wheel more to the left results in wider turn.
  • Result good policy gradient estimate.

H
t0
4. Main Idea
Effect on Policy Gradient Estimate
Test the model-based optimal policy in
real-life. How to proceed when the real-life
trajectory is not the desired trajectory
predicted by the model? The policy gradient is
zero according to the model, so no improvement is
possible based on the model.
Solution Update the model such that it becomes
exact for the current policy. More specifically,
add a bias to the model for each time step. See
illustration below for details.
  • Exact policy gradient
  • Model based policy gradient
  • Two sources of error

Real-life trajectory
Trajectory predicted by model (equals desired
trajectory)
Evaluation of derivatives along wrong trajectory
Derivative of approximate transition function

Our algorithm eliminates the second source of
error.
The new model perfectly predicts the state
sequence obtained by the current policy.
Consequently, the new model knows that more
right steering is required.
  • 5. Complete Algorithm
  • Find the (locally) optimal policy ?? for the
    model.
  • Execute the current policy ?? and record the
    state trajectory.
  • Update the model such that the new model is
    exact for the current policy ??.
  • Compute the policy gradient in the new model and
    update the policy ? ? ? ??.
  • Go back to Step 2.
  • Notes
  • The step-size parameter ? is determined by a line
    search.
  • Instead of the policy gradient, any algorithm
    that provides a local policy improvement
    direction can be used. In our experiments we
    used differential dynamic programming.

7. Experiments
Videos available.
  • Real RC Car
  • Control actions throttle and steering.
  • We used DDP.
  • Our algorithm took 10 iterations.
  • Flight Simulator
  • We generated approximate models by randomly
    perturbing the 43 model parameters.
  • All 4 standard fixed-wing control actions
    throttle, ailerons, elevators and rudder.
  • We used differential dynamic programming (DDP)
    for the model-based RL and to provide local
    policy improvements.
  • Our algorithm took 5 iterations.

Figure-8 Maneuver
Figure-8 Maneuver
Open Loop Turn
76 utility improvement over model-based RL
Improvements over model-based RL Turn 97
Circle 88 Figure-8 67
Circle Maneuver
  • 6. Theoretical Guarantees
  • Let the local policy improvement algorithm be
    policy gradient.
  • Notes
  • These assumptions are insufficient to give the
    same performance guarantees for model-based RL.
  • The constant K depends only on the dimensionality
    of the state, action, and policy (?), the horizon
    H and an upper bound on the 1st and 2nd
    derivatives of the transition model, the policy
    and the reward function.

Fixed-wing flight simulator available at
http//sourceforge.net/projects/aviones.
  • 9. Conclusion
  • We presented an algorithm that uses a crude
    model and a small number of real-life trials to
    find a policy that works well in real-life.
  • Our theoretical results show that---assuming a
    deterministic setting and an approximate
    model---our algorithm returns a policy that is
    (locally) near-optimal.
  • Our experiments show that our algorithm can
    significantly improve on purely model-based RL by
    using only a small number of real-life trials,
    even when the true system is not deterministic.
  • 8. Related Work
  • Iterative Learning Control
  • Uchiyama (1978), Longman et al. (1992), Moore
    (1993), Horowitz (1993), Bien et al. (1991),
    Owens et al. (1995), Chen et al. (1997),
  • Successful robot control with limited number of
    trials
  • Atkeson and Schaal (1997), Morimoto and Doya
    (2001).
  • Non-parametric learning
  • Atkeson et al. (1997).
  • Classical and robust control theory
  • Anderson and Moore (1989), Zhou et al. (1995),
    Bagnell et al. (2001), Morimoto and Atkeson
    (2002),
Write a Comment
User Comments (0)
About PowerShow.com