Apprenticeship learning for robotic control - PowerPoint PPT Presentation

About This Presentation
Title:

Apprenticeship learning for robotic control

Description:

Apprenticeship learning for robotic control – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 27
Provided by: OAO97
Learn more at: http://ai.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Apprenticeship learning for robotic control


1
Apprenticeship learning for robotic control
Pieter Abbeel Stanford University Joint work
with Andrew Y. Ng, Adam Coates, Morgan Quigley.

2
This talk
Reinforcement Learning
Reward Function R
Control policy p
Recurring theme Apprenticeship learning.
3
Motivation
  • In practice reward functions are hard to specify,
    and people tend to tweak them a lot.
  • Motivating example helicopter tasks, e.g. flip.
  • Another motivating example Highway driving.

4
Apprenticeship Learning
  • Learning from observing an expert.
  • Previous work
  • Learn to predict experts actions as a function
    of states.
  • Usually lacks strong performance guarantees.
  • (E.g.,. Pomerleau, 1989 Sammut et al., 1992
    Kuniyoshi et al., 1994 Demiris Hayes, 1994
    Amit Mataric, 2002 Atkeson Schaal, 1997 )
  • Our approach
  • Based on inverse reinforcement learning (Ng
    Russell, 2000).
  • Returns policy with performance as good as the
    expert as measured according to the experts
    unknown reward function.
  • Most closely related work Ratliff et al. 2005,
    2006.

5
Algorithm
  • For t 1,2,
  • Inverse RL step
  • Estimate experts reward function R(s) wT?(s)
    such that under R(s) the expert performs better
    than all previously found policies ?i.
  • RL step
  • Compute optimal policy ?t for
  • the estimated reward w.

Abbeel Ng, 2004
6
Algorithm IRL step
  • Maximize ?, ww2 1 ?
  • s.t. Uw(?E) ? Uw(?i) ? i1,,t-1
  • ? margin of experts performance over the
    performance of previously found policies.
  • Uw(?) E ?t1 R(st)? E ?t1 wT?(st)?
  • wT E ?t1 ?(st)?
  • wT ?(?)
  • ?(?) E ?t1 ?(st)? are the feature
    expectations

T
T
T
T
7
Feature Expectation Closeness and Performance
  • If we can find a policy ? such that
  • ?(?E) - ?(?)2 ? ?,
  • then for any underlying reward R(s) wT?(s),
  • we have that
  • Uw(?E) - Uw(?) wT ?(?E) - wT ?(?)
  • ? w2 ?(?E) - ?(?)2
  • ? ?.

8
Theoretical Results Convergence
  • Theorem. Let an MDP (without reward function), a
    k-dimensional feature vector ? and the experts
    feature expectations ?(?E) be given. Then after
    at most
  • kT2/?2
  • iterations, the algorithm outputs a policy ?
    that performs nearly as well as the expert, as
    evaluated on the unknown reward function
    R(s)wT?(s), i.e.,
  • Uw(?) ? Uw(?E) - ?.

9
Case study Highway driving
Input Driving demonstration
Output Learned behavior
The only input to the learning algorithm was the
driving demonstration (left panel). No reward
function was provided.
10
More driving examples
Driving demonstration
Driving demonstration
Learned behavior
Learned behavior
In each video, the left sub-panel shows a
demonstration of a different driving style, and
the right sub-panel shows the behavior learned
from watching the demonstration.
11
Inverse reinforcement learning summary
  • Our algorithm returns a policy with performance
    as good as the expert as evaluated according to
    the experts unknown reward function.
  • Algorithm is guaranteed to converge in
    poly(k,1/?) iterations.
  • The algorithm exploits reward simplicity (vs.
    policy simplicity in previous approaches).

12
The dynamics model

13
Collecting data to learn the dynamics model

14
Learning the dynamics model Psa from data
Estimate Psa from data
For example, in discrete-state problems, estimate
Psa(s) to be the fraction of times you
transitioned to state s after taking action a in
state s. Challenge Collecting enough data to
guarantee that you can model the entire flight
envelop.
15
Collecting data to learn dynamical model
  • State-of-the-art E3 algorithm (Kearns and Singh,
    2002)

Have good model of dynamics?
YES
NO
Exploit
Explore
16
Aggressive exploration (Manual flight)
Aggressively exploring the edges of the flight
envelope isnt always a good idea.
17
Learning the dynamics
Autonomous flight
Expert human pilot flight
Dynamics Model
Psa
(a1, s1, a2, s2, a3, s3, .)
(a1, s1, a2, s2, a3, s3, .)
Reinforcement Learning
Reward Function R
Control policy ?
18
Apprenticeship learning of model
  • Theorem. Suppose that we obtain m O(poly(S, A,
    T, 1/e)) examples from a human expert
    demonstrating the task. Then after a polynomial
    number k of iterations of testing/re-learning,
    with high probability, we will obtain a policy p
    whose performance is comparable to the experts

U(?) ? U(?E) - e
Thus, so long as a demonstration is available, it
isnt necessary to explicitly explore. In
practice, k1 or 2 is almost always
enough. Abbeel Ng, 2005
19
Proof idea
  • From initial pilot demonstrations, our
    model/simulator Psa will be accurate for the part
    of the flight envelop (s,a) visited by the pilot.
  • Our model/simulator will correctly predict the
    helicopters behavior under the pilots policy
    ?E.
  • Consequently, there is at least one policy
    (namely ?E) that looks like its able to fly the
    helicopter in our simulation.
  • Thus, each time we solve the MDP using the
    current simulator Psa, we will find a policy that
    successfully flies the helicopter according to
    Psa.
  • If, on the actual helicopter, this policy fails
    to fly the helicopter---despite the model Psa
    predicting that it should---then it must be
    visiting parts of the flight envelop that the
    model is failing to accurately model.
  • Hence, this gives useful training data to model
    new parts of the flight envelop.

20
Configurations flown (exploitation only)
21
Tail-in funnel
22
Nose-in funnel
23
In-place rolls
24
In place flips
25
Acknowledgements
Andrew Ng, Adam Coates, Morgan Quigley
26
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com