Apprenticeship Learning via Inverse Reinforcement Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Apprenticeship Learning via Inverse Reinforcement Learning

Description:

RL formalism. Assume that at each time step, our system is in some state st. ... RL formalism. Markov Decision Process (S,A,P,s0,R) W.l.o.g. we assume. Policy ... – PowerPoint PPT presentation

Number of Views:221
Avg rating:3.0/5.0
Slides: 46
Provided by: OAO97
Learn more at: http://ai.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Apprenticeship Learning via Inverse Reinforcement Learning


1
Apprenticeship Learning via Inverse
Reinforcement Learning
  • Pieter Abbeel
  • Stanford University
  • Joint work with Andrew Ng.

2
Overview
  • Reinforcement Learning (RL)
  • Motivation for Apprenticeship Learning
  • Proposed algorithm
  • Theoretical results
  • Experimental results
  • Conclusion

3
Example of Reinforcement Learning Problem
  • Highway driving.

4
RL formalism
System dynamics
System dynamics
System dynamics
s0

sT
s1
sT-1
s2
R(s0)
R(s2)
R(sT-1)
R(s1)
R(sT)



  • Assume that at each time step, our system is in
    some state st.
  • Upon taking an action at, our system randomly
    transitions to some new state st1.
  • We are also given a reward function R.
  • The goal Pick actions over time so as to
    maximize the expected sum of rewards ER(s0)
    R(s1) R(sT).

5
RL formalism
  • Markov Decision Process (S,A,P,s0,R)
  • W.l.o.g. we assume
  • Policy
  • Utility of a policy?? for reward RwT?

6
Motivation for Apprenticeship Learning
  • Reinforcement learning (RL) gives powerful tools
    for solving MDPs. It can be difficult to specify
    the reward function. Example Highway driving.

7
Apprenticeship Learning
  • Learning from observing an expert.
  • Previous work
  • Learn to predict experts actions as a function
    of states.
  • Usually lacks strong performance guarantees.
  • (E.g.,. Pomerleau, 1989 Sammut et al., 1992
    Kuniyoshi et al., 1994 Demiris Hayes, 1994
    Amit Mataric, 2002 Atkeson Schaal, 1997 )
  • Our approach
  • Based on inverse reinforcement learning (Ng
    Russell, 2000).
  • Returns policy with performance as good as the
    expert as measured according to the experts
    unknown reward function.

8
Algorithm
  • For i 1,2,
  • Inverse RL step
  • Estimate experts reward function R(s) wT?(s)
    such that under R(s) the expert performs better
    than all previously found policies ?j.
  • RL step
  • Compute optimal policy ?i for
  • the estimated reward w.

9
Algorithm Inverse RL step
10
Algorithm Inverse RL step
Quadratric programming problem. (same as for SVM)
11
Algorithm
?2
?(?E)
?(?2)
w(3)
?(?1)
w(2)
w(1)
?(?0)
?1
12
Feature Expectation Closeness and Performance
  • If we can find a policy ? such that
  • ?(?E) - ?(?)2 ? ?,
  • then for any underlying reward R(s) wT?(s),
  • we have that
  • Uw(?E) - Uw(?) wT ?(?E) - wT ?(?)
  • ? w2 ?(?E) - ?(?)2
  • ? ?.

13
Theoretical Results Convergence
  • Theorem. Let an MDP (without reward function), a
    k-dimensional feature vector ? and the experts
    feature expectations ?(?E) be given. Then after
    at most
  • k T2/?2
  • iterations, the algorithm outputs a policy ?
    that performs nearly as well as the expert, as
    evaluated on the unknown reward function
    R(s)wT?(s), i.e.,
  • Uw(?) ? Uw(?E) - ?.

14
Algorithm (projection version)
?2
?(?E)
?(?2)
?(?1)
w(3)
w(2)
?(2)
?(1)
w(1)
?(?0)
?1
15
Theoretical Results Sampling
  • In practice, we have to use sampling to estimate
    the feature expectations of the expert. We still
    have ?-optimal performance with high probability
    if the number of observed samples is at least
  • O(poly(k,1/?)).
  • Note the bound has no dependence on the
    complexity of the policy.

16
Gridworld Experiments
Reward function is piecewise constant over small
regions. Features ? for IRL are these small
regions.
128x128 grid, small regions of size 16x16.
17
Gridworld Experiments
18
Gridworld Experiments
19
Gridworld Experiments
20
Gridworld Experiments
21
Case study Highway driving
Output Learned behavior
Input Driving demonstration
The only input to the learning algorithm was the
driving demonstration (left panel). No reward
function was provided.
22
More driving examples
In each video, the left sub-panel shows a
demonstration of a different driving style, and
the right sub-panel shows the behavior learned
from watching the demonstration.
23
Car driving results
Collision Left Shoulder Left Lane Middle Lane Right Lane Right Shoulder
? (expert) 0 0 0.13 0.20 0.60 0.07
1 ? (learned) 0 0 0.09 0.23 0.60 0.08
w (learned) -0.08 -0.04 0.01 0.01 0.03 -0.01
? (expert) 0.12 0 0.06 0.47 0.47 0
2 ? (learned) 0.13 0 0.10 0.32 0.58 0
w (learned) 0.23 -0.11 0.01 0.05 0.06 -0.01
? (expert) 0 0 0 0.01 0.70 0.29
3 ? (learned) 0 0 0 0 0.74 0.26
w (learned) -0.11 -0.01 -0.06 -0.04 0.09 0.01
24
Different Formulation
  • LP formulation for RL problem
  • max. ? ?s,a ?(s,a) R(s)
  • s.t.
  • ?s ?a ?(s,a) ?s,a P(ss,a) ?(s,a)
  • QP formulation for Apprenticeship Learning
  • min. ?,? ?i (?E,i - ?i)2
  • s.t.
  • ?s ?a ?(s,a) ?s,a P(ss,a) ?(s,a)
  • ?i ?i ?s,a ?i(s) ?(s,a)

25
Different Formulation (ctd.)
  • Our algorithm is equivalent to iteratively
  • linearizing QP at current point (Inverse RL
    step),
  • solve resulting LP (RL step).
  • Why not solving QP directly? Typically only
    possible for very small toy problems (curse of
    dimensionality). Our algorithm makes use of
    existing RL solvers to deal with the curse of
    dimensionality.

26
Conclusions
  • Our algorithm returns a policy with performance
    as good as the expert as evaluated according to
    the experts unknown reward function.
  • Algorithm is guaranteed to converge in
    poly(k,1/?) iterations.
  • Sample complexity poly(k,1/?).
  • The algorithm exploits reward simplicity (vs.
    policy simplicity in previous approaches).

27
Proof (sketch)
?2
?(?E)
?(?1)
d0
d1
?(1)
w(1)
?(?0)
?1
28
Proof (sketch)
29
More driving examples
In each video, the left sub-panel shows a
demonstration of a different driving style, and
the right sub-panel shows the behavior learned
from watching the demonstration.
30
Additional slides for poster
  • (slides to come are additional material, not
    included in the talk, in particular projection
    (vs. QP) version of the Inverse RL step another
    formulation of the apprenticeship learning
    problem, and its relation to our algorithm)

31
Simplification of Inverse RL step QP ? Euclidean
projection
  • In the Inverse RL step
  • set ?(i-1) orthogonal projection of ?E onto
    line through ?(i-1),?(?(i-1))
  • set w(i) ?E - ?(i-1)
  • Note the theoretical results on convergence and
    sample complexity hold unchanged for the simpler
    algorithm.

32
Algorithm (projection version)
?2
?E
?(?1)
w(1)
?(?0)
?1
33
Algorithm (projection version)
?2
?E
?(?2)
?(?1)
w(2)
?(1)
w(1)
?(?0)
?1
34
Algorithm (projection version)
?2
?E
?(?2)
?(?1)
w(3)
w(2)
?(2)
?(1)
w(1)
?(?0)
?1
35
Appendix Different View
  • Bellman LP for solving MDPs
  • Min. V cV s.t.
  • ? s,a V(s) ? R(s,a) ? ?s P(s,a,s)V(s)
  • Dual LP
  • Max. ? ?s,a ?(s,a)R(s,a) s.t.
  • ?s c(s) - ?a ?(s,a) ? ?s,a P(s,a,s) ?(s,a)
    0
  • Apprenticeship Learning as QP
  • Min. ? ?i (?E,i - ?s,a ?(s,a)?i(s))2 s.t.
  • ?s c(s) - ?a ?(s,a) ? ?s,a P(s,a,s) ?(s,a)
    0

36
Different View (ctd.)
  • Our algorithm is equivalent to iteratively
  • linearize QP at current point (Inverse RL step),
  • solve resulting LP (RL step).
  • Why not solving QP directly? Typically only
    possible for very small toy problems (curse of
    dimensionality). Our algorithm makes use of
    existing RL solvers to deal with the curse of
    dimensionality.

37
Slides that are different for poster
  • (slides to come are slightly different for
    poster, but already appeared earlier)

38
Algorithm (QP version)
?2
?(?E)
?(?1)
w(1)
Uw(?) wT?(?)
?(?0)
?1
39
Algorithm (QP version)
?2
?(?E)
?(?2)
?(?1)
w(2)
w(1)
Uw(?) wT?(?)
?(?0)
?1
40
Algorithm (QP version)
?2
?(?E)
?(?2)
w(3)
?(?1)
w(2)
w(1)
Uw(?) wT?(?)
?(?0)
?1
41
Gridworld Experiments
42
Case study Highway driving
Output Learned behavior
Input Driving demonstration
(Videos available.)
43
More driving examples
(Videos available.)
44
Car driving results (more detail)
    Collision Offroad Left Left Lane Middle Lane Right Lane Offroad Right
1 Feature Distr. Expert 0 0 0.1325 0.2033 0.5983 0.0658
  Feature Distr. Learned 5.00E-05 0.0004 0.0904 0.2286 0.604 0.0764
Weights Learned -0.0767 -0.0439 0.0077 0.0078 0.0318 -0.0035
2 Feature Distr. Expert 0.1167 0 0.0633 0.4667 0.47 0
  Feature Distr. Learned 0.1332 0 0.1045 0.3196 0.5759 0
  Weights Learned 0.234 -0.1098 0.0092 0.0487 0.0576 -0.0056
3 Feature Distr. Expert 0 0 0 0.0033 0.7058 0.2908
  Feature Distr. Learned 0 0 0 0 0.7447 0.2554
  Weights Learned -0.1056 -0.0051 -0.0573 -0.0386 0.0929 0.0081
4 Feature Distr. Expert 0.06 0 0 0.0033 0.2908 0.7058
  Feature Distr. Learned 0.0569 0 0 0 0.2666 0.7334
  Weights Learned 0.1079 -0.0001 -0.0487 -0.0666 0.059 0.0564
5 Feature Distr. Expert 0.06 0 0 1 0 0
  Feature Distr. Learned 0.0542 0 0 1 0 0
  Weights Learned 0.0094 -0.0108 -0.2765 0.8126 -0.51 -0.0153
45
Proof (sketch)
46
Apprenticeship Learning via Inverse
Reinforcement Learning
  • Pieter Abbeel and Andrew Y. Ng
  • Stanford University
Write a Comment
User Comments (0)
About PowerShow.com