Hierarchical Reinforcement Learning Using Graphical Models

1 / 20
About This Presentation
Title:

Hierarchical Reinforcement Learning Using Graphical Models

Description:

Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and Sridhar Mahadevan Rich Representations for Reinforcement Learning – PowerPoint PPT presentation

Number of Views:0
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Hierarchical Reinforcement Learning Using Graphical Models


1
Hierarchical Reinforcement Learning Using
Graphical Models
  • Victoria Manfredi and Sridhar Mahadevan
  • Rich Representations for Reinforcement Learning
  • ICML05 Workshop
  • August 7, 2005

2
Introduction
  • Abstraction necessary to scale RL?hierarchical RL
  • Want to learn abstractions automatically
  • Other approaches
  • Find subgoals McGovern Barto01, Simsek
    Barto04, Simsek, Wolfe, Barto05, Mannor et al
    04
  • Build policy hierarchy Hengst02
  • Potentially proto-value functions Mahadevan05
  • Our approach
  • Learn initial policy hierarchy using graphical
    model framework, then learn how to use policies
    using reinforcement learning and reward
  • Related to imitation
  • Price Boutilier03, Abbeel Ng04

3
Outline
  • Dynamic Abstraction Networks
  • Approach
  • Experiments
  • Results
  • Summary
  • Future Work

4
Dynamic Abstraction Network
HHMM Fine, Singer, Tishby98 AHMM Bui,
Venkatesh, West02 DAN Manfredi Mahadevan05
Just one realization of a DAN others are
possible
5
Approach
6
DANs vs MAXQ/HAMs
  • DANs
  • of levels in state/policy hierarchies
  • of values for each (abstract) state/policy node
  • Training sequences (flat state,action) pairs
  • MAXQ Dietterich00
  • of levels, of tasks at each level
  • Connections between levels
  • Initiation set for each task
  • Termination set for each task
  • HAMs Parr Russell98
  • of levels
  • Hierarchy of stochastic finite state machines
  • Explicit action, call, choice, stop states

7
Why Graphical Models?
  • Advantages of Graphical Models
  • Joint learning of multiple policy/state
    abstractions
  • Continuous/hidden domains
  • Full machinery of inference can be used
  • Disadvantages
  • Parameter learning with hidden variables is
    expensive
  • Expectation-Maximization can get stuck in local
    maxima

8
Domain
  • Dietterichs Taxi (2000)
  • States
  • Taxi Location (TL) 25
  • Passenger Location (PL) 5
  • Passenger Destination (PD) 5
  • Actions
  • North, South, East, West
  • Pickup, Putdown
  • Hand-coded policies
  • GotoRed
  • GotoGreen
  • GotoYellow
  • GotoBlue
  • Pickup, Putdown

9
Experiments
  • Phase 1
  • S1 5, S0 25, ?1 6, ?0 6
  • 1000 sequences from SMDP Q-learner
  • TL, PL, PD, A1 , , TL, PL, PD, An
  • Bayes Net Toolbox (Murphy01)
  • Phase 2
  • SMDP Q-learning
  • Choose policy ?1 using ?-greedy
  • Compute most likely abstract state s0
    given TL, PL, PD
  • Select action ?0 using
  • Pr ( ?0 ? ?1 ?1 , S0 s0 )

Taxi DAN
Policy
Policy
Policy
Policy
F
F
Action
Action
S1
S1
F1
F1
S0
S0
F0
F0
10
Policy Improvement
  • Policy learned over DAN policies performs well
  • Each plot is average over 10 RL runs and 1 EM run

11
Policy Recognition

DAN
Initial Passenger Loc Passenger Dest Policy 1
Policy 6
  • Can (sometimes!) recognize a specific sequence of
    actions as composing a single policy

12
Summary
  • Two-phased method for automating hierarchical RL
    using graphical models
  • Advantages
  • Limited info needed ( of levels, of values)
  • Permits continuous and partially observable
    state/actions
  • Disadvantages
  • EM is expensive
  • Need mentor
  • Abstractions learned can be hard to decipher
    (local maxima?)

13
Future Work
  • Approximate inference in DANs
  • Saria Mahadevan04 Rao-Blackwellized particle
    filtering for multi-agent AHMMs
  • Johns Mahadevan05 variational inference for
    AHMMs
  • Take advantage of ability to do inference in
    hierarchical RL phase
  • Incorporate reward in DAN

14
  • Thank You
  • Questions?

15
Abstract State Transitions S0
  • Regardless of abstract P0 policy being executed,
    abstract S0 states self-transition with high
    probability
  • Depending on abstract P0 policy, may
    alternatively transition to one of a few abstract
    S0 states
  • Similarly for abstract S1 states and abstract P1
    policies

16
State Abstractions
Abstract state to which agent is most likely to
transition is a consequence, in part, of the
learned state abstractions
17
Semi-MDP Q-learning
  • Q(s,o) activity-value for state s and activity o
  • ? learning rate
  • ?? discount rate raised to the number of time
    steps o took
  • r accumulated discounted reward since o began

18
Abstract State S1 Transitions
  • Abstract state S1 transitions under abstract
    policy P1

19
Expectation-Maximization (EM)
  • Hidden variables and unknown parameters
  • E(xpectation)-step
  • Assume parameters known and compute the
    conditional expected values for variables
  • M(aximization)-step
  • Assume variables observed and compute the argmax
    parameters

20
Abstract State S0 Transitions
  • Abstract state S0 transitions under abstract
    policy P0
Write a Comment
User Comments (0)