Title: Hierarchical Reinforcement Learning Using Graphical Models
1Hierarchical Reinforcement Learning Using
Graphical Models
- Victoria Manfredi and Sridhar Mahadevan
- Rich Representations for Reinforcement Learning
- ICML05 Workshop
- August 7, 2005
2Introduction
- Abstraction necessary to scale RL?hierarchical RL
- Want to learn abstractions automatically
- Other approaches
- Find subgoals McGovern Barto01, Simsek
Barto04, Simsek, Wolfe, Barto05, Mannor et al
04 - Build policy hierarchy Hengst02
- Potentially proto-value functions Mahadevan05
- Our approach
- Learn initial policy hierarchy using graphical
model framework, then learn how to use policies
using reinforcement learning and reward - Related to imitation
- Price Boutilier03, Abbeel Ng04
3Outline
- Dynamic Abstraction Networks
- Approach
- Experiments
- Results
- Summary
- Future Work
4Dynamic Abstraction Network
HHMM Fine, Singer, Tishby98 AHMM Bui,
Venkatesh, West02 DAN Manfredi Mahadevan05
Just one realization of a DAN others are
possible
5Approach
6DANs vs MAXQ/HAMs
- DANs
- of levels in state/policy hierarchies
- of values for each (abstract) state/policy node
- Training sequences (flat state,action) pairs
- MAXQ Dietterich00
- of levels, of tasks at each level
- Connections between levels
- Initiation set for each task
- Termination set for each task
- HAMs Parr Russell98
- of levels
- Hierarchy of stochastic finite state machines
- Explicit action, call, choice, stop states
7Why Graphical Models?
- Advantages of Graphical Models
- Joint learning of multiple policy/state
abstractions - Continuous/hidden domains
- Full machinery of inference can be used
- Disadvantages
- Parameter learning with hidden variables is
expensive - Expectation-Maximization can get stuck in local
maxima
8Domain
- Dietterichs Taxi (2000)
- States
- Taxi Location (TL) 25
- Passenger Location (PL) 5
- Passenger Destination (PD) 5
- Actions
- North, South, East, West
- Pickup, Putdown
- Hand-coded policies
- GotoRed
- GotoGreen
- GotoYellow
- GotoBlue
- Pickup, Putdown
9Experiments
- Phase 1
- S1 5, S0 25, ?1 6, ?0 6
- 1000 sequences from SMDP Q-learner
- TL, PL, PD, A1 , , TL, PL, PD, An
- Bayes Net Toolbox (Murphy01)
- Phase 2
- SMDP Q-learning
- Choose policy ?1 using ?-greedy
- Compute most likely abstract state s0
given TL, PL, PD - Select action ?0 using
- Pr ( ?0 ? ?1 ?1 , S0 s0 )
Taxi DAN
Policy
Policy
Policy
Policy
F
F
Action
Action
S1
S1
F1
F1
S0
S0
F0
F0
10Policy Improvement
- Policy learned over DAN policies performs well
- Each plot is average over 10 RL runs and 1 EM run
11Policy Recognition
DAN
Initial Passenger Loc Passenger Dest Policy 1
Policy 6
- Can (sometimes!) recognize a specific sequence of
actions as composing a single policy
12Summary
- Two-phased method for automating hierarchical RL
using graphical models - Advantages
- Limited info needed ( of levels, of values)
- Permits continuous and partially observable
state/actions - Disadvantages
- EM is expensive
- Need mentor
- Abstractions learned can be hard to decipher
(local maxima?)
13Future Work
- Approximate inference in DANs
- Saria Mahadevan04 Rao-Blackwellized particle
filtering for multi-agent AHMMs - Johns Mahadevan05 variational inference for
AHMMs - Take advantage of ability to do inference in
hierarchical RL phase - Incorporate reward in DAN
14 15Abstract State Transitions S0
- Regardless of abstract P0 policy being executed,
abstract S0 states self-transition with high
probability - Depending on abstract P0 policy, may
alternatively transition to one of a few abstract
S0 states - Similarly for abstract S1 states and abstract P1
policies
16State Abstractions
Abstract state to which agent is most likely to
transition is a consequence, in part, of the
learned state abstractions
17Semi-MDP Q-learning
- Q(s,o) activity-value for state s and activity o
- ? learning rate
- ?? discount rate raised to the number of time
steps o took - r accumulated discounted reward since o began
18Abstract State S1 Transitions
- Abstract state S1 transitions under abstract
policy P1
19Expectation-Maximization (EM)
- Hidden variables and unknown parameters
- E(xpectation)-step
- Assume parameters known and compute the
conditional expected values for variables - M(aximization)-step
- Assume variables observed and compute the argmax
parameters
20Abstract State S0 Transitions
- Abstract state S0 transitions under abstract
policy P0