Hierarchical Reinforcement Learning Using Graphical Models

About This Presentation

Title:

Hierarchical Reinforcement Learning Using Graphical Models

Description:

Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and Sridhar Mahadevan Rich Representations for Reinforcement Learning – PowerPoint PPT presentation

Number of Views:0

Avg rating:3.0/5.0

Slides: 21

Provided by: VictoriaM152

more less

Transcript and Presenter's Notes

Title: Hierarchical Reinforcement Learning Using Graphical Models

1
Hierarchical Reinforcement Learning Using
Graphical Models

Victoria Manfredi and Sridhar Mahadevan
Rich Representations for Reinforcement Learning
ICML05 Workshop
August 7, 2005

2
Introduction

Abstraction necessary to scale RL?hierarchical RL
Want to learn abstractions automatically
Other approaches
Find subgoals McGovern Barto01, Simsek
Barto04, Simsek, Wolfe, Barto05, Mannor et al
04
Build policy hierarchy Hengst02
Potentially proto-value functions Mahadevan05
Our approach
Learn initial policy hierarchy using graphical
model framework, then learn how to use policies
using reinforcement learning and reward
Related to imitation
Price Boutilier03, Abbeel Ng04

3
Outline

Dynamic Abstraction Networks
Approach
Experiments
Results
Summary
Future Work

4
Dynamic Abstraction Network
HHMM Fine, Singer, Tishby98 AHMM Bui,
Venkatesh, West02 DAN Manfredi Mahadevan05
Just one realization of a DAN others are
possible
5
Approach
6
DANs vs MAXQ/HAMs

DANs
of levels in state/policy hierarchies
of values for each (abstract) state/policy node
Training sequences (flat state,action) pairs
MAXQ Dietterich00
of levels, of tasks at each level
Connections between levels
Initiation set for each task
Termination set for each task

HAMs Parr Russell98
of levels
Hierarchy of stochastic finite state machines
Explicit action, call, choice, stop states

7
Why Graphical Models?

Advantages of Graphical Models
Joint learning of multiple policy/state
abstractions
Continuous/hidden domains
Full machinery of inference can be used
Disadvantages
Parameter learning with hidden variables is
expensive
Expectation-Maximization can get stuck in local
maxima

8
Domain

Dietterichs Taxi (2000)
States
Taxi Location (TL) 25
Passenger Location (PL) 5
Passenger Destination (PD) 5
Actions
North, South, East, West
Pickup, Putdown
Hand-coded policies
GotoRed
GotoGreen
GotoYellow
GotoBlue
Pickup, Putdown

9
Experiments

Phase 1
S1 5, S0 25, ?1 6, ?0 6
1000 sequences from SMDP Q-learner
TL, PL, PD, A1 , , TL, PL, PD, An
Bayes Net Toolbox (Murphy01)
Phase 2
SMDP Q-learning
Choose policy ?1 using ?-greedy
Compute most likely abstract state s0
given TL, PL, PD
Select action ?0 using
Pr ( ?0 ? ?1 ?1 , S0 s0 )

Taxi DAN
Policy
Policy
Policy
Policy
F
F
Action
Action
S1
S1
F1
F1
S0
S0
F0
F0
10
Policy Improvement

Policy learned over DAN policies performs well
Each plot is average over 10 RL runs and 1 EM run

11
Policy Recognition

DAN
Initial Passenger Loc Passenger Dest Policy 1
Policy 6

Can (sometimes!) recognize a specific sequence of
actions as composing a single policy

12
Summary

Two-phased method for automating hierarchical RL
using graphical models
Advantages
Limited info needed ( of levels, of values)
Permits continuous and partially observable
state/actions
Disadvantages
EM is expensive
Need mentor
Abstractions learned can be hard to decipher
(local maxima?)

13
Future Work

Approximate inference in DANs
Saria Mahadevan04 Rao-Blackwellized particle
filtering for multi-agent AHMMs
Johns Mahadevan05 variational inference for
AHMMs
Take advantage of ability to do inference in
hierarchical RL phase
Incorporate reward in DAN

Thank You
Questions?

15
Abstract State Transitions S0

Regardless of abstract P0 policy being executed,
abstract S0 states self-transition with high
probability
Depending on abstract P0 policy, may
alternatively transition to one of a few abstract
S0 states
Similarly for abstract S1 states and abstract P1
policies

16
State Abstractions
Abstract state to which agent is most likely to
transition is a consequence, in part, of the
learned state abstractions
17
Semi-MDP Q-learning

Q(s,o) activity-value for state s and activity o
? learning rate
?? discount rate raised to the number of time
steps o took
r accumulated discounted reward since o began

18
Abstract State S1 Transitions

Abstract state S1 transitions under abstract
policy P1

19
Expectation-Maximization (EM)

Hidden variables and unknown parameters
E(xpectation)-step
Assume parameters known and compute the
conditional expected values for variables
M(aximization)-step
Assume variables observed and compute the argmax
parameters

20
Abstract State S0 Transitions

Abstract state S0 transitions under abstract
policy P0

Write a Comment

User Comments (0)