Title: Hierarchical Reinforcement Learning
1Hierarchical Reinforcement Learning
- Gideon Maillette de Buy Wenniger
- Recent Advances in Hierarchical Reinforcement
Learning - Andrew G.Barto
- Sridhar Mahadevan
- Hierarchical Reinforcement Learning with the MAXQ
Value Function Decomposition - Thomas G.Dietterich
2Reinforcement Learning, Formulas
- Value function, discounted reward
- Future expexted discounted reward
-
- Bellman Equations
- Optimal values
3Value Iteration
- Dynamic programming update rules
- Q-learning update rule (off-policy)
- Sarsa update rule (on-policy)
4Extension SMDP's
- -Extension of MDP
- -Amount of time between decisions random variable
(discrete, continuous) - -Nescessary for operations that take multiple
timesteps - -Random variable denotes waiting time
- New formula's
5Approaches to Hierarchical Reinforcement Learning
- -Idea of macro-operator
- -sequence of actions, can be invoked as simple
action. - -macro's can call other macro's
- Hierarchical policies extension of macro idea
- - Specifie Termination conditions
- -partial policies/temporally extended actions
6Options approach
- Simplest option
- -Markov options
- -Stationary stochastic policy
- -Termination condition
- -Input set
- Semi markov options Option-policies may depend
on history since option-call. - Expansion of each option to primitiva actions
- -flat policy
- -Non-markovian
7Adapted value functions, update rules
- Event of being initiated at
time t in s - Semi markov policy that follows o unitil
it terminates after timeteps and then
continues according to . - Analog Value-iteration step
- Analog Q-learning step
8MAXQ motivation
9Max-Q value decomposition
10MaxQ
- -Decompose task in to set of closed
hierarchically subtasks M0, M1,.. M,n - -Subtasks have to be solved to complete root-task
M0 - -Assign local reward to completing subtask
- -When subtasks is called it runs until it, or a
subtask higher in the Hierarchie completes - -Use deterministic completion states
- -Assign reward depending on completion state
- - Recursive instead of Hierarchical optimality
11Optimalities
12MAXQ - continued
- -Lowest level of Hierarchie gives primitive
actions, and direct rewards - -Use thes in combination with local rewards to
implement learning - -Find recursive optimal policy
- (v.s hierarchical optimality)
- - Enable state abstractions
- - Speed-up learning by minimizing no states
- - Proof of convergence to Recursive optimality
- - Possibility of Executing policy
nonhierarchically -
13MaxQ graph
14Results MAXQ
15Topics for Future Research
- 1. Compact representations
- 2. Learning task Hierarchies
- 3. Dynamic Abstraction
- 4. Large Applications
16Conclusions
- Two main approaches
- Extend statespace with macro's
- Limit statespace using hierarchie
- Macro's Problem How to learn good macro's
autonomically. Eiterh suboptimal performance
(limited actionspace) or no real gain (extended
actionspace). But might speedup learning
signifficantly. - MAXQ Making use of programmer-defined
hierarchical decomposition makes statespaces
smaller and learning faster. Problem is the
effort needed from the programmer.
17The MAXQ algorithm