Title: Transfer in Variable - Reward Hierarchical Reinforcement Learning
1Transfer in Variable - Reward Hierarchical
Reinforcement Learning
Hui Li March 31, 2006
2Overview
- Multi-criteria reinforcement learning
- Transfer in variable-reward hierarchical
reinforcement learning - Results
- Conclusions
3Multi-criteria reinforcement learning
Reinforcement Learning
- Definition
- Reinforcement learning is the process by
which the agent - learns an approximately optimal policy
through trial and - error interactions with the environment.
4- Goal
- The agents goal is to maximize the
cumulative amount of - rewards he receives over the long run.
- A new value function -- average adjusted sum of
rewards (bias)
Average reward (gain) per time step at given
policy ?
5- H-learning
- model-based version of average reward
reinforcement learning
learning rate, 0lt?lt1
?new
New observation rs(a)
?old
- R-learning
- model-free version of average reward
reinforcement learning
6Multi-criteria reinforcement learning
In many situations, it is nature to express the
objective as making some appropriate tradeoffs
between different kinds of rewards.
- Goals
- Eating food
- Guarding food
- Minimize the number of steps it walks
Buridans donkey problem
7Weighted optimization criterion
weight, which represents the importance of each
reward
If the weight vector w is static, never changes
over time, then the problem reduces to the
reinforcement learning with a scalar value of
reward.
If the weight vector varies from time to time,
learning policy for each weight vector from
scratch is very inefficient.
8Since the MDP model is a liner transformation,
the average reward ?? and the average adjusted
reward h?(s) are linear in the reward weights
for a given policy ?.
9- Each line represents the weighted average reward
given a policy ?k ,
- Solid lines represent those active weighted
average rewards
- Dot lines represent those inactive weighted
average rewards
- Dark line segments represent the best average
rewards for any weight vectors
10The key idea
? the set of all stored policies.
Only those policies which have active average
rewards are stored.
Update equations
11(No Transcript)
12Variable-reward hierarchical reinforcement
learning
- The original MDP M is split into sub-SMDP M0,
Mn, each sub-SMDP representing a sub-task - Solving the root task M0 solves the entire MDP M
- The task hierarchy is represented as a directed
acyclic graph known as the task graph - A local policy ?i for the subtask Mi is a
mapping from the states to the child tasks of Mi - A hierarchical policy ? for the whole task is
an assignment of a local policy ?i to each
subtask Mi - The objective is to learn an optimal policy that
optimizes the policy for each subtask assuming
that its childrens polices are optimized
13Forest
Home base
Enemy base
Goldmine
Peasants
14Two kinds of subtask
- Composite subtask
- Root the whole task
- Harvest the goal is to harvest wood or gold
- Deposit the goal is to deposit a resource
into home base - Attack the goal is to attack the enemy base
- Primitive subtask primitive actions
- north, south, east, west,
- pick a resource, put a resource
- attack the enemy base
- idle
15SMDP semi-MDP
A SMDP is a tuple lt S, A, P, r, t gt
- S, A, P, r are defined the same as in MDP
- t(s, a) is the execution tine for taking action
a in state s - Bellman equation of SMDP for average reward
learning
A sub-task is a tuple lt Bi, Ai , Gi gt
- Bi state abstraction function which maps state
s in the original MDP into an abstract state in
Mi
- Ai The set of subtasks that can be called by Mi
16The value function decomposition satisfied the
following set of Bellman equations
where
At root, we only store the average adjusted reward
17(No Transcript)
18Results
- Learning curves for a test reward weight after
having seen 0, 1, - 2, , 10 previous training weight vectors
- Negative transfer learning based on one
previous weight is worse than learning from
scratch.
19Transfer ratio
FY/FY/X
- FY is the area between the learning curve and
its optimal value for problem with no prior
learning experience on X. - FY/X is the area between the learning curve and
its optimal value for problem given prior
training on X.
20Conclusions
- This paper showed that hierarchical task
structure can - accelerate transfer across variable-reward MDPs
more - than in the flat MDP
- This hierarchical task structure facilitates
multi-agent - learning
21References
1 T. Dietterich, Hierarchical Reinforcement
Learning with the MAXQ Value Function
Decomposition. Journal of Artificial Intelligence
Research, 9227303, 2000. 2 N. Mehta and P.
Tadepalli, Multi-Agent Shared Hierarchy
Reinforcement Learning. ICML Workshop on Richer
Representations in Reinforcement Learning,
2005. 3 S. Natarajan and P. Tadepalli, Dynamic
Preferences in Multi-Criteria Reinforcement
Learning, in Proceedings of ICML-05, 2005. 4 N.
Mehta, S. Natarajan, P. Tadepalli and A. Fern,
Transfer in Variable-Reward Hierarchical
Reinforcement Learning, in NIPS Workshop on
transfer learning, 2005. 5 Barto, A.,
Mahadevan, S. (2003). Recent Advances in
Hierarchical Reinforcement Learning, Discrete
Event Systems. 6 S. Mahadevan, Average Reward
Reinforcement Learning Foundations, Algorithms,
and Empirical Results, Machine Learning, 22,
169-196 (1996) 7 P. Tadepalli and D. OK,
Model-based Average Reward Reinforcement
Learning, Artificial intelligence 1998