Transfer in Variable - Reward Hierarchical Reinforcement Learning - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Transfer in Variable - Reward Hierarchical Reinforcement Learning

Description:

Title: PowerPoint Presentation Author: xjliao Last modified by: xjliao Created Date: 3/29/2006 1:44:49 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 22

Provided by: xjl3

Learn more at: http://people.ee.duke.edu

Category:

more less

Transcript and Presenter's Notes

Title: Transfer in Variable - Reward Hierarchical Reinforcement Learning

1
Transfer in Variable - Reward Hierarchical
Reinforcement Learning
Hui Li March 31, 2006
2
Overview

Multi-criteria reinforcement learning
Transfer in variable-reward hierarchical
reinforcement learning
Results
Conclusions

3
Multi-criteria reinforcement learning
Reinforcement Learning

Definition
Reinforcement learning is the process by
which the agent
learns an approximately optimal policy
through trial and
error interactions with the environment.

Goal
The agents goal is to maximize the
cumulative amount of
rewards he receives over the long run.

A new value function -- average adjusted sum of
rewards (bias)

Average reward (gain) per time step at given
policy ?

Bellman equation

H-learning
model-based version of average reward
reinforcement learning

learning rate, 0lt?lt1
?new
New observation rs(a)
?old

R-learning
model-free version of average reward
reinforcement learning

6
Multi-criteria reinforcement learning
In many situations, it is nature to express the
objective as making some appropriate tradeoffs
between different kinds of rewards.

Goals
Eating food
Guarding food
Minimize the number of steps it walks

Buridans donkey problem
7
Weighted optimization criterion
weight, which represents the importance of each
reward
If the weight vector w is static, never changes
over time, then the problem reduces to the
reinforcement learning with a scalar value of
reward.
If the weight vector varies from time to time,
learning policy for each weight vector from
scratch is very inefficient.
8
Since the MDP model is a liner transformation,
the average reward ?? and the average adjusted
reward h?(s) are linear in the reward weights
for a given policy ?.
9

Each line represents the weighted average reward
given a policy ?k ,

Solid lines represent those active weighted
average rewards

Dot lines represent those inactive weighted
average rewards

Dark line segments represent the best average
rewards for any weight vectors

10
The key idea
? the set of all stored policies.
Only those policies which have active average
rewards are stored.
Update equations
11
(No Transcript)
12
Variable-reward hierarchical reinforcement
learning

The original MDP M is split into sub-SMDP M0,
Mn, each sub-SMDP representing a sub-task
Solving the root task M0 solves the entire MDP M
The task hierarchy is represented as a directed
acyclic graph known as the task graph
A local policy ?i for the subtask Mi is a
mapping from the states to the child tasks of Mi
A hierarchical policy ? for the whole task is
an assignment of a local policy ?i to each
subtask Mi
The objective is to learn an optimal policy that
optimizes the policy for each subtask assuming
that its childrens polices are optimized

13
Forest
Home base
Enemy base
Goldmine
Peasants
14
Two kinds of subtask

Composite subtask
Root the whole task
Harvest the goal is to harvest wood or gold
Deposit the goal is to deposit a resource
into home base
Attack the goal is to attack the enemy base

Primitive subtask primitive actions
north, south, east, west,
pick a resource, put a resource
attack the enemy base
idle

15
SMDP semi-MDP
A SMDP is a tuple lt S, A, P, r, t gt

S, A, P, r are defined the same as in MDP
t(s, a) is the execution tine for taking action
a in state s
Bellman equation of SMDP for average reward
learning

A sub-task is a tuple lt Bi, Ai , Gi gt

Bi state abstraction function which maps state
s in the original MDP into an abstract state in
Mi

Ai The set of subtasks that can be called by Mi

Gi Termination predicate

16
The value function decomposition satisfied the
following set of Bellman equations
where
At root, we only store the average adjusted reward
17
(No Transcript)
18
Results

Learning curves for a test reward weight after
having seen 0, 1,
2, , 10 previous training weight vectors

Negative transfer learning based on one
previous weight is worse than learning from
scratch.

19
Transfer ratio
FY/FY/X

FY is the area between the learning curve and
its optimal value for problem with no prior
learning experience on X.
FY/X is the area between the learning curve and
its optimal value for problem given prior
training on X.

20
Conclusions

This paper showed that hierarchical task
structure can
accelerate transfer across variable-reward MDPs
more
than in the flat MDP
This hierarchical task structure facilitates
multi-agent
learning

21
References
1 T. Dietterich, Hierarchical Reinforcement
Learning with the MAXQ Value Function
Decomposition. Journal of Artificial Intelligence
Research, 9227303, 2000. 2 N. Mehta and P.
Tadepalli, Multi-Agent Shared Hierarchy
Reinforcement Learning. ICML Workshop on Richer
Representations in Reinforcement Learning,
2005. 3 S. Natarajan and P. Tadepalli, Dynamic
Preferences in Multi-Criteria Reinforcement
Learning, in Proceedings of ICML-05, 2005. 4 N.
Mehta, S. Natarajan, P. Tadepalli and A. Fern,
Transfer in Variable-Reward Hierarchical
Reinforcement Learning, in NIPS Workshop on
transfer learning, 2005. 5 Barto, A.,
Mahadevan, S. (2003). Recent Advances in
Hierarchical Reinforcement Learning, Discrete
Event Systems. 6 S. Mahadevan, Average Reward
Reinforcement Learning Foundations, Algorithms,
and Empirical Results, Machine Learning, 22,
169-196 (1996) 7 P. Tadepalli and D. OK,
Model-based Average Reward Reinforcement
Learning, Artificial intelligence 1998

Write a Comment

User Comments (0)