Multiagent Reinforcement Learning in a Dynamic Environment

About This Presentation

Title:

Multiagent Reinforcement Learning in a Dynamic Environment

Description:

PSP is Robust against non-Markovian, Because : PSP does not require the environment to have ... Each hunter modifies its own lookup table by PSP independently. ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 15

Provided by: csC76

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multiagent Reinforcement Learning in a Dynamic Environment

1
Multi-agent Reinforcement Learningin a Dynamic
Environment

The research goal is to enable multiple agents to
learn suitable behaviors in a dynamic environment
using reinforcement learning.
We found that this approach could be available to
create cooperative behavior among the agents
without any prior-knowledge.

Footnote Work done by Sachiyo Arai, Katia
Sycara
2
Reinforcement Learning Approach

Feature
The Reward wont be given immediately after
agents action.
Usually, it will be given only after
achieving the goal.
This delayed reward is the only clue to agents
learning.

Overview
TD Sutton 88, Q-learning Watkins 92
Agent can estimate a model of state transition
probabilities of E(Environment), if E has a
fixed state transition probability ( E is a
MDPs) .
Profit sharing Grefensttette 88
Agent can estimate a model of state transition
probabilities of E, even though E does not have a
fixed state transition probability.
c.f. Dynamic programming
Agent needs to have a perfect model of state
transition probabilities of E.

3
Our Approach Profit Sharing Plan (PSP)
Usually, Multi-agents Environment is
non-Markovian. Because transition
probability from St to St1 could vary. Due to
agents concurrent learning and
perceptual aliasing. PSP is Robust against
non-Markovian, Because PSP does not require
the environment to have a fixed transition
probability from St to St1.
f Reinforcement function for
a temporal credit assignment.
Rationality Theorem to suppress
ineffective rules t ?t1,2,.T
L?fj lt ft1 j0 (L the number
of available actions at each time step.)
Example
a1
In this environment, a1 should be reinforced
less than a2
100
S
G
a2
rT reward at time T (Goal) Wn weight of the
state-action pair after n episodes, (xt,
at) state and action at time t of
n-th episode.
4
Our Experiments
1. Pursuit Game 4Hunters and 1Prey Torus
Grid World, Required Agents Cooperative work
to capture the prey.
3. Neo Block World domain 3 groups of
evacuees and 3 shelters of varying degree of
safety Grid World, Required Agents
Cooperative work includes Conflict Resolution and
Information Sharing to evacuate.
5
Experiment 1 4Hunters and 1 Prey Pursuit Game
Objective To make sure that cooperative
behavior is emerged by Profit Sharing. Hypothesis
Cooperative Behavior, such as Result
sharing, Task sharing and Conflict resolution
will be emerged. Setting Torus Grid World,
Size 15x15, Sight size of each agent 5x5. -
Each hunter modifies its own lookup table by PSP
independently. - Hunters and Prey are located
randomly at initial state of each episode. -
Hunters learn by PSP and Prey move
randomly. Modeling Each hunter consists of
State Recognizer, Action Selector, LookUp
Table, and PSP module as a
learner.
4 Hunters 1 Prey
Agent Hunter
Input
Action
Action Selector
State Recognizer
Reward
Profit Sharing
Agent
6
3
7
Experiment 2 3 Hunters and Multiple Prey
Pursuit Game
Objective To make sure that Task Scheduling
knowledge is emerged by PSP in the
environment of conjunctive multiple
goals. Which Proverb is true in the
Reinforcement learning agents ? proverb
1. He who runs after two hares will catch
neither. proverb 2. Kill two birds
with one stone. Hypothesis If the
agent know about location of prey and other
agents, agent realize proverb 2, but sensory
limitation makes them behave like proverb
1. Setting Torus Triangular World where 7
triangles are on each edge. - Sight size
5 triangles on each edge, 7 triangles on each
edge. - Prey moves randomly. -
Each hunter modifies its own lookup table by PSP
independently. Modeling Each hunter consists
of State Recognizer, Action Selector, LookUp
Table, and PSP module as a
learner.
Agent Hunter
Input
Action
State Recognizer
Action Selector
Reward
Profit Sharing
8
Experiment 2 Results
1. Convergence

2. Discussion
Without global scheduling mechanism, hunters
capture the prey in reasonable order.
(e.g. capture closest prey first.)
The larger the number of prey in the
environment, the more steps are required
to capture the 1st prey .
Because it is getting more difficult to
coordinate decision of each hunters target.
This facts implies that target of each hunter is
scattered. (Proverb 1)
The required steps to capture the last prey in
the multiple prey environment is less than that
to capture the 1st prey in the single prey
environment.
This facts implies that hunters pursuit multiple
prey simultaneously.(Proverb 2)

9
Experiment 3 Neo Block World domain -No.1-

Objective To make sure that Opportunistic
knowledge is emerged by PSP in the
environment of disjunctive multiple goals.
When there are more than 1
alternatives to get rewards in the environment,
agent can behave reasonably ?
Hypothesis
If the agent knows about location of
safe places correctly, each agent can
select the best place to evacuate, but sensory
limitation makes them back and forth in
confusion.
Setting Graph World Size 15 x 15. Sight size
7 x 7.
- 2 groups of evacuees, 2 shelter groups.
- Each group of evacuees learns by PSP
independently.
- The groups and Shelters are located randomly
at initial state of each episode.
Input of Group 7x7 sized agents own
input no input sharing.
Output of Group
walk-north, walk-south, walk-east, walk-west,
stay
Reward
- Each group gets a reward only when it
moves into the shelter.
- The amount of reward is dependent on
the degree of shelters safety.

10
Experiment 3 Results
1. Convergence
Unavailable path
2. Discussion 1. Agents learned so that they
could get larger amount of reward. So, if the
reward amount of shelter1s is same as the one of
shelter2s, they learned stochastic
policies. On the other hand, if their amount
difference is large, they learned the
deterministic policies which seems to be nearly
optimal . 2. In the latter case (reward
difference is large), the other agent works as a
landmark to search the shelter.
11
Experiment 4 Neo Block World domain

Objective To make sure of the effects of
Sharing Sensory Information on the agents
learning and their behaviors.
Hypothesis
Sharing their sensory input increases
the amount of state spaces,
and the required time to converge.
But the policy of the agents become
more optimal than that of agents without sharing
information, because it reduces perceptual
aliasing problem of the agents.
Setting Graph World Size 15 x 15. Sight size
7 x 7.
- 3 groups of evacuees, 3 shelters.
- Each group of evacuees learns by PSP
independently.
- The groups are located randomly at initial
state of each episode.
Input of Group 7x7 sized agents own input,
plus, information from Blackboard.
Output of Group walk-north, walk-south,
walk-east, walk-west, stay
Reward
- Each group gets a reward only when it
moves into the shelter.
- The degree of safety is the same for
each shelter.
- The rewards are not shared among the
agents.

12
Experiment 4 Neo Block World domain
-No.2-

Modeling
Model1 Each hunter consists of State Recognizer,
Action Selector, LookUp Table, PSP
module as a learner. Agents share the sensory
input by means of B.B. , combine them
with their own input.

BlackBoard
Other Agents
Environment
Agent
Observation Ot OtO1, O2,,Om (t1,..,T)
LookUp Table Wnml(O, a ) Size ml
Action at Ata1, a2,,al (t1,..,T)
State Recognizer
Action Selector
Profit Sharing f (Rn, Oj) (j1,..,T)
Reward Rn(tT)n
13
Experiment 4 Results
1. Convergence
2. Discussion 1. In the Initial Stage
The required steps to shelter of a
Non-sharing-Agent reduces faster than that of a
Sharing-Agent. Non-sharing-Agent seems to be
able to generalize the state and behave
rationally even in inexperienced state. On the
other hand, Sharing-Agent needs to experience
discriminated state spaces, the numbers of which
is larger than generalized state space.
Therefore, it takes longer time to reduce the
number of steps than Non-sharing agent does.
2. In the Final Stage The performance
of a Sharing-Agent is better than that of a
Non-Sharing-Agent. Non-sharing-Agent seems to
overgeneralize the spaces and to be confused by
aliases. On the other hand, Sharing-Agent seems
to refine the policy successfully and hard to be
confused.
14
Conclusion

Agent learns suitable behaviors in a dynamic
environment including multiple agents and
goals, if there are no aliasing due to the
sensory limitation, concurrent learning of other
agents, and the existence of multiple sources of
reward.
The strict division of the state space causes
the state explosion and the worse performance in
the early stage of learning.

Future Works

Development of the structured mechanism of
Reinforcement Learning .
Hypothesis Structured mechanism
facilitates knowledge transfer.
Agent learns knowledge about appropriate
generalization level of the
state spaces.
Agent learns knowledge about appropriate amount
of communication with others.
Competitive Learning
Agents compete for resources.
We need to resolve structural credit assignment
problem.

Write a Comment

User Comments (0)

About PowerShow.com

Multiagent Reinforcement Learning in a Dynamic Environment - PowerPoint PPT Presentation

Multiagent Reinforcement Learning in a Dynamic Environment

PSP is Robust against non-Markovian, Because : PSP does not require the environment to have ... Each hunter modifies its own lookup table by PSP independently. ... – PowerPoint PPT presentation