Multi-Agent Systems Lecture 10 University presentation

About This Presentation

Transcript and Presenter's Notes

Title: Multi-Agent Systems Lecture 10 University

1
Multi-Agent SystemsLecture 10University
Politehnica of Bucarest2005-2006Adina Magda
Floreaadina_at_cs.pub.rohttp//turing.cs.pub.ro/bl
ia_06
2
Machine LearningLecture outline

1 Learning in AI (machine learning)
2 Reinforcement learning
3 Learning in multi-agent systems
3.1 Learning action coordination
3.2 Learning individual performance
3.3 Learning to communicate
3.4 Layered learning
5 Conclusions

3
1 Learning in AI

What is machine learning?
Herbet Simon defines learning as
any change in a system that allows it to perform
better the second time on repetition of the same
task or another task drawn from the same
population (Simon, 1983).
In ML the agent learns
knowledge representation of the problem domain
problem solving rules, inferences
problem solving strategies

3
4
Classifying learning

In MAS learning the agents should learn
what an agent learns in ML but in the context of
MAS - both cooperative and self-interested agents
how to cooperate for problem solving -
cooperative agents
how to communicate - both cooperative and
self-interested agents
how to negotiate - self interested agents
Different dimensions
explicitly represented domain knowledge
how the critic component (performance evaluation)
of a learning agent works
the use of knowledge of the domain/environment

4
5
Single agent learning
Learning Process
Feed-back
Data
Environment
Learning results
Problem Solving K B Inferences Strategy
Feed-back
Results
Performance Evaluation
5
6
Self-interested learning agent
Feed-back
Communication
Data
Environment
Actions
Feed-back

NB Both in this diagram and the next, not all
components or flow arrows are always present - it
depends on the type of agent (cognitive,
reactive), type of learning, etc.

6
7
Cooperative learning agents
Feed-back
Learning Process
Learning Process
Feed-back
Communication
Learning results
Learning results
Data
Results
Results
Performance Evaluation
Feed-back
Actions
Actions
Communication
Communication
Environment
7
8
2 Reinforcement learning

Combines dynamic programming and AI machine
learning techniques
Trial-and-error interactions with a dynamic
environment
The feedback of the environment reward or
reinforcement
search in the space of behaviors
genetic algorithms
Two main approaches
learn utility based on statistical techniques
and dynamic programming methods

8
9
2.1 A reinforcement-learning model

B agent's behavior
i input current state of the env
r value of reinforcement
(reinforcement signal)
T model of the world
The model consists of
- a discrete set of environment states S
(s?S)
- a discrete set of agent actions A (a ?
A)
- a set of scalar reinforcement signals,
typically 0, 1 or real numbers
- the transition model of the world, T
environment is nondeterministic
T S x A ? P(S) T transition model
T(s, a, s)
Environment history a sequence of states that
leads to a terminal state

9
10

A 4 x 3 environment
The intended outcome occurs with probability 0.8,
and with probability 0.2 (0.1, 0.1) the agent
moves at right angles to the intended direction.
The two terminal states have reward 1 and 1,
all other states have a reward of 0.04

0.8
0.1
0.1
3 2 1
Up, Up, Right, Right, Right (4,3) 0.85 0.32768
1 2 3 4
10
11

2.2 Features varying RL
accessible / inaccessible environment
has (T known) / has not a model of the
environment
learn behavior / learn behavior model
reward received only in terminal states or in any
state
passive/active learner
learn utilities of states
active learner learn also what to do
how does the agent represent B, namely its
behavior
utility functions on states or state histories (T
is known)
active-value functions (T is not necessarily
known) - assigns an expected utility to taking a
given action in a given state

11
12
Agents

State and goals
goal E ? 0, 1
Utilities
utility E ? R
env E x A ? P(E)
Expected utility of an action a in a state e
Maximum Expected Utility (MEU)

12
13

2.3 The RL problem
the agent has to find a policy ? a function
which maps states to actions and which maximizes
some long-time measure of reinforcement.
The agent has to learn an optimal behavior
optimal policy a policy which yields the
highest expected utility - ?
The utility function depends on the environment
history (a sequence of states)
In each state s the agents receives a reward -
R(s)
Uh(s0, s1, , sn) utility function on
histories

13
14

Models of behavior
Finite-horizon model at a given moment of time
the agent should optimize its expected reward for
the next h steps
E(?t0, h R(st))
rt represents the reward received t steps into
the future.
Infinite-horizon model optimize the long-run
reward
E(?t0,? R(st))
Infinite-horizon discounted model optimize the
long-run reward but rewards received in the
future are geometrically discounted according to
a discount factor ?.
E(?t0,? ?t R(st))
0 ? ? lt 1.
? can be interpreted in several ways. It can be
seen as an interest rate, a probability of living
another step, or as a mathematical trick to bound
an infinite sum.

14
15

2.4 Markov systems
Discounted rewards
An AP gets payed 20/year
202020..
(reward now) ?(reward at time 1) ?2(rewards
at time) 2
A Markov System with rewards
(S1, S2,Sn)
A transition probability matrix
PijProb(NextSjThis Si)
Each state has a rweard r1, r2,rn
Discount factor ? in 0,1
On each time step
Assume state is Si
Get reward ri
Randomly move to another state Pij
All future rewards are discounted by ?

15
16

U(Si)expected discounted sum of future rewards
starting in state Si
U(Si) ri?(Pi1U(S1)Pi2U(S2) .. PinU(Sn)),
i1,n
Solve equations, get an exact answer but 100 000
states splve a 100 000 by 100 000 system of
equations
Value iteration to solve a Markov system
U1(Si)ri
U2(Si) ri ? ?j1,N PijU1(Sj)
Compute U1(Si) for each sate
Compute U2(Si) for eaxch state, etc
Stop when Uk1(Si) - Uk(Si) lt eps

16
17

2.5 Markov Decision Problem (MDP)
consists of
ltS, A, P, Rgt
S - a set of states
A - a set of actions
R reward function, R S x A ? R
T S x A ? ?(S), with ?(S) the probability
distribution over the states S
On each time step
Assume state is Si
Get reward Ri
Choose action a (from a1ak)
Move to another state Pij with probability
T(Si,a)
All future rewards are discounted by ?
We shall use T(s,a,s)

PassProb(NextsThiss and I use action k)
17
18

Markov Decision Problem (MDP)
The model is Markov if the state transitions are
independent of any previous environment states or
agent actions.
MDP finite-state and finite-action focus on
that / infinite state and action space
For every MDP there exists an optimal policy
Its a policy such that for every possible start
state there is no better option than to follow
the policy
Finding the optimal policy given a model T
calculate the utility of each state U(state) and
use state utilities to select an optimal action
in each state.

18
19

Value iteration to solve a MDP
U1(s)R(s)
U2(s) maxa(R(s) ? ?s T(s,a,s)U1(s))
.
UK1(s) maxa(R(s) ? ?s T(s,a,s)Uk(s))
Compute U1(si) for each state, ssi
Compute U2(si) for each state, etc
Stop when maxi Uk1(si) - Uk(si) lt eps
convergence
(dynamic programming)

Value iteration for a MS Uk1(Si) ri ? ?j1,N
PijU k(Sj)
19
20

The utility of a state is the immediate reward
for that state plus the expected discounted
utility of the next state, assuming that the
agent chooses the optimal action
U(s) R(s) ? max a?sT(s,a,s)U(s)
Bellman equation - U?(s) unique solutions
The utility function U(s) allows the agent to
select actions by using the Maximum Expected
Utility principle
?(s) argmaxa (R(s) ? ?sT(s,a,s)U(s))
optimal policy

20
21

A 4 x 3 environment
The intended outcome occurs with probability 0.8,
and with probability 0.2 (0.1, 0.1) the agent
moves at right angles to the intended direction.
The two terminal states have reward 1 and 1,
all other states have a reward of 0.04, ?1

0.8
0.1
0.1
3 2 1
3 2 1
0.812
0.868
0.918
0.762
0.660
0.705
0.655
0.611
0.388
1 2 3 4
1 2 3 4
21
22
Bellman equation for the 4x3 world Equation for
the state (1,1) U(1,1) -0.04 ? max 0.8
U(1,2) 0.1 U(2,1) 0.1 U(1,1), Up
0.9U(1,1) 0.1U(1,2), Left
0.9U(1,1) 0.1U(2,1), Down
0.8U(2,1) 0.1U(1,2) 0.1U(1,1) Right U
p is the best action
3 2 1
0.812
0.868
0.918
0.762
0.660
0.705
0.655
0.611
0.388
1 2 3 4
23
defines the best action in state s

Value Iteration
Given the maximal expected utility, the optimal
policy is
?(s) arg maxa(R(s) ? ?s T(s,a,s) U(s))
Compute U(s) using an iterative approach ? Value
Iteration
U0(s) R(s)
Ut1(s) R(s) maxa(? ?s T(s,a,s) Ut(s))
t ? inf .utility values converge to the
optimal values

compute for all s
23
24

Policy iteration
Manipulate the policy directly, rather than
finding it indirectly via the optimal value
function
choose an arbitrary policy ? (randomly)
at each time t, compute the the long time reward
starting in s, using ?t, i.e. solve the equations
Ut(s) R(s) ? ?s (T(s, ?t(s),s) Ut(s))
improve the policy at each state
?t1(s) ? arg maxa (R(s) ? ?s T(s,a,s)
Ut(s))
Involves all next states - complex

24
25

2.6 RL learning
Use observed rewards to learn an optimal (or near
optimal) policy for the environment
Ex play 100 moves, you loose
In an MDP the agent has a complete model f the
evironment
Now the agent has not such a model
Passive learning the agent policy is fixedThe
tesk is to learn the utilities of states (or
state-action pairs)
Active learning the agent must aso learn what
to do exploitation/exploration

25
26

(a) Passive reinforcement learning
Policy is fixed in state s always execute ?(s)
Goal learn how good the policy is learn U?(s)
Does not know T(s,a,s), does not know before
R(s)
ADP (Adaptive Dynamic Programming) learning
The problem of calculating an optimal policy in
an accessible, stochastic environment.
ADP plug the learned T(s, ?(s),s) and the
observed rewards R(s) into the Bellman equations
to calculate the utility of states
Supervised learning input state-action pairs
output resulting state
Estimate transition probabilities T(s,a,s) from
frequencies with which s is reached after
executing a in s
(1,3) Right 2 times (2,3), 1 time in (1,3)
gt
T((1,3),Right,(2,3))2/3

26
27

ADP (Adaptive Dynamic Programming) learning
function Passive-ADP-Agent(percept) returns an
action
inputs percept, a percept indicating the current
state s and reward signal r
variable ?, a fixed policy
mdp, an MDP with model T, rewards R, discount ?
U, a table of utilities, initially empty
Nsa, a table of frequencies for state-action
pairs, initially zero
Nsas, a table of frequencies of
state-action-state triples, initially zero
s, a, the previous state and action, initially
null
if s is new then Us ? r, Rs ? r
if s is not null then
increment Nsas,a and Nsass,a,s
for each t such that Nsass,a,t ltgt0 do
Ts,a,t ? Nsass,a,t / Nsas,a
U ? Value-Determination(?,U,mdp)
if Terminals then s,a ? null else s,a ? s,
?s
return a
end

according to MDP (value iteration or policy
iteration)
27
28

Temporal difference learning
(TD learning)
The value function is no longer implemented by
solving a set of linear equations, but it is
computed iteratively.
Used observed transitions to adjust the values of
the observed states so that they agree with the
constraint equations.
U?(s) ? U ?(s) ?(R(s) ? U ?(s) U ?(s))
? is the learning rate.
Whatever state is visited, its estimated value is
updated to be closer to R(s) ? U ?(s)
since R(s) is the instantaneous reward received
and
U ?(s') is the estimated value of the actually
occurring next state.
simpler, involves only next states
? decreases as the number of times the state is
visited increases

28
29

Temporal difference learning
function Passive-TD-Agent(percept) returns an
action
inputs percept, a percept indicating the current
state s and reward signal r
variable ?, a fixed policy
U, a table of utilities, initially empty
Ns, a table of frequencies for states,
initially zero
s, a, r, the previous state, action, and
reward, initially null
if s is new then Us ? r
if s is not null then
increment Nss
Us ? Us ?(Nss)(r ? U s U s)
if Terminals then s, a, r ? null else s, a, r
? s, ?s, r
return a
end

29
30

Temporal difference learning
Does not need a model to perform its updates
The environment supplies the connections between
neighboring states in the form of observed
transitions.
ADP and TD comparison
ADP and TD try both to make local adjustments to
the utility estimates in order to make each state
agree with its successors
TD adjusts a state to agree with the observed
successor
ADP adjusts a state to agree with all of the
successors that might occur, weighted by their
probabilities

30
31

(b) Active reinforcement learning
Passive learning agent has a fixed policy that
determines its behavior
An active learning agent must decide what action
to take
The agent must learn a complete model with
outcome probabilities for all actions (instead of
a model for the fixed policy)
Compute/learn the utilities that obey the Bellman
equation
U (s) R(s) ? maxa?s (T(s, ?t(s),s)
U(s))
using value iteration r policy iteration
- If value iteration then look for the action
that maximze utility
- If policy iteration you already have the action
- Exploration/exploitation
- The representative problem is the n-armed
bandit problem
Solutions
1/t time choose random actions, rest follow ?
give weights to actions that have not been
explored, avoid actions with low utilities
Exploratory function f(u,n) how greedy
(prefer high utility vales r not (exploration)
the agent is

31
32

Q-learning
Active learning of action-value functions
action-value function assigns an expected
utility to taking a given action in a given
state, Q-values
Q(a, s) the value of doing action a in state s
(expected utility)
Q-values are related to utility values by the
equation
U(s) maxaQ(a, s)
Approach 1
Q(a,s) R(s) ? ?s (T(s, a,s) maxa
Q(a,s))
This requires a model
Approach 2
Use TD
The agent does not need to learn a model model
free

32
33

Q-learning
TD learning, unknown environment
Q(a,s) ? Q(a,s) ?(R(s) ? maxaQ(a, s)
Q(a,s))
calculated after each transition from state s to
s.
Is it better to learn a model and a utility
function or to learn an action-value function
with no model?

33
34

Q-learning
function Q-Learning-Agent(percept) returns an
action
inputs percept, a percept indicating the current
state s and reward signal r
variable Q, a table of action values index by
state and action
Nsa, a table of frequencies for state-action
pairs
s, a, r the previous state, action, and reward,
initially null
if s is not null then
increment Nsas,a
Qa,s ? Qa,s ?(Nsas,a)(r ? maxaQ
a,s Q a,s)
if Terminals then s, a, r ? null
else s, a, r ? s, argmaxa f(Qa, s,
Nsaa,s), r
return a
end

s, a, r ? s, argmaxa (Qa, s), r
34
35

Generalization of RL
The problem of learning in large spaces large
no. of states
Generalization techniques - allow compact storage
of learned information and transfer of knowledge
between "similar" states and actions.
Neural nets
Decision trees
U(state)U(most similar sate in memory)
U(state) average U(most similar sates in memory)

35
36
3 Learning in MAS

The credit-assignment problem (CAP) the problem
of assigning feed-back (credit or blame) for an
overall performance of the MAS (increase,
decrease) to each agent that contributed to that
change
inter-agent CAP assigns credit or blame to the
external actions of agents
intra-agent CAP assigns credit or blame for a
particular external action of an agent to its
internal inferences and decisions
distinction not always obvious
one or another

36
37
3.1 Learning action coordination

s current environment state
Agent i determines the set of actions it can do
in s Ai(s) Aij(s)
Computes the goal relevance of each action
Eij(s)
Agent i announces a bid for each action with
Eij(s) gt threshold
Bij(s) (? ?) Eij(s)
? - risk factor (small) ? - noise term (to
prevent convergence to local minima)

37
38

The action with the highest bid is selected
Incompatible actions are eliminated
Repeat process until all actions in bids are
either selected or eliminated
A selected actions activity context
Execute selected actions
Update goal relevance for actions in A
Eij(s) ? Eij(s) Bij(s) (R / A)
R external reward received
Update goal relevance for actions in the previous
activity context Ap (actions Akl)
Ekl(sp) ? Ekl(sp) (?Aij?A Bij(s)/ Ap)

38
39
3.2 Learning individual performance

The agent learns how to improve its individual
performance in a multi-agent settings
Examples
Cooperative agents - learning organizational
roles
Competitive agents - learning from market
conditions

39
40
3.2.1 Learning organizational roles (Nagendra,
e.a.)

Agents learn to adopt a specific role in a
particular situation (state) in a cooperative
MAS.
Aim to increase utility of final states
Each agent may play several roles in a situation
The agents learn to select the most appropriate
role
Use reinforcement learning
Utility, Probability, and Cost (UPC) estimates of
a role in a situation
Utility - the agent's estimate of a final state
worth for a specific role in a situation world
states mapped to smaller set of situations
S s0,,sf
Urs U(sf), s0 ? ? sf

40
41

Probability - the likelihood of reaching a final
state for a specific role in a situation
Prs p(sf), s0 ? ? sf
Cost - the computational cost of reaching a final
state for a specific role in a situation
Potential of a role - estimates the usefulness of
a role, discovering pertinent global information
and constraints (ortogonal to utilities)
Representation
Sk - vector of situations for agent k, SK1,,SKn
Rk - vector of roles for agent k, Rk1,,Rkm
Sk x Rk x 4 values to describe UPC and
Potential

41
42

Functioning
Phase I Learning
Several learning cycles in each cycle
each agent goes from s0 to sf and selects its
role as the one with the highest probability
Probability of selecting a role r in a situation
s
f - objective function used to rate the roles
(e.g., f(U,P,C,Pot) UPC Pot)
- depends on the domain

42
43

Use reinforcement learning to update UPC and the
potential of a role
For every s ? s0,,sf and chosen role r in s
? Ursi1 (1-?)Ursi ?Usf
i - the learning cycle
Usf - the utility of a final state
0???1 - the learning rate
? Prsi1 (1-?)Prsi ?O(sf)
O(sf) 1 if sf is successful, 0 otherwise

43
44

? Potrsi1 (1-?)Potrsi ?Conf(Path)
Path s0,,sf
Conf(Path) 0 if there are conflicts on the
Path, 1 otherwise
The update rules for cost are domain dependent
Phase II Performing
In a situation s the role r is chosen such that

44
45
3.2.2 Learning in market environments(Vidal
Durfee)

Agents use past experience and evolved models of
other agents to better sell and buy goods
Environment a market in which agents buy and
sell information (electronic marketplace)
Open environment
The agents are self-interested (max local
utility)
g - a set of goods
P - set of possible prices for goods
Qg - set of possible qualities for a good g

45
46

information has a cost for the seller and a value
for the buyer
information is sold at a certain price
a buyer announces a good it needs
sellers bid their prices for delivering the good
the buyer selects from these bids and pays the
corresponding price
the buyer assesses the quality of information
after it receives it from the seller
Profit of a seller s for selling the good g at
price p
Profitsg(p) p - csg
csg - the cost of producing the good g
by s p - the price
Value of a good g for a buyer b
Vbg(p,q) p - price b paid for g
q - quality of good g
Goal seller - maximize profit in a transaction
buyer - maximize value in a
transaction

46
47

3 types of agents
0-level agents
they set their buying and selling prices based on
their own past experience
they do not model the behavior of other agents
1-level agents
model other agents based on previous interactions
they set their buying and selling prices based on
these models and on past experience
they model the other agents as 0-level agents
2-level agents
same as 1-level agents but they model the other
agents as 1-level agents

47
48

Strategy of 0-level agents
0-level buyer
- learns the expected value function, fg(p), of
buying g at price p
- uses reinforcement learning
fgi1(p) (1-?)fgi(p) ?Vbg(p,q), ?min? ? ?
1, for i0, ? 1
- chooses the seller s for supplying a good g
0-level seller
- learns the expected profit function, hg(p),if
it offers good g at price p
- uses reinforcement learning
hgi1(p) (1-?)hgi(p) ?Profitbg(p)
where Profitbg(p) p - csg if it
wins the auction, 0 otherwise
- chooses the price ps to sell the good g so as
to maximize profit

48
49

Strategy of 1-level agents
1-level buyer
- models sellers for good g
- does not model other buyers
- uses a probability distribution function qsg(x)
over the qualities x of a good g
- computes expected utility, Esg, of buying good
g from seller s
- chooses the seller s for supplying a good g
that maximizes this expected utility

49
50

1-level seller
- models buyers for good g
- models the other sellers s for good g
Buyer's modeling
- uses a probability distribution function mbg(p)
- the probability that b will choose price p for
good g
Seller's modeling
- uses a probability distribution function
ns'g(y) - the probability that s' will bid price
y for good g
- computes the probability of bidding lower than
a given seller s' with the price p
Prob_of_bidding_lower_than_s'
?p'(Prob of bid of s' with p' for which s wins)
?p' N(g,b,ss',p,p')
N(g,b,ss',p,p') ns'g(p') if mbg(p') ?
mbg(p)
0 otherwise

50
51

- computes the probability of bidding lower than
all other sellers with the price p
Prob_of_bidding_lower_with_p
? (Prob_of_bidding_lower_than_s')
s'?S - s
- chooses the best price p to bid so as to
maximize profit

51
52
3.3 Learning to communicate

What to communicate (e.g., what information is of
interest to the others)
When to communicate (e.g., when try doing
something by itself or when look for help)
With which agents to communicate
How to communicate (e.g., language, protocol,
ontology)

52
53
Learning with which agents to communicate (Ohko,
e.a. )

Learning to which agents to ask for performing a
task
Used in a contract net protocol for task
allocation to reduce communication for task
announcement
Goal acquire and refine knowledge about other
agents' task solving abilities
Case-based reasoning used for knowledge
acquisition and refinement
A case consists of
(1) A task specification
(2) Information about which agents solved a task
or similar tasks in the past and the quality of
the provided solution

53
54

(1) Task specification
Ti Ai1 Vi1, , Aimi Vimi
Aij - task attribute, Vij - value of attribute
Similar tasks
Sim(Ti, Tj) ?r ?s Dist(Air, Ajs)
Air?Ti, Ajs?Tj
Dist(Air, Ajs) Sim_Attr(Air, Ajs)
Sim_Vals(Vir, Vjs)
Set of similar tasks
S(T) Tj Sim(T, Tj) ?0.85

54
55

(2) Which agents performed T or similar tasks in
the past
Suitability of Agent k
Perform(Ak, Tj) - quality of solution for Tj
assured by agent Ak performing Tj in the past
The agent computes
Suit(Ak, T), Suit(Ak, T)gt0 and selects the
agent k such that
or the first m agents with best suitability
After each encounter, the agent stores the tasks
performed by other agents and the solution
quality
Tradeoff between exploitation and exploration

55
56
3.4 Layered learning

(Stone Veloso)
A hierarchical machine learning paradigm in MAS
Used simulated robotic soccer RoboCup
Learning
Input ? Output Intractable
Decompose the learning task L into subtasks L1,
, Ln
Characteristics of the environment
Cooperative MAS
Teammates and adversaries
Hidden states agents have a partial world view
at any given moment
Agents have noisy sensory data and actuators
Perception and action cycles are asynchronous
Agents must make their decisions in real-time

56
57

Problem the agent receives a moving ball and
must decide what to do with it dribble, pass to
a teammate, shoot towards the goal
Decompose the problem into 3 subtasks
Layer Behavior type Example
L1 Individual Ball interception
L2 Multiagent Pass evaluation
L3 Team Pass selection
The decomposition into subtasks enables the
learning of more complex behaviors
The hierarchical task decomposition is
constructed bottom-up, in a domain dependent
fashion
Learning methods are chosen to suit the task
Learning in one layer feeds into the next layer
either by providing a portion of the behavior
used for training (ball interception pass
evaluation) or by creating the input
representation and pruning the action space (pass
evaluation pass selection)

57
58

L1 Ball interception
behavior individual
Aim
Blocks or intercepts opponents shots or passes or
Receive passes from teammates
Learning method a fully connected
backpropagation NN
Repeatedly shooting the ball towards a defender
in front of a goal.The defender collects t.e. by
acting randomly and noticing when it successfully
stops the ball
Classification
Saves successful interceptions
Goals unsuccessful attempts
Misses shoots that went wide of the goal

58
59

L2 Pass evaluation
behavior multiagent
Uses its learned ball-interception skills as part
of the behavior for training MAS behavior
Aim the agent must decide
To pass (or not) the ball to a teammate and
If the teammate will successfully receive the
ball (based on positions abilities of the
teammate to receive or intercept a pass)
Learning method decision trees (C4.5)
Kick the ball towards randomly placed teammates
interspread with randomly placed opponents
The intended pass recipient and the opponents all
use the learned ball-interception behavior
Classification of a potential pass to a receiver
Success, with a c.f. ? (0,1
Failure, with a c.f. ? -1,0)
Miss, ( 0)

59
60

L3 Pass selection
behavior team
Uses its learned pass-evaluation capabilities to
create the input and output set for learning pass
selection
Aim the agent has the ball and must decide
To which teammate to pass the ball or
Shoot on goal
Learning method Q-learning of a function that
depends on the agents position on the field
Simulate 2 teams playing with identical behavior
others than their pass-selection policies
Reinforcement total goals scored
Learns
Shoot the goal
The teammate to which to pass

60
61
4 Conclusions

There is no unique method or set of methods for
learning in MAS
Many approaches are based on extending ML
techniques in a MAS setting
Many approaches use reinforcement learning, but
also NN or genetic algorithms

61
62

References
S. Sen, G. Weiss. Learning in Multiagent systems.
In Multiagent Systems - A Modern Approach to
Distributed Artificial Intelligence, G. Weiss
(Ed.), The MIT Press, 2001, p.257-298.
T. Ohko, e.a. - Addressee learning and message
interception for communication load reduction in
multiple robot environment. In Distributed
Artificial Intelligence Meets Machine Learning,
G. Weiss, Ed., Lecture Notes in Artificial
Intelligence, Vol. 1221, Springer-Verlag, 1997,
p.242-258.
M.V. Nagendra, e.a. Learning organizational roles
in a heterogeneous multi-agent systems. In Proc.
of the Second International Conference on
Multiagent Systems, AAAI Press, 1996, p.291-298.
J.M. Vidal, E.H. Durfee. The impact of nested
agent models in an information economy. In Proc.
of the Second International Conference on
Multiagent Systems, AAAI Press, 1996, p.377-384.
P. Stone, M. Veloso. Layered Learning, Eleventh
European Conference on Machine Learning,
ECML-2000.

62
63

Web References
An interesting set of training examples and the
connection between decision trees and rules.
http//www.dcs.napier.ac.uk/peter/vldb/dm/node11.
html
Decision trees construction
http//www.cs.uregina.ca/hamilton/courses/831/not
es/ml/dtrees/4_dtrees2.html
Building Classification Models ID3 and C4.5
http//yoda.cis.temple.edu8080/UGAIWWW/lectures/C
45/
n Introduction to Reinforcement Learning
http//www.cs.indiana.edu/gasser/Salsa/rl.html
n On-line book on Reinforcement Learning
http//www-anw.cs.umass.edu/rich/book/the-book.ht
ml

Write a Comment

User Comments (0)

About PowerShow.com

Multi-Agent Systems Lecture 10 University PowerPoint PPT Presentation