Title: Multi-Agent Systems Lecture 10 University 
 1Multi-Agent SystemsLecture 10University 
Politehnica of Bucarest2005-2006Adina Magda 
Floreaadina_at_cs.pub.rohttp//turing.cs.pub.ro/bl
ia_06 
 2Machine LearningLecture outline
- 1 Learning in AI (machine learning) 
 - 2 Reinforcement learning 
 - 3 Learning in multi-agent systems 
 - 3.1 Learning action coordination 
 - 3.2 Learning individual performance 
 - 3.3 Learning to communicate 
 - 3.4 Layered learning 
 - 5 Conclusions
 
  31 Learning in AI
- What is machine learning? 
 -  
 - Herbet Simon defines learning as 
 - any change in a system that allows it to perform 
better the second time on repetition of the same 
task or another task drawn from the same 
population (Simon, 1983).  - In ML the agent learns 
 - knowledge representation of the problem domain 
 - problem solving rules, inferences 
 - problem solving strategies 
 
3 
 4Classifying learning
- In MAS learning the agents should learn 
 - what an agent learns in ML but in the context of 
MAS - both cooperative and self-interested agents  - how to cooperate for problem solving - 
cooperative agents  - how to communicate - both cooperative and 
self-interested agents  - how to negotiate - self interested agents 
 - Different dimensions 
 - explicitly represented domain knowledge 
 - how the critic component (performance evaluation) 
of a learning agent works  - the use of knowledge of the domain/environment 
 
4 
 5Single agent learning
 Learning Process
Feed-back
Data
 Environment
Learning results
 Problem Solving K  B Inferences Strategy
Feed-back
Results
 Performance Evaluation
5 
 6Self-interested learning agent
Feed-back
Communication
Data
 Environment
Actions
Feed-back
- NB Both in this diagram and the next, not all 
components or flow arrows are always present - it 
depends on the type of agent (cognitive, 
reactive), type of learning, etc. 
6 
 7Cooperative learning agents
Feed-back
 Learning Process
 Learning Process
Feed-back
Communication
Learning results
Learning results
Data
Results
Results
Performance Evaluation 
Feed-back
Actions
Actions
Communication
Communication
 Environment
7 
 82 Reinforcement learning
- Combines dynamic programming and AI machine 
learning techniques  - Trial-and-error interactions with a dynamic 
environment  - The feedback of the environment  reward or 
reinforcement  -  search in the space of behaviors  
 genetic algorithms  -   
 - Two main approaches 
 -   
 -  learn utility based on statistical techniques 
and dynamic programming methods  
8 
 92.1 A reinforcement-learning model 
- B  agent's behavior 
 - i  input  current state of the env 
 - r  value of reinforcement 
 -  (reinforcement signal) 
 - T  model of the world 
 - The model consists of 
 - -         a discrete set of environment states S 
(s?S)  - -         a discrete set of agent actions A (a ? 
A)  - -         a set of scalar reinforcement signals, 
typically 0, 1 or real numbers  - - the transition model of the world, T 
 - environment is nondeterministic 
 - T  S x A ? P(S)  T  transition model 
 - T(s, a, s) 
 - Environment history  a sequence of states that 
leads to a terminal state  
9 
 10-  A 4 x 3 environment 
 - The intended outcome occurs with probability 0.8, 
and with probability 0.2 (0.1, 0.1) the agent 
moves at right angles to the intended direction.  - The two terminal states have reward 1 and 1, 
all other states have a reward of 0.04 
0.8
0.1
0.1
3 2 1
Up, Up, Right, Right, Right (4,3) 0.85 0.32768 
1 2 3 4
10 
 11- 2.2 Features varying RL 
 - accessible / inaccessible environment 
 - has (T known) / has not a model of the 
environment  - learn behavior / learn behavior  model 
 - reward received only in terminal states or in any 
state  - passive/active learner 
 - learn utilities of states 
 - active learner  learn also what to do 
 - how does the agent represent B, namely its 
behavior  - utility functions on states or state histories (T 
is known)  - active-value functions (T is not necessarily 
known) - assigns an expected utility to taking a 
given action in a given state  
11 
 12Agents
- State and goals 
 - goal  E ? 0, 1 
 - Utilities 
 - utility  E ? R 
 - env  E x A ? P(E) 
 - Expected utility of an action a in a state e 
 - Maximum Expected Utility (MEU) 
 -  
 
12 
 13- 2.3 The RL problem 
 - the agent has to find a policy ?  a function 
which maps states to actions and which maximizes 
some long-time measure of reinforcement.  - The agent has to learn an optimal behavior  
optimal policy  a policy which yields the 
highest expected utility - ?  - The utility function depends on the environment 
history (a sequence of states)  - In each state s the agents receives a reward - 
R(s)  - Uh(s0, s1, , sn)  utility function on 
histories  
13 
 14- Models of behavior 
 - Finite-horizon model at a given moment of time 
the agent should optimize its expected reward for 
the next h steps  -  E(?t0, h R(st)) 
 -   rt represents the reward received t steps into 
the future.  -   
 - Infinite-horizon model optimize the long-run 
reward  -   E(?t0,? R(st)) 
 -   
 - Infinite-horizon discounted model optimize the 
long-run reward but rewards received in the 
future are geometrically discounted according to 
a discount factor ?.  -   E(?t0,? ?t R(st)) 
 -  0 ? ? lt 1. 
 - ? can be interpreted in several ways. It can be 
seen as an interest rate, a probability of living 
another step, or as a mathematical trick to bound 
an infinite sum.  
14 
 15- 2.4 Markov systems 
 - Discounted rewards 
 - An AP gets payed 20/year 
 - 202020.. 
 - (reward now)  ?(reward at time 1)  ?2(rewards 
at time) 2    - A Markov System with rewards 
 - (S1, S2,Sn) 
 - A transition probability matrix 
PijProb(NextSjThis  Si)  - Each state has a rweard r1, r2,rn 
 - Discount factor ? in 0,1 
 -  On each time step 
 -  Assume state is Si 
 -  Get reward ri 
 -  Randomly move to another state Pij 
 -  All future rewards are discounted by ?
 
15 
 16- U(Si)expected discounted sum of future rewards 
starting in state Si  - U(Si) ri?(Pi1U(S1)Pi2U(S2) .. PinU(Sn)), 
i1,n  - Solve equations, get an exact answer but 100 000 
states splve a 100 000 by 100 000 system of 
equations  - Value iteration to solve a Markov system 
 - U1(Si)ri 
 - U2(Si)  ri  ? ?j1,N PijU1(Sj) 
 - Compute U1(Si) for each sate 
 - Compute U2(Si) for eaxch state, etc 
 - Stop when Uk1(Si) - Uk(Si) lt eps
 
16 
 17- 2.5 Markov Decision Problem (MDP) 
 - consists of 
 -  ltS, A, P, Rgt 
 -  S - a set of states 
 -  A - a set of actions 
 -  R  reward function, R S x A ? R 
 -  T  S x A ? ?(S), with ?(S) the probability 
distribution over the states S  - On each time step 
 -  Assume state is Si 
 -  Get reward Ri 
 -  Choose action a (from a1ak) 
 -  Move to another state Pij with probability 
T(Si,a)  -  All future rewards are discounted by ? 
 - We shall use T(s,a,s) 
 
PassProb(NextsThiss and I use action k)
17 
 18- Markov Decision Problem (MDP) 
 - The model is Markov if the state transitions are 
independent of any previous environment states or 
agent actions.  - MDP finite-state and finite-action  focus on 
that / infinite state and action space  - For every MDP there exists an optimal policy 
 - Its a policy such that for every possible start 
state there is no better option than to follow 
the policy  - Finding the optimal policy given a model T  
calculate the utility of each state U(state) and 
use state utilities to select an optimal action 
in each state.  
18 
 19- Value iteration to solve a MDP 
 - U1(s)R(s) 
 - U2(s)  maxa(R(s)  ? ?s T(s,a,s)U1(s)) 
 - . 
 - UK1(s)  maxa(R(s)  ? ?s T(s,a,s)Uk(s)) 
 - Compute U1(si) for each state, ssi 
 - Compute U2(si) for each state, etc 
 - Stop when maxi Uk1(si) - Uk(si) lt eps 
 -  convergence 
 - (dynamic programming) 
 
Value iteration for a MS Uk1(Si)  ri  ? ?j1,N 
PijU k(Sj)
19 
 20- The utility of a state is the immediate reward 
for that state plus the expected discounted 
utility of the next state, assuming that the 
agent chooses the optimal action  - U(s)  R(s)  ? max a?sT(s,a,s)U(s) 
 - Bellman equation - U?(s)  unique solutions 
 - The utility function U(s) allows the agent to 
select actions by using the Maximum Expected 
Utility principle  -  ?(s)  argmaxa (R(s)  ? ?sT(s,a,s)U(s)) 
 -   optimal policy
 
20 
 21-  A 4 x 3 environment 
 - The intended outcome occurs with probability 0.8, 
and with probability 0.2 (0.1, 0.1) the agent 
moves at right angles to the intended direction.  - The two terminal states have reward 1 and 1, 
all other states have a reward of 0.04, ?1  
0.8
0.1
0.1
3 2 1
3 2 1
0.812
0.868
0.918
0.762
0.660
0.705
0.655
0.611
0.388
1 2 3 4
1 2 3 4
21 
 22Bellman equation for the 4x3 world Equation for 
the state (1,1) U(1,1)  -0.04  ? max 0.8 
U(1,2)  0.1 U(2,1)  0.1 U(1,1), Up 
 0.9U(1,1)  0.1U(1,2), Left 
 0.9U(1,1)  0.1U(2,1), Down 
0.8U(2,1) 0.1U(1,2)  0.1U(1,1) Right U
p is the best action 
3 2 1
0.812
0.868
0.918
0.762
0.660
0.705
0.655
0.611
0.388
1 2 3 4 
 23defines the best action in state s
- Value Iteration 
 - Given the maximal expected utility, the optimal 
policy is  -   
 - ?(s)  arg maxa(R(s)  ? ?s T(s,a,s)  U(s)) 
 -   
 - Compute U(s) using an iterative approach ? Value 
Iteration  -   
 -  U0(s)  R(s) 
 -  Ut1(s)  R(s)  maxa(? ?s T(s,a,s)  Ut(s)) 
 -   
 -  t ? inf .utility values converge to the 
optimal values  -   
 
compute for all s
23 
 24- Policy iteration 
 -  Manipulate the policy directly, rather than 
finding it indirectly via the optimal value 
function  - choose an arbitrary policy ? (randomly) 
 - at each time t, compute the the long time reward 
starting in s, using ?t, i.e. solve the equations  -  Ut(s)  R(s)  ? ?s (T(s, ?t(s),s)  Ut(s)) 
 - improve the policy at each state 
 -  ?t1(s) ? arg maxa (R(s)  ? ?s T(s,a,s)  
Ut(s))  - Involves all next states - complex
 
24 
 25- 2.6 RL learning 
 - Use observed rewards to learn an optimal (or near 
optimal) policy for the environment  - Ex play 100 moves, you loose 
 - In an MDP the agent has a complete model f the 
evironment  - Now the agent has not such a model 
 - Passive learning  the agent policy is fixedThe 
tesk is to learn the utilities of states (or 
state-action pairs)  - Active learning  the agent must aso learn what 
to do exploitation/exploration  
25 
 26- (a) Passive reinforcement learning 
 - Policy is fixed  in state s always execute ?(s) 
 - Goal  learn how good the policy is  learn U?(s) 
 - Does not know T(s,a,s), does not know before 
R(s)  - ADP (Adaptive Dynamic Programming) learning 
 - The problem of calculating an optimal policy in 
an accessible, stochastic environment.  - ADP  plug the learned T(s, ?(s),s) and the 
observed rewards R(s) into the Bellman equations 
to calculate the utility of states  - Supervised learning  input state-action pairs 
 - output resulting state 
 - Estimate transition probabilities T(s,a,s) from 
frequencies with which s is reached after 
executing a in s  -  (1,3)  Right  2 times (2,3), 1 time in (1,3) 
gt  -  T((1,3),Right,(2,3))2/3
 
26 
 27- ADP (Adaptive Dynamic Programming) learning 
 - function Passive-ADP-Agent(percept) returns an 
action  - inputs percept, a percept indicating the current 
state s and reward signal r  - variable ?, a fixed policy 
 -  mdp, an MDP with model T, rewards R, discount ? 
 -  U, a table of utilities, initially empty 
 -  Nsa, a table of frequencies for state-action 
pairs, initially zero  -  Nsas, a table of frequencies of 
state-action-state triples, initially zero  -  s, a, the previous state and action, initially 
null  - if s is new then Us ? r, Rs ? r 
 - if s is not null then 
 -  increment Nsas,a and Nsass,a,s 
 -  for each t such that Nsass,a,t ltgt0 do 
 -  Ts,a,t ? Nsass,a,t / Nsas,a 
 - U ? Value-Determination(?,U,mdp) 
 - if Terminals then s,a ? null else s,a ? s, 
?s  - return a 
 - end
 
according to MDP (value iteration or policy 
iteration)
27 
 28- Temporal difference learning 
 - (TD learning) 
 -  The value function is no longer implemented by 
solving a set of linear equations, but it is 
computed iteratively.  -   
 - Used observed transitions to adjust the values of 
the observed states so that they agree with the 
constraint equations.  -   
 -  U?(s) ? U ?(s)  ?(R(s)  ? U ?(s)  U ?(s)) 
 -   ? is the learning rate. 
 - Whatever state is visited, its estimated value is 
updated to be closer to R(s)  ? U ?(s)  -  since R(s) is the instantaneous reward received 
and  -  U ?(s') is the estimated value of the actually 
occurring next state.  - simpler, involves only next states 
 - ? decreases as the number of times the state is 
visited increases 
28 
 29- Temporal difference learning 
 - function Passive-TD-Agent(percept) returns an 
action  - inputs percept, a percept indicating the current 
state s and reward signal r  - variable ?, a fixed policy 
 -  U, a table of utilities, initially empty 
 -  Ns, a table of frequencies for states, 
initially zero  -  s, a, r, the previous state, action, and 
reward, initially null  - if s is new then Us ? r 
 - if s is not null then 
 -  increment Nss 
 -  Us ? Us  ?(Nss)(r  ? U s  U s) 
 - if Terminals then s, a, r ? null else s, a, r 
? s, ?s, r  - return a 
 - end
 
29 
 30- Temporal difference learning 
 - Does not need a model to perform its updates 
 - The environment supplies the connections between 
neighboring states in the form of observed 
transitions.  - ADP and TD comparison 
 - ADP and TD try both to make local adjustments to 
the utility estimates in order to make each state 
 agree  with its successors  - TD adjusts a state to agree with the observed 
successor  - ADP adjusts a state to agree with all of the 
successors that might occur, weighted by their 
probabilities  
30 
 31- (b) Active reinforcement learning 
 - Passive learning agent  has a fixed policy that 
determines its behavior  - An active learning agent must decide what action 
to take  - The agent must learn a complete model with 
outcome probabilities for all actions (instead of 
a model for the fixed policy)  - Compute/learn the utilities that obey the Bellman 
equation  -  U (s)  R(s)  ? maxa?s (T(s, ?t(s),s)  
U(s))  - using value iteration r policy iteration 
 - - If value iteration then look for the action 
that maximze utility  - - If policy iteration you already have the action 
 - - Exploration/exploitation 
 - - The representative problem is the n-armed 
bandit problem  - Solutions 
 - 1/t time choose random actions, rest follow ? 
 - give weights to actions that have not been 
explored, avoid actions with low utilities  - Exploratory function  f(u,n)  how greedy 
(prefer high utility vales r not (exploration) 
the agent is  
31 
 32- Q-learning 
 - Active learning of action-value functions 
 - action-value function  assigns an expected 
utility to taking a given action in a given 
state, Q-values  - Q(a, s)  the value of doing action a in state s 
(expected utility)  -  Q-values are related to utility values by the 
equation  -  U(s)  maxaQ(a, s) 
 - Approach 1 
 -  Q(a,s)  R(s)  ? ?s (T(s, a,s) maxa 
Q(a,s))  - This requires a model 
 - Approach 2 
 -  Use TD 
 -  The agent does not need to learn a model  model 
free  
32 
 33- Q-learning 
 -   
 - TD learning, unknown environment 
 -   
 -  Q(a,s) ? Q(a,s)  ?(R(s)  ? maxaQ(a, s)  
Q(a,s))  -   
 - calculated after each transition from state s to 
s.  -   
 -   
 - Is it better to learn a model and a utility 
function or to learn an action-value function 
with no model?  
33 
 34- Q-learning 
 - function Q-Learning-Agent(percept) returns an 
action  - inputs percept, a percept indicating the current 
state s and reward signal r  - variable Q, a table of action values index by 
state and action  -  Nsa, a table of frequencies for state-action 
pairs  -  s, a, r the previous state, action, and reward, 
initially null  - if s is not null then 
 -  increment Nsas,a 
 -  Qa,s ? Qa,s  ?(Nsas,a)(r  ? maxaQ 
a,s  Q a,s)  - if Terminals then s, a, r ? null 
 - else s, a, r ? s, argmaxa f(Qa, s, 
Nsaa,s), r  - return a 
 - end
 
s, a, r ? s, argmaxa (Qa, s), r
34 
 35- Generalization of RL 
 -  The problem of learning in large spaces  large 
no. of states  - Generalization techniques - allow compact storage 
of learned information and transfer of knowledge 
between "similar" states and actions.  - Neural nets 
 - Decision trees 
 - U(state)U(most similar sate in memory) 
 - U(state) average U(most similar sates in memory) 
 -  
 
35 
 363 Learning in MAS
- The credit-assignment problem (CAP)  the problem 
of assigning feed-back (credit or blame) for an 
overall performance of the MAS (increase, 
decrease) to each agent that contributed to that 
change  - inter-agent CAP  assigns credit or blame to the 
external actions of agents  - intra-agent CAP  assigns credit or blame for a 
particular external action of an agent to its 
internal inferences and decisions  - distinction not always obvious 
 - one or another
 
36 
 373.1 Learning action coordination
- s  current environment state 
 - Agent i  determines the set of actions it can do 
in s Ai(s)  Aij(s)  - Computes the goal relevance of each action 
Eij(s)  - Agent i announces a bid for each action with 
 -  Eij(s) gt threshold 
 - Bij(s)  (?  ?) Eij(s) 
 - ? - risk factor (small) ? - noise term (to 
prevent convergence to local minima) 
37 
 38- The action with the highest bid is selected 
 - Incompatible actions are eliminated 
 - Repeat process until all actions in bids are 
either selected or eliminated  - A  selected actions  activity context 
 - Execute selected actions 
 - Update goal relevance for actions in A 
 -  Eij(s) ? Eij(s)  Bij(s)  (R / A) 
 -  R external reward received 
 - Update goal relevance for actions in the previous 
activity context Ap (actions Akl)  -  Ekl(sp) ? Ekl(sp)  (?Aij?A Bij(s)/ Ap) 
 
38 
 393.2 Learning individual performance
- The agent learns how to improve its individual 
performance in a multi-agent settings  - Examples 
 - Cooperative agents - learning organizational 
roles  - Competitive agents - learning from market 
conditions  
39 
 403.2.1 Learning organizational roles (Nagendra, 
e.a.)
- Agents learn to adopt a specific role in a 
particular situation (state) in a cooperative 
MAS.  - Aim  to increase utility of final states 
 - Each agent may play several roles in a situation 
 - The agents learn to select the most appropriate 
role  - Use reinforcement learning 
 - Utility, Probability, and Cost (UPC) estimates of 
a role in a situation  - Utility - the agent's estimate of a final state 
worth for a specific role in a situation  world 
states mapped to smaller set of situations  -  S  s0,,sf 
 -  Urs  U(sf), s0 ?  ? sf
 
40 
 41- Probability - the likelihood of reaching a final 
state for a specific role in a situation  -  Prs  p(sf), s0 ?  ? sf 
 - Cost - the computational cost of reaching a final 
state for a specific role in a situation  - Potential of a role - estimates the usefulness of 
a role, discovering pertinent global information 
and constraints (ortogonal to utilities)  - Representation 
 - Sk - vector of situations for agent k, SK1,,SKn 
 - Rk - vector of roles for agent k, Rk1,,Rkm 
 - Sk x Rk x 4 values to describe UPC and 
Potential 
41 
 42- Functioning 
 - Phase I Learning 
 - Several learning cycles in each cycle 
 - each agent goes from s0 to sf and selects its 
role as the one with the highest probability  - Probability of selecting a role r in a situation 
s  -  f - objective function used to rate the roles 
 -  (e.g., f(U,P,C,Pot)  UPC  Pot) 
 -  - depends on the domain
 
42 
 43- Use reinforcement learning to update UPC and the 
potential of a role  - For every s ? s0,,sf and chosen role r in s 
 -  ? Ursi1  (1-?)Ursi  ?Usf 
 -  i - the learning cycle 
 -  Usf - the utility of a final state 
 -  0???1 - the learning rate 
 -  ? Prsi1  (1-?)Prsi  ?O(sf) 
 -  O(sf)  1 if sf is successful, 0 otherwise 
 -  
 
43 
 44-  ? Potrsi1  (1-?)Potrsi  ?Conf(Path) 
 -  Path  s0,,sf 
 -  Conf(Path)  0 if there are conflicts on the 
Path, 1 otherwise  - The update rules for cost are domain dependent 
 - Phase II Performing 
 - In a situation s the role r is chosen such that 
 
44 
 453.2.2 Learning in market environments(Vidal  
Durfee)
- Agents use past experience and evolved models of 
other agents to better sell and buy goods  - Environment  a market in which agents buy and 
sell information (electronic marketplace)  - Open environment 
 - The agents are self-interested (max local 
utility)  -  g - a set of goods 
 -  P - set of possible prices for goods 
 -  Qg - set of possible qualities for a good g
 
45 
 46- information has a cost for the seller and a value 
for the buyer  - information is sold at a certain price 
 - a buyer announces a good it needs 
 - sellers bid their prices for delivering the good 
 - the buyer selects from these bids and pays the 
corresponding price  - the buyer assesses the quality of information 
after it receives it from the seller  - Profit of a seller s for selling the good g at 
price p  -  Profitsg(p)  p - csg 
 -  csg - the cost of producing the good g 
by s p - the price  - Value of a good g for a buyer b 
 -  Vbg(p,q) p - price b paid for g 
 -  q - quality of good g 
 - Goal seller - maximize profit in a transaction 
 -  buyer - maximize value in a 
transaction 
46 
 47- 3 types of agents 
 - 0-level agents 
 - they set their buying and selling prices based on 
their own past experience  - they do not model the behavior of other agents 
 - 1-level agents 
 - model other agents based on previous interactions 
 - they set their buying and selling prices based on 
these models and on past experience  - they model the other agents as 0-level agents 
 - 2-level agents 
 - same as 1-level agents but they model the other 
agents as 1-level agents  
47 
 48- Strategy of 0-level agents 
 - 0-level buyer 
 - - learns the expected value function, fg(p), of 
buying g at price p  - - uses reinforcement learning 
 -  fgi1(p)  (1-?)fgi(p)  ?Vbg(p,q), ?min? ? ? 
1, for i0, ?  1  - - chooses the seller s for supplying a good g 
 - 0-level seller 
 - - learns the expected profit function, hg(p),if 
it offers good g at price p  - - uses reinforcement learning 
 -  hgi1(p)  (1-?)hgi(p)  ?Profitbg(p) 
 -  where Profitbg(p)  p - csg if it 
wins the auction, 0 otherwise  - - chooses the price ps to sell the good g so as 
to maximize profit 
48 
 49- Strategy of 1-level agents 
 - 1-level buyer 
 - - models sellers for good g 
 - - does not model other buyers 
 - - uses a probability distribution function qsg(x) 
over the qualities x of a good g  - - computes expected utility, Esg, of buying good 
g from seller s  - - chooses the seller s for supplying a good g 
that maximizes this expected utility  
49 
 50- 1-level seller 
 - - models buyers for good g 
 - - models the other sellers s for good g 
 - Buyer's modeling 
 - - uses a probability distribution function mbg(p) 
- the probability that b will choose price p for 
good g  - Seller's modeling 
 - - uses a probability distribution function 
ns'g(y) - the probability that s' will bid price 
y for good g  - - computes the probability of bidding lower than 
a given seller s' with the price p  - Prob_of_bidding_lower_than_s'  
 -  ?p'(Prob of bid of s' with p' for which s wins) 
  -  ?p' N(g,b,ss',p,p') 
 -  N(g,b,ss',p,p')  ns'g(p') if mbg(p') ? 
mbg(p)  -  0 otherwise 
 
50 
 51- - computes the probability of bidding lower than 
all other sellers with the price p  -  Prob_of_bidding_lower_with_p  
 -  ? (Prob_of_bidding_lower_than_s') 
 -  s'?S - s 
 - - chooses the best price p to bid so as to 
maximize profit  -  
 
51 
 523.3 Learning to communicate
- What to communicate (e.g., what information is of 
interest to the others)  - When to communicate (e.g., when try doing 
something by itself or when look for help)  - With which agents to communicate 
 - How to communicate (e.g., language, protocol, 
ontology) 
52 
 53Learning with which agents to communicate (Ohko, 
e.a. )
- Learning to which agents to ask for performing a 
task  - Used in a contract net protocol for task 
allocation to reduce communication for task 
announcement  - Goal  acquire and refine knowledge about other 
agents' task solving abilities  - Case-based reasoning used for knowledge 
acquisition and refinement  - A case consists of 
 - (1) A task specification 
 - (2) Information about which agents solved a task 
or similar tasks in the past and the quality of 
the provided solution  
53 
 54- (1) Task specification 
 - Ti  Ai1 Vi1, , Aimi Vimi 
 -  Aij - task attribute, Vij - value of attribute 
 - Similar tasks 
 -  Sim(Ti, Tj)  ?r ?s Dist(Air, Ajs) 
 -  Air?Ti, Ajs?Tj 
 - Dist(Air, Ajs)  Sim_Attr(Air, Ajs)  
Sim_Vals(Vir, Vjs)  - Set of similar tasks 
 -  S(T)  Tj  Sim(T, Tj) ?0.85 
 
54 
 55- (2) Which agents performed T or similar tasks in 
the past  - Suitability of Agent k 
 -  Perform(Ak, Tj) - quality of solution for Tj 
assured by agent Ak performing Tj in the past  - The agent computes 
 -   Suit(Ak, T), Suit(Ak, T)gt0  and selects the 
agent k such that  - or the first m agents with best suitability 
 - After each encounter, the agent stores the tasks 
performed by other agents and the solution 
quality  - Tradeoff between exploitation and exploration
 
55 
 563.4 Layered learning
- (Stone  Veloso) 
 - A hierarchical machine learning paradigm in MAS 
 - Used simulated robotic soccer  RoboCup 
 -  Learning 
 -  Input ? Output  Intractable 
 - Decompose the learning task L into subtasks L1, 
, Ln  - Characteristics of the environment 
 - Cooperative MAS 
 - Teammates and adversaries 
 - Hidden states  agents have a partial world view 
at any given moment  - Agents have noisy sensory data and actuators 
 - Perception and action cycles are asynchronous 
 - Agents must make their decisions in real-time
 
56 
 57- Problem the agent receives a moving ball and 
must decide what to do with it dribble, pass to 
a teammate, shoot towards the goal  - Decompose the problem into 3 subtasks 
 - Layer Behavior type Example 
 - L1 Individual Ball interception 
 - L2 Multiagent Pass evaluation 
 - L3 Team Pass selection 
 - The decomposition into subtasks enables the 
learning of more complex behaviors  - The hierarchical task decomposition is 
constructed bottom-up, in a domain dependent 
fashion  - Learning methods are chosen to suit the task 
 - Learning in one layer feeds into the next layer 
either by providing a portion of the behavior 
used for training (ball interception  pass 
evaluation) or by creating the input 
representation and pruning the action space (pass 
evaluation  pass selection) 
57 
 58- L1  Ball interception 
 -  behavior  individual 
 - Aim 
 - Blocks or intercepts opponents shots or passes or 
 - Receive passes from teammates 
 - Learning method a fully connected 
backpropagation NN  - Repeatedly shooting the ball towards a defender 
in front of a goal.The defender collects t.e. by 
acting randomly and noticing when it successfully 
stops the ball  - Classification 
 - Saves  successful interceptions 
 - Goals  unsuccessful attempts 
 - Misses  shoots that went wide of the goal
 
58 
 59- L2  Pass evaluation 
 -  behavior  multiagent 
 - Uses its learned ball-interception skills as part 
of the behavior for training MAS behavior  - Aim the agent must decide 
 - To pass (or not) the ball to a teammate and 
 - If the teammate will successfully receive the 
ball (based on positions  abilities of the 
teammate to receive or intercept a pass)  - Learning method decision trees (C4.5) 
 - Kick the ball towards randomly placed teammates 
interspread with randomly placed opponents  - The intended pass recipient and the opponents all 
use the learned ball-interception behavior  - Classification of a potential pass to a receiver 
 - Success, with a c.f. ? (0,1 
 - Failure, with a c.f. ? -1,0) 
 - Miss, ( 0)
 
59 
 60- L3  Pass selection 
 -  behavior  team 
 - Uses its learned pass-evaluation capabilities to 
create the input and output set for learning pass 
selection  - Aim the agent has the ball and must decide 
 - To which teammate to pass the ball or 
 - Shoot on goal 
 - Learning method Q-learning of a function that 
depends on the agents position on the field  - Simulate 2 teams playing with identical behavior 
others than their pass-selection policies  - Reinforcement  total goals scored 
 - Learns 
 - Shoot the goal 
 - The teammate to which to pass 
 
60 
 614 Conclusions
- There is no unique method or set of methods for 
learning in MAS  - Many approaches are based on extending ML 
techniques in a MAS setting  - Many approaches use reinforcement learning, but 
also NN or genetic algorithms 
61 
 62- References 
 - S. Sen, G. Weiss. Learning in Multiagent systems. 
In Multiagent Systems - A Modern Approach to 
Distributed Artificial Intelligence, G. Weiss 
(Ed.), The MIT Press, 2001, p.257-298.  - T. Ohko, e.a. - Addressee learning and message 
interception for communication load reduction in 
multiple robot environment. In Distributed 
Artificial Intelligence Meets Machine Learning, 
G. Weiss, Ed., Lecture Notes in Artificial 
Intelligence, Vol. 1221, Springer-Verlag, 1997, 
p.242-258.  - M.V. Nagendra, e.a. Learning organizational roles 
in a heterogeneous multi-agent systems. In Proc. 
of the Second International Conference on 
Multiagent Systems, AAAI Press, 1996, p.291-298.  - J.M. Vidal, E.H. Durfee. The impact of nested 
agent models in an information economy. In Proc. 
of the Second International Conference on 
Multiagent Systems, AAAI Press, 1996, p.377-384.  - P. Stone, M. Veloso. Layered Learning, Eleventh 
European Conference on Machine Learning, 
ECML-2000.  
62 
 63- Web References 
 - An interesting set of training examples and the 
connection between decision trees and rules.  - http//www.dcs.napier.ac.uk/peter/vldb/dm/node11.
html  - Decision trees construction 
 - http//www.cs.uregina.ca/hamilton/courses/831/not
es/ml/dtrees/4_dtrees2.html  - Building Classification Models ID3 and C4.5 
 - http//yoda.cis.temple.edu8080/UGAIWWW/lectures/C
45/  - n       Introduction to Reinforcement Learning 
 - http//www.cs.indiana.edu/gasser/Salsa/rl.html 
 -  n       On-line book on Reinforcement Learning 
 - http//www-anw.cs.umass.edu/rich/book/the-book.ht
ml  
63