Multi-Agent Systems Lecture 10 University - PowerPoint PPT Presentation

About This Presentation
Title:

Multi-Agent Systems Lecture 10 University

Description:

... a policy = a function which maps states to actions and which ... Get reward ri. Randomly move to another state Pij. All future rewards ... (Si) = ri j=1,N ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 64
Provided by: AdinaF3
Category:

less

Transcript and Presenter's Notes

Title: Multi-Agent Systems Lecture 10 University


1
Multi-Agent SystemsLecture 10University
Politehnica of Bucarest2005-2006Adina Magda
Floreaadina_at_cs.pub.rohttp//turing.cs.pub.ro/bl
ia_06
2
Machine LearningLecture outline
  • 1 Learning in AI (machine learning)
  • 2 Reinforcement learning
  • 3 Learning in multi-agent systems
  • 3.1 Learning action coordination
  • 3.2 Learning individual performance
  • 3.3 Learning to communicate
  • 3.4 Layered learning
  • 5 Conclusions

3
1 Learning in AI
  • What is machine learning?
  • Herbet Simon defines learning as
  • any change in a system that allows it to perform
    better the second time on repetition of the same
    task or another task drawn from the same
    population (Simon, 1983).
  • In ML the agent learns
  • knowledge representation of the problem domain
  • problem solving rules, inferences
  • problem solving strategies

3
4
Classifying learning
  • In MAS learning the agents should learn
  • what an agent learns in ML but in the context of
    MAS - both cooperative and self-interested agents
  • how to cooperate for problem solving -
    cooperative agents
  • how to communicate - both cooperative and
    self-interested agents
  • how to negotiate - self interested agents
  • Different dimensions
  • explicitly represented domain knowledge
  • how the critic component (performance evaluation)
    of a learning agent works
  • the use of knowledge of the domain/environment

4
5
Single agent learning
Learning Process
Feed-back
Data
Environment
Learning results
Problem Solving K B Inferences Strategy
Feed-back
Results
Performance Evaluation
5
6
Self-interested learning agent
Feed-back
Communication
Data
Environment
Actions
Feed-back
  • NB Both in this diagram and the next, not all
    components or flow arrows are always present - it
    depends on the type of agent (cognitive,
    reactive), type of learning, etc.

6
7
Cooperative learning agents
Feed-back
Learning Process
Learning Process
Feed-back
Communication
Learning results
Learning results
Data
Results
Results
Performance Evaluation
Feed-back
Actions
Actions
Communication
Communication
Environment
7
8
2 Reinforcement learning
  • Combines dynamic programming and AI machine
    learning techniques
  • Trial-and-error interactions with a dynamic
    environment
  • The feedback of the environment reward or
    reinforcement
  • search in the space of behaviors
    genetic algorithms
  •  
  • Two main approaches
  •  
  • learn utility based on statistical techniques
    and dynamic programming methods

8
9
2.1 A reinforcement-learning model
  • B agent's behavior
  • i input current state of the env
  • r value of reinforcement
  • (reinforcement signal)
  • T model of the world
  • The model consists of
  • -         a discrete set of environment states S
    (s?S)
  • -         a discrete set of agent actions A (a ?
    A)
  • -         a set of scalar reinforcement signals,
    typically 0, 1 or real numbers
  • - the transition model of the world, T
  • environment is nondeterministic
  • T S x A ? P(S) T transition model
  • T(s, a, s)
  • Environment history a sequence of states that
    leads to a terminal state

9
10
  •  A 4 x 3 environment
  • The intended outcome occurs with probability 0.8,
    and with probability 0.2 (0.1, 0.1) the agent
    moves at right angles to the intended direction.
  • The two terminal states have reward 1 and 1,
    all other states have a reward of 0.04

0.8
0.1
0.1
3 2 1
Up, Up, Right, Right, Right (4,3) 0.85 0.32768
1 2 3 4
10
11
  • 2.2 Features varying RL
  • accessible / inaccessible environment
  • has (T known) / has not a model of the
    environment
  • learn behavior / learn behavior model
  • reward received only in terminal states or in any
    state
  • passive/active learner
  • learn utilities of states
  • active learner learn also what to do
  • how does the agent represent B, namely its
    behavior
  • utility functions on states or state histories (T
    is known)
  • active-value functions (T is not necessarily
    known) - assigns an expected utility to taking a
    given action in a given state

11
12
Agents
  • State and goals
  • goal  E ? 0, 1
  • Utilities
  • utility  E ? R
  • env  E x A ? P(E)
  • Expected utility of an action a in a state e
  • Maximum Expected Utility (MEU)

12
13
  • 2.3 The RL problem
  • the agent has to find a policy ? a function
    which maps states to actions and which maximizes
    some long-time measure of reinforcement.
  • The agent has to learn an optimal behavior
    optimal policy a policy which yields the
    highest expected utility - ?
  • The utility function depends on the environment
    history (a sequence of states)
  • In each state s the agents receives a reward -
    R(s)
  • Uh(s0, s1, , sn) utility function on
    histories

13
14
  • Models of behavior
  • Finite-horizon model at a given moment of time
    the agent should optimize its expected reward for
    the next h steps
  • E(?t0, h R(st))
  •  rt represents the reward received t steps into
    the future.
  •  
  • Infinite-horizon model optimize the long-run
    reward
  •   E(?t0,? R(st))
  •  
  • Infinite-horizon discounted model optimize the
    long-run reward but rewards received in the
    future are geometrically discounted according to
    a discount factor ?.
  •   E(?t0,? ?t R(st))
  • 0 ? ? lt 1.
  • ? can be interpreted in several ways. It can be
    seen as an interest rate, a probability of living
    another step, or as a mathematical trick to bound
    an infinite sum.

14
15
  • 2.4 Markov systems
  • Discounted rewards
  • An AP gets payed 20/year
  • 202020..
  • (reward now) ?(reward at time 1) ?2(rewards
    at time) 2
  • A Markov System with rewards
  • (S1, S2,Sn)
  • A transition probability matrix
    PijProb(NextSjThis Si)
  • Each state has a rweard r1, r2,rn
  • Discount factor ? in 0,1
  • On each time step
  • Assume state is Si
  • Get reward ri
  • Randomly move to another state Pij
  • All future rewards are discounted by ?

15
16
  • U(Si)expected discounted sum of future rewards
    starting in state Si
  • U(Si) ri?(Pi1U(S1)Pi2U(S2) .. PinU(Sn)),
    i1,n
  • Solve equations, get an exact answer but 100 000
    states splve a 100 000 by 100 000 system of
    equations
  • Value iteration to solve a Markov system
  • U1(Si)ri
  • U2(Si) ri ? ?j1,N PijU1(Sj)
  • Compute U1(Si) for each sate
  • Compute U2(Si) for eaxch state, etc
  • Stop when Uk1(Si) - Uk(Si) lt eps

16
17
  • 2.5 Markov Decision Problem (MDP)
  • consists of
  • ltS, A, P, Rgt
  • S - a set of states
  • A - a set of actions
  • R reward function, R S x A ? R
  • T S x A ? ?(S), with ?(S) the probability
    distribution over the states S
  • On each time step
  • Assume state is Si
  • Get reward Ri
  • Choose action a (from a1ak)
  • Move to another state Pij with probability
    T(Si,a)
  • All future rewards are discounted by ?
  • We shall use T(s,a,s)

PassProb(NextsThiss and I use action k)
17
18
  • Markov Decision Problem (MDP)
  • The model is Markov if the state transitions are
    independent of any previous environment states or
    agent actions.
  • MDP finite-state and finite-action focus on
    that / infinite state and action space
  • For every MDP there exists an optimal policy
  • Its a policy such that for every possible start
    state there is no better option than to follow
    the policy
  • Finding the optimal policy given a model T
    calculate the utility of each state U(state) and
    use state utilities to select an optimal action
    in each state.

18
19
  • Value iteration to solve a MDP
  • U1(s)R(s)
  • U2(s) maxa(R(s) ? ?s T(s,a,s)U1(s))
  • .
  • UK1(s) maxa(R(s) ? ?s T(s,a,s)Uk(s))
  • Compute U1(si) for each state, ssi
  • Compute U2(si) for each state, etc
  • Stop when maxi Uk1(si) - Uk(si) lt eps
  • convergence
  • (dynamic programming)

Value iteration for a MS Uk1(Si) ri ? ?j1,N
PijU k(Sj)
19
20
  • The utility of a state is the immediate reward
    for that state plus the expected discounted
    utility of the next state, assuming that the
    agent chooses the optimal action
  • U(s) R(s) ? max a?sT(s,a,s)U(s)
  • Bellman equation - U?(s) unique solutions
  • The utility function U(s) allows the agent to
    select actions by using the Maximum Expected
    Utility principle
  • ?(s) argmaxa (R(s) ? ?sT(s,a,s)U(s))
  •  optimal policy

20
21
  •  A 4 x 3 environment
  • The intended outcome occurs with probability 0.8,
    and with probability 0.2 (0.1, 0.1) the agent
    moves at right angles to the intended direction.
  • The two terminal states have reward 1 and 1,
    all other states have a reward of 0.04, ?1

0.8
0.1
0.1
3 2 1
3 2 1
0.812
0.868
0.918
0.762
0.660
0.705
0.655
0.611
0.388
1 2 3 4
1 2 3 4
21
22
Bellman equation for the 4x3 world Equation for
the state (1,1) U(1,1) -0.04 ? max 0.8
U(1,2) 0.1 U(2,1) 0.1 U(1,1), Up
0.9U(1,1) 0.1U(1,2), Left
0.9U(1,1) 0.1U(2,1), Down
0.8U(2,1) 0.1U(1,2) 0.1U(1,1) Right U
p is the best action
3 2 1
0.812
0.868
0.918
0.762
0.660
0.705
0.655
0.611
0.388
1 2 3 4
23
defines the best action in state s
  • Value Iteration
  • Given the maximal expected utility, the optimal
    policy is
  •  
  • ?(s) arg maxa(R(s) ? ?s T(s,a,s) U(s))
  •  
  • Compute U(s) using an iterative approach ? Value
    Iteration
  •  
  • U0(s) R(s)
  • Ut1(s) R(s) maxa(? ?s T(s,a,s) Ut(s))
  •  
  • t ? inf .utility values converge to the
    optimal values
  •  

compute for all s
23
24
  • Policy iteration
  • Manipulate the policy directly, rather than
    finding it indirectly via the optimal value
    function
  • choose an arbitrary policy ? (randomly)
  • at each time t, compute the the long time reward
    starting in s, using ?t, i.e. solve the equations
  • Ut(s) R(s) ? ?s (T(s, ?t(s),s) Ut(s))
  • improve the policy at each state
  • ?t1(s) ? arg maxa (R(s) ? ?s T(s,a,s)
    Ut(s))
  • Involves all next states - complex

24
25
  • 2.6 RL learning
  • Use observed rewards to learn an optimal (or near
    optimal) policy for the environment
  • Ex play 100 moves, you loose
  • In an MDP the agent has a complete model f the
    evironment
  • Now the agent has not such a model
  • Passive learning the agent policy is fixedThe
    tesk is to learn the utilities of states (or
    state-action pairs)
  • Active learning the agent must aso learn what
    to do exploitation/exploration

25
26
  • (a) Passive reinforcement learning
  • Policy is fixed in state s always execute ?(s)
  • Goal learn how good the policy is learn U?(s)
  • Does not know T(s,a,s), does not know before
    R(s)
  • ADP (Adaptive Dynamic Programming) learning
  • The problem of calculating an optimal policy in
    an accessible, stochastic environment.
  • ADP plug the learned T(s, ?(s),s) and the
    observed rewards R(s) into the Bellman equations
    to calculate the utility of states
  • Supervised learning input state-action pairs
  • output resulting state
  • Estimate transition probabilities T(s,a,s) from
    frequencies with which s is reached after
    executing a in s
  • (1,3) Right 2 times (2,3), 1 time in (1,3)
    gt
  • T((1,3),Right,(2,3))2/3

26
27
  • ADP (Adaptive Dynamic Programming) learning
  • function Passive-ADP-Agent(percept) returns an
    action
  • inputs percept, a percept indicating the current
    state s and reward signal r
  • variable ?, a fixed policy
  • mdp, an MDP with model T, rewards R, discount ?
  • U, a table of utilities, initially empty
  • Nsa, a table of frequencies for state-action
    pairs, initially zero
  • Nsas, a table of frequencies of
    state-action-state triples, initially zero
  • s, a, the previous state and action, initially
    null
  • if s is new then Us ? r, Rs ? r
  • if s is not null then
  • increment Nsas,a and Nsass,a,s
  • for each t such that Nsass,a,t ltgt0 do
  • Ts,a,t ? Nsass,a,t / Nsas,a
  • U ? Value-Determination(?,U,mdp)
  • if Terminals then s,a ? null else s,a ? s,
    ?s
  • return a
  • end

according to MDP (value iteration or policy
iteration)
27
28
  • Temporal difference learning
  • (TD learning)
  •  The value function is no longer implemented by
    solving a set of linear equations, but it is
    computed iteratively.
  •  
  • Used observed transitions to adjust the values of
    the observed states so that they agree with the
    constraint equations.
  •  
  • U?(s) ? U ?(s) ?(R(s) ? U ?(s) U ?(s))
  •   ? is the learning rate.
  • Whatever state is visited, its estimated value is
    updated to be closer to R(s) ? U ?(s)
  • since R(s) is the instantaneous reward received
    and
  • U ?(s') is the estimated value of the actually
    occurring next state.
  • simpler, involves only next states
  • ? decreases as the number of times the state is
    visited increases

28
29
  • Temporal difference learning
  • function Passive-TD-Agent(percept) returns an
    action
  • inputs percept, a percept indicating the current
    state s and reward signal r
  • variable ?, a fixed policy
  • U, a table of utilities, initially empty
  • Ns, a table of frequencies for states,
    initially zero
  • s, a, r, the previous state, action, and
    reward, initially null
  • if s is new then Us ? r
  • if s is not null then
  • increment Nss
  • Us ? Us ?(Nss)(r ? U s U s)
  • if Terminals then s, a, r ? null else s, a, r
    ? s, ?s, r
  • return a
  • end

29
30
  • Temporal difference learning
  • Does not need a model to perform its updates
  • The environment supplies the connections between
    neighboring states in the form of observed
    transitions.
  • ADP and TD comparison
  • ADP and TD try both to make local adjustments to
    the utility estimates in order to make each state
     agree  with its successors
  • TD adjusts a state to agree with the observed
    successor
  • ADP adjusts a state to agree with all of the
    successors that might occur, weighted by their
    probabilities

30
31
  • (b) Active reinforcement learning
  • Passive learning agent has a fixed policy that
    determines its behavior
  • An active learning agent must decide what action
    to take
  • The agent must learn a complete model with
    outcome probabilities for all actions (instead of
    a model for the fixed policy)
  • Compute/learn the utilities that obey the Bellman
    equation
  • U (s) R(s) ? maxa?s (T(s, ?t(s),s)
    U(s))
  • using value iteration r policy iteration
  • - If value iteration then look for the action
    that maximze utility
  • - If policy iteration you already have the action
  • - Exploration/exploitation
  • - The representative problem is the n-armed
    bandit problem
  • Solutions
  • 1/t time choose random actions, rest follow ?
  • give weights to actions that have not been
    explored, avoid actions with low utilities
  • Exploratory function f(u,n) how greedy
    (prefer high utility vales r not (exploration)
    the agent is

31
32
  • Q-learning
  • Active learning of action-value functions
  • action-value function assigns an expected
    utility to taking a given action in a given
    state, Q-values
  • Q(a, s) the value of doing action a in state s
    (expected utility)
  • Q-values are related to utility values by the
    equation
  • U(s) maxaQ(a, s)
  • Approach 1
  • Q(a,s) R(s) ? ?s (T(s, a,s) maxa
    Q(a,s))
  • This requires a model
  • Approach 2
  • Use TD
  •  The agent does not need to learn a model model
    free

32
33
  • Q-learning
  •  
  • TD learning, unknown environment
  •  
  • Q(a,s) ? Q(a,s) ?(R(s) ? maxaQ(a, s)
    Q(a,s))
  •  
  • calculated after each transition from state s to
    s.
  •  
  •  
  • Is it better to learn a model and a utility
    function or to learn an action-value function
    with no model?

33
34
  • Q-learning
  • function Q-Learning-Agent(percept) returns an
    action
  • inputs percept, a percept indicating the current
    state s and reward signal r
  • variable Q, a table of action values index by
    state and action
  • Nsa, a table of frequencies for state-action
    pairs
  • s, a, r the previous state, action, and reward,
    initially null
  • if s is not null then
  • increment Nsas,a
  • Qa,s ? Qa,s ?(Nsas,a)(r ? maxaQ
    a,s Q a,s)
  • if Terminals then s, a, r ? null
  • else s, a, r ? s, argmaxa f(Qa, s,
    Nsaa,s), r
  • return a
  • end

s, a, r ? s, argmaxa (Qa, s), r
34
35
  • Generalization of RL
  •  The problem of learning in large spaces large
    no. of states
  • Generalization techniques - allow compact storage
    of learned information and transfer of knowledge
    between "similar" states and actions.
  • Neural nets
  • Decision trees
  • U(state)U(most similar sate in memory)
  • U(state) average U(most similar sates in memory)
  •  

35
36
3 Learning in MAS
  • The credit-assignment problem (CAP) the problem
    of assigning feed-back (credit or blame) for an
    overall performance of the MAS (increase,
    decrease) to each agent that contributed to that
    change
  • inter-agent CAP assigns credit or blame to the
    external actions of agents
  • intra-agent CAP assigns credit or blame for a
    particular external action of an agent to its
    internal inferences and decisions
  • distinction not always obvious
  • one or another

36
37
3.1 Learning action coordination
  • s current environment state
  • Agent i determines the set of actions it can do
    in s Ai(s) Aij(s)
  • Computes the goal relevance of each action
    Eij(s)
  • Agent i announces a bid for each action with
  • Eij(s) gt threshold
  • Bij(s) (? ?) Eij(s)
  • ? - risk factor (small) ? - noise term (to
    prevent convergence to local minima)

37
38
  • The action with the highest bid is selected
  • Incompatible actions are eliminated
  • Repeat process until all actions in bids are
    either selected or eliminated
  • A selected actions activity context
  • Execute selected actions
  • Update goal relevance for actions in A
  • Eij(s) ? Eij(s) Bij(s) (R / A)
  • R external reward received
  • Update goal relevance for actions in the previous
    activity context Ap (actions Akl)
  • Ekl(sp) ? Ekl(sp) (?Aij?A Bij(s)/ Ap)

38
39
3.2 Learning individual performance
  • The agent learns how to improve its individual
    performance in a multi-agent settings
  • Examples
  • Cooperative agents - learning organizational
    roles
  • Competitive agents - learning from market
    conditions

39
40
3.2.1 Learning organizational roles (Nagendra,
e.a.)
  • Agents learn to adopt a specific role in a
    particular situation (state) in a cooperative
    MAS.
  • Aim to increase utility of final states
  • Each agent may play several roles in a situation
  • The agents learn to select the most appropriate
    role
  • Use reinforcement learning
  • Utility, Probability, and Cost (UPC) estimates of
    a role in a situation
  • Utility - the agent's estimate of a final state
    worth for a specific role in a situation world
    states mapped to smaller set of situations
  • S s0,,sf
  • Urs U(sf), s0 ? ? sf

40
41
  • Probability - the likelihood of reaching a final
    state for a specific role in a situation
  • Prs p(sf), s0 ? ? sf
  • Cost - the computational cost of reaching a final
    state for a specific role in a situation
  • Potential of a role - estimates the usefulness of
    a role, discovering pertinent global information
    and constraints (ortogonal to utilities)
  • Representation
  • Sk - vector of situations for agent k, SK1,,SKn
  • Rk - vector of roles for agent k, Rk1,,Rkm
  • Sk x Rk x 4 values to describe UPC and
    Potential

41
42
  • Functioning
  • Phase I Learning
  • Several learning cycles in each cycle
  • each agent goes from s0 to sf and selects its
    role as the one with the highest probability
  • Probability of selecting a role r in a situation
    s
  • f - objective function used to rate the roles
  • (e.g., f(U,P,C,Pot) UPC Pot)
  • - depends on the domain

42
43
  • Use reinforcement learning to update UPC and the
    potential of a role
  • For every s ? s0,,sf and chosen role r in s
  • ? Ursi1 (1-?)Ursi ?Usf
  • i - the learning cycle
  • Usf - the utility of a final state
  • 0???1 - the learning rate
  • ? Prsi1 (1-?)Prsi ?O(sf)
  • O(sf) 1 if sf is successful, 0 otherwise

43
44
  • ? Potrsi1 (1-?)Potrsi ?Conf(Path)
  • Path s0,,sf
  • Conf(Path) 0 if there are conflicts on the
    Path, 1 otherwise
  • The update rules for cost are domain dependent
  • Phase II Performing
  • In a situation s the role r is chosen such that

44
45
3.2.2 Learning in market environments(Vidal
Durfee)
  • Agents use past experience and evolved models of
    other agents to better sell and buy goods
  • Environment a market in which agents buy and
    sell information (electronic marketplace)
  • Open environment
  • The agents are self-interested (max local
    utility)
  • g - a set of goods
  • P - set of possible prices for goods
  • Qg - set of possible qualities for a good g

45
46
  • information has a cost for the seller and a value
    for the buyer
  • information is sold at a certain price
  • a buyer announces a good it needs
  • sellers bid their prices for delivering the good
  • the buyer selects from these bids and pays the
    corresponding price
  • the buyer assesses the quality of information
    after it receives it from the seller
  • Profit of a seller s for selling the good g at
    price p
  • Profitsg(p) p - csg
  • csg - the cost of producing the good g
    by s p - the price
  • Value of a good g for a buyer b
  • Vbg(p,q) p - price b paid for g
  • q - quality of good g
  • Goal seller - maximize profit in a transaction
  • buyer - maximize value in a
    transaction

46
47
  • 3 types of agents
  • 0-level agents
  • they set their buying and selling prices based on
    their own past experience
  • they do not model the behavior of other agents
  • 1-level agents
  • model other agents based on previous interactions
  • they set their buying and selling prices based on
    these models and on past experience
  • they model the other agents as 0-level agents
  • 2-level agents
  • same as 1-level agents but they model the other
    agents as 1-level agents

47
48
  • Strategy of 0-level agents
  • 0-level buyer
  • - learns the expected value function, fg(p), of
    buying g at price p
  • - uses reinforcement learning
  • fgi1(p) (1-?)fgi(p) ?Vbg(p,q), ?min? ? ?
    1, for i0, ? 1
  • - chooses the seller s for supplying a good g
  • 0-level seller
  • - learns the expected profit function, hg(p),if
    it offers good g at price p
  • - uses reinforcement learning
  • hgi1(p) (1-?)hgi(p) ?Profitbg(p)
  • where Profitbg(p) p - csg if it
    wins the auction, 0 otherwise
  • - chooses the price ps to sell the good g so as
    to maximize profit

48
49
  • Strategy of 1-level agents
  • 1-level buyer
  • - models sellers for good g
  • - does not model other buyers
  • - uses a probability distribution function qsg(x)
    over the qualities x of a good g
  • - computes expected utility, Esg, of buying good
    g from seller s
  • - chooses the seller s for supplying a good g
    that maximizes this expected utility

49
50
  • 1-level seller
  • - models buyers for good g
  • - models the other sellers s for good g
  • Buyer's modeling
  • - uses a probability distribution function mbg(p)
    - the probability that b will choose price p for
    good g
  • Seller's modeling
  • - uses a probability distribution function
    ns'g(y) - the probability that s' will bid price
    y for good g
  • - computes the probability of bidding lower than
    a given seller s' with the price p
  • Prob_of_bidding_lower_than_s'
  • ?p'(Prob of bid of s' with p' for which s wins)
  • ?p' N(g,b,ss',p,p')
  • N(g,b,ss',p,p') ns'g(p') if mbg(p') ?
    mbg(p)
  • 0 otherwise

50
51
  • - computes the probability of bidding lower than
    all other sellers with the price p
  • Prob_of_bidding_lower_with_p
  • ? (Prob_of_bidding_lower_than_s')
  • s'?S - s
  • - chooses the best price p to bid so as to
    maximize profit

51
52
3.3 Learning to communicate
  • What to communicate (e.g., what information is of
    interest to the others)
  • When to communicate (e.g., when try doing
    something by itself or when look for help)
  • With which agents to communicate
  • How to communicate (e.g., language, protocol,
    ontology)

52
53
Learning with which agents to communicate (Ohko,
e.a. )
  • Learning to which agents to ask for performing a
    task
  • Used in a contract net protocol for task
    allocation to reduce communication for task
    announcement
  • Goal acquire and refine knowledge about other
    agents' task solving abilities
  • Case-based reasoning used for knowledge
    acquisition and refinement
  • A case consists of
  • (1) A task specification
  • (2) Information about which agents solved a task
    or similar tasks in the past and the quality of
    the provided solution

53
54
  • (1) Task specification
  • Ti Ai1 Vi1, , Aimi Vimi
  • Aij - task attribute, Vij - value of attribute
  • Similar tasks
  • Sim(Ti, Tj) ?r ?s Dist(Air, Ajs)
  • Air?Ti, Ajs?Tj
  • Dist(Air, Ajs) Sim_Attr(Air, Ajs)
    Sim_Vals(Vir, Vjs)
  • Set of similar tasks
  • S(T) Tj Sim(T, Tj) ?0.85

54
55
  • (2) Which agents performed T or similar tasks in
    the past
  • Suitability of Agent k
  • Perform(Ak, Tj) - quality of solution for Tj
    assured by agent Ak performing Tj in the past
  • The agent computes
  • Suit(Ak, T), Suit(Ak, T)gt0 and selects the
    agent k such that
  • or the first m agents with best suitability
  • After each encounter, the agent stores the tasks
    performed by other agents and the solution
    quality
  • Tradeoff between exploitation and exploration

55
56
3.4 Layered learning
  • (Stone Veloso)
  • A hierarchical machine learning paradigm in MAS
  • Used simulated robotic soccer RoboCup
  • Learning
  • Input ? Output Intractable
  • Decompose the learning task L into subtasks L1,
    , Ln
  • Characteristics of the environment
  • Cooperative MAS
  • Teammates and adversaries
  • Hidden states agents have a partial world view
    at any given moment
  • Agents have noisy sensory data and actuators
  • Perception and action cycles are asynchronous
  • Agents must make their decisions in real-time

56
57
  • Problem the agent receives a moving ball and
    must decide what to do with it dribble, pass to
    a teammate, shoot towards the goal
  • Decompose the problem into 3 subtasks
  • Layer Behavior type Example
  • L1 Individual Ball interception
  • L2 Multiagent Pass evaluation
  • L3 Team Pass selection
  • The decomposition into subtasks enables the
    learning of more complex behaviors
  • The hierarchical task decomposition is
    constructed bottom-up, in a domain dependent
    fashion
  • Learning methods are chosen to suit the task
  • Learning in one layer feeds into the next layer
    either by providing a portion of the behavior
    used for training (ball interception pass
    evaluation) or by creating the input
    representation and pruning the action space (pass
    evaluation pass selection)

57
58
  • L1 Ball interception
  • behavior individual
  • Aim
  • Blocks or intercepts opponents shots or passes or
  • Receive passes from teammates
  • Learning method a fully connected
    backpropagation NN
  • Repeatedly shooting the ball towards a defender
    in front of a goal.The defender collects t.e. by
    acting randomly and noticing when it successfully
    stops the ball
  • Classification
  • Saves successful interceptions
  • Goals unsuccessful attempts
  • Misses shoots that went wide of the goal

58
59
  • L2 Pass evaluation
  • behavior multiagent
  • Uses its learned ball-interception skills as part
    of the behavior for training MAS behavior
  • Aim the agent must decide
  • To pass (or not) the ball to a teammate and
  • If the teammate will successfully receive the
    ball (based on positions abilities of the
    teammate to receive or intercept a pass)
  • Learning method decision trees (C4.5)
  • Kick the ball towards randomly placed teammates
    interspread with randomly placed opponents
  • The intended pass recipient and the opponents all
    use the learned ball-interception behavior
  • Classification of a potential pass to a receiver
  • Success, with a c.f. ? (0,1
  • Failure, with a c.f. ? -1,0)
  • Miss, ( 0)

59
60
  • L3 Pass selection
  • behavior team
  • Uses its learned pass-evaluation capabilities to
    create the input and output set for learning pass
    selection
  • Aim the agent has the ball and must decide
  • To which teammate to pass the ball or
  • Shoot on goal
  • Learning method Q-learning of a function that
    depends on the agents position on the field
  • Simulate 2 teams playing with identical behavior
    others than their pass-selection policies
  • Reinforcement total goals scored
  • Learns
  • Shoot the goal
  • The teammate to which to pass

60
61
4 Conclusions
  • There is no unique method or set of methods for
    learning in MAS
  • Many approaches are based on extending ML
    techniques in a MAS setting
  • Many approaches use reinforcement learning, but
    also NN or genetic algorithms

61
62
  • References
  • S. Sen, G. Weiss. Learning in Multiagent systems.
    In Multiagent Systems - A Modern Approach to
    Distributed Artificial Intelligence, G. Weiss
    (Ed.), The MIT Press, 2001, p.257-298.
  • T. Ohko, e.a. - Addressee learning and message
    interception for communication load reduction in
    multiple robot environment. In Distributed
    Artificial Intelligence Meets Machine Learning,
    G. Weiss, Ed., Lecture Notes in Artificial
    Intelligence, Vol. 1221, Springer-Verlag, 1997,
    p.242-258.
  • M.V. Nagendra, e.a. Learning organizational roles
    in a heterogeneous multi-agent systems. In Proc.
    of the Second International Conference on
    Multiagent Systems, AAAI Press, 1996, p.291-298.
  • J.M. Vidal, E.H. Durfee. The impact of nested
    agent models in an information economy. In Proc.
    of the Second International Conference on
    Multiagent Systems, AAAI Press, 1996, p.377-384.
  • P. Stone, M. Veloso. Layered Learning, Eleventh
    European Conference on Machine Learning,
    ECML-2000.

62
63
  • Web References
  • An interesting set of training examples and the
    connection between decision trees and rules.
  • http//www.dcs.napier.ac.uk/peter/vldb/dm/node11.
    html
  • Decision trees construction
  • http//www.cs.uregina.ca/hamilton/courses/831/not
    es/ml/dtrees/4_dtrees2.html
  • Building Classification Models ID3 and C4.5
  • http//yoda.cis.temple.edu8080/UGAIWWW/lectures/C
    45/
  • n       Introduction to Reinforcement Learning
  • http//www.cs.indiana.edu/gasser/Salsa/rl.html
  •  n       On-line book on Reinforcement Learning
  • http//www-anw.cs.umass.edu/rich/book/the-book.ht
    ml

63
Write a Comment
User Comments (0)
About PowerShow.com