Multiagent Planning with Factored MDPs

1 / 25
About This Presentation
Title:

Multiagent Planning with Factored MDPs

Description:

Status of a machine and neighbors. Load on machine ... Functions are factored, use Variable Elimination to represent constraints: ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 26
Provided by: carlosg94

less

Transcript and Presenter's Notes

Title: Multiagent Planning with Factored MDPs


1
Multiagent Planning with Factored MDPs
  • Carlos Guestrin
  • Daphne Koller
  • Stanford University
  • Ronald Parr
  • Duke University

2
Multiagent Coordination Examples
  • Search and rescue
  • Factory management
  • Supply chain
  • Firefighting
  • Network routing
  • Air traffic control
  • Multiple, simultaneous decisions
  • Limited observability
  • Limited communication

3
Network Management Problem
  • Administrators must coordinate to
  • maximize global reward

4
Joint Decision Space
  • Represent as MDP
  • Action space joint action a a1,, an for all
    agents
  • State space joint state x of entire system
  • Reward function total reward r
  • Action space is exponential in agents
  • State space is exponential in variables
  • Global decision requires complete observation

5
Long-term Utilities
  • One step utility
  • SysAdmin Ai receives reward () if process
    completes
  • Total utility sum of rewards
  • Optimal action requires long-term planning
  • Long-term utility Q(x,a)
  • Expected reward, given current state x and action
    a
  • Optimal action at state x is

6
Local Q function Approximation
Q(A1,,A4, X1,,X4)
Q(A1,,A4, X1,,X4) ¼ Q1(A1, A4, X1,X4) Q2(A1,
A2, X1,X2) Q3(A2, A3, X2,X3) Q4(A3, A4,
X3,X4)
Associated with Agent 3
Limited observability agent i only observes
variables in Qi
Must choose action to maximize åi Qi
7
Maximizing ?i Qi Coordination Graph
  • Use variable elimination for maximization
  • Bertele Brioschi 72

Here we need only 23, instead of 63 sum
operations.
  • Limited communication for optimal action choice
  • Comm. bandwidth induced width of coord. graph

8
Where do the Qi come from?
  • Use function approximation to find Qi
  • Q(X1, , X4, A1, , A4) ¼ Q1(A1, A4, X1,X4)
    Q2(A1, A2, X1,X2) Q3(A2,
    A3, X2,X3) Q4(A3, A4, X3,X4)
  • Long-term planning requires Markov Decision
    Process
  • states exponential
  • actions exponential
  • Efficient approximation by exploiting structure!

9
Dynamic Decision Diagram
P(X1X1, X4, A1)
  • State
  • Dynamics
  • Decisions
  • Rewards

10
Long-term Utility Value of MDP
  • Value computed by linear programming
  • One variable V (x) for each state
  • One constraint for each state x and action a
  • Number of states and actions exponential!

11
Decomposable Value Functions
Linear combination of restricted domain functions
Bellman et al. 63 Tsitsiklis Van Roy
96 Koller Parr 99,00 Guestrin et al.
01
  • Each hi is status of small part(s) of a complex
    system
  • Status of a machine and neighbors
  • Load on machine
  • Must find w giving good approximate value function

12
Single LP Solution for Factored MDPs
Schweitzer and Seidmann 85
  • One variable wi for each basis function ?
  • Polynomially many LP variables
  • One constraint for every state and action ?
  • Exponentially many LP constraints
  • hi , Qi depend on small sets of variables/actions
    ?

13
Representing Exponentially Many Constraints
Guestrin et al 01
Exponentially many linear one nonlinear
constraint
14
Representing the Constraints
  • Functions are factored, use Variable Elimination
    to represent constraints

Number of constraints exponentially smaller
15
Summary of Algorithm
  1. Pick local basis functions hi
  2. Single LP to compute local Qis in factored MDP
  3. Coordination graph computes maximizing action

16
Network Management Problem
Server
Unidirectional Ring
Ring of Rings
Star
17
Single Agent Policy Quality
Single LP versus Approximate Policy Iteration
350
300
250
LP single basis
200
Discounted reward
150
100
50
0
10
20
30
number of machines
PI Approximate Policy Iteration with Max-norm
Projection Guestrin et al.01
18
Single Agent Running Time
PI Approximate Policy Iteration with Max-norm
Projection Guestrin et al.01
19
Multiagent Policy Quality
  • Comparing to Distributed Reward and Distributed
    Value Function algorithms Schneider et al. 99

20
Multiagent Policy Quality
  • Comparing to Distributed Reward and Distributed
    Value Function algorithms Schneider et al. 99

Distributed reward
Distributed value
21
Multiagent Policy Quality
  • Comparing to Distributed Reward and Distributed
    Value Function algorithms Schneider et al. 99

LP pair basis
LP single basis
Distributed reward
Distributed value
22
Multiagent Running Time
Ring of rings
Star pair basis
Star single basis
23
Conclusions
  • Multiagent planning algorithm
  • Limited Communication
  • Limited Observability
  • Unified view of function approximation and
    multiagent communication
  • Single LP solution is simple and very efficient
  • Exploit structure to reduce computation costs!
  • Solve very large MDPs efficiently

24
Solve Very Large MDPs
Solved MDPs with
500 agents
over 10150 actions and

144365965422032752148167664920368 226828597346
70489954077831385060806196390977769687258235595095
45 82100618911865342725257953674027620225198320803
87801477422896484 12743904001175886180411289478156
23094438061566173054086674490506 17812548034440554
70543970388958174653682549161362208302685637785 82
29022846398307887896918556404084898937609373242171
846359938695 5167650189405881090604260896714388641
028143503856487471658320106 1436613217310276890285
5220001
1322070819480806636890455259752
states
25
Conclusions
  • Multiagent planning algorithm
  • Limited Communication
  • Limited Observability
  • Unified view of function approximation and
    multiagent communication
  • Single LP solution is simple and very efficient
  • Exploit structure to reduce computation costs!
  • Solve very large MDPs efficiently
Write a Comment
User Comments (0)