Title: Multiagent Planning with Factored MDPs
1Multiagent Planning with Factored MDPs
- Carlos Guestrin
- Daphne Koller
- Stanford University
- Ronald Parr
- Duke University
2Multiagent Coordination Examples
- Search and rescue
- Factory management
- Supply chain
- Firefighting
- Network routing
- Air traffic control
- Multiple, simultaneous decisions
- Limited observability
- Limited communication
3Network Management Problem
- Administrators must coordinate to
- maximize global reward
4Joint Decision Space
- Represent as MDP
- Action space joint action a a1,, an for all
agents - State space joint state x of entire system
- Reward function total reward r
- Action space is exponential in agents
- State space is exponential in variables
- Global decision requires complete observation
5Long-term Utilities
- One step utility
- SysAdmin Ai receives reward () if process
completes - Total utility sum of rewards
- Optimal action requires long-term planning
- Long-term utility Q(x,a)
- Expected reward, given current state x and action
a - Optimal action at state x is
6Local Q function Approximation
Q(A1,,A4, X1,,X4)
Q(A1,,A4, X1,,X4) ¼ Q1(A1, A4, X1,X4) Q2(A1,
A2, X1,X2) Q3(A2, A3, X2,X3) Q4(A3, A4,
X3,X4)
Associated with Agent 3
Limited observability agent i only observes
variables in Qi
Must choose action to maximize åi Qi
7Maximizing ?i Qi Coordination Graph
- Use variable elimination for maximization
- Bertele Brioschi 72
Here we need only 23, instead of 63 sum
operations.
- Limited communication for optimal action choice
- Comm. bandwidth induced width of coord. graph
8Where do the Qi come from?
- Use function approximation to find Qi
- Q(X1, , X4, A1, , A4) ¼ Q1(A1, A4, X1,X4)
Q2(A1, A2, X1,X2) Q3(A2,
A3, X2,X3) Q4(A3, A4, X3,X4) - Long-term planning requires Markov Decision
Process - states exponential
- actions exponential
- Efficient approximation by exploiting structure!
9Dynamic Decision Diagram
P(X1X1, X4, A1)
- State
- Dynamics
- Decisions
- Rewards
10Long-term Utility Value of MDP
- Value computed by linear programming
- One variable V (x) for each state
- One constraint for each state x and action a
- Number of states and actions exponential!
11Decomposable Value Functions
Linear combination of restricted domain functions
Bellman et al. 63 Tsitsiklis Van Roy
96 Koller Parr 99,00 Guestrin et al.
01
- Each hi is status of small part(s) of a complex
system - Status of a machine and neighbors
- Load on machine
- Must find w giving good approximate value function
12Single LP Solution for Factored MDPs
Schweitzer and Seidmann 85
- One variable wi for each basis function ?
- Polynomially many LP variables
- One constraint for every state and action ?
- Exponentially many LP constraints
- hi , Qi depend on small sets of variables/actions
?
13Representing Exponentially Many Constraints
Guestrin et al 01
Exponentially many linear one nonlinear
constraint
14Representing the Constraints
- Functions are factored, use Variable Elimination
to represent constraints
Number of constraints exponentially smaller
15Summary of Algorithm
- Pick local basis functions hi
- Single LP to compute local Qis in factored MDP
- Coordination graph computes maximizing action
16Network Management Problem
Server
Unidirectional Ring
Ring of Rings
Star
17Single Agent Policy Quality
Single LP versus Approximate Policy Iteration
350
300
250
LP single basis
200
Discounted reward
150
100
50
0
10
20
30
number of machines
PI Approximate Policy Iteration with Max-norm
Projection Guestrin et al.01
18Single Agent Running Time
PI Approximate Policy Iteration with Max-norm
Projection Guestrin et al.01
19Multiagent Policy Quality
- Comparing to Distributed Reward and Distributed
Value Function algorithms Schneider et al. 99
20Multiagent Policy Quality
- Comparing to Distributed Reward and Distributed
Value Function algorithms Schneider et al. 99
Distributed reward
Distributed value
21Multiagent Policy Quality
- Comparing to Distributed Reward and Distributed
Value Function algorithms Schneider et al. 99
LP pair basis
LP single basis
Distributed reward
Distributed value
22Multiagent Running Time
Ring of rings
Star pair basis
Star single basis
23Conclusions
- Multiagent planning algorithm
- Limited Communication
- Limited Observability
- Unified view of function approximation and
multiagent communication - Single LP solution is simple and very efficient
- Exploit structure to reduce computation costs!
- Solve very large MDPs efficiently
24Solve Very Large MDPs
Solved MDPs with
500 agents
over 10150 actions and
144365965422032752148167664920368 226828597346
70489954077831385060806196390977769687258235595095
45 82100618911865342725257953674027620225198320803
87801477422896484 12743904001175886180411289478156
23094438061566173054086674490506 17812548034440554
70543970388958174653682549161362208302685637785 82
29022846398307887896918556404084898937609373242171
846359938695 5167650189405881090604260896714388641
028143503856487471658320106 1436613217310276890285
5220001
1322070819480806636890455259752
states
25Conclusions
- Multiagent planning algorithm
- Limited Communication
- Limited Observability
- Unified view of function approximation and
multiagent communication - Single LP solution is simple and very efficient
- Exploit structure to reduce computation costs!
- Solve very large MDPs efficiently