Multiagent Planning with Factored MDPs

About This Presentation

Title:

Multiagent Planning with Factored MDPs

Description:

Status of a machine and neighbors. Load on machine ... Functions are factored, use Variable Elimination to represent constraints: ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 26

Provided by: carlosg94

more less

Transcript and Presenter's Notes

Title: Multiagent Planning with Factored MDPs

1
Multiagent Planning with Factored MDPs

Carlos Guestrin
Daphne Koller
Stanford University
Ronald Parr
Duke University

2
Multiagent Coordination Examples

Search and rescue
Factory management
Supply chain
Firefighting
Network routing
Air traffic control

Multiple, simultaneous decisions
Limited observability
Limited communication

3
Network Management Problem

Administrators must coordinate to
maximize global reward

4
Joint Decision Space

Represent as MDP
Action space joint action a a1,, an for all
agents
State space joint state x of entire system
Reward function total reward r
Action space is exponential in agents
State space is exponential in variables
Global decision requires complete observation

5
Long-term Utilities

One step utility
SysAdmin Ai receives reward () if process
completes
Total utility sum of rewards
Optimal action requires long-term planning
Long-term utility Q(x,a)
Expected reward, given current state x and action
a
Optimal action at state x is

6
Local Q function Approximation
Q(A1,,A4, X1,,X4)
Q(A1,,A4, X1,,X4) ¼ Q1(A1, A4, X1,X4) Q2(A1,
A2, X1,X2) Q3(A2, A3, X2,X3) Q4(A3, A4,
X3,X4)
Associated with Agent 3
Limited observability agent i only observes
variables in Qi
Must choose action to maximize åi Qi
7
Maximizing ?i Qi Coordination Graph

Use variable elimination for maximization
Bertele Brioschi 72

Here we need only 23, instead of 63 sum
operations.

Limited communication for optimal action choice
Comm. bandwidth induced width of coord. graph

8
Where do the Qi come from?

Use function approximation to find Qi
Q(X1, , X4, A1, , A4) ¼ Q1(A1, A4, X1,X4)
Q2(A1, A2, X1,X2) Q3(A2,
A3, X2,X3) Q4(A3, A4, X3,X4)
Long-term planning requires Markov Decision
Process
states exponential
actions exponential
Efficient approximation by exploiting structure!

9
Dynamic Decision Diagram
P(X1X1, X4, A1)

State
Dynamics
Decisions
Rewards

10
Long-term Utility Value of MDP

Value computed by linear programming

One variable V (x) for each state
One constraint for each state x and action a
Number of states and actions exponential!

11
Decomposable Value Functions
Linear combination of restricted domain functions
Bellman et al. 63 Tsitsiklis Van Roy
96 Koller Parr 99,00 Guestrin et al.
01

Each hi is status of small part(s) of a complex
system
Status of a machine and neighbors
Load on machine
Must find w giving good approximate value function

12
Single LP Solution for Factored MDPs
Schweitzer and Seidmann 85

One variable wi for each basis function ?
Polynomially many LP variables
One constraint for every state and action ?
Exponentially many LP constraints
hi , Qi depend on small sets of variables/actions
?

13
Representing Exponentially Many Constraints
Guestrin et al 01
Exponentially many linear one nonlinear
constraint
14
Representing the Constraints

Functions are factored, use Variable Elimination
to represent constraints

Number of constraints exponentially smaller
15
Summary of Algorithm

Pick local basis functions hi
Single LP to compute local Qis in factored MDP
Coordination graph computes maximizing action

16
Network Management Problem
Server
Unidirectional Ring
Ring of Rings
Star
17
Single Agent Policy Quality
Single LP versus Approximate Policy Iteration
350
300
250
LP single basis
200
Discounted reward
150
100
50
0
10
20
30
number of machines
PI Approximate Policy Iteration with Max-norm
Projection Guestrin et al.01
18
Single Agent Running Time
PI Approximate Policy Iteration with Max-norm
Projection Guestrin et al.01
19
Multiagent Policy Quality

Comparing to Distributed Reward and Distributed
Value Function algorithms Schneider et al. 99

20
Multiagent Policy Quality

Comparing to Distributed Reward and Distributed
Value Function algorithms Schneider et al. 99

Distributed reward
Distributed value
21
Multiagent Policy Quality

Comparing to Distributed Reward and Distributed
Value Function algorithms Schneider et al. 99

LP pair basis
LP single basis
Distributed reward
Distributed value
22
Multiagent Running Time
Ring of rings
Star pair basis
Star single basis
23
Conclusions

Multiagent planning algorithm
Limited Communication
Limited Observability
Unified view of function approximation and
multiagent communication
Single LP solution is simple and very efficient
Exploit structure to reduce computation costs!
Solve very large MDPs efficiently

24
Solve Very Large MDPs
Solved MDPs with
500 agents
over 10150 actions and

144365965422032752148167664920368 226828597346
70489954077831385060806196390977769687258235595095
45 82100618911865342725257953674027620225198320803
87801477422896484 12743904001175886180411289478156
23094438061566173054086674490506 17812548034440554
70543970388958174653682549161362208302685637785 82
29022846398307887896918556404084898937609373242171
846359938695 5167650189405881090604260896714388641
028143503856487471658320106 1436613217310276890285
5220001
1322070819480806636890455259752
states
25
Conclusions