Networked Distributed POMDPs: DCOPInspired Distributed POMDPs - PowerPoint PPT Presentation

1 / 23
About This Presentation

Networked Distributed POMDPs: DCOPInspired Distributed POMDPs


Makoto Yokoo, Kyushu University. 2. Background: DPOMDP ... ND-POMDP. Transition independence: Agent i's local state cannot be affected by other ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 24
Provided by: Empl206


Transcript and Presenter's Notes

Title: Networked Distributed POMDPs: DCOPInspired Distributed POMDPs

Networked Distributed POMDPs DCOP-Inspired
Distributed POMDPs
  • Ranjit Nair, Honeywell Labs
  • Pradeep Varakantham, USC
  • Milind Tambe, USC
  • Makoto Yokoo, Kyushu University

Background DPOMDP
  • Distributed Partially Observable Markov Decision
    Problems (DPOMDP) a decision theoretic approach
  • Performance linked to optimality of decision
  • Explicitly reasons about (/-ve) rewards and
  • Current methods use centralized planning and
    distributed execution
  • The complexity of finding optimal policy is
  • In many domains, not all agents can interact or
    affect each other
  • Most current DPOMDP algorithms do not exploit
    locality of interaction

Disaster Rescue simulations
Distributed sensors
Battlefield simulations
Background DCOP
  • Distributed Constraint Optimization Problem
  • Constraint Graph (V,E)
  • Vertices are agents variables (x1, ..,, x4) each
    with a domain d1, , d4
  • Edges represent rewards
  • DCOP algorithms exploit locality of interaction
  • DCOP algorithms do not reason about uncertainty

Key ideas and contributions
  • Key ideas
  • Exploit locality of interaction to enable
  • Hybrid DCOP DPOMDP approach to collaboratively
    find joint policy
  • Distributed offline planning and distributed
  • Key contributions
  • Distributed POMDP model that captures locality of
  • Locally Interacting Distributed Joint
    Equilibrium-based Search for Policies (LID-JESP)
  • Hill climbing like Distributed Breakout Algorithm
  • Distributed Parallel Algorithm for Finding
    Locally Optimal Joint Policy
  • Globally Optimal Algorithm (GOA)
  • Variable Elimination

  • Sensor net domain
  • Networked Distributed POMDPs (ND-POMDPs)
  • Locally interacting distributed joint
    equilibrium-based search for policies (LID-JESP)
  • Globally optimal algorithm
  • Experiments
  • Conclusions and Future Work

Example Domain
  • Two independent targets
  • Each changes position based on its stochastic
    transition function
  • Sensing agents cannot affect each other or
    targets position
  • False positives and false negatives in observing
    targets possible
  • Reward obtained if two agents track a target
    correctly together
  • Cost for leaving sensor on

Networked Distributed POMDP
  • ND-POMDP for set of n agents Ag ltS, A, P, O, O,
    R, bgt
  • World state s ? S where S S1 Sn Su
  • Each agent i ? Ag has local state si ? Si
  • E.g. Is sensor on or off?
  • Su is the part of the state that no agent can
  • E.g. Location of the two targets
  • b is the initial belief state, a probability
    distribution over S
  • b b1 bn. bu
  • A A1 An , where Ai is set of actions for
    agent i
  • E.g. Scan East, Scan West, Turn Off
  • No communication during execution
  • Agents communicate during planning

  • Transition independence Agent is local state
    cannot be affected by other agents
  • Pi Si Su Ai Si ? 0,1
  • Pu Su Su ? 0,1
  • O O1 On , where Oi is set of observations
    for agent i
  • E.g. Target present in sector
  • Observation independence Agent is observations
    not dependent on others
  • Oi Si Su Ai Oi ? 0,1
  • Reward function R is decomposable
  • R(s,a) ?l Rl (sl1, slk, su, al1, alk)
  • l ? Ag, and k l
  • Goal To find a joint policy p lt p1, , pngt
    where pi is the local policy of agent i such
    that p maximizes the expected joint reward over
    finite horizon T

  • Inter-agent interactions captured by an
    interaction hypergraph (Ag, E)
  • Each agent is a node
  • Set of hyperedges E l l ? Ag and Rl is a
    component of R
  • Neighborhood of agent i Set of is neighbors
  • Ni j ? Ag j ? i, l ? E, i ? l and j ? l
  • Agents are solving a DCOP where
  • Constraint graph is the interaction hypergraph
  • Variable at each node is the local policy of that
  • Optimize expected joint reward

R1 Ag1s cost for scanning R12 Reward for Ag1
and Ag2 tracking target
ND-POMDP theorems
  • Theorem 1 For an ND-POMDP, expected reward for a
    policy ? is the sum of expected rewards for each
    of the links for policy ?
  • Global value function is decomposable into value
    functions for each link
  • Local Neighborhood Utility V?Ni Expected
    reward obtained from all links involving agent i
    for executing policy ?
  • Theorem 2 Locality of interaction For policies
    ? and ?, if ?i ?i and ?Ni ?Ni then V?Ni
  • Given its neighbors policies, local neighborhood
    utility of agent i does not depend on any
    non-neighbors policy

  • LID-JESP Algorithm (based on Distributed Breakout
  • Choose local policy randomly
  • Communicate local policy to neighbors
  • Compute local neighborhood utility of current
    policy wrt to neighbors policies
  • Compute local neighborhood utility of best
    response policy wrt neighbors (GetValue)
  • Communicate the gain (4 - 3) to neighbors
  • If gain is greater than gain of neighbors
  • Change local policy to best response policy
  • Communicate changed policy to neighbors
  • Else
  • If not reached termination go to step 3
  • Theorem 3 Global Utility is strictly increasing
    with each iteration until local optimum is

Termination Detection
  • Each agent maintains a termination counter
  • Reset to zero is gain gt 0 else increment by 1
  • Exchange counter with neighbors
  • Set counter to min of own counter and neighbors
  • Termination detected if counter d (diameter of
  • Theorem 4 LID-JESP will terminate within d
    cycles of reaching local optimum
  • Theorem 5 If LID-JESP terminates, agents are in
    a local optimum
  • From Theorems 3-5, LID-JESP will terminate in a
    local optimum within d cyles

Computing best response policy
  • Given neighbors fixed policies, each agent is
    faced with solving a single agent POMDP
  • State is
  • Note state is not fully observable
  • Transition function
  • Observation function
  • Reward function
  • Best response computed using Bellman backup

Global Optimal Algorithm (GOA)
  • Similar to variable elimination
  • Relies on a tree structured interaction graph
  • Cycle cutset algorithm to eliminate cycles
  • Assumes only binary interactions
  • Phase 1 Values are propagated upwards from
    leaves to root
  • For each policy, sum up values of its childrens
    optimal responses
  • Compute value of optimal response to each of the
    parents policies
  • Communicate these values to parent
  • Phase 2 Policies are propagated downwards from
    root to leaves.
  • Agent chooses policy corresponding to optimal
    response to parents policy
  • Communicates its policy to child

  • Compared to
  • LID-JESP-no-n/w ignores interaction graph
  • JESP Centralized solver (Nair2003)
  • 3 agent chain
  • LID-JESP exponentially faster than GOA
  • 4 agent chain
  • LID-JESP is faster than JESP and LID-JESP-no-nw
  • LID-JESP exponentially faster than GOA

  • 5 agent chain
  • LID-JESP is much faster than JESP and
  • Values
  • LID-JESP values are comparable to GOA
  • Random restarts can be used to find global

  • Reasons for speedup
  • C No. of cycles
  • G No. of GetValue calls
  • W No. of agents that change their policies in a
  • LID-JESP converges in fewer cycles (column C)
  • LID-JESP allows multiple agents to change their
    policies in a single cycle (column W)
  • JESP has fewer GetValue calls than LID-JESP
  • But each such call was slower

  • Complexity of best response
  • JESP O(S2. Ai. ?jOjT)
  • depends on entire world state
  • depends on observation histories of all agents
  • LID-JESP O(SuSiSNi2. Ai. ?j?NiOjT)
  • depends on observation histories of only
  • depends only on Su, Si and SNi
  • Increasing number of agents does not affect
  • Fixed number of neighbors
  • Complexity of GOA
  • Brute force global optimal O(?jpj.S2.?jOjT)
  • GOA O(n.pj.SuSiSj2. Ai.OiT.OjT)
  • Increasing number of agents will cause linear
    increase run time

  • DCOP algorithms are applied to finding solution
    to Distributed POMDP
  • Exploiting locality of interaction reduces run
  • LID-JESP based on DBA
  • Agents converge to locally optimal joint policy
  • GOA based on variable elimination
  • First distributed parallel algorithms for
    Distributed POMDPs
  • Exploiting locality of interaction reduces run
  • Complexity increases linearly with increased
    number of agents
  • Fixed number of neighbors

Future Work
  • How can communication be incorporated?
  • Will introducing communication cause agents to
    lose locality of interaction
  • Remove assumption of transition independence
  • May cause all agents to be dependent on each
  • Other globally optimal algorithms
  • Increased parallelism

Backup slides
Global Optimal
  • Consider only binary constraints. Can be applied
    to n-ary constraints
  • Run distributed cycle cutset algorithm in case
    graph is not a tree
  • Algorithm
  • Convert graph into trees and a cycle cutset C
  • For each possible joint policy pC of agents in C
  • ValpC 0
  • For each tree of agents
  • ValpC DP-Global (tree, pC)
  • Choose joint policy with highest value

Global Optimal Algorithm (GOA)
  • Similar to variable elimination
  • Relies on a tree structured interaction graph
  • Cycle cutset algorithm to eliminate cycles
  • Assumes only binary interactions
  • Phase 1 Values are propagated upwards from
    leaves to root
  • From the deepest nodes in the tree to the root,
  • 1. For each of agent is policies, pi do
  •      eval(pi) ? ?ci valuepi ci
  •          where valuepi ci is received from child
  • 2. for each parent's policy pj do
  • valuepji ? 0
  • for each of agent is policy pi do
  • set current-eval ? expected-reward(pj , pi)
  • if valuepji lt current-eval then
  • valuepji ? current-eval
  • send valuepji to parent j
  • Phase 2 Policies are propagated downwards from
    root to leaves.
Write a Comment
User Comments (0)