Title: Hierarchical Methods for Planning under Uncertainty
1Hierarchical Methods forPlanning under
Uncertainty
- Thesis Proposal
- Joelle Pineau
- Thesis Committee
- Sebastian Thrun, Chair
- Matthew Mason
- Andrew Moore
- Craig Boutilier, U. of Toronto
2Integrating robots in living environments
The robots role - Social interaction - Mobile
manipulation - Intelligent reminding -
Remote-operation - Data collection / monitoring
3A broad perspective
Belief state
OBSERVATIONS
STATE
USER WORLD ROBOT
ACTIONS
GOAL Selecting appropriate actions
4Why is this a difficult problem?
UNCERTAINTY
Cause 1 Non-deterministic effects of
actions Cause 2 Partial and noisy sensor
information Cause 3 Inaccurate model of the
world and the user
5Why is this a difficult problem?
UNCERTAINTY
Cause 1 Non-deterministic effects of
actions Cause 2 Partial and noisy sensor
information Cause 3 Inaccurate model of the
world and the user
A solution Partially Observable Markov Decision
Processes (POMDPs)
6The truth about POMDPs
- Bad news
- Finding an optimal POMDP action selection policy
is computationally intractable for complex
problems.
7The truth about POMDPs
- Bad news
- Finding an optimal POMDP action selection policy
is computationally intractable for complex
problems. - Good news
- Many real-world decision-making problems exhibit
structure inherent to the problem domain. - By leveraging structure in the problem domain, I
propose an algorithm that makes POMDPs tractable,
even for large domains.
8How is it done?
- Use a Divide-and-conquer approach
- We decompose a large monolithic problem into a
collection of loosely-related smaller problems.
9Thesis statement
Decision-making under uncertainty can be made
tractable for complex problems by exploiting
hierarchical structure in the problem domain.
10Outline
- Problem motivation
- Partially observable Markov decision processes
- The hierarchical POMDP algorithm
- Proposed research
11POMDPs within the family of Markov models
12What are POMDPs?
Components Set of states s?S Set of actions
a?A Set of observations o?O
S2
0.5
Pr(o1)0.9 Pr(o2)0.1
0.5
S1
a1
Pr(o1)0.5 Pr(o2)0.5
S3
1
a2
Pr(o1)0.2 Pr(o2)0.8
POMDP parameters Initial belief
b0(s)Pr(sos) Observation probabilities
O(s,a,o)Pr(os,a) Transition probabilities
T(s,a,s)Pr(ss,a) Rewards R(s,a)
HMM
MDP
13A POMDP example The tiger problem
Reward Function R(alisten)
-1 R(aopen-right, stiger-left)
10 R(aopen-left, stiger-left) -100
14What can we do with POMDPs?
- 1) State tracking
- After an action, what is the state of the world,
st ? - 2) Computing a policy
- Which action, aj, should the controller apply
next?
Not so hard.
Very hard!
St-1
st
...
World
at-1
Control layer
ot
??
bt-1
??
...
Robot
15The tiger problem State tracking
b0
Belief vector
S1 tiger-left
S2 tiger-right
Belief
16The tiger problem State tracking
b0
Belief vector
S1 tiger-left
S2 tiger-right
Belief
obsgrowl-left
actionlisten
17The tiger problem State tracking
b1
b0
Belief vector
S1 tiger-left
S2 tiger-right
Belief
obsgrowl-left
actionlisten
18Policy Optimization
- Which action, aj, should the controller apply
next? - In MDPs
- Policy is a mapping from state to action, ? si ?
aj - In POMDPs
- Policy is a mapping from belief to action, ? b ?
aj - Recursively calculate expected long-term reward
for each state/belief - Find the action that maximizes the expected
reward
19The tiger problem Optimal policy
open-left
open-right
listen
Optimal policy
Belief vector
S1 tiger-left
S2 tiger-right
20Complexity of policy optimization
- Finite-horizon POMDPs are in worse-case doubly
exponential - Infinite-horizon undiscounted stochastic POMDPs
are EXPTIME-hard, and may not be decidable
(?n????).
21The essence of the problem
- How can we find good policies for complex POMDPs?
- Is there a principled way to provide near-optimal
policies in reasonable time?
22Outline
- Problem motivation
- Partially observable Markov decision processes
- The hierarchical POMDP algorithm
- Proposed research
23A hierarchical approach to POMDP planning
- Key Idea Exploit hierarchical structure in the
problem domain to break a problem into many
related POMDPs. - What type of structure?
- Action set partitioning
subtask
abstract action
24Assumptions
- Each POMDP controller has a subset of Ao.
- Each POMDP controller has full state set S0,
observation set O0. - Each controller includes discriminative reward
information. - We are given the action set partitioning graph.
- We are given a full POMDP model of the problem
So,Ao,Oo,Mo.
25The tiger problem An action hierarchy
act
open-left
investigate
open-right
listen
PinvestigateS0, Ainvestigate, O0,
Minvestigate Ainvestigatelisten, open-right
26Optimizing the investigate controller
open-right
listen
Locally optimal policy
Belief vector
S1 tiger-left
S2 tiger-right
27The tiger problem An action hierarchy
PactS0, Aact, O0, Mact Aactopen-left,
investigate
act
But... R(s, ainvestigate) is not defined!
open-left
investigate
open-right
listen
28Modeling abstract actions
Insight Use the local policy of corresponding
low-level controller. General form R( si, ak)
R ( si, Policy(controllerk,si) ) Example
R(stiger-left,ak investigate)
Policy (investigate,stiger-left) open-right
open-right listen open-left tiger-left
10 -1 -100 tiger-right -100
-1 10
29Optimizing the act controller
investigate
open-left
Locally optimal policy
Belief vector
S1 tiger-left
S2 tiger-right
30The complete hierarchical policy
open-left
open-right
listen
Hierarchical policy
Belief vector
S1 tiger-left
S2 tiger-right
31The complete hierarchical policy
Optimal policy
open-left
open-right
listen
Hierarchical policy
Belief vector
S1 tiger-left
S2 tiger-right
32Results for larger simulation domains
33Related work on hierarchical methods
- Hierarchical HMMs
- Fine et al., 1998
- Hierarchical MDPs
- DayanHinton, 1993 Dietterich, 1998 McGovern et
al., 1998 ParrRussell, 1998 Singh, 1992. - Loosely-coupled MDPs
- Boutilier et al., 1997 DeanLin, 1995 Meuleau
et al. 1998 SinghCohn, 1998 WangMahadevan,
1999. - Factored state POMDPs
- Boutilier et al., 1999 BoutilierPoole, 1996
HansenFeng, 2000. - Hierarchical POMDPs
- Castanon, 1997 Hernandez-GardiolMahadevan,
2001 Theocharous et al., 2001
WieringSchmidhuber, 1997.
34Outline
- Problem motivation
- Partially observable Markov decision processes
- The hierarchical POMDP algorithm
- Proposed research
35Proposed research
- 1) Algorithmic design
- 2) Algorithmic analysis
- 3) Model learning
- 4) System development and application
36Research block 1 Algorithmic design
- Goal 1.1 Developing/implementing hierarchical
POMDP algorithm. - Goal 1.2 Extending H-POMDP for factorized state
representation. - Goal 1.3 Using state/observation abstraction.
- Goal 1.4 Planning for controllers with no local
reward information.
37Goal 1.3 State/observation abstraction
- Assumption 2
- Each POMDP controller has full state set S0, and
observation set O0. - Can we reduce the number of states/observations,
S and O?
38Goal 1.3 State/observation abstraction
- Assumption 2
- Each POMDP controller has full state set S0, and
observation set O0. - Can we reduce the number of states/observations,
S and O? - Yes! Each controller only needs subset of
state/observation features. - What is the computational speed-up?
39Goal 1.4 Local controller reward information
- Assumption 3
- Each controller includes some amount of
discriminative reward information. - Can we relax this assumption?
40Goal 1.4 Local controller reward information
- Assumption 3
- Each controller includes some amount of
discriminative reward information. - Can we relax this assumption?
- Possibly. Use reward shaping to select
policy-invariant reward function. - What is the benefit?
- H-POMDP could solve problems with sparse reward
functions.
41Research block 2 Algorithmic analysis
- Goal 2.1 Evaluating performance of the H-POMDP
algorithm. - Goal 2.2 Quantifying the loss due to the
hierarchy. - Goal 2.3 Comparing different possible
decompositions of a problem.
42Goal 2.1 Performance evaluation
- How does the hierarchical POMDP algorithm compare
to - Exact value function methods
- Sondik, 1971 Monahan, 1982 Littman, 1996
Cassandra et al, 1997. - Policy search methods
- Hansen, 1998 Kearns et al., 1999 NgJordan,
2000 BaxterBartlett, 2000. - Value approximation methods
- ParrRussell, 1995 Thrun, 2000.
- Belief approximation methods
- Nourbakhsh, 1995 KoenigSimmons, 1996
Hauskrecht, 2000 RoyThrun, 2000. - Memory-based methods
- McCallum, 1996.
- Consider problems from POMDP literature and
dialogue management domain.
43Goal 2.2 Quantifying the loss
- The hierarchical POMDP planning algorithm
provides an approximately-optimal policy. - How near-optimal is the policy?
- Subject to some (very restrictive) conditions
- The value function of top-level controller
- is an upper-bound on the value
- of the approximation.
- Can we loosen the restrictions? Tighten the
bound? - Find a lower-bound?
Vtop(b)?Vactual(b)
44Goal 2.3 Comparing different decomposition
- Assumption 4
- We are given an action set partitioning graph.
- What makes a good hierarchical action
decomposition? - Comparing decompositions is the first step
towards automatic decomposition.
Replace
Manufacture
Examine
Inspect
Manufacture
Replace
Examine
Inspect
45Research block 3 Model learning
- Goal 3.1 Automatically generating good action
hierarchies. - Assumption 4 We are given an action set
partitioning graph. - Can we automatically generate a good hierarchical
decomposition? - Maybe. It is being done for hierarchical MDPs.
- Goal 3.2 Including parameter learning.
- Assumption 5 We are given a full POMDP model
of the problem. - Can we introduce parameter learning?
- Yes! Maximum-likelihood parameter optimization
(Baum-Welch) can be used for POMDPs.
46Research block 4 System development and
application
- Goal 4.1 Building an extensive dialogue manager
Remote-control command
Facemail operations
Teleoperation module
Reminder message
Status information
Reminding module
Touchscreen input Speech utterance
Touchscreen message Speech utterance
User
Robot sensor readings
Motion command
Robot module
Dialogue Manager
47An implemented scenario
Problem size S288, A14, O15 State
Features RobotLocation, UserLocation,
UserStatus, ReminderGoal,
UserMotionGoal, UserSpeechGoal
Patient room
Robot home
Physiotherapy
Test subjects 3 elderly residents in assisted
living facility
48Contributions
- Algorithmic contribution A novel POMDP
algorithm based on hierarchical structure. - Enables use of POMDPs for much larger problems.
- Application contribution Application of POMDPs
to dialogue management is novel. - Allows design of robust robot behavioural
managers.
49Research schedule
fall 01 spring/summer 02 spring/summer/fall
02 ongoing fall 02 / spring 03
- 1) Algorithmic design/implementation
- 2) Algorithmic analysis
- 3) Model learning
- 4) System development and application
- 5) Thesis writing
50Questions?
51A simulated robot navigation example
()
()
Domain size S11, A6, O6
52A dialogue management example
- SayTime
Act
CheckHealth
- AskHealth - OfferHelp
CheckWeather
Move
Greet
DoMeds
Phone
- GreetGeneral - GreetMorning - GreetNight -
RespondThanks
- AskGoWhere - GoToRoom - GoToKitchen -
GoToFollow - VerifyRoom - VerifyKitchen -
VerifyFollow
- AskWeatherTime - SayCurrent - SayToday -
SayTomorrow
- StartMeds - NextMeds - ForceMeds - QuitMeds
- AskCallWho - Call911 - CallNurse -
CallRelative - Verify911 - VerifyNurse -
VerifyRelative
Domain size S20, A30, O27
53Action hierarchy for implemented scenario
Act
Remind
Assist
Rest
Move
Contact
Inform
54Sondiks parts manufacturing problem
Decomposition2
Decomposition1
Replace
Manufacture
Examine
Inspect
5 more decompositions
55Manufacturing task results
56Using state/observation abstraction
Action Set
State Set
CheckHealth
- AskHealth - OfferHelp
ReminderGoalnone, medsX CommunicationGoalnone
, personX UserHealthgood, poor, emergency
DoMeds
Phone
- AskCallWho - CallHelp - CallNurse -
CallRelative - VerifyHelp - VerifyNurse -
VerifyRelative
Phone
CommunicationGoalnone, nurse, 911, relative
57Related work on robot planning and control
- Manually-scripted dialogue strategies
- DeneckeWaibel, 1997 Walker et al., 1997.
- Markov decision processes (MDPs) for dialogue
management - Levin et al., 1997 Fromer, 1998 Walker et al.,
1998 GoddeauPineau, 2000 Singh et al., 2000
Walker, 2000. - Robot interface
- Torrance, 1996 Asoh et al., 1999.
- Classical planning
- FikesNilsson, 1971 Simmons, 1987
McAllesterRosenblitt, 1991 PenberthyWeld,
1992 Kushmerick, 1995 Velosoal., 1995
SmithWeld, 1998. - Execution architectures
- Firby, 1987 Musliner, 1993 Simmons, 1994
BonassoKortenkamp, 1996
58Decision-theoretic planning models
59The tiger problem Value function solution
open-right
open-left
listen
V
belief
Stiger-left
Stiger-right
60Optimizing the investigate controller
listen
open-right
V
belief
Stiger-left
Stiger-right
61Optimizing the act controller
investigate
open-left
V
belief
Stiger-left
Stiger-right