Title: Lihan He, Shihao Ji and Lawrence Carin
1Combining Exploration and Exploitation in
Landmine Detection
Lihan He, Shihao Ji and Lawrence Carin ECE, Duke
University
2Outline
- Introduction
- Partially observable Markov decision processes
(POMDPs) - Model definition offline learning
- Lifelong-learning algorithm
- Experimental results
3Landmine detection (1)
- Landmine detection
- By robots instead of human beings
- Underlying model controlling the robot POMDP
-
- Multiple sensors
- Single sensor sensitive to only certain types
of objects - EMI sensor conductivity
- GPR sensor dielectric property
- Seismic sensor mechanical property
- Multiple complementary sensors improve
detection performance -
4Landmine detection (2)
- Given a minefield where some landmines and
clutter are buried underground - Two types of sensors are available EMI sensor
and GPR sensor - Sensing has cost
- Correct / incorrect declarations have
corresponding reward / penalty
How can we develop a strategy to effectively find
the landmines in this minefield with the minimal
cost ?
Questions in it
- How to optimally choose sensing positions in the
field, so as to use as few sensing points as
possible to find the landmines? - How to optimally choose sensors in each sensing
position? - When to sense and when to declare?
5Landmine detection (3)
A partially observable Markov decision process
(POMDP) model is built to solve this problem
since it provides an approach to select actions
(sensor deployment, sensing positions and
declarations) optimally based on maximal reward /
minimal cost.
- The robot learns the model at the same time as it
moves and senses in the mine field (combining
exploration and exploitation) - Model is updated based on the exploration process.
6Outline
- Introduction
- Partially observable Markov decision processes
(POMDPs) - Model definition offline learning
- Lifelong-learning algorithm
- Experimental results
7POMDP (1)
POMDP HMM controllable actions rewards
A POMDP is a model of an agent interacting
synchronously with its environment. The agent
takes as input the observations of the
environment, estimates the state according to
observed information, and then generates as
output actions based on its policy. During the
periodic observation-action loops, the agent gets
maximal reward, or equivalently, minimal cost.
8POMDP (2)
A POMDP model is defined by the tuple lt S, A, T,
R, ?, O gt
- S is a finite set of discrete states of the
environment. - A is a finite set of discrete actions.
- ? is a finite set of discrete observations
providing noisy state information - T S?A ? ?(S) is the state transition
probability - the probability of transitioning from state s
to s when taking action a - O S?A ? ?(?) is the observation function
- the probability of receiving observation o after
taking action a, landing in state s. - R S?A ? ? , R(s, a) is the expected reward the
agent receives by taking action a in state s.
9POMDP (3)
Belief state b
- The agent believes which state it is currently
in - A probability distribution over all the state S
- A summary of past information
- updated in each step by Bayes rule
- based on the latest action and observation, and
the previous belief state
10POMDP (4)
Policy
- A mapping from belief states to actions
- Telling agent which action it should take given
current belief state.
Optimal policy
Horizon length
- Maximize the expected discounted reward
Immediate reward
Discounted future reward
- V(b) is piecewise linear and convex in belief
state (Sondik, 1971) - Represent V(b) by a set of S-dimensional
vector a1, , am
11POMDP (5)
Policy learning
- Solve for vectors a1, , am
- Point based value iteration (PBVI) algorithm
- Iteratively updates the vector a and value V for
a set of sample belief points.
One step from the horizon
n1 step from the horizon is computed from the
results of the n step
with
where
12Outline
- Introduction
- Partially observable Markov decision processes
(POMDPs) - Model definition offline learning
- Lifelong-learning algorithm
- Experimental results
13Model definition (1)
- Feature extraction EMI sensor
EMI Model
Sensor measurements
Model parameters extracted by nonlinear fitting
method
14Model definition (2)
- Feature extraction GPR sensor
Time
Down-track position
- Raw moments energy features
- Central moments variance and asymmetry of the
wave
15Model definition (3)
16Model definition (4)
- Variational Bayesian (VB) expectation-maximization
(EM) method - for model selection.
- Bayesian learning
- One Criterion compare model evidence (marginal
likelihood)
17Model definition (5)
Candidate models
HMMs with two sets of observations
S1,5,9, O2,3,4,
18Model definition (6)
Estimate S
Estimate O
19Model definition (7)
- Specification of action A
10 sensing actions allow movements of 4
directions
1 Stay, GPR sensing 2 South, GPR sensing 3
North, GPR sensing 4 West, GPR sensing 5 East,
GPR sensing 6 Stay, EMI sensing 7 South,
EMI sensing 8 North, EMI sensing 9 West, EMI
sensing 10 East, EMI sensing
Declaration actions declare as one type of
target
11 Declare as metal mine 12 Declare as
plastic mine 13 Declare as Type-1 clutter
14 Declare as Type-2 clutter 15 Declare as
clean
20Model definition (8)
Across all 5 types of mines and clutter, a total
of 29 states are defined.
metal mine
plastic mine
type-1 clutter
type-2 clutter
clean
21Model definition (9)
- Stay actions do not cause state transition
identity matrix - Other sensing actions cause state transition
by elementary geometric probability computation - Declaration actions reset the problem
uniform distribution over states
where awalk south and then sense with
EMI or walk south and then sense with GPR
d distance traveled in a single step by a robot
s1, s2, s3 and s4 denote the 4 borders of state
5, as well as their respective area metric.
22Model definition (10)
- Assume a mine or clutter is buried separately
- State transitions happen only within the states
of a target when robot moves - Clean is a bridge between the targets.
Other target
Other target
29
Other target
Metal mine states
clean
23Model definition (11)
- State transition matrix diagonal block
- Model expansion add more diagonal block, each
one a target
24Model definition (12)
- Specification of reward R
Sensing -1
Each sensing (either EMI or GPR) has a cost -1
Correct declaration 10
Correctly declare a target
Partially correct declaration 5
Confused between different types of
landmines Confused between different types of
clutter
Incorrect declaration large penalty
Missing declare as clean or clutter when it is
a landmine -100 False alarm declare as a
landmine when it is clean or clutter -50
25Outline
- Introduction
- Partially observable Markov decision processes
(POMDPs) - Model definition offline learning
- Lifelong-learning algorithm
- Experimental results
26Lifelong learning (1)
Model-based algorithm No training data
available in advance Learn the POMDP model by
Bayes approach during the exploration
exploitation processes. Assume a rough
model is given, but some model parameters are
uncertain An oracle is available, which
can provide exact information about target
label, size and position, but using oracle is
expansive. Criteria to use oracle 1.
Policy selects oracle query action 2. Agent
finds new observations new knowledge 3.
After sensing a lot, agent still cannot make
decisions too difficult
27Lifelong learning (2)
Oracle query includes three steps 1.
measures data of both sensors on a grid 2. true
target label is revealed 3. build target model
based on measured data Two learning
approaches 1. model expansion (more target
types are considered) 2. model hyper-parameter
update
28Lifelong learning (3)
Dirichlet distribution
- A distribution of multinomial distribution
parameters. - A conjugate prior to the multinomial
distribution. - We can put a Dirichlet prior for each
state-action pair in transition
probability and observation
function
variables , parameters
with
where
29 Lifelong learning (4)
Algorithm
- 1. Imperfect model M0, containing clean and
some mine or clutter types, with the
corresponding S and O S and O could be expanded
in the learning process - 2. Oracle query is one possible action
- Set learning rate ?
- Set Dirichlet prior according to the imperfect
model M0
For any unknown transition probability, For any
unknown observation,
? ? ? ?
? ? ? ?
? ? ? ?
? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
30 Lifelong learning (4)
Algorithm
5. Sample N models , solve policies
6. Initial the weights
wi1/N 7. Initial the history h 8. Initial
the belief state b0 for each model 9. Run the
experiment. At each time step
- Compute the optimal actions for each model ai
pi(bi) for i1,,N - Pick an action a according to the weights wi
p(ai)wi - c. If one of the three query conditions is met
(exploration) - (1) Sense current local area on a grid
- (2) Current target label is revealed
- (3) Build the sub-model for current target and
compute the hyper-parameters. - If the target is a new target type
- Expand model by including the new target type
as a diagonal block - Else, the target is an existing target type
- Update the Dirichlet parameters of the current
target type (next page)
31 Lifelong learning (4)
Algorithm
32Outline
- Introduction
- Partially observable Markov decision processes
(POMDPs) - Model definition offline learning
- Lifelong-learning algorithm
- Experimental results
33Results (1)
mine fields 1.61.6 m2 sensing on a
spatial grid of 2cm by 2cm two sensors EMI
and GPR
search almost everywhere to avoid missing
landmines active sensing to minimize the cost
Basic path lanes
The basic path restrains the robot from
moving across the lanes. the robot takes
actions to determine its sensing positions within
the lanes.
34Results (2)
- Offline-learning approach performance summary
Mine field 1 Mine field 2 Mine field 3
Ground truth Number of mines (metal plastic) 5 (32) 7 (43) 7 (43)
Ground truth Number of clutter (metal nonmetal) 21 (183) 57 (3423) 29 (236)
Detection result Number of mines missed 1 1 2
Detection result Number of false alarms 2 2 2
Metal clutter soda can, shell, nail, quarter,
penny, screw, lead, rod, ball bearing Nonmetal
clutter rock, bag of wet sand, bag of dry sand,
CD
35Results (3)
- Offline-learning approach Minefield 1
Detection result
Ground truth
M
M
P
P
M
P plastic mine M metal mine Other clutter
1 missing 2 false alarms
36Results (4)
Plastic mine GPR sensor Metal mine EMI
sensor Clean center of a mine sensing
few times (2-3 times in general) interface of
mine/clean sensing many times
37Red rectangular region oracle query Other mark
declarations
Results (5)
Red metal mine Pink plastic mine Yellow
clutter1 Cyan clutter2 Blue c clean
- Lifelong-learning approach Minefield 1
Ground truth
Initial learning from Mine field 1
38Results (6)
- Lifelong-learning approach compared with offline
learning
Difference of the parameters between the model
learned by lifelong learning and the model
learned by offline learning (training data are
given in advance). The three big error drops
correspond to adding new targets into the model.
39Red rectangular region oracle query Other mark
declarations
Results (7)
Red metal mine Pink plastic mine Yellow
clutter1 Cyan clutter2 Blue c clean
- Lifelong-learning approach Minefield 2
Sensing Minefield 2 after the model was learned
from Minefield 1
Ground truth
40Red rectangular region oracle query Other mark
declarations
Results (8)
Red metal mine Pink plastic mine Yellow
clutter1 Cyan clutter2 Blue c clean
- Lifelong-learning approach Minefield 3
Ground truth