Title: Reinforcement Learning by Policy Search
1Reinforcement Learning by Policy Search
Dr. Leonid Peshkin Harvard University
2Learning agent
- A system that has an ongoing interaction with an
external environment - household robot
- factory controller
- web agent
- Mars explorer
- pizza delivery robot
3Reinforcement learning
- given a connection
- to the environment
Reinforce
find a behavior that maximizes long-run
reinforcement
Action
Observation
4Why Reinforcement Learning?
- Supervision signal is rarely available
- Reward is easier than behavior for humans to
specify - for removing dirt
- - for consuming energy
- - - for damaging furniture
- - - - for terrorizing cat
5Major Successes
- Backgammon Tesauro_at_IBM
- Elevator scheduling Crites Barto _at_UMASS
- Cellular phone channel allocation Singh
Bertsekas - Space-shuttle scheduling Zhang Dietterich
- Real robots crawling Kimura Kobayashi
6Outline
- model of agent learning from environment
- learning as stochastic optimization problem
- re-using the experience in policy evaluation
- theoretical highlight
- sample complexity bounds for likelihood ratio
estimation. - empirical highlight
- adaptive network routing.
7Interaction loop
Design a learning algorithm from agents
perspective
8Interaction loop
action
at
state
new state
st
st1
Markov decision process (MDP)
9Model with partial observability
action
observation
ot
at
state
new state
st
st1
POMDP
- set of states
- set of actions
- set of observations
reward
rt
10Model with partial observability
action
observation
ot-1
at
state
new state
st-1
st
POMDP
- observation function
- world state transition function
- reward function
reward
rt
11Objective
ot-1
at
st
st-1
rt
12Objective
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
13Cumulative reward
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
S
Return(h)
14Cumulative discounted reward
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
t1
t
g
g
t-1
g
S
Return(h)
15Objective
Experience
Sh Pr(h) Return(h)
Maximize expected return!
16Objective
Experience
at
ot
Sh Pr(h) Return(h)
st1
st
rt
Policy
17Policy
action
at
m
state
new state
st
st1
reward
rt1
Markov decision process assumes complete
observability
18Markov Decision Processes
Bellman,64
- Good news Many techniques to learn in MDP
- value indication of a potential payoff
- guaranteed to converge to the best policy
- Bad news Guaranteed to converge if
- environment is Markov, observable
- value could be represented exactly
- every action is tried in every state infinitely
often
19Partial Observability
action
observation
ot
at
state
new state
st
st1
reward
rt1
PO Markov decision process assumes partial
observability
20Policy with memory
ot-1
ot1
at1
ot
at
at-1
st
st1
st-1
rt
rt1
rt-1
21Reactive policy
ot-1
at
ot1
ot
at-1
st
st1
st-1
rt
rt1
rt-1
Finding optimal reactive policy is NP-hard
Papadimitriou,89
22Finite-State Controller
Meuleau, Peshkin, Kim, Kaelbling UAI-99
at
ot1
Environment POMDP
st
st1
rt
23Finite-State Controller
Agent FSC
n t1
n t
at
ot1
Environment POMDP
st
st1
rt
Experience
24Finite-State Controller
Agent FSC
n t1
n t
at
ot1
- set of controller states
- internal state transition function
- action function
Policy m, optimal parameters q
25Learning as optimization
- Choose point in policy space
- evaluate q
- improve q
26Learning algorithm
Choose policy vector qltq1,q2,...,qkgt evaluate
q (by following policy q several times improve q
(by changing qi according to some credit)
27Policy evaluation
Experience
sampling
Estimator
policy m is parameterized by q
28Gradient descent
i 1..n
i
29Policy improvement
Williams,92
Optimize
Stochastic gradient descent
sampling
Finds a locally optimal policy
30Algorithm for reactive policy
Look-up table one parameter qoa per (o,a) pair
Action selection
Contribution
31Algorithm for reactive policy
Peshkin, Meuleau, Kaelbling ICML-99
- Initialize controller weights qoa
- Initialize counters No , Noa return R
- At each time step t
- a. Draw action at from Pr(atot,q)
- b. Increment No , Noa R R gtrt
- Update for all (o,a)
- qoa qoa a R (Noa - Pr(ao,q)No)
- Loop
surprise
32Slow learning!
Peshkin, Meuleau, Kim, Kaelbling 00
Meuleau, Peshkin, Kaelbling 99
Learning distributed control simulated soccer
- Learning with FSCs
- pole and cart
large number
33Issues
- Learning takes lots of experience.
- We are wasting experience! Could we re-use ?
- Crucial dependence on the complexity of
controller. - How to choose the right one ( of states in FSC)
? - How to initialize controller ?
- Combine with supervised learning ?
34Wasting experience
q
Policy space Q
35Wasting experience
q
Policy space Q
q
36Evaluation by re-using data
q2
q1
qk
given experiences under other policies
q1,q2...qk evaluate arbitrary policy q
Policy space Q
37Likelihood ratio enables re-use
Experience
Markov property warrants
We can calculate
38Likelihood ratio sampling
Naïve sampling
Weighted sampling
Likelihood ratio
39Likelihood ratio estimator
Accumulate experiences following p
Naïve sampling
Likelihood ratio
Weighted sampling
40Outline
- model of agent learning from environment
- learning as stochastic optimization problem
- re-using the samples in policy evaluation
- theoretical highlight
- sample complexity bounds for likelihood ratio
estimation - empirical highlight
- adaptive network routing
41Learning revisited
- Two problems
- Approximation
- Optimization
42Approximation
Valiant,84
deviation
confidence
average return
expected return
- How many samples N we need related to e, d and
- maximal return Vmax
- likelihood ratio bound h
- complexity K(Q) of the policy space Q
43Complexity of the policy space
q2
q1
qk
evaluate arbitrary policy q, given experiences
under other policies q1...qk
Policy space Q
44Complexity of the policy space
K(Q) k means that any new policy q is close
to one of the cover set policies q1...qk
Policy space Q
45Sample complexity result
Peshkin, Mukherjee COLT-01
- How many samples N we need related to e, d and
- maximal return Vmax
- likelihood ratio bound h
- complexity K(Q) of the policy space Q
- Given sample size, calculate confidence
- Given sample size and confidence, choose policy
class
46Comparison of bounds
Kearns,Ng,MansourNIPS-00
KNM
PM
KNM reusable trajectories Partial reuse
estimate is built on experience consistent with
evaluated policy Fixed sampling policy all
choices are made uniformly at random Linear in
VC(Q) dimension which is greater than covering
number K(Q) Exponential dependency on experience
size T in general case
47Sample complexity result
- Proof is done by using concentration
inequalities bounds on deviations from function
expectations McDiarmidBernstein - How far is weighted average return from expected
return? - We obtained better result by bounding likelihood
ratio through log regret in sequence guessing
game Cesa-Bianchi, Lugosi99
48Outline
- model of agent learning from environment
- learning as stochastic optimization problem
- re-using the experience in policy evaluation
- theoretical highlight
- sample complexity bounds for likelihood ratio
estimation. - empirical highlight
- adaptive network routing.
49Adaptive routing
50Adaptive routing a problem
T
S
51Adaptive routing a problem
Observation destination node Action the next
node in route Remove packet Delivered or
defunct Process takes a unit of time per packet
Forward takes a unit of time per hop Reward
inverse average routing time Shaping penalize
for loops in route Inject Source and target
uniformly, Poisson process,
parameter 0, 5
52Adaptive routing an algorithm
Peshkin, Meuleau, Kaelbling UAI-00
Algorithm distributed gradient descent,
bio-plausible local, reactive policy Policy
Soft-max action choosing rule, lookup table
for each destination Learning rate, temperature
constant Coordination identical reinforcement
distributed by acknowledgement packets
53Performance comparison
Shortest path, considering loads industrys
favorite optimal (deterministic) policy relies
on global info
Q-route assigns value of estimated routing time
to every (destination,action) pair sends along
estimated best route (deterministically)
Boyan,Littman
PolSearch (my personal favorite)
54Performance comparison
0.5 1.0 1.5 2.0
2.5 3.0
55Regular 6x6 network
56Modified 6x6 network
57Performance comparison
0.5 1.0 1.5 2.0
2.5 3.0
58Performance comparison
0.5 1.0 1.5 2.0
2.5 3.0
59Theses and Conclusion
- Policy search is often the only option available
- It performs reasonably well on sample domains
- Performance depends on policy encoding
- Building controllers is art, must become science
- Complexity driven choice of controllers
- Global search by constructive covering of space
60Why do you need RL?
Information Extraction, Web Spider Index
McCallum_at_CMU Baum_at_NECI Kushmerick_at_DublinU
Selective Visual Attention Mahadevan_at_UMASS Fov
eation for Target Detection Schmidhuber_at_IDSIA
- Vision NLP Sequential processing via routines
61Contributions
- Learning with external memory.
Peshkin, Meuleau, Kaelbling
ICML-99 - Learning with Finite State Controllers.
Meuleau, Peshkin, Kim,
Kaelbling UAI-99 - Equivalence of centralized and distributed
gradient ascent Relating optima to Nash
equilibria.
Peshkin, Meuleau, Kaelbling UAI-00 - Likelihood ratio estimation for experience
re-use. Peshkin, Shelton ICML-02 - Sample complexity bounds for likelihood ratio
estimation. Peshkin, Mukherjee COLT-01 - Empirical results for several domains
- adaptive routing Peshkin, Savova IJCNN-02
- pole balancing
- load-unload
- simulated soccer.
62Acknowledgments
- I have benefited from technical interactions
with many people, including - Tom Dean, John Tsitsiklis, Nicolas Meuleau,
Christian Shelton, Kee-Eung Kim, Luis Ortiz,
Theos Evgeniou, Mike Schwarz - and
- Leslie Kaelbling
63Multi-Agent Learning
Environment Partially Observable Markov
Game
ot1
1
st
st1
ot1
2
64Multi-Agent Learning
Agent 1 FSC
n t1
n t
1
1
ot1
1
Environment Partially Observable Markov
Game
st
st1
ot1
2
Agent 2 FSC
n t
2
65Multi-Agent Learning
Peshkin, Meuleau, Kaelbling UAI-00
joint gradient descent
h lto0, o0,n0, n0,a0,a0,r0,,ot,ot,nt,nt,at,at,rt,
gt
distributed gradient descent