Title: Apprentissage par Renforcement Reinforcement Learning
1Apprentissage par RenforcementReinforcement
Learning
- Kenji Doya
- doya_at_atr.co.jp
- ATR Human Information Science Laboratories
- CREST, Japan Science and Technology Corporation
2Outline
- Introduction to Reinforcement Learning (RL)
- Markov decision process (MDP)
- Current topics
- RL in Continuous Space and Time
- Model-free and model-based approaches
- Learning to Stand Up
- Discrete plans and continuous control
- Modular Decomposition
- Multiple model-based RL (MMRL)
3Learning to Walk (Doya Nakano, 1985)
- Action cycle of 4 postures
- Reward speed sensor output
- Multiple solutions creeping, jumping,
4Markov Decision Process (MDP)
- Environment
- dynamics P(ss,a)
- reward P(rs,a)
- Agent
- policy P(as)
- Goal maximize cumulative future rewards
- E r(t1) g r(t2)
- 0g1 discount factor
5Value Function and TD error
- State value function
- V(s) E r(t1) g r(t2) s(t)s, P(as)
- 0g1 discount factor
- Consistency condition
- d(t) r(t) g V(s(t)) - V(s(t-1)) 0
- new estimate - old estimate
- Dual role of temporal difference (TD) error d(t)
- Reward prediction d(t) ? 0 in average
- Action selection d(t)gt0 better than average
6Example Navigation
7Actor-Critic Architecture
critic V(s)
reward r
TD error d
environment
action a
actor P(as)
state s
- Critic future reward prediction
- update value DV(s(t-1)) ? d(t)
- Actor action reinforcement
- increase P(a(t-1)s(t-1)) if d(t) gt 0
8Q Learning
- Action value function
- Q(s,a) E r(t1) g r(t2) s(t)s,
a(t)a, P(as) E r(t1) g V(s(t1))
s(t)s, a(t)a - Action selection
- a(t) argmaxa Q(s(t),a) with prob. 1-e
- Update
- Q(s(t),a(t)) r(t1) g maxa Q(s(t1),a)
- Q(s(t),a(t)) r(t1) g Q(s(t1),a(t1))
9Dynamic Programming and RL
- Dynamic Programming
- given models P(ss,a) and P(rs,a)
- off-line solution of Bellman equation
- V(s) maxa ?rrP(rs,a) g?sV(s)P(ss,a)
- Reinforcement Learning
- on-line learning with TD error
- d(t) r(t) gV(s(t) - V(s(t-1))
- DV(s(t-1)) a d(t)
- DQ(s(t-1),a(t-1)) a d(t)
10Model-free and Model-based RL
- Model-free e.g., learn action values
- Q(s,a) r(s,a) g Q(s,a)
- a argmaxa Q(s,a)
- Model-based forward model P(ss,a)
- action selection
- a argmaxa E R(s,a) g Ss V(s)P(ss,a)
- simulation learn V(s) and/or Q(s,a) off-line
- dynamic programming solve Bellman eq.
- V(s) maxa E R(s,a) g Ss V(s)P(ss,a)
11Current Topics
- Convergence proofs
- with function approximators
- Learning with hidden states POMDP
- estimate belief states
- reactive, stochastic policy
- parameterized finite-state policies
- Hierarchical architectures
- learn to select fixed sub-modules
- train sub-modules
- both
12Partially Observable Markov Decision Process
(POMDP)
- Update the belief state
- observation P(os) not identity
- belief state b(P(s1), P(s2),) real valued
- P(sko) ? P(osk) Si P(sksi,a) P(si)
13Tiger Problem (Kaelbing et al., 1998)
- state a tiger is in left,right
- action left, right, listen
- observation with 15 error
- policy tree finite state policy
14Outline
- Introduction to Reinforcement Learning (RL)
- Markov decision process (MDP)
- Current topics
- RL in Continuous Space and Time
- Model-free and model-based approaches
- Learning to Stand Up
- Discrete plans and continuous control
- Modular Decomposition
- Multiple model-based RL (MMRL)
15Why Continuous?
- Analog control problems
- discretization ? poor control performance
- how to discretize?
- Better theoretical properties
- differential algorithms
- use of local linear models
16Continuous TD learning
- Dynamics
- Value function
- TD error
- Discount factor
- Gradient Policy
17On-line Learning of State Value
- state x(angle, angular vel.)
- V(x)
-
18Example Cart-pole Swing up
- Reward height of the tip
- Punish crash to wall
19Fast Learning by Internal Models
- Pole balancing (Stefan Schaal, USC)
- Forward modelof pole dynamics
- Inverse modelof arm dynamics
20Internal Models for Planning
- Devil sticking (Chris Atkeson, CMU)
21Outline
- Introduction to Reinforcement Learning (RL)
- Markov decision process (MDP)
- Current topics
- RL in Continuous Space and Time
- Model-free and model-based approaches
- Learning to Stand Up
- Discrete plans and continuous control
- Modular Decomposition
- Multiple model-based RL (MMRL)
22Need for Hierarchical Architecture
- Performance of control
- Many high-precision sensors and actuator
- Prohibitively long time for learning
- Speed of learning
- Search in low-dimensional, low-resolution space
23Learning to Stand up (Morimoto Doya, 1998)
- Reward height of the head
- Punishment tumble
- State pitch and joint angles, their derivatives
- Simulation ? many thousands of trials to learn
24Hierarchical Architecture
- Upper level
- discrete state/time
- kinematics
- action subgoals
- reward total task
- Lower level
- continuous state/time
- dynamics
- action motor torque
- reward
- achieving subgoals
Q(S,A)
sequence ofsubgoals
V(s) ag(s)
25Learning in Simulation
Upper level subgoals
Lower level control
early learning
after 700 trials
26Learning with Real Hardware (Morimoto Doya,
2001)
- after simulation
- after 100 physical trials
- Adaptation by lower control modules
27Outline
- Introduction to Reinforcement Learning (RL)
- Markov decision process (MDP)
- Current topics
- RL in Continuous Space and Time
- Model-free and model-based approaches
- Learning to Stand Up
- Discrete plans and continuous control
- Modular Decomposition
- Multiple model-based RL (MMRL)
28Modularity in Motor Learning
- Fast De-adaptation and Re-adaptation
- switching rather than re-learning
- Combination of Learned Modules
- serial/parallel/sigmoidal mixture
29Soft Switching of Adaptive Modules
- Hard switching based on prediction errors
- (Narendra et al., 1995)
- Can result in sub-optimal task decomposition with
initially poor prediction models. - Soft switching by softmax of prediction
errors - (Wolpert and Kawato, 1998)
- Can use annealing for optimal decomposition.
- (Pawelzik et al., 1996)
30Responsibility by Competition
- predict state change
- responsibility
- weight output/learning
31Multiple Linear Quadratic Controllers
- Linear dynamic models
- Quadratic reward models
- Value functions
- Action outputs
32Swing-up control of a pendulum
- Red module 1 Green module 2
33Non-linearity and Non-stationarity
- Specialization by predictability in space and time
34Swing-up control of an Acrobot
- Reward height of the center of mass
- Linearized around four fixed points
35Swing-up motions
36Module switching
- trajectories x(t) R0.001 R0.002
- responsibility li symbol-like representation
- 1-2-1-2-1-3-4-1-3-4-3-4 1-2-1-2-1-2-1-3-4-1-
3-4
37Stand Up by Multiple Modules
- Seven locally linear models
38Segmentation of Observed Trajectory
- Predicted motor output
- Predicted state change
- Predicted responsibility
39Imitation of Acrobot Swing-up
- q1(0)p/12 q1(0)p/6 q1(0)p/12 (imitation)
40Outline
- Introduction to Reinforcement Learning (RL)
- Markov decision process (MDP)
- Current topics
- RL in Continuous Space and Time
- Model-free and model-based approaches
- Learning to Stand Up
- Discrete plans and continuous control
- Modular Decomposition
- Multiple model-based RL (MMRL)
41Future Directions
- Autonomous learning agents
- Tuning of meta-parameters
- Design of rewards
- Selection of necessary/sufficient state coding
- Neural mechanisms of RL
- Dopamine neurons encoding TD error
- Basal ganglia value-based action selection
- Cerebellum internal models
- Cerebral cortex modular decomposition
42What is Reward for a robot?
- Should be grounded by
- Self preservation self recharging
- Self reproduction copying control program
- Cyber Rodent
43The Cyber Rodent Project
- Learning mechanisms under realistic constraints
of self-preservation and self-reproduction - acquisition of task-oriented internal
representation - metalearning algorithms
- constraints of finite time and energy
- mechanisms for collaborative behaviors
- roles of communication
- abstract/emotional, concrete/symbolic
- gene exchange rules for evolution
44Input/Output
- Sensory
- CCD camera
- range sensor
- IR proximity x8
- acceleration/gylo
- microphone x2
- Motor
- two wheels
- jaw
- R/G/B LED
- speaker
45Computation/Communication
- CPU Hitachi SH-4 CPU
- FPGA image processor
- IO modules
- Communication
- IR port
- wireless LAN
- Software
- learning/evolution
- dynamic simulation