Apprentissage par Renforcement Reinforcement Learning presentation

About This Presentation

Transcript and Presenter's Notes

Title: Apprentissage par Renforcement Reinforcement Learning

1
Apprentissage par RenforcementReinforcement
Learning

Kenji Doya
doya_at_atr.co.jp
ATR Human Information Science Laboratories
CREST, Japan Science and Technology Corporation

2
Outline

Introduction to Reinforcement Learning (RL)
Markov decision process (MDP)
Current topics
RL in Continuous Space and Time
Model-free and model-based approaches
Learning to Stand Up
Discrete plans and continuous control
Modular Decomposition
Multiple model-based RL (MMRL)

3
Learning to Walk (Doya Nakano, 1985)

Action cycle of 4 postures
Reward speed sensor output
Multiple solutions creeping, jumping,

4
Markov Decision Process (MDP)

Environment
dynamics P(ss,a)
reward P(rs,a)
Agent
policy P(as)
Goal maximize cumulative future rewards
E r(t1) g r(t2)
0g1 discount factor

5
Value Function and TD error

State value function
V(s) E r(t1) g r(t2) s(t)s, P(as)
0g1 discount factor
Consistency condition
d(t) r(t) g V(s(t)) - V(s(t-1)) 0
new estimate - old estimate
Dual role of temporal difference (TD) error d(t)
Reward prediction d(t) ? 0 in average
Action selection d(t)gt0 better than average

6
Example Navigation

Reward field

7
Actor-Critic Architecture
critic V(s)
reward r
TD error d
environment
action a
actor P(as)
state s

Critic future reward prediction
update value DV(s(t-1)) ? d(t)
Actor action reinforcement
increase P(a(t-1)s(t-1)) if d(t) gt 0

8
Q Learning

Action value function
Q(s,a) E r(t1) g r(t2) s(t)s,
a(t)a, P(as) E r(t1) g V(s(t1))
s(t)s, a(t)a
Action selection
a(t) argmaxa Q(s(t),a) with prob. 1-e
Update
Q(s(t),a(t)) r(t1) g maxa Q(s(t1),a)
Q(s(t),a(t)) r(t1) g Q(s(t1),a(t1))

9
Dynamic Programming and RL

Dynamic Programming
given models P(ss,a) and P(rs,a)
off-line solution of Bellman equation
V(s) maxa ?rrP(rs,a) g?sV(s)P(ss,a)
Reinforcement Learning
on-line learning with TD error
d(t) r(t) gV(s(t) - V(s(t-1))
DV(s(t-1)) a d(t)
DQ(s(t-1),a(t-1)) a d(t)

10
Model-free and Model-based RL

Model-free e.g., learn action values
Q(s,a) r(s,a) g Q(s,a)
a argmaxa Q(s,a)
Model-based forward model P(ss,a)
action selection
a argmaxa E R(s,a) g Ss V(s)P(ss,a)
simulation learn V(s) and/or Q(s,a) off-line
dynamic programming solve Bellman eq.
V(s) maxa E R(s,a) g Ss V(s)P(ss,a)

11
Current Topics

Convergence proofs
with function approximators
Learning with hidden states POMDP
estimate belief states
reactive, stochastic policy
parameterized finite-state policies
Hierarchical architectures
learn to select fixed sub-modules
train sub-modules
both

12
Partially Observable Markov Decision Process
(POMDP)

Update the belief state
observation P(os) not identity
belief state b(P(s1), P(s2),) real valued
P(sko) ? P(osk) Si P(sksi,a) P(si)

13
Tiger Problem (Kaelbing et al., 1998)

state a tiger is in left,right
action left, right, listen
observation with 15 error
policy tree finite state policy

14
Outline

Introduction to Reinforcement Learning (RL)
Markov decision process (MDP)
Current topics
RL in Continuous Space and Time
Model-free and model-based approaches
Learning to Stand Up
Discrete plans and continuous control
Modular Decomposition
Multiple model-based RL (MMRL)

15
Why Continuous?

Analog control problems
discretization ? poor control performance
how to discretize?
Better theoretical properties
differential algorithms
use of local linear models

16
Continuous TD learning

Dynamics
Value function
TD error
Discount factor
Gradient Policy

17
On-line Learning of State Value

state x(angle, angular vel.)
V(x)

18
Example Cart-pole Swing up

Reward height of the tip
Punish crash to wall

19
Fast Learning by Internal Models

Pole balancing (Stefan Schaal, USC)
Forward modelof pole dynamics
Inverse modelof arm dynamics

20
Internal Models for Planning

Devil sticking (Chris Atkeson, CMU)

21
Outline

Introduction to Reinforcement Learning (RL)
Markov decision process (MDP)
Current topics
RL in Continuous Space and Time
Model-free and model-based approaches
Learning to Stand Up
Discrete plans and continuous control
Modular Decomposition
Multiple model-based RL (MMRL)

22
Need for Hierarchical Architecture

Performance of control
Many high-precision sensors and actuator
Prohibitively long time for learning
Speed of learning
Search in low-dimensional, low-resolution space

23
Learning to Stand up (Morimoto Doya, 1998)

Reward height of the head
Punishment tumble
State pitch and joint angles, their derivatives
Simulation ? many thousands of trials to learn

24
Hierarchical Architecture

Upper level
discrete state/time
kinematics
action subgoals
reward total task
Lower level
continuous state/time
dynamics
action motor torque
reward
achieving subgoals

Q(S,A)
sequence ofsubgoals
V(s) ag(s)
25
Learning in Simulation
Upper level subgoals
Lower level control
early learning
after 700 trials
26
Learning with Real Hardware (Morimoto Doya,
2001)

after simulation
after 100 physical trials
Adaptation by lower control modules

27
Outline

Introduction to Reinforcement Learning (RL)
Markov decision process (MDP)
Current topics
RL in Continuous Space and Time
Model-free and model-based approaches
Learning to Stand Up
Discrete plans and continuous control
Modular Decomposition
Multiple model-based RL (MMRL)

28
Modularity in Motor Learning

Fast De-adaptation and Re-adaptation
switching rather than re-learning
Combination of Learned Modules
serial/parallel/sigmoidal mixture

29
Soft Switching of Adaptive Modules

Hard switching based on prediction errors
(Narendra et al., 1995)
Can result in sub-optimal task decomposition with
initially poor prediction models.
Soft switching by softmax of prediction
errors
(Wolpert and Kawato, 1998)
Can use annealing for optimal decomposition.
(Pawelzik et al., 1996)

30
Responsibility by Competition

predict state change
responsibility
weight output/learning

31
Multiple Linear Quadratic Controllers

Linear dynamic models
Quadratic reward models
Value functions
Action outputs

32
Swing-up control of a pendulum

Red module 1 Green module 2

33
Non-linearity and Non-stationarity

Specialization by predictability in space and time

34
Swing-up control of an Acrobot

Reward height of the center of mass
Linearized around four fixed points

35
Swing-up motions

R0.001 R0.002

36
Module switching

trajectories x(t) R0.001 R0.002
responsibility li symbol-like representation
1-2-1-2-1-3-4-1-3-4-3-4 1-2-1-2-1-2-1-3-4-1-
3-4

37
Stand Up by Multiple Modules

Seven locally linear models

38
Segmentation of Observed Trajectory

Predicted motor output
Predicted state change
Predicted responsibility

39
Imitation of Acrobot Swing-up

q1(0)p/12 q1(0)p/6 q1(0)p/12 (imitation)

40
Outline

Introduction to Reinforcement Learning (RL)
Markov decision process (MDP)
Current topics
RL in Continuous Space and Time
Model-free and model-based approaches
Learning to Stand Up
Discrete plans and continuous control
Modular Decomposition
Multiple model-based RL (MMRL)

41
Future Directions

Autonomous learning agents
Tuning of meta-parameters
Design of rewards
Selection of necessary/sufficient state coding
Neural mechanisms of RL
Dopamine neurons encoding TD error
Basal ganglia value-based action selection
Cerebellum internal models
Cerebral cortex modular decomposition

42
What is Reward for a robot?

Should be grounded by
Self preservation self recharging
Self reproduction copying control program
Cyber Rodent

43
The Cyber Rodent Project

Learning mechanisms under realistic constraints
of self-preservation and self-reproduction
acquisition of task-oriented internal
representation
metalearning algorithms
constraints of finite time and energy
mechanisms for collaborative behaviors
roles of communication
abstract/emotional, concrete/symbolic
gene exchange rules for evolution

44
Input/Output

Sensory
CCD camera
range sensor
IR proximity x8
acceleration/gylo
microphone x2
Motor
two wheels
jaw
R/G/B LED
speaker

45
Computation/Communication

CPU Hitachi SH-4 CPU
FPGA image processor
IO modules
Communication
IR port
wireless LAN
Software
learning/evolution
dynamic simulation

Write a Comment

User Comments (0)

Apprentissage par Renforcement Reinforcement Learning PowerPoint PPT Presentation