Reinforcement Learning

1 / 127
About This Presentation
Title:

Reinforcement Learning

Description:

In-class assignment & discussion. Overview day 2 (Tuesday 9-12) ... TD-Gammon: Tesauro. world's best backgammon program. Elevator Control: Crites & Barto ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 128
Provided by: andy288

less

Transcript and Presenter's Notes

Title: Reinforcement Learning


1
Reinforcement Learning
  • HUT Spatial Intelligence course
    August/September 2004
  • Bram Bakker
  • Computer Science, University of Amsterdam
  • bram_at_science.uva.nl

2
Overview day 1 (Monday 13-16)
  • Basic concepts
  • Formalized model
  • Value functions
  • Learning value functions
  • In-class assignment discussion

3
Overview day 2 (Tuesday 9-12)
  • Learning value functions more efficiently
  • Generalization
  • Case studies
  • In-class assignment discussion

4
Overview day 3 (Thursday 13-16)
  • Models and planning
  • Multi-agent reinforcement learning
  • Other advanced RL issues
  • Presentation of home assignments discussion

5
Machine Learning
  • What is it?
  • Subfield of Artificial Intelligence
  • Making computers learn tasks rather than directly
    program them
  • Why is it interesting?
  • Some tasks are very difficult to program, or
    difficult to optimize, so learning might be
    better
  • Relevance for geoinformatics/spatial
    intelligence
  • Geoinformatics deals with many such tasks
    transport optimization, water management, etc.

6
Classes of Machine Learning techniques
  • Supervised learning
  • Works by instructing the learning system what
    output to give for each input
  • Unsupervised learning
  • Clustering inputs based on similarity (e.g.
    Kohonen Self-organizing maps)
  • Reinforcement learning
  • Works by letting the learning system learn
    autonomously what is good and bad

7
Some well-known Machine Learning techniques
  • Neural networks
  • Work in a way analogous to brains, can be used
    with supervised, unsupervised, reinforcement
    learning, genetic algorithms
  • Genetic algorithms
  • Work in a way analogous to evolution
  • Ant Colony Optimization
  • Works in a way analogous to ant colonies

8
What is Reinforcement Learning?
  • Learning from interaction
  • Goal-oriented learning
  • Learning about, from, and while interacting with
    an external environment
  • Learning what to dohow to map situations to
    actionsso as to maximize a numerical reward
    signal

9
Some Notable RL Applications
  • TD-Gammon Tesauro
  • worlds best backgammon program
  • Elevator Control Crites Barto
  • high performance elevator controller
  • Dynamic Channel Assignment Singh Bertsekas,
    Nie Haykin
  • high performance assignment of radio channels to
    mobile telephone calls
  • Traffic light control Wiering et al., Choy et
    al.
  • high performance control of traffic lights to
    optimize traffic flow
  • Water systems control Bhattacharya et al.
  • high performance control of water levels of
    regional water systems

10
Relationships to other fields
Artificial Intelligence Planning methods
Control Theory and Operations Research
Psychology
Reinforcement Learning (RL)
Neuroscience
Artificial Neural Networks
11
Recommended literature
  • Sutton Barto (1998). Reinforcement learning an
    introduction. MIT Press.
  • Kaelbling, Littmann, Moore (1996).
    Reinforcement learning a survey. Artificial
    Inteligence Research, vol. 4, pp. 237--285.

12
Complete Agent
  • Temporally situated
  • Continual learning and planning
  • Agent affects the environment
  • Environment is stochastic and uncertain

Environment
action
state
reward
Agent
13
Supervised Learning
Training Info desired (target) outputs
Supervised Learning System
Inputs
Outputs
Error (target output actual output)
14
Reinforcement Learning (RL)
Training Info evaluations (rewards /
penalties)
RL System
Inputs
Outputs (actions)
Objective get as much reward as possible
15
Key Features of RL
  • Learner is not told which actions to take
  • Trial-and-Error search
  • Possibility of delayed reward
  • Sacrifice short-term gains for greater long-term
    gains
  • The need to explore and exploit
  • Considers the whole problem of a goal-directed
    agent interacting with an uncertain environment

16
What is attractive about RL?
  • Online, autonomous learning without a need for
    preprogrammed behavior or instruction
  • Learning to satisfy long-term goals
  • Applicable to many tasks

17
Some RL History
Trial-and-Error learning
Temporal-difference learning
Optimal control, value functions
Hamilton (Physics) 1800s
Thorndike (?) 1911
Secondary reinforcement (?)
Shannon
Samuel
Minsky
Bellman/Howard (OR)
Holland
Klopf
Witten
Werbos
Barto et al.
Sutton
Watkins
18
Elements of RL
  • Policy what to do
  • Maps states to actions
  • Reward what is good
  • Value what is good because it predicts reward
  • Reflects total, long-term reward
  • Model what follows what
  • Maps states and actions to new states and rewards

19
An Extended Example Tic-Tac-Toe
X
X
X
X
X
X
X
X
O
O
X
X
X
X
X
O
O
O
O
X
X
O
O
O
O
O
X
O
xs move
...
x
x
x
os move
...
...
...
x
o
o
x
o
x
xs move
x
...
...
...
...
...
os move
Assume an imperfect opponent he/she
sometimes makes mistakes
xs move
x
o
x
x
o
20
An RL Approach to Tic-Tac-Toe
1. Make a table with one entry per state
State V(s) estimated probability of
winning
.5 ?
2. Now play lots of games. To pick our moves,
look ahead one step
.5 ?
x
. . .
. . .
1 win
x
x
x
o
o
. . .
. . .
0 loss
o
x
o
x
o
. . .
. . .
0 draw
o
o
x
x
x
o
o
o
x
Just pick the next state with the
highest estimated prob. of winning the largest
V(s) a greedy move. But 10 of the time pick a
move at random an exploratory move.
21
RL Learning Rule for Tic-Tac-Toe
Exploratory move
22
How can we improve this T.T.T. player?
  • Take advantage of symmetries
  • representation/generalization
  • Do we need random moves? Why?
  • Do we always need a full 10?
  • Can we learn from random moves?
  • Can we learn offline?
  • Pre-training from self play?
  • Using learned models of opponent?
  • . . .

23
How is Tic-Tac-Toe easy?
  • Small number of states and actions
  • Small number of steps until reward
  • . . .

24
RL Formalized
25
The Agent Learns a Policy
  • Reinforcement learning methods specify how the
    agent changes its policy as a result of
    experience.
  • Roughly, the agents goal is to get as much
    reward as it can over the long run.

26
Getting the Degree of Abstraction Right
  • Time steps need not refer to fixed intervals of
    real time.
  • Actions can be low level (e.g., voltages to
    motors), or high level (e.g., accept a job
    offer), mental (e.g., shift in focus of
    attention), etc.
  • States can be low-level sensations, or they can
    be abstract, symbolic, based on memory, or
    subjective (e.g., the state of being surprised
    or lost).
  • Reward computation is in the agents environment
    because the agent cannot change it arbitrarily.

27
Goals and Rewards
  • Is a scalar reward signal an adequate notion of a
    goal?maybe not, but it is surprisingly flexible.
  • A goal should specify what we want to achieve,
    not how we want to achieve it.
  • A goal must be outside the agents direct
    controlthus outside the agent.
  • The agent must be able to measure success
  • explicitly
  • frequently during its lifespan.

28
Returns
Episodic tasks interaction breaks naturally into
episodes, e.g., plays of a game, trips through a
maze.
where T is a final time step at which a terminal
state is reached, ending an episode.
29
Returns for Continuing Tasks
Continuing tasks interaction does not have
natural episodes.
Discounted return
30
An Example
Avoid failure the pole falling beyond a critical
angle or the cart hitting end of track.
As an episodic task where episode ends upon
failure
As a continuing task with discounted return
In either case, return is maximized by avoiding
failure for as long as possible.
31
Another Example
Get to the top of the hill as quickly as
possible.
Return is maximized by minimizing number of
steps reach the top of the hill.
32
A Unified Notation
  • Think of each episode as ending in an absorbing
    state that always produces reward of zero
  • We can cover all cases by writing

33
The Markov Property
  • A state should retain all essential
    information, i.e., it should have the Markov
    Property

34
Markov Decision Processes
  • If a reinforcement learning task has the Markov
    Property, it is a Markov Decision Process (MDP).
  • If state and action sets are finite, it is a
    finite MDP.
  • To define a finite MDP, you need to give
  • state and action sets
  • one-step dynamics defined by state transition
    probabilities
  • expected rewards

35
Value Functions
  • The value of a state is the expected return
    starting from that state depends on the agents
    policy
  • The value of taking an action in a state under
    policy p is the expected return starting from
    that state, taking that action, and thereafter
    following p

36
Bellman Equation for a Policy p
The basic idea
So
Or, without the expectation operator
37
Gridworld
  • Actions north, south, east, west deterministic.
  • If action would take agent off the grid no move
    but reward 1
  • Other actions produce reward 0, except actions
    that move agent out of special states A and B as
    shown.

State-value function for equiprobable random
policy g 0.9
38
Optimal Value Functions
  • For finite MDPs, policies can be partially
    ordered
  • There is always at least one (and possibly many)
    policies that is better than or equal to all the
    others. This is an optimal policy. We denote them
    all by p .
  • Optimal policies share the same optimal
    state-value function
  • Optimal policies also share the same optimal
    action-value function

This is the expected return for taking action a
in state s and thereafter following an optimal
policy.
39
Bellman Optimality Equation for V
The value of a state under an optimal policy must
equal the expected return for the best action
from that state
is the unique solution of this system of
nonlinear equations.
40
Bellman Optimality Equation for Q
is the unique solution of this system of
nonlinear equations.
41
Why Optimal State-Value Functions are Useful
Any policy that is greedy with respect to
is an optimal policy.
Therefore, given , one-step-ahead search
produces the long-term optimal actions.
E.g., back to the gridworld
42
What About Optimal Action-Value Functions?
Given , the agent does not even have to do a
one-step-ahead search
43
Solving the Bellman Optimality Equation
  • Finding an optimal policy by solving the Bellman
    Optimality Equation exactly requires the
    following
  • accurate knowledge of environment dynamics
  • we have enough space and time to do the
    computation
  • the Markov Property.
  • How much space and time do we need?
  • polynomial in number of states (via dynamic
    programming methods),
  • BUT, number of states is often huge (e.g.,
    backgammon has about 1020 states).
  • We usually have to settle for approximations.
  • Many RL methods can be understood as
    approximately solving the Bellman Optimality
    Equation.

44
Temporal Difference (TD) Learning
  • Basic idea transform the Bellman Equation into
    an update rule, using two consecutive timesteps
  • Policy Evaluation learn approximation to the
    value function of the current policy
  • Policy Improvement Act greedily with respect to
    the intermediate, learned value function
  • Repeating this over and over again leads to
    approximations of the optimal value function

45
Q-Learning TD-learning of action values
46
Exploration/Exploitation revisited
  • Suppose you form estimates
  • The greedy action at t is
  • You cant exploit all the time you cant explore
    all the time
  • You can never stop exploring but you should
    always reduce exploring

action value estimates
47
e-Greedy Action Selection
  • Greedy action selection
  • e-Greedy


. . . the simplest way to try to balance
exploration and exploitation
48
Softmax Action Selection
  • Softmax action selection methods grade action
    probs. by estimated values.
  • The most common softmax uses a Gibbs, or
    Boltzmann, distribution

computational temperature
49
Pole balancing learned using RL
50
Improving the basic TD learning scheme
  • Can we learn more efficiently?
  • Can we update multiple values at the same
    timestep?
  • Can we look ahead further in time, rather than
    just use the value at the next timestep?
  • Yes! All these can be done simultaneously with
    one extension eligibility traces

51
N-step TD Prediction
  • Idea Look farther into the future when you do TD
    backup (1, 2, 3, , n steps)

52
Mathematics of N-step TD Prediction
  • Monte Carlo
  • TD
  • Use V to estimate remaining return
  • n-step TD
  • 2 step return
  • n-step return

53
Learning with N-step Backups
  • Backup (on-line or off-line)

54
Random Walk Example
  • How does 2-step TD work here?
  • How about 3-step TD?

55
Forward View of TD(l)
  • TD(l) is a method for averaging all n-step
    backups
  • weight by ln-1 (time since visitation)
  • l-return
  • Backup using l-return

56
l-return Weighting Function
57
Relation to TD(0) and MC
  • l-return can be rewritten as
  • If l 1, you get MC
  • If l 0, you get TD(0)

Until termination
After termination
58
Forward View of TD(l) II
  • Look forward from each state to determine update
    from future states and rewards

59
l-return on the Random Walk
  • Same random walk as before, but now with 19
    states
  • Why do you think intermediate values of l are
    best?

60
Backward View of TD(l)
  • The forward view was for theory
  • The backward view is for mechanism
  • New variable called eligibility trace
  • On each step, decay all traces by gl and
    increment the trace for the current state by 1
  • Accumulating trace

61
Backward View
  • Shout dt backwards over time
  • The strength of your voice decreases with
    temporal distance by gl

62
Relation of Backwards View to MC TD(0)
  • Using update rule
  • As before, if you set l to 0, you get to TD(0)
  • If you set l to 1, you get MC but in a better way
  • Can apply TD(1) to continuing tasks
  • Works incrementally and on-line (instead of
    waiting to the end of the episode)

63
Forward View Backward View
  • The forward (theoretical) view of TD(l) is
    equivalent to the backward (mechanistic) view
  • Sutton Bartos book shows

algebra shown in book
64
Q(l)-learning
  • Zero out eligibility trace after a non-greedy
    action. Do max when backing up at first
    non-greedy choice.

65
Q(l)-learning
66
Q(l) Gridworld Example
  • With one trial, the agent has much more
    information about how to get to the goal
  • not necessarily the best way
  • Can considerably accelerate learning

67
Conclusions TD(l)/Q(l) methods
  • Can significantly speed learning
  • Robustness against unreliable value estimations
    (e.g. caused by Markov violation)
  • Does have a cost in computation

68
Generalization and Function Approximation
  • Look at how experience with a limited part of the
    state set be used to produce good behavior over a
    much larger part
  • Overview of function approximation (FA) methods
    and how they can be adapted to RL

69
Generalization
Table Generalizing
Function Approximator
State V
State V
s s s . . . s
1
2
3
Train here
N
70
So with function approximation a single value
update affects a larger region of the state space
71
Value Prediction with FA
Before, value functions were stored in lookup
tables.
72
Adapt Supervised Learning Algorithms
Training Info desired (target) outputs
Supervised Learning System
Inputs
Outputs
Training example input, target output
Error (target output actual output)
73
Backups as Training Examples
As a training example
input
target output
74
Any FA Method?
  • In principle, yes
  • artificial neural networks
  • decision trees
  • multivariate regression methods
  • etc.
  • But RL has some special requirements
  • usually want to learn while interacting
  • ability to handle nonstationarity
  • other?

75
Gradient Descent Methods
76
Performance Measures for Gradient Descent
  • Many are applicable but
  • a common and simple one is the mean-squared error
    (MSE) over a distribution P

77
Gradient Descent
Iteratively move down the gradient
78
Control with FA
  • Learning state-action values
  • The general gradient-descent rule
  • Gradient-descent Q(l) (backward view)

79
Linear Gradient Descent Q(l)
80
Mountain-Car Task
81
Mountain-Car Results
82
Summary
  • Generalization can be done in those cases where
    there are too many states
  • Adapting supervised-learning function
    approximation methods
  • Gradient-descent methods

83
Case Studies
  • Illustrate the promise of RL
  • Illustrate the difficulties, such as long
    learning times, finding good state representations

84
TD Gammon
Tesauro 1992, 1994, 1995, ...
  • Objective is to advance all pieces to points
    19-24
  • 30 pieces, 24 locations implies enormous number
    of configurations
  • Effective branching factor of 400

85
A Few Details
  • Reward 0 at all times except those in which the
    game is won, when it is 1
  • Episodic (game episode), undiscounted
  • Gradient descent TD(l) with a multi-layer neural
    network
  • weights initialized to small random numbers
  • backpropagation of TD error
  • four input units for each point unary encoding
    of number of white pieces, plus other features
  • Learning during self-play

86
Multi-layer Neural Network
87
Summary of TD-Gammon Results
88
The Acrobot
Spong 1994 Sutton 1996
89
Acrobot Learning Curves for Q(l)
90
Typical Acrobot Learned Behavior
91
Elevator Dispatching
Crites and Barto 1996
92
State Space
18
  • 18 hall call buttons 2 combinations
  • positions and directions of cars 18
    (rounding to nearest floor)
  • motion states of cars (accelerating, moving,
    decelerating, stopped, loading, turning) 6
  • 40 car buttons 2
  • Set of passengers waiting at each floor, each
    passenger's arrival time and destination
    unobservable. However, 18 real numbers are
    available giving elapsed time since hall buttons
    pushed we discretize these.
  • Set of passengers riding each car and their
    destinations observable only through the car
    buttons

4
4
40
22
Conservatively about 10 states
93
Control Strategies
  • Zoning divide building into zones park in
    zone when idle. Robust in heavy traffic.
  • Search-based methods greedy or non-greedy.
    Receding Horizon control.
  • Rule-based methods expert systems/fuzzy
    logic from human experts
  • Other heuristic methods Longest Queue First
    (LQF), Highest Unanswered Floor First (HUFF),
    Dynamic Load Balancing (DLB)
  • Adaptive/Learning methods NNs for prediction,
    parameter space search using simulation, DP on
    simplified model, non-sequential RL

94
Performance Criteria
Minimize
  • Average wait time
  • Average system time (wait travel time)
  • waiting gt T seconds (e.g., T 60)
  • Average squared wait time (to encourage fast
    and fair service)

95
Average Squared Wait Time
  • Instantaneous cost
  • Define return as an integral rather than a sum
    (Bradtke and Duff, 1994)

becomes
96
Algorithm
97
Neural Networks
47 inputs, 20 sigmoid hidden units, 1 or 2 output
units
Inputs
  • 9 binary state of each hall down button
  • 9 real elapsed time of hall down button if
    pushed
  • 16 binary one on at a time position and
    direction of car making decision
  • 10 real location/direction of other cars
  • 1 binary at highest floor with waiting
    passenger?
  • 1 binary at floor with longest waiting
    passenger?
  • 1 bias unit ? 1

98
Elevator Results
99
Dynamic Channel Allocation
Details in Singh and Bertsekas 1997
100
Helicopter flying
  • Difficult nonlinear control problem
  • Also difficult for humans
  • Approach learn in simulation, then transfer to
    real helicopter
  • Uses function approximator for generalization
  • Bagnell, Ng, and Schneider (2001, 2003, )

101
In-class assignment
  • Think again of your own RL problem, with states,
    actions, and rewards
  • This time think especially about how uncertainty
    may play a role, and about how generalization may
    be important
  • Discussion

102
Homework assignment
  • Due Thursday 13-16
  • Think again of your own RL problem, with states,
    actions, and rewards
  • Do a web search on your RL problem or related
    work
  • What is there already, and what, roughly, have
    they done to solve the RL problem?
  • Present briefly in class

103
Overview day 3
  • Summary of what weve learnt about RL so far
  • Models and planning
  • Multi-agent RL
  • Presentation of homework assignments and
    discussion

104
RL summary
  • Objective maximize the total amount of
    (discounted) reward
  • Approach estimate a value function (defined over
    state space) which represents this total amount
    of reward
  • Learn this value function incrementally by doing
    updates based on values of consecutive states
    (temporal differences).
  • After having learnt optimal value function,
    optimal behavior can be obtained by taking action
    which has or leads to highest value
  • Use function approximation techniques for
    generalization if state space becomes too large
    for tables

105
RL weaknesses
  • Still art involved in defining good state (and
    action) representations
  • Long learning times

106
Planning and Learning
  • Use of environment models
  • Integration of planning and learning methods

107
Models
  • Model anything the agent can use to predict how
    the environment will respond to its actions
  • Models can be used to produce simulated experience

108
Planning
  • Planning any computational process that uses a
    model to create or improve a policy

109
Learning, Planning, and Acting
  • Two uses of real experience
  • model learning to improve the model
  • direct RL to directly improve the value function
    and policy
  • Improving value function and/or policy via a
    model is sometimes called indirect RL or
    model-based RL. Here, we call it planning.

110
Direct vs. Indirect RL
  • Indirect methods
  • make fuller use of experience get better policy
    with fewer environment interactions
  • Direct methods
  • simpler
  • not affected by bad models

But they are very closely related and can be
usefully combined planning, acting, model
learning, and direct RL can occur simultaneously
and in parallel
111
The Dyna Architecture (Sutton 1990)
112
The Dyna-Q Algorithm
direct RL
model learning
planning
113
Dyna-Q on a Simple Maze
rewards 0 until goal, when 1
114
Dyna-Q Snapshots Midway in 2nd Episode
115
Using Dyna-Q for real-time robot learning
After learning (approx. 15 minutes)
Before learning
116
Multi-agent RL
  • So far considered only single-agent RL
  • But many domains have multiple agents!
  • Group of industrial robots working on a single
    car
  • Robot soccer
  • Traffic
  • Can we extend the methods of single-agent RL to
    multi-agent RL?

117
Dimensions of multi-agent RL
  • Is the objective to maximize individual rewards
    or to maximize global rewards?
  • Competition vs. cooperation
  • Do the agents share information?
  • Shared state representation?
  • Communication?
  • Homogeneous or heterogeneous agents?
  • Do some agents have special capabilities?

118
Competion
  • Like multiple single-agent cases simultaneously
  • Related to game theory
  • Nash equilibria etc.
  • Research goals
  • study how to optimize individual rewards in the
    face of competition
  • study group dynamics

119
Cooperation
  • More different from single-agent case than
    competition
  • How can we make the individual agents work
    together?
  • Are rewards shared among the agents?
  • should all agents be punished for individual
    mistakes?

120
Robot soccer example cooperation
  • Riedmiller group in Karlsruhe
  • Robots must play together to beat other groups of
    robots in Robocup tournaments
  • Riedmiller group uses reinforcement learning
    techniques to do this

121
Opposite approaches to cooperative case
  • Consider the multi-agent system as a collection
    of individual reinforcement learners
  • Design individual reward functions such that
    cooperation emerges
  • They may become selfish, or may not cooperate
    in a desirable way
  • Consider the whole multi-agent system as one big
    MDP with a large action vector
  • State-action space may become very large, but
    perhaps possible with advanced function
    approximation

122
Interesting intermediate approach
  • Let agents learn mostly individually
  • Assign (or learn!) a limited number of states
    where agents must coordinate, and at those points
    consider those agents as a larger single agent
  • This can be represented and computed efficiently
    using coordination graphs
  • Guestrin Koller (2003), Kok Vlassis (2004)

123
Robocup simulation league
  • Kok Vlassis (2002-2004)

124
Advanced Generalization Issues
  • Generalization over states
  • tables
  • linear methods
  • nonlinear methods
  • Generalization over actions
  • Proving convergence with generalizion methods

125
Non-Markov case
  • Try to do the best you can with non-Markov states
  • Partially Observable MDPs (POMDPs)
  • Bayesian approach belief states
  • construct state from sequence of observations

126
Other issues
  • Model-free vs. model-based
  • Value functions vs. directly searching for good
    policies (e.g. using genetic algorithms)
  • Hierarchical methods
  • Incorporating prior knowledge
  • advice and hints
  • trainers and teachers
  • shaping
  • Lyapunov functions
  • etc.

127
The end!
Write a Comment
User Comments (0)