Title: Reinforcement Learning
1Reinforcement Learning
2Content
- Introduction
- Main Elements
- Markov Decision Process (MDP)
- Value Functions
3Reinforcement Learning
4Reinforcement Learning
- Learning from interaction (with environment)
- Goal-directed learning
- Learning what to do and its effect
- Trial-and-error search and delayed reward
- The two most important distinguishing features of
reinforcement learning
5Exploration and Exploitation
- The agent has to exploit what it already knows in
order to obtain reward, but it also has to
explore in order to make better action selections
in the future. - Dilemma ? neither exploitation nor exploration
can be pursued exclusively without failing at the
task.
6Supervised Learning
Training Info desired (target) outputs
Error (target output actual output)
7Reinforcement Learning
Training Info evaluations (rewards /
penalties)
Objective get as much reward as possible
8Reinforcement Learning
9Main Elements
To maximize value
10Main Elements
?????????,??????,????, ????,?????,??????,??????,??
?????
Total reward (long term)
Immediate reward (short term)
To maximize value
11Example (Bioreactor)
- State
- current temperature and other sensory readings,
composition, target chemical - Actions
- how much heating, stirring are required?
- what ingredients need to be added?
- Reward
- moment-by-moment production of desired chemical
12Example (Pick-and-Place Robot)
- State
- current positions and velocities of joints
- Actions
- voltages to apply to motors
- Reward
- reach end-position successfully, speed,
smoothness of trajectory
13Example (Recycling Robot)
- State
- charge level of battery
- Actions
- look for cans, wait for can, go recharge
- Reward
- positive for finding cans, negative for running
out of battery
14Main Elements
- Environment
- Its state is perceivable
- Reinforcement Function
- To generate reward
- A function of states (or state/action pairs)
- Value Function
- The potential to reach the goal (with maximum
total reward) - To determine the policy
- A function of state
15The Agent-Environment Interface
Frequently, we model the environment as a Markov
Decision Process (MDP).
Environment
Agent
16Reward Function
S a set of states
A a set of actions
or
- A reward function is closely related to the goal
in reinforcement learning. - It maps perceived states (or state-action pairs)
of the environment to a single number, a reward,
indicating the intrinsic desirability of the
state.
17Goals and Rewards
- The agent's goal is to maximize the total amount
of reward it receives. - This means maximizing not just immediate reward,
but cumulative reward in the long run.
18Goals and Rewards
Can you design another reward function?
Reward 1
Reward 0
19Goals and Rewards
state
reward
1
Win
?1
Loss
Draw or Non-terminal
0
20Goals and Rewards
The reward signal is the way of communicating to
the agent what we want it to achieve, not how we
want it achieved.
0
?1
?1
?1
?1
21Reinforcement Learning
- Markov Decision Processes
22Definition
- An MDP consists of
- A set of states S, and a set of actions A
- A transition distribution
- Expected next rewards
23Example (Recycling Robot)
High
Low
24Example (Recycling Robot)
25Decision Making
- Many stochastic processes can be modeled within
the MDP framework. - The process is controlled by choosing actions in
each state trying to attain the maximum long-term
reward.
How to find the optimal policy?
26Reinforcement Learning
27Value Functions
or
- To estimate how good it is for the agent in a
given state - (or how good it is to perform a given action in
a given state). - The notion of how good" here is defined in
terms of future rewards that can be expected, or,
to be precise, in terms of expected return. - Value functions are defined with respect to
particular policies.
28Returns
- Episodic Tasks
- finite-horizon tasks
- terminates after a fixed number of time steps
- indefinite-horizon tasks
- can last arbitrarily long but eventually
terminate - Continual Tasks
- infinite-horizon tasks
29Finite Horizontal Tasks
k-armed bandit problem
Return at time t
Expected return at time t
30Indefinite Horizontal Tasks
Play chess
Return at time t
Expected return at time t
31Continual Tasks
Control
Return at time t
Expected return at time t
32Unified Notation
Reformulation of episodic tasks
0
?
Discounted return at time t
1
lt 1
? discounting factor
33Policies
- A policy, ?, is a mapping from states, s?S, and
actions, a?A(s), to the probability ?(s, a) of
taking action a when in state s.
34Value Functions under a Policy
State-Value Function
Action-Value Function
35Bellman Equation for a Policy p ? State-Value
Function
36Backup Diagram ?State-Value Function
37Bellman Equation for a Policy p ? Action-Value
Function
38Backup Diagram ?Action-Value Function
39Bellman Equation for a Policy p
- This is a set of equations (in fact, linear), one
for each state. - It specifies the consistency condition between
values of states and successor states, and
rewards. - Its unique solution is the value function for ?.
40Example (Grid World)
- State position
- Actions north, south, east, west resulting
state is deterministic. - Reward If would take agent off the grid no move
but reward 1 - Other actions produce reward 0, except actions
that move agent out of special states A and B as
shown.
State-value function for equiprobable random
policy g 0.9
41Optimal Policy (?)
Optimal State-Value Function
What is the relation btw. them.
Optimal Action-Value Function
42Optimal Value Functions
Bellman Optimality Equations
43Optimal Value Functions
Bellman Optimality Equations
How to apply the value function to determine the
action to be taken on each state?
How to compute? How to store?
44Example (Grid World)
Random Policy
Optimal Policy
?
V
45Finding Optimal Solution via Bellman
- Finding an optimal policy by solving the Bellman
Optimality Equation requires the following - accurate knowledge of environment dynamics
- enough space and time for computation
- the Markov Property.
46Example (Recycling Robot)
47Example (Recycling Robot)
48Optimality and Approximation
- How much space and time do we need?
- polynomial in number of states (via dynamic
programming methods) - BUT, number of states is often huge (e.g.,
backgammon has about 1020 states). - We usually have to settle for approximations.
- Many RL methods can be understood as
approximately solving the Bellman Optimality
Equation.