Reinforcement Learning

1 / 48
About This Presentation
Title:

Reinforcement Learning

Description:

Learning from interaction (with environment) Goal-directed learning ... BUT, number of states is often huge (e.g., backgammon has about 1020 states) ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 49
Provided by: taiwe

less

Transcript and Presenter's Notes

Title: Reinforcement Learning


1
Reinforcement Learning
  • ??????

2
Content
  • Introduction
  • Main Elements
  • Markov Decision Process (MDP)
  • Value Functions

3
Reinforcement Learning
  • Introduction

4
Reinforcement Learning
  • Learning from interaction (with environment)
  • Goal-directed learning
  • Learning what to do and its effect
  • Trial-and-error search and delayed reward
  • The two most important distinguishing features of
    reinforcement learning

5
Exploration and Exploitation
  • The agent has to exploit what it already knows in
    order to obtain reward, but it also has to
    explore in order to make better action selections
    in the future.
  • Dilemma ? neither exploitation nor exploration
    can be pursued exclusively without failing at the
    task.

6
Supervised Learning
Training Info desired (target) outputs
Error (target output actual output)
7
Reinforcement Learning
Training Info evaluations (rewards /
penalties)
Objective get as much reward as possible
8
Reinforcement Learning
  • Main Elements

9
Main Elements

To maximize value
10
Main Elements
?????????,??????,????, ????,?????,??????,??????,??
?????

Total reward (long term)
Immediate reward (short term)
To maximize value
11
Example (Bioreactor)
  • State
  • current temperature and other sensory readings,
    composition, target chemical
  • Actions
  • how much heating, stirring are required?
  • what ingredients need to be added?
  • Reward
  • moment-by-moment production of desired chemical

12
Example (Pick-and-Place Robot)
  • State
  • current positions and velocities of joints
  • Actions
  • voltages to apply to motors
  • Reward
  • reach end-position successfully, speed,
    smoothness of trajectory

13
Example (Recycling Robot)
  • State
  • charge level of battery
  • Actions
  • look for cans, wait for can, go recharge
  • Reward
  • positive for finding cans, negative for running
    out of battery

14
Main Elements
  • Environment
  • Its state is perceivable
  • Reinforcement Function
  • To generate reward
  • A function of states (or state/action pairs)
  • Value Function
  • The potential to reach the goal (with maximum
    total reward)
  • To determine the policy
  • A function of state

15
The Agent-Environment Interface
Frequently, we model the environment as a Markov
Decision Process (MDP).
Environment
Agent
16
Reward Function
S a set of states
A a set of actions
or
  • A reward function is closely related to the goal
    in reinforcement learning.
  • It maps perceived states (or state-action pairs)
    of the environment to a single number, a reward,
    indicating the intrinsic desirability of the
    state.

17
Goals and Rewards
  • The agent's goal is to maximize the total amount
    of reward it receives.
  • This means maximizing not just immediate reward,
    but cumulative reward in the long run.

18
Goals and Rewards
Can you design another reward function?
Reward 1
Reward 0
19
Goals and Rewards
state
reward
1
Win
?1
Loss
Draw or Non-terminal
0
20
Goals and Rewards
The reward signal is the way of communicating to
the agent what we want it to achieve, not how we
want it achieved.
0
?1
?1
?1
?1
21
Reinforcement Learning
  • Markov Decision Processes

22
Definition
  • An MDP consists of
  • A set of states S, and a set of actions A
  • A transition distribution
  • Expected next rewards

23
Example (Recycling Robot)
High
Low
24
Example (Recycling Robot)
25
Decision Making
  • Many stochastic processes can be modeled within
    the MDP framework.
  • The process is controlled by choosing actions in
    each state trying to attain the maximum long-term
    reward.

How to find the optimal policy?
26
Reinforcement Learning
  • Value Functions

27
Value Functions
or
  • To estimate how good it is for the agent in a
    given state
  • (or how good it is to perform a given action in
    a given state).
  • The notion of how good" here is defined in
    terms of future rewards that can be expected, or,
    to be precise, in terms of expected return.
  • Value functions are defined with respect to
    particular policies.

28
Returns
  • Episodic Tasks
  • finite-horizon tasks
  • terminates after a fixed number of time steps
  • indefinite-horizon tasks
  • can last arbitrarily long but eventually
    terminate
  • Continual Tasks
  • infinite-horizon tasks

29
Finite Horizontal Tasks
k-armed bandit problem
Return at time t
Expected return at time t
30
Indefinite Horizontal Tasks
Play chess
Return at time t
Expected return at time t
31
Continual Tasks
Control
Return at time t
Expected return at time t
32
Unified Notation
Reformulation of episodic tasks
0
?
Discounted return at time t
1
lt 1
? discounting factor
33
Policies
  • A policy, ?, is a mapping from states, s?S, and
    actions, a?A(s), to the probability ?(s, a) of
    taking action a when in state s.

34
Value Functions under a Policy
State-Value Function
Action-Value Function
35
Bellman Equation for a Policy p ? State-Value
Function
36
Backup Diagram ?State-Value Function
37
Bellman Equation for a Policy p ? Action-Value
Function
38
Backup Diagram ?Action-Value Function
39
Bellman Equation for a Policy p
  • This is a set of equations (in fact, linear), one
    for each state.
  • It specifies the consistency condition between
    values of states and successor states, and
    rewards.
  • Its unique solution is the value function for ?.

40
Example (Grid World)
  • State position
  • Actions north, south, east, west resulting
    state is deterministic.
  • Reward If would take agent off the grid no move
    but reward 1
  • Other actions produce reward 0, except actions
    that move agent out of special states A and B as
    shown.

State-value function for equiprobable random
policy g 0.9
41
Optimal Policy (?)
Optimal State-Value Function
What is the relation btw. them.
Optimal Action-Value Function
42
Optimal Value Functions
Bellman Optimality Equations
43
Optimal Value Functions
Bellman Optimality Equations
How to apply the value function to determine the
action to be taken on each state?
How to compute? How to store?
44
Example (Grid World)
Random Policy
Optimal Policy
?
V
45
Finding Optimal Solution via Bellman
  • Finding an optimal policy by solving the Bellman
    Optimality Equation requires the following
  • accurate knowledge of environment dynamics
  • enough space and time for computation
  • the Markov Property.

46
Example (Recycling Robot)
47
Example (Recycling Robot)
48
Optimality and Approximation
  • How much space and time do we need?
  • polynomial in number of states (via dynamic
    programming methods)
  • BUT, number of states is often huge (e.g.,
    backgammon has about 1020 states).
  • We usually have to settle for approximations.
  • Many RL methods can be understood as
    approximately solving the Bellman Optimality
    Equation.
Write a Comment
User Comments (0)