Reinforcement Learning

About This Presentation

Title:

Reinforcement Learning

Description:

Learning from interaction (with environment) Goal-directed learning ... BUT, number of states is often huge (e.g., backgammon has about 1020 states) ... – PowerPoint PPT presentation

Number of Views:16

Avg rating:3.0/5.0

Slides: 49

Provided by: taiwe

more less

Transcript and Presenter's Notes

Title: Reinforcement Learning

1
Reinforcement Learning

??????

2
Content

Introduction
Main Elements
Markov Decision Process (MDP)
Value Functions

3
Reinforcement Learning

Introduction

4
Reinforcement Learning

Learning from interaction (with environment)
Goal-directed learning
Learning what to do and its effect
Trial-and-error search and delayed reward
The two most important distinguishing features of
reinforcement learning

5
Exploration and Exploitation

The agent has to exploit what it already knows in
order to obtain reward, but it also has to
explore in order to make better action selections
in the future.
Dilemma ? neither exploitation nor exploration
can be pursued exclusively without failing at the
task.

6
Supervised Learning
Training Info desired (target) outputs
Error (target output actual output)
7
Reinforcement Learning
Training Info evaluations (rewards /
penalties)
Objective get as much reward as possible
8
Reinforcement Learning

Main Elements

9
Main Elements

To maximize value
10
Main Elements
?????????,??????,????, ????,?????,??????,??????,??
?????

Total reward (long term)
Immediate reward (short term)
To maximize value
11
Example (Bioreactor)

State
current temperature and other sensory readings,
composition, target chemical
Actions
how much heating, stirring are required?
what ingredients need to be added?
Reward
moment-by-moment production of desired chemical

12
Example (Pick-and-Place Robot)

State
current positions and velocities of joints
Actions
voltages to apply to motors
Reward
reach end-position successfully, speed,
smoothness of trajectory

13
Example (Recycling Robot)

State
charge level of battery
Actions
look for cans, wait for can, go recharge
Reward
positive for finding cans, negative for running
out of battery

14
Main Elements

Environment
Its state is perceivable
Reinforcement Function
To generate reward
A function of states (or state/action pairs)
Value Function
The potential to reach the goal (with maximum
total reward)
To determine the policy
A function of state

15
The Agent-Environment Interface
Frequently, we model the environment as a Markov
Decision Process (MDP).
Environment
Agent
16
Reward Function
S a set of states
A a set of actions
or

A reward function is closely related to the goal
in reinforcement learning.
It maps perceived states (or state-action pairs)
of the environment to a single number, a reward,
indicating the intrinsic desirability of the
state.

17
Goals and Rewards

The agent's goal is to maximize the total amount
of reward it receives.
This means maximizing not just immediate reward,
but cumulative reward in the long run.

18
Goals and Rewards
Can you design another reward function?
Reward 1
Reward 0
19
Goals and Rewards
state
reward
1
Win
?1
Loss
Draw or Non-terminal
0
20
Goals and Rewards
The reward signal is the way of communicating to
the agent what we want it to achieve, not how we
want it achieved.
0
?1
?1
?1
?1
21
Reinforcement Learning

Markov Decision Processes

22
Definition

An MDP consists of
A set of states S, and a set of actions A
A transition distribution
Expected next rewards

23
Example (Recycling Robot)
High
Low
24
Example (Recycling Robot)
25
Decision Making

Many stochastic processes can be modeled within
the MDP framework.
The process is controlled by choosing actions in
each state trying to attain the maximum long-term
reward.

How to find the optimal policy?
26
Reinforcement Learning

Value Functions

27
Value Functions
or

To estimate how good it is for the agent in a
given state
(or how good it is to perform a given action in
a given state).
The notion of how good" here is defined in
terms of future rewards that can be expected, or,
to be precise, in terms of expected return.
Value functions are defined with respect to
particular policies.

28
Returns

Episodic Tasks
finite-horizon tasks
terminates after a fixed number of time steps
indefinite-horizon tasks
can last arbitrarily long but eventually
terminate
Continual Tasks
infinite-horizon tasks

29
Finite Horizontal Tasks
k-armed bandit problem
Return at time t
Expected return at time t
30
Indefinite Horizontal Tasks
Play chess
Return at time t
Expected return at time t
31
Continual Tasks
Control
Return at time t
Expected return at time t
32
Unified Notation
Reformulation of episodic tasks
0
?
Discounted return at time t
1
lt 1
? discounting factor
33
Policies

A policy, ?, is a mapping from states, s?S, and
actions, a?A(s), to the probability ?(s, a) of
taking action a when in state s.

34
Value Functions under a Policy
State-Value Function
Action-Value Function
35
Bellman Equation for a Policy p ? State-Value
Function
36
Backup Diagram ?State-Value Function
37
Bellman Equation for a Policy p ? Action-Value
Function
38
Backup Diagram ?Action-Value Function
39
Bellman Equation for a Policy p

This is a set of equations (in fact, linear), one
for each state.
It specifies the consistency condition between
values of states and successor states, and
rewards.
Its unique solution is the value function for ?.

40
Example (Grid World)

State position
Actions north, south, east, west resulting
state is deterministic.
Reward If would take agent off the grid no move
but reward 1
Other actions produce reward 0, except actions
that move agent out of special states A and B as
shown.

State-value function for equiprobable random
policy g 0.9
41
Optimal Policy (?)
Optimal State-Value Function
What is the relation btw. them.
Optimal Action-Value Function
42
Optimal Value Functions
Bellman Optimality Equations
43
Optimal Value Functions
Bellman Optimality Equations
How to apply the value function to determine the
action to be taken on each state?
How to compute? How to store?
44
Example (Grid World)
Random Policy
Optimal Policy
?
V
45
Finding Optimal Solution via Bellman

Finding an optimal policy by solving the Bellman
Optimality Equation requires the following
accurate knowledge of environment dynamics
enough space and time for computation
the Markov Property.

46
Example (Recycling Robot)
47
Example (Recycling Robot)
48
Optimality and Approximation

How much space and time do we need?
polynomial in number of states (via dynamic
programming methods)
BUT, number of states is often huge (e.g.,
backgammon has about 1020 states).
We usually have to settle for approximations.
Many RL methods can be understood as
approximately solving the Bellman Optimality
Equation.

Write a Comment

User Comments (0)