Title: Rigid motions
1Reinforcement Learning
- Problem
- Given Markov Decision Process with unknown
transition dynamics - Find an optimal policy
2A few random bits of AI history
- Donald Michie (1961-1963), matchbox educable
naughts and crosses (MENACE) - Arthur Samuel (1959), checkers.
- Claude Shannon (1950s)
3Paradigm
agent
action
state
reward
environment
4Expected Return
Expected return expected discounted reward
Discount rate
- Two ways to reason about a control problem
- Episodic tasks tasks that are broken down into
episodes. Expected return is calculated only over
one episode. (Each episode has length T). - Continuing tasks tasks where discounted return
is calculated out to infinity. (T is infinity).
5Markov property
6Markov property example
- Consider the cart and pole system
- Actions accelerate the cart in either direction
- Why do position, velocity, angle, and angular
velocity define a Markov state space?
7Markov Decision Processes (MDPs)
- An MDP is defined by
- State space
- Action space
- Expected reward function
- Transition probabilities
An MDP is a tuple
8MDP example
9Value functions
Expected one-step return
State-action transition probabilities
Deterministic policy
Stochastic policy
- Value of a state expected return for executing
policy starting in given state. - Value fn value as a function of state.
10Value functions
Value of a state expected return for executing
policy starting in given state.
State-value function
Action-value function
11Bellman Equation
12Backup Diagrams
13Optimal Value Function
- Optimal value function
- Maximizes expected reward over the episode
- Ranking policies
- If for all states
- Then
- There is always a policy that is at least as good
as all others. - This is the optimal policy,
Optimal state-value function
Optimal action-value function
14Bellman optimality equation for
15Bellman optimality equation for
16Policy evaluation
Policy evaluation determine the value of each
state under a given policy.
Recall the bellman equation
- Policy evaluation algorithm
- 1.
- 2. For k1 until done
- For all
-
17Policy improvement
18Policy iteration
Policy evaluation determine the value of each
state under a given policy.
- Policy iteration algorithm
- 1. Initialize arbitrarily
- Repeat until done
- do policy evaluation
- do policy improvement
19Value iteration
- Value evaluation similar to policy iteration
except it does not iterate multiple times
evaluating a single policy. - Do one step of policy evaluation
- Then do policy improvement
One step of policy evaluation
Policy improvement
20Value iteration
One step of policy evaluation
Policy improvement
Combined value iteration update
21Value iteration
- Policy evaluation algorithm
- 1.
- 2. For k1 until done
- For all
-
Output policy such that
22Temporal-Difference (TD) Learning
- Dynamic programming
- Off-line. You need to know transition
probabilities and reward function. - TD Learning
- Simultaneously estimates transition probabilities
and reward function while also computing optimal
policy. - This happens on-line, while the agent is
interacting with the environment.
23Temporal-Difference (TD) Learning
24Estimating state-value function using TD(0)
Value function update
- Given
- Initialize arbitrarily
- For each episode, repeat
- Initialize agent state
- repeat until episode terminates
-
- take action a
-
- Sample backups v. full backups
25E-greedy actions
Greedy
Random
26SARSA
Update rule
- Initialize arbitrarily
- For each episode, repeat
- Initialize agent state
- Select action a from s using e-greedy
strategy on Q - repeat until episode terminates
- take action a observe s
- Select action a from s using e-greedy
strategy on Q -
-
27Example windy grid world
28Q-learning
Update rule
- Initialize arbitrarily
- For each episode, repeat
- Initialize agent state
- repeat until episode terminates
- Select action a from s using e-greedy
strategy on Q - take action a observe s
-
-
29Example cliff world
30Eligibility Traces
Idea
One-step return
Two-step return
n-step return
31SARSA( )
- Initialize arbitrarily
- For each episode, repeat
- Initialize agent state
- repeat until episode terminates
- take action a observe s
- Select action a from s using
e-greedy strategy on Q -
-
- for all s,a
-
-
-