Rigid motions

1 / 31
About This Presentation
Title:

Rigid motions

Description:

A few random bits of AI history. Donald Michie (1961-1963), matchbox educable naughts and crosses (MENACE) ... Episodic tasks: tasks that are broken down into ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 32
Provided by: rpl94

less

Transcript and Presenter's Notes

Title: Rigid motions


1
Reinforcement Learning
  • Problem
  • Given Markov Decision Process with unknown
    transition dynamics
  • Find an optimal policy

2
A few random bits of AI history
  • Donald Michie (1961-1963), matchbox educable
    naughts and crosses (MENACE)
  • Arthur Samuel (1959), checkers.
  • Claude Shannon (1950s)

3
Paradigm
agent
action
state
reward
environment
4
Expected Return
Expected return expected discounted reward
Discount rate
  • Two ways to reason about a control problem
  • Episodic tasks tasks that are broken down into
    episodes. Expected return is calculated only over
    one episode. (Each episode has length T).
  • Continuing tasks tasks where discounted return
    is calculated out to infinity. (T is infinity).

5
Markov property
6
Markov property example
  • Consider the cart and pole system
  • Actions accelerate the cart in either direction
  • Why do position, velocity, angle, and angular
    velocity define a Markov state space?

7
Markov Decision Processes (MDPs)
  • An MDP is defined by
  • State space
  • Action space
  • Expected reward function
  • Transition probabilities

An MDP is a tuple
8
MDP example
9
Value functions
Expected one-step return
State-action transition probabilities
Deterministic policy
Stochastic policy
  • Value of a state expected return for executing
    policy starting in given state.
  • Value fn value as a function of state.

10
Value functions
Value of a state expected return for executing
policy starting in given state.
State-value function
Action-value function
11
Bellman Equation
12
Backup Diagrams
13
Optimal Value Function
  • Optimal value function
  • Maximizes expected reward over the episode
  • Ranking policies
  • If for all states
  • Then
  • There is always a policy that is at least as good
    as all others.
  • This is the optimal policy,

Optimal state-value function
Optimal action-value function
14
Bellman optimality equation for
15
Bellman optimality equation for
16
Policy evaluation
Policy evaluation determine the value of each
state under a given policy.
Recall the bellman equation
  • Policy evaluation algorithm
  • 1.
  • 2. For k1 until done
  • For all

17
Policy improvement
18
Policy iteration
Policy evaluation determine the value of each
state under a given policy.
  • Policy iteration algorithm
  • 1. Initialize arbitrarily
  • Repeat until done
  • do policy evaluation
  • do policy improvement

19
Value iteration
  • Value evaluation similar to policy iteration
    except it does not iterate multiple times
    evaluating a single policy.
  • Do one step of policy evaluation
  • Then do policy improvement

One step of policy evaluation
Policy improvement
20
Value iteration
One step of policy evaluation
Policy improvement
Combined value iteration update
21
Value iteration
  • Policy evaluation algorithm
  • 1.
  • 2. For k1 until done
  • For all

Output policy such that
22
Temporal-Difference (TD) Learning
  • Dynamic programming
  • Off-line. You need to know transition
    probabilities and reward function.
  • TD Learning
  • Simultaneously estimates transition probabilities
    and reward function while also computing optimal
    policy.
  • This happens on-line, while the agent is
    interacting with the environment.

23
Temporal-Difference (TD) Learning
24
Estimating state-value function using TD(0)
Value function update
  • Given
  • Initialize arbitrarily
  • For each episode, repeat
  • Initialize agent state
  • repeat until episode terminates
  • take action a
  • Sample backups v. full backups

25
E-greedy actions
Greedy
Random
  • E-greedy
  • Given
  • else

26
SARSA
Update rule
  • Initialize arbitrarily
  • For each episode, repeat
  • Initialize agent state
  • Select action a from s using e-greedy
    strategy on Q
  • repeat until episode terminates
  • take action a observe s
  • Select action a from s using e-greedy
    strategy on Q

27
Example windy grid world
28
Q-learning
Update rule
  • Initialize arbitrarily
  • For each episode, repeat
  • Initialize agent state
  • repeat until episode terminates
  • Select action a from s using e-greedy
    strategy on Q
  • take action a observe s

29
Example cliff world
30
Eligibility Traces
Idea
One-step return
Two-step return
n-step return
31
SARSA( )
  • Initialize arbitrarily
  • For each episode, repeat
  • Initialize agent state
  • repeat until episode terminates
  • take action a observe s
  • Select action a from s using
    e-greedy strategy on Q
  • for all s,a
Write a Comment
User Comments (0)