Rigid motions

About This Presentation

Title:

Rigid motions

Description:

A few random bits of AI history. Donald Michie (1961-1963), matchbox educable naughts and crosses (MENACE) ... Episodic tasks: tasks that are broken down into ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 32

Provided by: rpl94

more less

Transcript and Presenter's Notes

Title: Rigid motions

1
Reinforcement Learning

Problem
Given Markov Decision Process with unknown
transition dynamics
Find an optimal policy

2
A few random bits of AI history

Donald Michie (1961-1963), matchbox educable
naughts and crosses (MENACE)
Arthur Samuel (1959), checkers.
Claude Shannon (1950s)

3
Paradigm
agent
action
state
reward
environment
4
Expected Return
Expected return expected discounted reward
Discount rate

Two ways to reason about a control problem
Episodic tasks tasks that are broken down into
episodes. Expected return is calculated only over
one episode. (Each episode has length T).
Continuing tasks tasks where discounted return
is calculated out to infinity. (T is infinity).

5
Markov property
6
Markov property example

Consider the cart and pole system
Actions accelerate the cart in either direction
Why do position, velocity, angle, and angular
velocity define a Markov state space?

7
Markov Decision Processes (MDPs)

An MDP is defined by
State space
Action space
Expected reward function
Transition probabilities

An MDP is a tuple
8
MDP example
9
Value functions
Expected one-step return
State-action transition probabilities
Deterministic policy
Stochastic policy

Value of a state expected return for executing
policy starting in given state.
Value fn value as a function of state.

10
Value functions
Value of a state expected return for executing
policy starting in given state.
State-value function
Action-value function
11
Bellman Equation
12
Backup Diagrams
13
Optimal Value Function

Optimal value function
Maximizes expected reward over the episode

Ranking policies
If for all states
Then
There is always a policy that is at least as good
as all others.
This is the optimal policy,

Optimal state-value function
Optimal action-value function
14
Bellman optimality equation for
15
Bellman optimality equation for
16
Policy evaluation
Policy evaluation determine the value of each
state under a given policy.
Recall the bellman equation

Policy evaluation algorithm
1.
2. For k1 until done
For all

17
Policy improvement
18
Policy iteration
Policy evaluation determine the value of each
state under a given policy.

Policy iteration algorithm
1. Initialize arbitrarily
Repeat until done
do policy evaluation
do policy improvement

19
Value iteration

Value evaluation similar to policy iteration
except it does not iterate multiple times
evaluating a single policy.
Do one step of policy evaluation
Then do policy improvement

One step of policy evaluation
Policy improvement
20
Value iteration
One step of policy evaluation
Policy improvement
Combined value iteration update
21
Value iteration

Policy evaluation algorithm
1.
2. For k1 until done
For all

Output policy such that
22
Temporal-Difference (TD) Learning

Dynamic programming
Off-line. You need to know transition
probabilities and reward function.
TD Learning
Simultaneously estimates transition probabilities
and reward function while also computing optimal
policy.
This happens on-line, while the agent is
interacting with the environment.

23
Temporal-Difference (TD) Learning
24
Estimating state-value function using TD(0)
Value function update