Chapter 5: Monte Carlo Methods - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Chapter 5: Monte Carlo Methods

Description:

Chapter 5: Monte Carlo Methods Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Monte Carlo methods learn directly from experience – PowerPoint PPT presentation

Number of Views:209
Avg rating:3.0/5.0
Slides: 22
Provided by: AndyB200
Category:

less

Transcript and Presenter's Notes

Title: Chapter 5: Monte Carlo Methods


1
Chapter 5 Monte Carlo Methods
  • Monte Carlo methods learn from complete sample
    returns
  • Only defined for episodic tasks
  • Monte Carlo methods learn directly from
    experience
  • On-line No model necessary and still attains
    optimality
  • Simulated No need for a full model

2
Monte Carlo Policy Evaluation
  • Goal learn Vp(s)
  • Given some number of episodes under p which
    contain s
  • Idea Average returns observed after visits to s
  • Every-Visit MC average returns for every time s
    is visited in an episode
  • First-visit MC average returns only for first
    time s is visited in an episode
  • Both converge asymptotically

3
First-visit Monte Carlo Policy Evaluation
Initialize p policy to be evaluated V an
arbitrary state-value function Returns(s) empty
list, for all seS Repeat forever Generate an
episode using p For each state s appearing in the
episode R return following the first occurrence
of s Append R to Return(s) V(s) average(Returns(s)
)
4
Blackjack example
  • Object Have your card sum be greater than the
    dealers without exceeding 21.
  • States (200 of them)
  • current sum (12-21)
  • dealers showing card (ace-10)
  • do I have a useable ace?
  • Reward 1 for winning, 0 for a draw, -1 for
    losing
  • Actions stick (stop receiving cards), hit
    (receive another card)
  • Policy Stick if my sum is 20 or 21, else hit

5
Blackjack Value Functions
  • After many MC state visit evaluations the
    state-value function is well approximated
  • Dynamic Programming parameters difficult to
    formulate here!
  • For instance with a given hand and a decision to
    stay what would be the expected return value?

6
Backup Diagram for Monte Carlo
  • Entire episode is included while in DP only one
    step transitions
  • Only one choice at each state (unlike DP which
    uses all possible transitions in one step)
  • Estimates for all states are independent so MC
    does not bootstrap (build on other estimates)
  • Time required to estimate one state does not
    depend on the total number of states

7
The Power of Monte Carlo
Example - Elastic Membrane (Dirichlet Problem)
How do we compute the shape of the membrane or
bubble attached to a fixed frame?
8
Two Approaches
Relaxation Iterate on the grid and compute
averages (like DP iterations)
Kakutanis algorithm, 1945 Use many random walks
and average the boundary points values (like MC
approach)
9
Monte Carlo Estimation of Action Values (Q)
  • Monte Carlo is most useful when a model is not
    available
  • We want to learn Q
  • Qp(s,a) - average return starting from state s
    and action a following p
  • Converges asymptotically if every state-action
    pair is visited
  • To assure this we must maintain exploration to
    visit many state-action pairs
  • Exploring starts Every state-action pair has a
    non-zero probability of being the starting pair

10
Monte Carlo Control (to approximate an optimal
policy)
Generalized Policy Iteration GPI
  • MC policy iteration Policy evaluation by
    approximating Qp using MC methods, followed by
    policy improvement
  • Policy improvement step greedy with respect to
    Qp (action-value) function no model needed to
    construct greedy policy

11
Convergence of MC Control
  • Policy improvement theorem tells us
  • This assumes exploring starts and infinite number
    of episodes for MC policy evaluation
  • To solve the latter
  • update only to a given level of performance
  • alternate between evaluation and improvement per
    episode

12
Monte Carlo Exploring Starts
Fixed point is optimal policy p Proof is one of
the fundamental questions of RL
13
Blackjack Example continued
  • Exploring starts easy to enforce by generating
    state-action pairs randomly
  • Initial policy as described before (sticks only
    at 20 or 21)
  • Initial action-value function equal to zero

14
On-policy Monte Carlo Control
  • On-policy learn or improve the policy currently
    executing
  • How do we get rid of exploring starts?
  • Need soft policies p(s,a) gt 0 for all s and a
  • e.g. e-soft policy
  • probability of action selection
  • Similar to GPI move policy towards greedy policy
    (i.e. e-soft)
  • Converges to best e-soft policy

15
On-policy MC Control
16
Learning about p while following another
17
Off-policy Monte Carlo control
  • On-policy estimates the policy value while
    following it
  • so the policy to generate behavior and estimate
    its result is identical
  • Off-policy assumes separate behavior and
    estimation policies
  • Behavior policy generates behavior in environment
  • may be randomized to sample all actions
  • Estimation policy is policy being learned about
  • may be deterministic (greedy)
  • Off-policy estimates one policy while following
    another
  • Average returns from behavior policy by using
    nonzero probabilities of selecting all actions
    for the estimation policy
  • May be slow to find improvement after nongreedy
    action

18
Off-policy MC control
19
Incremental Implementation
  • MC can be implemented incrementally
  • saves memory
  • Compute the weighted average of each return

incremental equivalent
non-incremental
20
Racetrack Exercise
  • States grid squares, velocity horizontal and
    vertical
  • Rewards -1 on track, -5 off track
  • Only the right turns allowed
  • Actions 1, -1, 0 to velocity
  • 0 lt Velocity lt 5
  • Stochastic 50 of the time it moves 1 extra
    square up or right

21
Summary
  • MC has several advantages over DP
  • Can learn directly from interaction with
    environment
  • No need for full models
  • No need to learn about ALL states
  • Less harm by Markovian violations (later in book)
  • MC methods provide an alternate policy evaluation
    process
  • One issue to watch for maintaining sufficient
    exploration
  • exploring starts, soft policies
  • No bootstrapping (as opposed to DP)
Write a Comment
User Comments (0)
About PowerShow.com