Reinforcement Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Reinforcement Learning

Description:

Note: only brief overview, no in-depth coverage. Rutgers CS440, Fall 2003 ... does not use the fact that utilities of states are dependent (Bellman equations) ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 9
Provided by: vladimir5
Category:

less

Transcript and Presenter's Notes

Title: Reinforcement Learning


1
Reinforcement Learning
  • Reading Ch. 21, AIMA 2nd Ed

2
Outline
  • What is RL?
  • Methods for RL.
  • Note only brief overview, no in-depth coverage.

3
What is Reinforcement Learning (RL)?
  • Learning so far learning probabilistic models
    (BNs) or functions (NNs)
  • Learning what/how to do from feedback
    (reward/reinforcement).
  • Chess playing learn how to play from feedback
    won/lost game.
  • Learning to speak, crawl,
  • Learning user preferences for web searching
  • MDP find optimal policy using known model
  • Optimal policy maximizes expected total reward
  • RL learn optimal policy from rewards
  • Do not know environment model
  • Do not know reward function
  • Know how well something is done (e.g., won / lost)

4
Types of RL
  • MDP actions states rewards
  • Passive learning policy fixed, learn utility
    of states ( rest of model )
  • Active learning policy not fixed, learn utility
    as well as optimal policy

5
Passive RL
  • Policy is known and fixed, need to learn how good
    it is environment model
  • Learn U(st) but do not know P(st st-1,
    at-1) and R( st ) from at
  • Method conduct trials, receive sequence of
    actions, states and rewards (at,st,Rt) ,
    compute model parameters and utility

at NA A A A
st NL NL L L
rt -20 0 20 20
6
Direct utility estimation
  • Observe (at,st,Rt), estimate U(st) from counts
    inductive learning

Sample 1
Sample 2
at NA A A A
st NL NL L L
rt -20 0 20 20
at NA A A NA A
st NL NL L L L
rt -20 0 20 5 20
  • Example (?1)
  • Sample 1 U(NL) -20 0 20 20 20, U(NL)
    0 20 20 40
  • Sample 2 U(NL) -20 0 20 5 20 25,
    U(NL) 020520 45
  • On average, U(NL) ( 20402545 ) / 4
  • Drawback does not use the fact that utilities of
    states are dependent (Bellman equations)!

7
Adaptive dynamic programming
  • Take into account constraints described by
    Bellman equations
  • AlgorithmFor each sample, each time step
  • Estimate P(st st-1, at-1)E.g., P(LNL,A)
    (L,NL,A) / (NL,A)
  • Compute U(st) from R(st), P(st st-1, at-1),
    using Bellman equations or update
  • Drawback usually (too) many states

8
TD-Learning
  • Only update U-values for observed transitions
  • Algorithm
  • Receive new sample pair, (st,st1)
  • Assume only transition st ? st1 can occur
  • Compute update of U
  • Does not need to compute model parameters!( Yet
    converges to the right solution. )

Old value
Old value
New value
Value computed fromBellman equation
Write a Comment
User Comments (0)
About PowerShow.com