Learning to Maximize Reward: Reinforcement Learning

About This Presentation
Title:

Learning to Maximize Reward: Reinforcement Learning

Description:

Optional Reading: : Planning and Acting in Partially Observable Stochastic ... With no model, agent can't choose action from V ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 48
Provided by: johnc106
Learn more at: http://www.ai.mit.edu

less

Transcript and Presenter's Notes

Title: Learning to Maximize Reward: Reinforcement Learning


1
Learning to Maximize Reward Reinforcement
Learning
Brian C. Williams 16.412J/6.834J October 28th,
2002
Slides adapted from Manuela Veloso, Reid
Simmons, Tom Mitchell, CMU
2/7/2014
2
Reading
  • Today Reinforcement Learning
  • Read 2nd ed AIMA Chapter 19, or1st ed AIMA
    Chapter 20
  • Read Reinforcement Learning A Survey by L.
    Kaebling, M. Littman and A. Moore, Journal of
    Artificial Intelligence Research 4 (1996)
    237-285.
  • For Markov Decision Processes
  • Read 1st/2nd ed AIMA Chapter 17 sections 1 4.
  • Optional Reading Planning and Acting in
    Partially Observable Stochastic Domains, by L.
    Kaebling, M. Littman and A. Cassandra, Elsevier
    (1998) 237-285.

3
Markov Decision Processes and Reinforcement
Learning
  • Motivation
  • Learning policies through reinforcement
  • Q values
  • Q learning
  • Multi-step backups
  • Nondeterministic MDPs
  • Function Approximators
  • Model-based Learning
  • Summary

4
Example TD-Gammon Tesauro, 1995
  • Learns to play Backgammon
  • Situations
  • Board configurations (1020)
  • Actions
  • Moves
  • Rewards
  • 100 if win
  • - 100 if lose
  • 0 for all other states
  • Trained by playing 1.5 million games against
    self.
  • Currently, roughly equal to best human player.

5
Reinforcement Learning Problem
  • Given Repeatedly
  • Executed action
  • Observed state
  • Observed reward
  • Learn action policy p S ? A
  • Maximizes life rewardr0 g r1 g 2 r2 . .
    .from any start state.
  • Discount 0 lt g lt 1
  • Note
  • Unsupervised learning
  • Delayed reward

Goal Learn to choose actions that
maximize life reward r0 g r1 g 2 r2 . . .
6
How About Learning the Policy Directly?
  1. p S ? A
  2. fill out table entries for p by collecting
    statistics on training pairs lts,agt.
  3. Where does acome from?

7
How About Learning the Value Function?
  • Have agent learn value function Vp, denoted V.
  • Given learned V, agent selects optimal action by
    one step lookahead
  • p(s) argmaxa r(s,a) gV(d(s, a)
  • Problem
  • Works well if agent knows the environment model.
  • d S x A ? S
  • r S x A ? ?
  • With no model, agent cant choose action from V.
  • With a model, could compute V via value
    iteration, why learn it?

8
How About Learning the Model as Well?
  • Have agent learn d and r by statistics on
    training instances ltst,rt1,st1gt
  • Compute V by value iteration.Vt1(s) ? maxa
    r(s,a) gV t(d(s, a))
  • Agent selects optimal action by one step
    lookahead
  • p(s) argmaxa r(s,a) gV(d(s, a)
  • Problem A viable strategy for many problems, but
  • When do you stop learning the model and compute
    V?
  • May take a long time to converge on model.
  • Would like to continuously interleave learning
    and acting, but repeatedly computing V is
    costly.
  • How can we avoid learning the model and V
    explicitly?

9
Eliminating the Model with Q Functions
  • p(s) argmaxa r(s,a) gV(d(s, a)
  • Key idea
  • Define function that encapsulates V, d and r
  • Q(s,a) r(s,a) gV(d(s, a))
  • From learned Q, can choose an optimal action
    without knowing d or r.
  • p(s) argmaxa Q(s,a)
  • V Cumulative reward of being in s.
  • Q Cumulative reward of being in s and taking
    action a.

10
Markov Decision Processes and Reinforcement
Learning
  • Motivation
  • Learning policies through reinforcement
  • Q values
  • Q learning
  • Multi-step backups
  • Nondeterministic MDPs
  • Function Approximators
  • Model-based Learning
  • Summary

11
How Do We Learn Q?
  • Q(st,at) r(st,at) gV(d(st, at))
  • Idea
  • Create update rule similar to Bellman equation.
  • Perform updates on training examples ltst , at ,
    rt1 , st1 gt
  • Q(st,at) ? rt1 gV(st1 )
  • How do we eliminate V?
  • Q and V are closely related
  • V(s) maxa Q(s,a)
  • Substituting Q for V
  • Q(st,at) ? rt1 g maxa Q(st1,a)

Called a backup
12
Q-Learning for Deterministic Worlds
  • Let Q denote the current approximation to Q.
  • Initially
  • For each s, a initialize table entry Q(s, a) ? 0
  • Observe initial state s0
  • Do for all time t
  • Select an action at and execute it
  • Receive immediate reward rt1
  • Observe the new state st1
  • Update the table entry for Q (st, at) as follows
  • Q(st, at) ? rt1 g maxa Q(st1,a)
  • st ? st1

13
Example Q Learning Update
72
100
  • g 0.9

63
81
0 reward received
14
Example Q Learning Update
90
aright
s1
s2
72
100
  • g 0.9

63
81
0 reward received
  • Q(s1,aright) ? r(s1,aright) g maxa Q(s2,a)
  • ? 0 0.9 max 63, 81, 100
  • ? 90
  • Note if rewards are non-negative
  • For all s, a, n, Qn(s, a) ? Qn1(s, a)
  • For all s, a, n, 0 ? Qn(s, a) ? Q(s, a)

15
Q-Learning Iterations Episodic
  • Start at upper left move clockwise table
    initially 0 g 0.8
  • Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0


16
Q-Learning Iterations Episodic
  • Start at upper left move clockwise table
    initially 0 g 0.8
  • Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0


17
Q-Learning Iterations
  • Start at upper left move clockwise g 0.8
  • Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10


18
Q-Learning Iterations
  • Start at upper left move clockwise g 0.8
  • Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
0 0 r g maxa Q(s4,W), Q(s4,N) 0 0.8 x max10,0) 8

19
Q-Learning Iterations
  • Start at upper left move clockwise g 0.8
  • Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
0 0 r g maxa Q(s4,W), Q(s4,N) 0 0.8 x max10,0) 8 10
0 r g maxa Q(s3,W), Q(s3,S) 0 0.8 x max0,8) 6.4
20
Q-Learning Iterations
  • Start at upper left move clockwise g 0.8
  • Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
0 0 r g maxa Q(s4,W), Q(s4,N) 0 0.8 x max10,0) 8 10
0 r g maxa Q(s3,W), Q(s3,S) 0 0.8 x max0,8) 6.4 8 10
21
Q-Learning Iterations
  • Start at upper left move clockwise g 0.8
  • Q(s, a) ? r g maxa Q(s,a)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0 r g maxa Q(s5,loop) 10 0.8 x 0 10
0 0 r g maxa Q(s4,W), Q(s4,N) 0 0.8 x max10,0) 8 10
0 r g maxa Q(s3,W), Q(s3,S) 0 0.8 x max0,8) 6.4 8 10
22
Example Summary Value Iteration and Q-Learning
R(s, a) values
23
Exploration vs Exploitation
  • How do you pick actions as you learn?
  • Greedy Action Selection
  • Always select the action that looks best
  • p(s) arg maxa Q(s,a)
  • Probabilistic Action Selection
  • Likelihood of a is proportional to current Q
    value.
  • P(ais) kQ(s, ai) / S j kQ(s, aj)

24
Markov Decision Processes and Reinforcement
Learning
  • Motivation
  • Learning policies through reinforcement
  • Q values
  • Q learning
  • Multi-step backups
  • Nondeterministic MDPs
  • Function Approximators
  • Model-based Learning
  • Summary

25
TD(l) Temporal Difference Learningfrom lecture
slides Machine Learning, T. Mitchell, McGraw
Hill, 1997.
  • Q learning reduce discrepancy between successive
    Q estimates
  • One step time difference
  • Q(1)(st,at) rt g maxa Q(st1,at)
  • Why not two steps?
  • Q(2)(st,at) rt g rt1 g2 maxa Q(st2,at1)
  • Or n ?
  • Q(n)(st,at) rt g rt1 g(n-1) rtn-1 gn
    maxa Q(stn,atn-1)
  • Blend all of these
  • Ql(st,at) (1-l) Q(1)(st,at) l Q(2)(st,at)
    l2Q(3)(st,at)

26
Eligibility Traces
  • Idea Perform backups on N previous data points,
    as well as most recent data point.
  • Select data to backup based on frequency of
    visitation.
  • Bias towards frequent data by geometric decay
    gi-j.

27
Markov Decision Processes and Reinforcement
Learning
  • Motivation
  • Learning policies through reinforcement
  • Nondeterministic MDPs
  • Value Iteration
  • Q Learning
  • Function Approximators
  • Model-based Learning
  • Summary

28
Nondeterministic MDPs
  • state transitions become probabilistic d(s,a,s)

29
NonDeterministic Case
  • How do we redefine cumulative reward to handle
    non-determinism?
  • Define V and Q based on expected values
  • Vp(st) Ert g rt1 g 2 rt2 . . .
  • Vp(st) E ? g i rtI
  • Q(st,at) Er(st,at) gV(d(st, at))

30
Value Iteration for Non-deterministic MDPs
  • V1(s) 0 for all s
  • t 1
  • loop
  • t t 1
  • loop for all s in S
  • loop for all a in A
  • Qt (s ,a) r(s,a) g S s in S
    d(s,a,s) V t(s)
  • end loop
  • Vt(s) maxa Qt (s,a)
  • end loop
  • until Vt1(s) - V t (s) lt e for all s in S

31
Q Learning for Nondeterministic MDPs
  • Q (s) r(s,a) g S s in S d(s,a,s) maxa Q
    (s,a)
  • Alter training rule for non-deterministic
    QnQn(st, at) ? (1- an) Qn-1 (st,at) an
    rt1 g maxa Qn-1(st1,a)
  • where an 1/(1visitsn(s,a))
  • Can still prove convergence of Q Watkins and
    Dayan, 92

32
Markov Decision Processes and Reinforcement
Learning
  • Motivation
  • Learning policies through reinforcement
  • Nondeterministic MDPs
  • Function Approximators
  • Model-based Learning
  • Summary

33
Function Approximation
Function Approximator
Q(s,a)
s
a
targets or error
  • Function Approximators
  • Backprop Neural Network
  • Radial Basis Function Network
  • CMAC Network
  • Nearest Neighbor, Memory-based
  • Decision Tree

gradient- descent methods
34
Function Approximation ExampleAdjusting Network
Weights
Function Approximator
Q(s,a)
s
a
targets or error
  • Function Approximator
  • Q(s,a) f(s,a,w)
  • Update Gradient-descent Sarsa
  • w ? w art1 gQ(st1,at1)-Q(st,at) ?w
    f(st,at,w)

35
Example TD-Gammon Tesauro, 1995
  • Learns to play Backgammon
  • Situations
  • Board configurations (1020)
  • Actions
  • Moves
  • Rewards
  • 100 if win
  • - 100 if lose
  • 0 for all other states
  • Trained by playing 1.5 million games against
    self.
  • Currently, roughly equal to best human player.

36
Example TD-Gammon Tesauro, 1995
V(s) predicted probability of winning
On win Outcome 1 On Loss Outcome 0
TD error
V(st1) V(st)
Hidden Units 0 - 160
Random Initial Weights
Raw Board Position ( of pieces at each position)
37
Markov Decision Processes and Reinforcement
Learning
  • Motivation
  • Learning policies through reinforcement
  • Nondeterministic MDPs
  • Function Approximators
  • Model-based Learning
  • Summary

38
Model-based Learning Certainty-Equivalence
Method
  • For every step
  • Use new experience to update model parameters.
  • Transitions
  • Rewards
  • Solve the model for V and p.
  • Value iteration.
  • Policy iteration.
  • Use the policy to choose the next action.

39
Learning the Model
  • For each state-action pair lts,agt visited
    accumulate
  • Mean Transition
  • T(s, a, s) number-times-seen(s, a ? s)

  • number-times-tried(s,a)
  • Mean Reward R(s, a)

40
Comparison of Model-based and Model-free methods
  • Temporal Differencing / Q Learning Only does
    computation for the states the system is actually
    in.
  • Good real-time performance
  • Inefficient use of data
  • Model-based methods Computes the best estimates
    for every state on every time step.
  • Efficient use of data
  • Terrible real-time performance
  • What is a middle ground?

41
Dyna A Middle GroundSutton, Intro to RL, 97
  • At each step, incrementally
  • Update model based on new data
  • Update policy based on new data
  • Update policy based on updated model
  • Performance, until optimal, on Grid World
  • Q-Learning
  • 531,000 Steps
  • 531,000 Backups
  • Dyna
  • 61,908 Steps
  • 3,055,000 Backups

42
Dyna Algorithm
  • Given state s
  • Choose action a using estimated policy.
  • Observe new state s and reward r.
  • Update T and R of model.
  • Update V at lts, agt
  • V(s) ? maxa r(s,a) g ?s T(s,a,s)V(s))
  • Perform k additional updates
  • Pick k random states sj in s1, s2, . . . sk
  • Update each V(sj)
  • V(sj) ? maxa r(sj,a) g ?s T(sj,a,s)V(s))

43
Markov Decision Processes and Reinforcement
Learning
  • Motivation
  • Learning policies through reinforcement
  • Nondeterministic MDPs
  • Function Approximators
  • Model-based Learning
  • Summary

44
Ongoing Research
  • Handling cases where state is only partially
    observable
  • Design of optimal exploration strategies
  • Extend to continuous action, state
  • Learn and use d S x A ? S
  • Scaling up in the size of the state space
  • Function approximators (neural net instead of
    table)
  • Generalization
  • Macros
  • Exploiting substructure
  • Multiple learners Multi-agent reinforcement
    learning

45
Markov Decision Processes (MDPs)
  • Model
  • Finite set of states, S
  • Finite set of actions, A
  • Probabilistic state transitions, d(s,a)
  • Reward for each state and action, R(s,a)
  • Process
  • Observe state st in S
  • Choose action at in A
  • Receive immediate reward rt
  • State changes to st1

s0
s1
r0
s1
a1
46
Crib Sheet MDPs by Value Iteration
  • Insight Can calculate optimal values iteratively
    using Dynamic Programming.
  • Algorithm
  • Iteratively calculate value using Bellmans
    Equation
  • Vt1(s) ? maxa r(s,a) gV t(d(s, a))
  • Terminate when values are close enough
  • Vt1(s) - V t (s) lt e
  • Agent selects optimal action by one step
    lookahead on V
  • p(s) argmaxa r(s,a) gV(d(s, a)

47
Crib Sheet Q-Learning for Deterministic Worlds
  • Let Q denote the current approximation to Q.
  • Initially
  • For each s, a initialize table entry Q(s, a) ? 0
  • Observe current state s
  • Do forever
  • Select an action a and execute it
  • Receive immediate reward r
  • Observe the new state s
  • Update the table entry for Q (s, a) as follows
  • Q(s, a) ? r g maxa Q(s,a)
  • s ? s
Write a Comment
User Comments (0)