Reinforcement Learning

1 / 61
About This Presentation
Title:

Reinforcement Learning

Description:

... xml.rels ppt/s/_rels/18.xml.rels ppt/s/_rels/17.xml. ... oleObject27.bin ppt/media/image37.png docProps/thumbnail.jpeg ppt/media/image42. ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 62
Provided by: con81

less

Transcript and Presenter's Notes

Title: Reinforcement Learning


1
Reinforcement Learning
2
Overview
  • Introduction
  • Q-learning
  • Exploration Exploitation
  • Evaluating RL algorithms
  • On-Policy learning SARSA
  • Model-based Q-learning

3
What Does Q-Learning learn
  • Does Q-learning gives the agent an optimal
    policy?

4
Exploration vs. Exploitation
  • Q-learning does not explicitly tell the agent
    what to do
  • just computes a Q-function Qs,a that allows the
    agent to see, for every state, which is the
    action with the highest expected reward
  • Given a Q-function, there are two things that the
    agent can do
  • Exploit the knowledge accumulated so far, and
    chose the action that maximizes Qs,a in a
    given state (greedy behavior)
  • Explore new actions, hoping to improve its
    estimate of the optimal Q-function, i.e. do not
    chose the action suggested by the current Qs,a

5
Exploration vs. Exploitation
  • Q-learning does not explicitly tell the agent
    what to do
  • just computes a Q-function Qs,a that allows the
    agent to see, for every state, which is the
    action with the highest expected reward
  • Given a Q-function, there are two things that the
    agent can do
  • Exploit the knowledge accumulated so far, and
    chose the action that maximizes Qs,a in a
    given state (greedy behavior)
  • Explore new actions, hoping to improve its
    estimate of the optimal Q-function, i.e. do not
    chose the action suggested by the current Qs,a
  • When to explore and when the exploit?
  • Never exploring may lead to being stuck in a
    suboptimal course of actions
  • Exploring too much is a waste of the knowledge
    accumulated via experience
  • Must find the right compromise

6
Exploration Strategies
  • Hard to come up with an optimal exploration
    policy (problem is widely studied in statistical
    decision theory)
  • But intuitively, any such strategy should be
    greedy in the limit of infinite exploration
    (GLIE), i.e.
  • Try each action an unbounded number of times, to
    avoid the possibility of missing an optimal
    action because of an unusually bad series of
    outcomes (we discussed this before)
  • choose the predicted best action when, in the
    limit, it has found the optimal value
    function/policy
  • We will look at a few exploration strategies
  • e-greedy
  • soft-max
  • Optimism in the face of uncertainty

7
e-greedy
  • Choose a random action with probability e and
    choose a best action with probability 1- e
  • Eventually converges to an optimal policy because
    it ensures that the first GLIE condition (try
    every action an unbounded number of times) is
    satisfied via the e random selection
  • But it is rather slow, because it does not really
    become fully greedy in the limit
  • It always chooses the non-optimal action with
    probability e, while ideally you would want to
    explore more at the beginning and become greedier
    as estimates become more accurate
  • Possible solution is to vary e overtime

8
Soft-Max
  • Takes into account improvement in estimates of
    expected reward function Qs,a
  • Choose action a in state s with a probability
    proportional to current estimate of Qs,a
  • t in the formula above influences how randomly
    values should be chosen
  • if t is high, the exponentials approach 1, the
    fraction approaches 1/(number of actions), and
    each action has approximately the same
    probability of being chosen ( exploration or
    exploitation?)
  • As t is reduced, actions with higher Qs,a are
    more likely to be chosen
  • as t ? 0, the exponential with the highest Qs,a
    dominates, and the best action is always chosen
    (exploration or exploitation?)

9
Optimism under Uncertainty
  • Initialize Q-function to values that encourage
    exploration
  • Amounts to giving an optimistic estimate for the
    initial values of Qs,a
  • Make the agent believe that there are wonderful
    rewards scattered all over, so that it explores
    all over
  • May take a long time to converge
  • A state gets to look bad when all its actions
    look bad
  • But when all actions lead to states that look
    good, it takes a long time to retrieve realistic
    Q-values via sheer exploration
  • Works fast only if the original values are a
    close approximation of the final values
  • This strategy does not work when the dynamics of
    the environment change over time.

10
Optimism under Uncertainty
  • Initialize Q-function to values that encourage
    exploration
  • Amounts to giving an optimistic estimate for the
    initial values of Qs,a
  • Make the agent believe that there are wonderful
    rewards scattered all over, so that it explores
    all over
  • May take a long time to converge
  • A state gets to look bad when all its actions
    look bad
  • But when all actions lead to states that look
    good, it takes a long time to retrieve realistic
    Q-values via sheer exploration
  • Works fast only if the original values are a
    close approximation of the final values
  • This strategy does not work when the dynamics of
    the environment change over time.
  • Exploration happens only in the initial phases of
    learning, as so it cant keep track of changes in
    the environment.

11
Optimism under Uncertainty Revised
  • Another approach favor exploration of
    rarely-tried actions, but stop pursuing them
    after enough evidence that they have low utility
  • This can be done by defining an exploration
    function
  • f(Qs,a, N(s,a))
  • where N(s,a) is the number of times a has
    been tried in s
  • Determines how greed (preference for high values
    of Q) is traded-off against curiosity (low values
    of N)

12
Exploration Function
  • There are many such functions, here is a simple
    one
  • This is the function that is used to select the
    next action to try in the current state
  • R is an optimistic estimate of the best possible
    reward obtainable in any state
  • Ne is a fixed parameter that forces the agent to
    try each action at least these many times in each
    state
  • After these many times, the agent stops relying
    on the initial overestimates and uses the
    potentially more accurate current Q values
  • This takes care of the problems of only relying
    on optimistic estimates throughout the process.

13
Modified Q-learning
argmaxa(f(Qs,a,N(s.a))
obviously need to modify the code to have N(s,a)
14
Overview
  • Introduction
  • Q-learning
  • Exploration vs. Exploitation
  • Evaluating RL algorithms
  • On-Policy Learning SARSA
  • Model-based Q-learning

15
Evaluating RL Algorithms
  • Two possible measures
  • Quality of the optimal policy
  • Reward received while looking for the policy
  • If there is a lot of time for learning before the
    agent is deployed, then quality of the learned
    policy is the measure to consider
  • If the agent has to learn while being deployed,
    it may not get to the optimal policy for a along
    time
  • Reward received while learning is the measure to
    look at, e.g, plot cumulative reward as a
    function of number of steps
  • One algorithm dominates another if its plot is
    consistently above

16
Evaluating RL Algorithms
  • Plots for example 11.7 in textbook (p. 480), with
  • Either fixed or variable a
  • Different initial values for Qs,a

17
Evaluating RL Algorithms
  • Lots of variability in each algorithm for
    different runs
  • for fair comparison, run each algorithm several
    times and report average behavior
  • Relevant statistics of the plot
  • Asymptotic slopes how good the policy is after
    the algorithm stabilizes
  • Plot minimum how much reward must be sacrificed
    before starting to gain (cost of learning)
  • zero-crossing how long it takes for the
    algorithm to recuperate its cost of learning

18
Overview
  • Introduction
  • Q-learning
  • Exploration vs. Exploitation
  • Evaluating RL algorithms
  • On-Policy Learning SARSA
  • Model-based Q-learning

19
Learning before vs. during deployment
  • As we saw earlier, there are two possible modus
    operandi for our learning agents
  • act in the environment to learn how it works,
    i.e. to learn an optimal policy. Then use this
    policy to act (there is a learning phase before
    deployment)
  • Learn as you go, that is start operating in the
    environment right away and learn from actions
    (learning happens during deployment)
  • If there is time to learn before deployment, the
    agent should try to do its best to learn as much
    as possible about the environment
  • even engage in locally suboptimal behaviors,
    because this will guarantee reaching an optimal
    policy in the long run
  • If learning while at work, suboptimal behaviors
    could be too costly

20
Example
  • Consider, for instance, our sample grid game
  • the optimal policy is to go up in S0
  • But if the agent includes some exploration in its
    policy (e.g. selects 20 of its actions
    randomly), exploring in S2 could be dangerous
    because it may cause hitting the -100 wall
  • No big deal if the agent is not deployed yet, but
    not ideal otherwise
  • Q-learning would not detect this problem
  • It does off-policy learning, i.e., it focuses on
    the optimal policy
  • On-policy learning addresses this problem

21
On-policy learning SARSA
  • On-policy learning learns the value of the policy
    being followed.
  • e.g., act greedily 80 of the time and act
    randomly 20 of the time
  • Better to be aware of the consequences of
    exploration has it happens, and avoid outcomes
    that are too costly while acting, rather than
    looking for the true optimal policy
  • SARSA
  • So called because it uses ltstate, action, reward,
    state, actiongt experiences rather than the
    ltstate, action, reward, stategt used by Q-learning
  • Instead of looking for the best action at every
    step, it evaluates the actions suggested by the
    current policy
  • Uses this info to revise it

22
On-policy learning SARSA
  • Given an experience lts,a,r,s,agt, SARSA updates
    Qs,a as follows
  • Whats different from Q-learning?

23
On-policy learning SARSA
  • Given an experience lts ,a, r, s, agt, SARSA
    updates Qs,a as follows
  • While Q-learning was using
  • There is no more MAX operator in the equation,
    there is instead the Q-value of the action
    suggested by the policy

24
On-policy learning SARSA
  • Does SARSA remind you of any other algorithm we
    have seen before?

25
Policy Iteration
  • Algorithm
  • p ? an arbitrary initial policy, U ? A vector of
    utility values, initially 0
  • 2. Repeat until no change in p
  • Compute new utilities given p and current U
    (policy evaluation)
  • (b) Update p as if utilities were correct (policy
    improvement)

Expected value of following current ?i from s
Expected value of following another action in s
Policy Improvement step
26
k1
k1
Only immediate rewards are included in the
update, as with Q-learning
27
k1
k2
SARSA backs up the rewards of the next action,
rather then the max reward
28
Comparing SARSA and Q-learning
  • For the little 6-states world
  • Policy learned by Q-learning 80 greedy is to go
    up in s0 to reach s4 quickly and get the big
    10 reward

29
Comparing SARSA and Q-learning
  • Policy learned by SARSA 80 greedy is to go left
    in s0
  • Safer because avoid the chance of getting the
    -100 reward in s2
  • but non-optimal gt lower q-values

30
SARSA Algorithm
  • This could be, for instance any e-greedy
    strategy
  • Choose random e times, and max the rest

This could be, for instance any e-greedy
strategy - Choose random e times, and max the
rest
If the random step is chosen here, and has a bad
negative reward, this will affect the value of
Qs,a. Next time in s, a may no longer be the
action selected because of its lowered Q value
31
Another Example
  • Gridworld with
  • Deterministic actions up, down, left, right
  • Start from S and arrive at G
  • Reward is -1 for all transitions, except those
    into the region marked Cliff
  • Falling into the cliff causes the agent to be
    sent back to start r -100

32
Another Example
  • Because of negative reward for every step taken,
    the optimal policy over the four standard actions
    is to take the shortest path along the cliff
  • But if the agents adopt an e-greedy action
    selection strategy with e0.1, walking along the
    cliff is dangerous
  • The optimal path that considers exploration is to
    go around as far as possible from the cliff

33
Q-learning vs. SARSA
  • Q-learning learns the optimal policy, but because
    it does so without taking exploration into
    account, it does not do so well while the agent
    is exploring
  • It occasionally falls into the cliff, so its
    reward per episode is not that great
  • SARSA has better on-line performance (reward per
    episode), because it learns to stay away from the
    cliff while exploring
  • But note that if e?0, SARSA and Q-learning would
    asymptotically converge to the optimal policy

34
Problem with Model-free methods
  • Q-learning and SARSA are model-free methods
  • What does this mean?

35
Problems With Model-free Methods
  • Q-learning and SARSA are model-free methods
  • They do not need to learn the transition and/or
    reward model, they are implicitly taken into
    account via experiences
  • Sounds handy, but there is a main disadvantage
  • How often does the agent get to update its
    Q-estimates?

36
Problems with Model-free Methods
  • Q-learning and SARSA are model-free methods
  • They do not need to learn the transition and/or
    reward model, they are implicitly taken into
    account via experiences
  • Sounds handy, but there is a main disadvantage
  • How often does the agent get to update its
    Q-estimates?
  • Only after a new experience comes in
  • Great if the agent acts very frequently, not so
    great if actions are sparse, because it wastes
    computation time

37
Model-based methods
  • Idea
  • learn the MDP and interleave acting and
    planning.
  • After each experience,
  • update probabilities and the reward,
  • do some steps of value iteration (asynchronous )
    to get better estimates of state utilities U(s)
    given the current model and reward function
  • Remember that there is the following link between
    Q values and utility values

38
VI algorithm
39
Asynchronous Value Iteration
  • The basic version of value iteration applies
    the Bellman update to all states at every
    iteration
  • This is in fact not necessary
  • On each iteration we can apply the update only to
    a chosen subset of states
  • Given certain conditions on the value function
    used to initialize the process, asynchronous
    value iteration converges to an optimal policy
  • Main advantage
  • one can design heuristics that allow the
    algorithm to concentrate on states that are
    likely to belong to the optimal policy
  • Makes sense if I have no intention of ever doing
    research in AI, there is no point in exploring
    the resulting states
  • Much faster convergence

40
Asynchronous VI algorithm
for some
41
Model-based RL algorithm
controller Prioritized Sweeping inputs S is a
set of states, A is a set of actions, ? the
discount, c is prior count internal state real
array QS,A, RS,A, S integer array TS,A,
S previous state s previous action a
Assumes a reward function as general as possible,
i.e. depending on all of s,a,s
42
Counts of events when action a performed in s
generated s
TD-based estimate of R(s,a,s)
Asynchronous value iteration steps
What is this c for?
Why is the reward inside the summation?
Frequency of transition from s1 to s2 via a1
43
Discussion
  • Which states to update?
  • At least s in which the action was generated
  • Then either select states randomly, or
  • States that are likely to get their Q-values
    changed because they can reach states with
    Q-values that have changed the most
  • How many steps of asynchronous value-iteration to
    perform?

44
Discussion
  • Which states to update?
  • At least s in which the action was generated
  • Then either select states randomly, or
  • States that are likely to get their Q-values
    changed because they can reach states with
    Q-values that have changed the most
  • How many steps of asynchronous value-iteration to
    perform?
  • As many as can be done before having to act again

45
Q-learning vs. Model-based
  • Is it better to learn a model and a utility
    function or an action value function with no
    model?
  • Still an open-question
  • Model-based approaches require less data to learn
    well, but they can be computationally more
    expensive (time per iteration)
  • Q-learning takes longer because it does not
    enforce consistency among Q-values via the model
  • Especially true when the environment becomes more
    complex
  • In games such as chess and backgammon,
    model-based approaches have been more successful
    that q-learning methods
  • Cost/ease of acting needs to be factored in

46
Overview
  • Introduction
  • Q-learning
  • Exploration vs. Exploitation
  • Evaluating RL algorithms
  • On-Policy Learning SARSA
  • Model-Based Methods
  • Reinforcement Learning with Features


47
Problem with state-base methods
  • In all the Q-learning methods that we have seen,
    the goal is to fill out the State-Action matrix
    with good Q-values
  • In order to do that, we need to make sure that
    the agents experiences visit all the states
  • Problem?

48
Problem with state-base methods
  • Model-based variations have been shown to handle
    reasonably well spaces with 10,000 states
  • Two dimensional maze-like environments
  • The real world is much more complex than that
  • Chess contains in the order of 105 states
  • Backgammon in the order of 10120 states
  • Unfeasible to visit all of them to learn how to
    play the game!
  • Additional problem with state-based methods,
  • information about one state cannot be used by
    similar states.
  • In order to do that, we need to make sure that
    the agents experiences visit all the states
  • Problem?

49
Alternative Approach
  • If we have more knowledge about the world
  • Approximate the Q-function using a function of
    state/action features
  • Most typical is a linear function of the
    features.
  • A linear function of variables X1, ., Xn is of
    the form

50
What are these features?
  • They are properties of the world states and
    actions that may be relevant to perform well in
    the world
  • However, if and how are relevant is not clear
    enough to create well defined rules of action
    (policies)
  • For instance, possible features in chess would be
  • Approximate material value of each piece (pawn
    has 1, bishop has 3, knight has 5, queen has 9)
  • King safety
  • Good pawn structure
  • Expert players have heuristics to use these
    features for successful playing, but they cannot
    be formalized in machine-ready ways

51
SARSA with Linear Function Approximation
  • Suppose that F1, ., Fn are features of states
    and actions in our world
  • If a new experience lts, a, r, s, agt is
    observed, it provides a new value to update Q(s,a)

52
SARSA with Linear Function Approximation
  • We use this experience to adjust the parameters
    w1,..,wn so as to minimize the squared-error
  • Does it remind you of anything we have already
    seen?

53
Gradient descent
  • So we have an expression of the error over Q(s,a)
    as a linear function of the parameters w1,..,wn
  • We want to minimize it
  • Gradient descent
  • To find the minimum of a real-valued function
    f(x1,.., x1)
  • Assign arbitrary values to x1,.., x1,
  • then repeat for each xi

54
Gradient Descent Search
each set of weights defines a point on the error
surface
Given a point on the surface, look at the slope
of the surface along the axis formed by each
weight partial derivative of the surface Err
with respect to each weight wj
55
SARSA with Linear Function Approximation
  • If we set

56
Algorithm
Error over Q(s,a)
Parameter adjustment via gradient descent
57
Example
  • 25 grid locations
  • prize could be at one of the corners or no prize.
  • If no prize, for each time step there is a
    probability that a prize appears at one of the
    corners.
  • Landing on prize gives reward of 10 and the
    prize disappears.
  • Monsters can appear at any time at one of the
    locations marked M.
  • The agent gets damaged if a monster appears at
    the square the agent is on.
  • If the agent is already damaged, it receives a
    reward of -10.
  • The agent can get repaired by visiting the repair
    station marked R.

58
  • 4 actions up, down, left and right.
  • These move the agent one step, usually in the
    direction indicated by the name.
  • but sometimes in one of the other directions.
  • If the agent crashes into an outside wall or one
    of the interior walls (the thick lines near the
    location R), it remains where is was and receives
    a reward of -1.

59
  • State consists of 4 components ltX,Y,P,Dgt,
  • X is the X-coordinate of the agent,
  • Y is the Y-coordinate of the agent,
  • P is the position of the prize
  • (Pi if there is a prize at Pi, i),
  • D is Boolean and is true when the agent is damaged
  • As the monsters are transient, there is no need
    to include them as part of the state.
  • There are thus 5552 250 states.
  • The agent does not know any of the story given
    here.
  • It just knows that there are 250 states, and 4
    actions, and which state it is in at every time
    and what reward was received at each time.
  • This game is difficult to learn
  • Visiting R is seemingly innocuous, until the
    agent has determined that being damaged is bad,
    and that visiting R makes it not damaged.
  • It needs to stumble upon this while trying to
    collect the prizes.
  • The states where there is no prize available do
    not last very long. Moreover, it has to learn
    this without being given the concept of damaged

60
Feature-based representation
  • F1(s,a) 1 action a would most likely take the
    agent from state s into a location where a
    monster could appear, and 0 otherwise.
  • F2(s,a) 1 if action a would most likely take the
    agent into a wall and 0 otherwise.
  • F3(s,a) has value 1 if the step a would most
    likely take the agent towards a prize.
  • F4(s,a) has value 1 if the agent is damaged in
    state s and action a takes it towards the repair
    station.
  • F5(s,a) has value 1 if the agent is damaged and
    action a would most likely take the agent into a
    location where a monster could appear, and 0
    otherwise.
  • same as F1(s,a), but is only applicable when the
    agent is damaged.
  • F6(s,a) has value 1 if the agent is damaged in
    state s and has value 0 otherwise.
  • F7(s,a) has value 1 if the agent is not damaged
    in state s and has value 0 otherwise.
  • F8(s,a) has value 1 if the agent is damaged and
    there is a prize ahead in direction a
  • F9(s,a) has value 1 if the agent is not damaged
    and there is a prize ahead in direction a

61
Feature-based representation
  • F10(s,a) has the value of the x value in state s
    if there is a prize at location P0 in state s
  • distance from the left wall if there is a prize
    at location P0
  • F11(s,a) has the value 4-x where x is the
    horizontal position in state s if there is a
    prize at location P0 in state s.
  • Distance from the right wall if there is a prize
    at location P0.
  • F12(s,a) to F29(s,a) are like F10 and F11 for
    different combinations of the prize location and
    the distance from each of the 4 walls.
  • For the case where the prize is at location P0,
    the y distance could take into account the wall.
  • http//www.cs.ubc.ca/spider/poole/demos/rl/sGameFA
    .html

62
Discussion
  • Finding the right features is difficult
  • The author of TD-Gammon, a program that uses RL
    to learn to play Backgammon, took over 5 years to
    come up with a reasonable set of features
  • Reached performance of three top players
    worldwide
Write a Comment
User Comments (0)