Evolutionary Reinforcement Learning Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Evolutionary Reinforcement Learning Systems

Description:

... adn generates offspring based on the fitness of each solution in the task. ... This method sustain information about both good and bad state-action pairs. ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 32
Provided by: LINK88
Category:

less

Transcript and Presenter's Notes

Title: Evolutionary Reinforcement Learning Systems


1
Evolutionary Reinforcement Learning Systems
  • Presented by Alp Sardag

2
Goal
  • Two main branches of reinforcement learning
  • Search the space of functions that asses values
    to utility of states.
  • Search the space of functions that asses values
    to behaviors (Q-learning).
  • Describe the evolutionary algorithm approach to
    reinforcement learning and compare and contrast
    with more common TD methods

3
Sequential Decision Task
This is an example of sequential decision task,
the optimal sequence For an agent that starts
from a1 is R,D,R,D,D,R,R,D
4
Learning from reinforcements
  • The EA and TD methods have similarities
  • Solve diffucult sequential decision tasks.
  • These methods are normally model-free, whereas
    dynamic programming approaches learn from a
    complete mathematical model of the underlying
    system.

5
Supervised vs. Unsupervised Learning
  • In supervised learning the agent receives the
    correct decisions paired with specific sensory
    input.
  • In reinforcement learning, the agent does not
    learn from examples of correct behavior but uses
    system payoffs as aguide to form effective
    decision policies.
  • A decision making may not receive feedback after
    every decision.
  • They are applicable in problems where significant
    domain knowledge is either unavailable or costly
    to obtain.

6
TD Reinforcement Learning
  • Q-learning update scheme
  • Q(x,a)?Q(x,a)?(R(i)maxQ(y,b)-Q(x,a))
  • After a long-run
  • Q-values will converge to the optimal values.
  • A reinforcement learning system can thus use the
    Q values to
  • evaluate each action that is possible from a
    given state.

7
Evolutionary Algorithm
  • Evolutionary algorithms are global search
    techniques.
  • They are built on Darwins theory of evolution by
    natural selection.
  • Numerous potential solutions are encoded in
    structures, called chromosomes.
  • During each iteration, the EA evaluates solutions
    adn generates offspring based on the fitness of
    each solution in the task.
  • Substructures, or genes, of the solutions are
    then modified through genetic operators such as
    mutation or recombination.
  • The idea structures that led to good solutions
    in previous evaluations can be mutated or
    combined to form even better solutions.

8
Basic Steps in Evolutionary Algorithm
  • Procedure EA
  • begin
  • t0
  • initialize P(t)
  • evaluate structures in P(t)
  • while termination condition not satisfied do
  • begin
  • tt1
  • select P(t) form P(t-1)
  • alter structures in P(t)
  • evaluate structures in P(t)
  • end
  • end.

9
Similarities EA and RL
  • EA method can learn online through the direct
    interaction with the underlying system
  • They are adaptive to changes.
  • Online learning is often time-consuming and
    dangerous. Teherfore researchers in EA and TD
    learning train in simulation then apply the
    learned control policy to real system.

10
Differences between TD and EA approach
  • This can be summarized along three dimensions
  • Policy representation
  • Credit assignment
  • Memory

11
Policy Representation
  • What they represent
  • TD methods form q(x,a)?v.
  • EA methods uses direct mapping from state to
    recommended action (e.G. Q(x) ?a).
  • Conclusion TD try to solve a more diffucult
    problem than reinforcement posed. In addition to
    choosing best action, it tells us why.

12
The size of hypothesis space
  • The hypothesis space for direct policy
    representation (e.g. Q(x) ?a)
  • The total number of functions cn
  • where n number of states c number of actions
  • The hypothesis space for a value function
    representation (e.g. Q(x,a) ?v)
  • The total number of functions wcn
  • where w the number of possible values
  • NOTE the size of hypothesis space does not
    reflect the difficulty of the problem northe
    efficiency of the methods that search the space.

13
Credit Assignment
  • TD reinforcement learning Credit is chained
    backward. In this manner, payoffs are distributed
    across sequence of actions. A single reward value
    becomes associated with each individual state and
    decision pair.
  • EA reinforcement learning rewards are only
    associated with sequences of decisions. Thus,
    which individual decisions are most responsible
    for a good or poor decision policy is irrelevant
    to the EA.

14
Issue in Credit Assignment
  • In TD, the update is focused on a single state
    and action.
  • In EA, after a recombination the evaluation is on
    the entire sequence of actions.

15
Memory
  • TD reinforcement learning maintain statistics
    concerning every state-action pairs. This method
    sustain information about both good and bad
    state-action pairs.
  • EA reinforcement learning maintain information
    only about good states-action pairs. Thus, memory
    losses occur.

16
Example
  • Unlike TD the information content decreases in
    EA, population
  • actually decreases during learning.
  • State loss can also occur in EA reinforcement
    learning.

17
Issues in RL
  • Exploration vs. Exploitation
  • Perceptual Aliasing
  • Generalization
  • Unstable Environments

18
Exploration vs. Exploitation for TD
  • Too much exploration could result in lower
    average rewards and too little could prevent the
    learning system from discovering new optimal
    states.
  • Solution 1
  • Solution 2

19
Exploration vs. Exploitation for EA
  • Too much exploration could result in lower
    average rewards and too little could prevent the
    learning system from discovering new optimal
    states.
  • Solution comes from the nature of EA. Early in an
    evolutionary search, selection pressure is low
    because most policies within the population have
    lower fitness. As the evolution proceeds, highly
    fit policy evolve, increasing the selective
    pressure within in the population.

20
Perceptual Aliasing or Hidden State
  • In real world situations, the agent often will
    not have access to complete information on the
    state of its world.
  • TD methods are vulnerable to hidden states.
    Ambiguous state information misleads the TD
    method.
  • Since EA methods associate values with entire
    decision sequences, credits are based on net
    results.

21
Example
In TD reinforcement learning
22
Generalization
  • The generalization of policy desicions from one
    area of the state space to another.
  • Since the number of possible states of the world
    grows exponentially with the size of the task.
    Solution is to apply action decisions from
    observed states to unobserved states. (ANN, rule
    bases)

23
Comparison of EA and TD
  • Generalization of TD In some large-scale
    problems, approximating the value function works
    well, while in some simple toy problems it fails.
    In discrete table one update affects one state,
    whereas in generalized case more than one state
    and create noisiness. (who cares toy problems?)
  • Generalization of EA Since it make less frequent
    update and base them on more global informatio,
    it is more diffucult for single observation to
    affect the global decision problem.

24
Unstable Environments
  • The agent must adapt its current decision policy
    in response to changes that occur in its
    world.(e.g. Faulty sensors, new obstructions,
    etc.)
  • Because TD makes constant updates to decision
    polic, it should respond to changes as soon as
    they occur.
  • Since EA do not update any policy until an
    individual or a population of individuals have
    been completely evaluated over several actions,
    the response to changes delayed.

25
Evolutionary Reinforcement Learning
Implementations
  • Learning Classifier Systems
  • SAMUEL
  • GENITOR
  • SANE

26
Learning Classifier Systems
  • Mesaages trigger Classifiers which are symbolic
    if-then rules that map sensory input to action
  • Classifiers are in a competition, resolved by a
    bidding algorithm.
  • Classifiers messages may trigger new classifiers.
  • Predate TD learning.
  • EA selects, mutates, and recombines classifiers
    that received the most credit by bidding
    algorithm.
  • The result was dissaponting.

27
SAMUEL
  • A system that learns to solve sequential decision
    problems.
  • SAMUEL searches the space of decision policies
    for stes of condition-action rules.
  • Each individual is a rule set that specifies its
    behavior.
  • Each gene is a rule that maps states to actions
  • Three major components
  • Problem specific module task environment
    simulation
  • Performance module interacts with the world,
    obtain payoffs
  • Learning module uses GA to evolve behaviors.

28
GENITOR
  • GENITOR uses neural networks to represent the
    decision policy, meaning generalization the state
    space.
  • GENITOR relies solely on its EA to adjust the
    weights in NN.
  • NN (individual) is represented in the population
    as a sequence of connection weights.
  • The croosover is realized on the basis that the
    offspring performance.
  • Genetic operators are applied aynchronously.

29
SANE
  • SANE (symbiotic, Adaptive Neuro-Evolution) was
    designed as a fast, efficient method fro building
    NN in domains where it is not possible to
    generate trainning data.
  • NN forms a direct mapping from sensors to actions
    and provides effective generalization over the
    state space.
  • Individuals are complete NNs.
  • Two separate population
  • Population of neurons population of building
    blocks of NN.
  • Population of network blueprints population of
    combination of building blocks of NN.

30
SANE
31
Conclusion
  • EA and TD learn through interactions with the
    actual system.
  • EA and TD do not require precise mathematical
    model of the domain.
  • The main differences
  • Policy representation
  • Credit assignment
  • Memory
Write a Comment
User Comments (0)
About PowerShow.com