Genetic Programming and Genetic Algorithms

About This Presentation

Title:

Genetic Programming and Genetic Algorithms

Description:

Sometimes the program 'is aware' of the exact set of acceptable inputs - and ... 'differential reproduction', so that 'fitter individuals' produce more offspring ... – PowerPoint PPT presentation

Number of Views:206

Avg rating:3.0/5.0

Slides: 55

Provided by: giampier

Category:

more less

Transcript and Presenter's Notes

Title: Genetic Programming and Genetic Algorithms

1
Genetic ProgrammingandGenetic Algorithms

General Introduction

2
Introduction

This series of lectures tries to cover the area
of search from a different perspective.
We first observe that every program is a
function, from a domain to a range a program
takes an input from an acceptable set of inputs
and generates an output (side-effects could be
part of the output). Sometimes the program is
aware of the exact set of acceptable inputs -
and reacts appropriately to inputs outside that
set most of the time such awareness is limited
and so the function the program corresponds to
may produce unpredictable (or undesirable)
input-output pairs outside a small part of the
possible set of inputs.

3
Introduction

If the function can be represented in terms of
already known functions either in terms of
algebraic formulae or in terms of exact (domain,
range) pairing rules, we can describe the
function explicitly and we can end with a program
for computing the input-output relation.
If the function cannot be so represented, we have
a problem
If the function CAN be so represented, we may
still (and, with high probability, do) have a
problem (NP-Complete, anyone?)
Can we solve the problem(s)?
The answer will be, by and large, NO, BUT

4
Introduction

If we do not have explicit rules to take us from
an input to an output, what can we expect to
have?
A finite collection of valid input-output pairs
A way of evaluating whether a collection of
input-output pairs produced is more or less
desirable than some other collection of pairs
with the same input components
A way of evaluating when our process of function
construction can stop, either because we are not
generating better functions or because some
other cost (time or space) is becoming
unacceptable.

5
Introduction

In some instances, we have a function and we are
trying to find an input-output with some desired
characteristics - e.g., a maximum, a minimum or a
saddle-point of the function. If the function is
given with a simple analytical form, Calculus
techniques may be adequate. If the function is
given in a very complex form (or, at least, not
amenable to simple analytic techniques), an
intelligent search (using known properties of
the function to reduce the search space) may be
the only available strategy.

6
Introduction

One of the early results was obtained by
McCulloch and Pitts (starting in the 1940s) via
their study of perceptrons input-output networks
with an input layer of nodes connected to an
output layer of nodes.
By the late 1960s this setup had been shown to
be inadequate to represent useful functions
(e.g. XOR) (Minsky and Papert, Perceptrons, MIT
Press, 1969). Later on, other people showed that
the introduction of a third layer was adequate
for the approximation of any desired
well-behaved function the level of
approximation was tied to the number of nodes in
this intermediate layer.
More specifically, we have

7
Introduction

The Universal Approximation Theorem. Let f() be
a non-constant, bounded and monotone-increasing
continuous function on R. Let Ip denote the
p-dimensional closed unit hypercube 0, 1p. Let
C(Ip) denote the space of continuous functions Ip
--gt R. Then, given any function f Î C(Ip) and e
gt 0, there exists an integer M and sets of real
constants ai, qi and wij, where i 1, , M and j
1, , p, such that we may define as
an approximate realization of the function f()
that is, F(x1, , xp) - f(x1, , xp) lt e
for all (x1, , xp) Î Ip.
Proof. See the references in Simon Haykin,
Neural Networks, a Comprehensive
Foundation,Macmillan, 1994, pp. 181-182. (There
is a second edition out)

8
Introduction

We can observe
the logistic function 1/1 exp(-v) used as
the nonlinearity in a neuron model satisfies the
conditions on f() above
the network can be thought as having p input
nodes and a single hidden layer with M neurons
the inputs are x1, x2, , xp
the hidden neuron i has synaptic weights wi1,
wi2, , wip and threshold qi
the network output is a linear combination of
the outputs of the hidden neurons, with a1, , aM
defining the coefficients of this combination.

9
Introduction

The theorem itself is a simple existence theorem
- a generalization of Fouriers result going back
to the 1820s. It tells us that a single layer
suffices, but it does not tell us anything about
efficiency of representation, minimality, ease of
training, size of the required set of hidden
nodes, etc.
It allows for further results relating the
accuracy of estimation of a function f by the
approximation F to the accuracy of the empirical
fit - i.e., a way of relating M, the number of
hidden nodes, to N, the size of the training set.
One can also show that there exists a choice of
M so that the rate of convergence of training is
O((1/N)1/2) time a logarithmic factor.

10
Introduction

One of the problems with single hidden layer
perceptrons is that all intermediate information
is global it becomes impossible to interpret
anything as a local feature.
Introducing two hidden layers allows one to
interpret the information at the layer closest to
the input as local while the information at the
layer closest to the output is global.
Other types of problems lead in a natural way to
two-layer networks.

11
Introduction

The point to be made is that, although
mathematical theory supports the approximability
of most reasonable classes of functions from an
n-dimensional euclidean space to an m-dimensional
one, coming up with a usable approximation in a
reasonable amount of time ( use of computational
resources) given a small amount of information
remains (for now and forever) a completely
non-trivial problem.
The problem is, essentially, one of search
among all possible approximations find one that
fits the known input/output relation best
(e.g., with minimum least square error)
allows us to interpolate at points where we do
not have data
has acceptable cost in both construction and
evaluation.

12
Introduction

What general method can we devise?
Random Search start with a pair, add another,,
and so on, rejecting all those that do not meet
acceptable (input, output) criteria. You may use
continuity - which, essentially, says that nearby
input value should lead to nearby output values.
After a while, you may have enough of a function
to perform (linear) interpolation with some
feeling that you wont have too many surprises.
Non-Random Search find a clever general purpose
algorithm that will allow us to build a good
function with acceptable cost.
For the second one, recall that just about all
the NP-Complete problems we know require a
complete enumeration of all possible
configurations to guarantee finding a solution
so

13
Introduction

More formally
The No Free Lunch Theorems No search algorithm
is superior to any other algorithm on average
across all possible problems.
Consequence algorithms that perform better than
random search on one class of problems will
perform worse than random search on another.
Consequence all algorithms we devise will have
to be tailored to the search domain - it is the
use of specific domain-dependent information that
gives us algorithms better than random search.
Outside of a given domain, our results will be
(much) worse.

14
Introduction

We are going to concentrate on some aspects of
search and of function construction through
search - the techniques to be introduced attempt
to find computational analogues to methods that
(to the best of our current knowledge) appear
used in biological evolution.
Since biological evolution appears based on
modification of the DNA exchanged from parent(s)
to offspring we are going to have to find ways
to
encode all desired characteristics of a problem
in a data structure that can support DNA-type
modifications (whatever they will be, but they
must include some analogues of chromosomal
recombination and mutation applied to strings
over some alphabet) and still remain meaningful
in our context
construct an evaluation function (equivalent to
evolutionary fitness determination) on either the
underlying data structure

15
Introduction

derivable from the original one (i.e., the
genotype)
c) devise a strategy for differential
reproduction, so that fitter individuals
produce more offspring with a high chance of
possessing desirable characteristics, while
guarding against genetic overspecialization.
A lot of the terms are in quotes, since it is not
obvious how the analogy with biological systems
will be implemented as they say, the devil is in
the details

16
Introduction

The Problem of Representation. How do we
represent the information to be searched for?
In DNA-based search we deal with an alphabet
over 4 letters (A, C, G, T), and finite strings
over that alphabet. The length of the strings is
(more or less) fixed and parent-child information
exchange seems to consist (at least at a first
approximation - and for two-parent species) of
the joining together of two single parental
chromosomal strands into a double descendant
strand with dominant and recessive alleles for
nearly all the genes so transmitted. Two other
known mechanisms involve movement of sections of
the string from one location to another and
single-locus changes (point mutations). Besides
some other known mechanisms, there may well be
many unknown ones.

17
Introduction

Some early representations used fixed-length
strings of binary digits each position had a 0
or a 1 (a bit-gene). A function scored the
string. Evolution of a solution involved
Determining which individuals would reproduce.
Selecting the pairs of individuals contributing
to the offspring.
Determining how they would so contribute.
Determining the role and frequency of
point-mutations.
Defining the new population.
Repeating the process from 1. above.
Determining when termination was reached.
Extraction of the best individual as the
solution.

18
Introduction

There are several slightly different ways of
looking at the evolution of a population via
genetic-analogue methods
each generation corresponds to another subset of
possible individuals. Convergence will
correspond to a family of subsets of decreasing
diameter (under some metric).

19
Introduction

2) Given a population of M individuals with N
binary genes each, the number of states such a
population can be in is given by the
formula Finding a solution
means, essentially, finding a limiting state for
the evolving family of populations. A best
individual of the limit population is our desired
solution.

20
Introduction

3) We can interpret each possible individual as a
point in some space. The evaluation function
defines a function from this space to R. We look
for the maximum of this function over the space.
The graph of this function provides us with what
is called a fitness landscape. The evolution
mechanism creates sets of different points in the
domain from one generation to the next. We stop
when several successive generations have not
produced a substantially better individual, or
when a given amount of computational resources
has been expended.

21
Introduction

We need to consider how the offspring will
receive the information from the parents, and the
mechanisms that correspond to point-mutation and,
possibly, to larger scale mutation (e.g.,
repositioning of substrings within a chromosome).
Start from binary strings each parent is
represented by a binary string of length N. A
reasonable analog to the contribution of single
chromosomal strands from each parent, with
dominant and recessive alleles may be to just
take a prefix substring of (random) length 0 n
N from the first parent, a suffix substring of
length N - n from the second parent and
concatenate them in the same prefix-suffix order.
This provides a new chromosome of length N.
Mutation can be simulated by walking down the
new string and randomly resetting the bits.

22
Introduction

At the simplest level, we now have a data
structure which is just a fixed length string of
bits, an evaluation function to provide relative
ranks of different individuals, a mechanism for
enforcing differential reproduction, and a
mechanism to provide the next generation. A
variant may include allowing some of the best
individuals of one generation to have exact
copies appear in the next.
A question that begs to be asked is given a
current generation, can we say anything about the
evolution of genetic patterns from one generation
to the next? This is crucial to our being able
to believe that the process set in motion has
some convergence properties, rather than just
leading us to a completely random population.

23
Introduction

Other Representations. Some problems have
natural representations in terms of larger
alphabets (actual DNA comes to mind - 4 letters)
or in terms of continuous quantities (requiring
floating point numbers over various ranges). If
the cardinality of the alphabet is a power of 2,
we can still use the same bit-oriented
mechanisms, and the chromosome at the next
generation will remain meaningful.
If the alphabet is made up of continuous
ranges, the problem of representation becomes
more complex. A possible solution involves using
a contiguous range of bits to represent each more
structured entity, with the caveat that mutation
and recombination must be constrained not to exit
the appropriate ranges. Another solution
involves accepting floating point values (rather
than bits) as genes.

24
Introduction

This may simplify the interpretation (and
implementation) of recombination and mutation,
but we are still left with the problem of
guaranteeing meaning for the results of such
actions.
Another representational problem arises in what
is properly called genetic programming where
the object being evolved is a program that
attempts to compute a specific (only partially
known) function. Natural representation of
programs may involve trees, where the nodes are
functions (interior nodes) or parameter values
(leaves - if the program does not require
iteration or recursion), or graphs. Although it
is always possible to reduce everything to bits,
any intuition is likely to be so removed from the
bit-representation as to be useless.

25
Introduction
A difference in approach between genetic
algorithms and genetic programming can be
exemplified in the two diagrams below in genetic
algorithms, once the problem is codified into a
data structure, we just apply the genetic
algorithm in genetic programming the interaction
with the original problem remains more direct and
ongoing.
26
Introduction

More specifically, the genetic algorithm approach
results in defining a binary string and then
modifying and evaluating binary strings,
constructing successive generations using (at
least) crossover and mutation operators

27
Introduction

The genetic programming approach leads to a cycle

28
Introduction

The string is replaced by a tree, and the tree
can be modified in ways that are more complex
than those supported by a string (the apparent
cycle is not crucial - the methods are all
cyclical in nature). More crucial is the
observation that we cannot limit a program to a
fixed number of tree nodes the search would be
much too limited. This implies that the usual
methods (which we will study in more detail
later) for evaluating convergence will have to
be modified - if they are applicable at all.

29
Introduction

Why should anything converge? By allowing a
best element of the population P to survive
from one generation to the next, we can ensure
that the derived evaluation function F(t)
maxx ÎPt f(x) is monotone non-decreasing in t,
but this does not mean that we should expect
improvement or convergence. Another approach,
based on the idea of a schema, provides a
probabilistic approach, still with substantial
drawbacks.

30
Introduction

Some Examples 1 - Function Optimization. You
are given the function f(x) xsin(10p x) 1.0
over the closed interval -1.0, 2.0, and you are
expected to find the value of x in that range
that maximizes it Z. Michalewicz, p.18.
An analytic approach would first compute the
zeros of the first derivative (the function
possesses a first derivative at every point in
(-1.0, 2.0) so any interior maxima or minima will
appear only at zeros of the derivative). There
are finitely many such values over the given
interval. We can now evaluate the function at
all such values, plus -1.0 and 2.0 (solving
tan(10px) 10px 0 will require some numerics -
nothing too hard). A value of x for which we
attain the maximum of this finite set, plus
endpoints, provides us with a correct (input,
output) pair, and a solution to our problem.

31
Introduction

In the absence of analytic techniques, what can
we do? We can choose a finite random set of
points in -1.0, 2.0, evaluate the function at
those points, choose a point where the function
achieves a largest value and stop.
A modification would entail choosing a small
random set, finding the x-value where the
function has the largest value choosing a second
random set near this value, and repeating the
process with smaller and smaller sets near better
and better values. Stop when you dont improve
from one generation to the next, or when you
run out of computational cycles.
The second technique, just like the first, is
likely to leave us stuck near a relative
maximum which is not optimal Can we do better?

32
Introduction

How do we devise a genetic algorithm?
Essentially, we want to add some intelligence
to this random search try to avoid getting stuck
on local maxima, and direct the search so that it
is - hopefully - more efficient than strictly
random. How?
Representation what precision do we want? Lets
choose six places after the decimal point (there
has to be a point beyond which we dont care).
Six decimal points 31000000 intervals over
-1.0, 2.0. Notice that 2097152 221 lt 3000000
lt 222 4194304, so we will use a string on 22
bits to represent numbers in the desired range.
We now have binary strings b21b20b1b0, which we
can convert to decimal in -1.0, 2.0

33
Introduction

The two chromosomes 000 and 111 correspond to
the endpoints -1.0 and 2.0, respectively. All
others correspond to interior points. The
evaluation function simply takes a binary
chromosome, say v, transforms into a decimal
number, say dec(v), and evaluates f at that
number eval(v) f(dec(v)).
Initialization. Create a random initial
population of chromosomes.
Ranking. Rank the chromosomes according to the
evaluation function.
Reproduction. Normalize the rankings so that each
rank corresponds to an appropriate subinterval of
0, 1. Run the random number generator twice,
using the subintervals to determine probability
of choice, to obtain two parents.

34
Introduction

When the parents are found, we create the
offspring. Two operators are used crossover and
mutation. Select, randomly, the gene after which
the crossover takes place. The first parent
contributes the part of its chromosome ending at
that gene (inclusive) the second parent
contributes the final part of its chromosome to
the offspring. If mutation is to be included,
one must successively use the random number
generator to determine if each gene of the
offspring is to be changed. Repeat the process
until a number of offspring equal to the desired
population is obtained.
Next Generation. We now have a new generation,
for the process to be repeated.

35
Introduction

Several issues must be resolved
the size of the initial population (50, in this
case)
the number of generations the process is allowed
to continue (150, in this case)
the probability of crossover (the probability
that a chromosome undegoes cross-over pc
0.25)
the probability of mutation (the probability
that a gene is flipped pm 0.01).
None of them can be optimally determined
A run provides the following results

36
Introduction

The best individuals discovered within 150
generations are given in the table below.

37
Introduction

Some Examples 2 - The Prisoners Dilemma. This
is discussed in Michalewiczs book, in
Mitchells, and, quite extensively, in D. B.
Fogels.
The Problem two individuals are held prisoner
and are under pressure to confess to some
undesirable activity implicating the other. The
options (and rewards/punishments) are summarized
in the table below.

38
Introduction

The aim of the game is to find a strategy that,
over the long run, will maximize ones gains. At
any one point, the maximizing strategy would have
the winner defect, and the loser holding out, so
there is always a temptation to defect. In fact,
defection is always the safest individual
choice at each point. On the other hand, a
sequence of mutual defections has a combined
payoff much smaller than that of a sequence of
mutual cooperations. We can compute
An infinite sequence of random choices (each
configuration has probability 0.25) has an
expected return (for each of the players) of
2.25, with an expected cumulative return of 4.5
An infinite sequence of defections, with the
other choosing randomly has an expected return
(for the defector) of 3 and of 0.5 (for the
other), so the expected cumulative return is only
3.5

39
Introduction

3. Ex. compute the expected returns, individual
and cumulative, for at least two other sequences
of actions.
Representing a solution. Ideally, we should have
a complete memory of the past to determine the
next decision. Since this is not possible, a
decision has to be made on the basis of a finite
memory. Michalewiczs text uses the previous
three moves. In fact (Mitchells and Fogels
books), tournaments (human and computer) were
organized by A. Axelrod in the 70s and 80s
to try to determine best strategies, and he
decided to use the three-move-memory we will
use here. If a chromosome contains information
about the three previous moves, and since there
are 4 possible outcomes for each move, we have to
keep track of 64 ( 444) possible games - 64
different histories.

40
Introduction

If we order the histories in some canonical
order, a 64-bit array allows us to associate a
response to each history. We must also prime the
system with three initial games, with two bits
required for each (index into histories). Total
70 bits for a chromosome.
Choose an initial population. Create a number, N,
of random 70-bit strings.
Test each player to determine effectiveness. Use
the strategy encoded in the chromosome to play
games against all other players. The fitness
is the average score over all the games played.
The original study by Axelrod had each member of
the population play against a set of given
strategies - culled from one of the human and
computer tournaments. The initial game sequence
provides an index into the bit-string, so one
could use the first six bits to determine the

41
Introduction

initial strategy for each player (player no. 1
uses its initial conditions to determine the game
sequence to be continued by both).
Select the player to breed. A player with an
average score (within a standard deviation of
average) is given one mating a player with a
score one or more standard deviations above
average is given two matings one with a score
one or more standard deviations below average is
give no matings (some minor adjustment may be
necessary to keep the population of constant
size).
Breed. Randomly pick pairs, and create two
descendants per mating using both crossover and
mutation.

42
Introduction

Results. Experiments lead to a number of
strategies being discovered.
Continue cooperation after three initial
cooperations (CC)(CC)(CC) leads to C.
Defect when the other player defects
(CC)(CC)(CD) leads to D.
Continue to cooperate after cooperation has been
restored (CD)(CD)(CC) leads to C.
Cooperate when cooperation has been restored
after your own defection (DC)(CC)(CC) leads to
C,
Defect after three mutual defections
(DD)(DD)(DD) leads to D.
If the payoff for successful defection is
increased to 6, strategies can delop with
expected payoffs gt 3.0 (Fogel).

43
Introduction

The evolved strategies can be represented as
finite-state machines, which can also mean that
the final best strategy can be interpreted as a
formal program - the best evolved.
This was an example of co-evolution, where the
individuals in a population were pitted against
one another and caused the population to evolve
through mutual interactions.
For much more detail, see the papers by Axelrod
or the book by D. B. Fogel Evolutionary
Computation.

44
Introduction

Some Examples 3 - The Traveling Salesman
Problem. This is a well-known NP-Complete
problem, which has some reasonable
approximation schemes - at least in the case in
which the distances between nodes satisfy a
triangle inequality (see Cormen al., Ch. 35).
There can be no expectation of an exact solution
(find a tour of minimum length) in anything but
exponential time, but it may be possible to
beat the quality of the known deterministic
approximation algorithms without giving up too
much time
What is a chromosome? A reasonable
interpretation for a chromosome is that it
represents a tour. Since the complete graph
consists of N nodes, is the chromosome a binary
string (array) or an integer array?

45
Introduction

This is an important decision because of the
genetic operators do we split at a bit or at a
full node index (integer)?
If we use a bit representation, our chromosome
will need Nceiling(log2N) bits. Using crossover
or mutation at the bit level cannot guarantee
that the new chromosome represents a new tour,
or, possibly, even that we have a path in the
graph (if exp(ceiling(log2N)) gt N).
If we use integer representation (N integers),
and we use our genetic operators at integer
boundaries, we can at least guarantee that we
still have a path, although maybe not a tour.

46
Introduction

The choice in this case is of an integer
representation for the chromosome, essentially
because
it avoids one set of problems - the possible
introduction of non-existent nodes and,
it permits some simpler cross-over repair
algorithms to make sure that no stillborn
descendants are allowed into the population. The
repair algorithms can be incorporated into the
genetic operators.
Mutation can be handled ina similar way.
Intialization. For a population of size M, pick
M random permutations on N items. Another
possibility is to use a greedy algorithm to
construct M approximate solutions, and start from
those.

47
Introduction

Evaluation and Ranking. Straightforward for each
individual - just compute the value of the tour
it represents.
Breeding. Both crossover and mutation must be
implemented carefully to preserve a tour and to
maintain a relationship with the parents. We
will look at the details at some later date.
Results. The algorithm, as described, appears to
be better than random search, but is not very
efficient a 100 city tour, after 20,000
generations, gives a value for the best tour
found about 9.4 above optimum (Michalewicz, p.
26).

48
Introduction

Some Examples 4 - Tree-Based Genetic
Programming. Assume we are given a function,
either explicitly or, more likely, as a set of
(input, output) pairs. Assume we have a finite
set of pairs derived from the function y x2.
We want to construct a best interpolating
function we can, obviously starting only from
the set of (input, output) pairs.
A program can be viewed as a tree structure,
where the leaves are terminals ( parameter
values), while the interior nodes are function
calls whose children are values provided from
farther down the tree.

49
Introduction

One would start with an initial population of
trees, evaluate the individuals on the set of
(input, output) pairs assigning each a value
dependent on how good the match is between the
output values computed by the program and those
given. One would then

apply some tree-compatible genetic algorithms to
generate the new population.
50
Introduction

Another example of next generation

51
Introduction

Final result we have a true match, although all
we really know is that we have achieved an exact
interpolation of the given (input, output) set.

52
Introduction

The program simple-gp.c evolves such a solution.
It develops a formula (with some understanding of
the need to take care of zero-divisions) which,
complex as it appears, can, in fact be reduced to
an actual function unfortunately it does not
look much like the actual function on a first
reading. This is not an unusual problem, since
the rules for canonical representation of
rational functions are fairly hard to implement
see some of the early programs for symbolic
computation.

53
Introduction

This approach raises a number of representational
questions how do we represent a program? What is
a chromosome? What is mutation? etc.
Part of the problem is that a constant length
chromosome might correspond to a rather limited
family of trees, making the whole evolutionary
process moot. On the other hand, introducing
variable length chromosomes with very large
alphabets ( primitive functions and parameter
values) may grow our chromosomes to enormous size
(although our own DNA may be trying to warn us).
Furthermore, we are looking for some theoretical
models to at least give plausibility to our
methods such theoretical models are likely to be
far too complicated

54
Introduction

Some other ideas on Genetic Programming. Other
questions would arise on the meaning of a
program as we indicated in the Prisoners
Dilemma, one can evolve finite-state machines
that are quite efficient. Those are, undeniably,
programs. The next question might be on how to
represent (and define and apply genetic
operators) for stack machines (supporting
recursion), assembly-language machines,
graph-reduction machines (which are used in the
compilation and optimization of functional
language programs), etc.
And all of this requires some kind of supporting
theoretical framework.