Genetic ProgrammingandGenetic Algorithms
  • General Introduction

  • This series of lectures tries to cover the area
    of search from a different perspective.
  • We first observe that every program is a
    function, from a domain to a range a program
    takes an input from an acceptable set of inputs
    and generates an output (side-effects could be
    part of the output). Sometimes the program is
    aware of the exact set of acceptable inputs -
    and reacts appropriately to inputs outside that
    set most of the time such awareness is limited
    and so the function the program corresponds to
    may produce unpredictable (or undesirable)
    input-output pairs outside a small part of the
    possible set of inputs.

  • If the function can be represented in terms of
    already known functions either in terms of
    algebraic formulae or in terms of exact (domain,
    range) pairing rules, we can describe the
    function explicitly and we can end with a program
    for computing the input-output relation.
  • If the function cannot be so represented, we have
    a problem
  • If the function CAN be so represented, we may
    still (and, with high probability, do) have a
    problem (NP-Complete, anyone?)
  • Can we solve the problem(s)?
  • The answer will be, by and large, NO, BUT

  • If we do not have explicit rules to take us from
    an input to an output, what can we expect to
  • A finite collection of valid input-output pairs
  • A way of evaluating whether a collection of
    input-output pairs produced is more or less
    desirable than some other collection of pairs
    with the same input components
  • A way of evaluating when our process of function
    construction can stop, either because we are not
    generating better functions or because some
    other cost (time or space) is becoming

  • In some instances, we have a function and we are
    trying to find an input-output with some desired
    characteristics - e.g., a maximum, a minimum or a
    saddle-point of the function. If the function is
    given with a simple analytical form, Calculus
    techniques may be adequate. If the function is
    given in a very complex form (or, at least, not
    amenable to simple analytic techniques), an
    intelligent search (using known properties of
    the function to reduce the search space) may be
    the only available strategy.

  • One of the early results was obtained by
    McCulloch and Pitts (starting in the 1940s) via
    their study of perceptrons input-output networks
    with an input layer of nodes connected to an
    output layer of nodes.
  • By the late 1960s this setup had been shown to
    be inadequate to represent useful functions
    (e.g. XOR) (Minsky and Papert, Perceptrons, MIT
    Press, 1969). Later on, other people showed that
    the introduction of a third layer was adequate
    for the approximation of any desired
    well-behaved function the level of
    approximation was tied to the number of nodes in
    this intermediate layer.
  • More specifically, we have

  • The Universal Approximation Theorem. Let f() be
    a non-constant, bounded and monotone-increasing
    continuous function on R. Let Ip denote the
    p-dimensional closed unit hypercube 0, 1p. Let
    C(Ip) denote the space of continuous functions Ip
    --gt R. Then, given any function f ÃŽ C(Ip) and e
    gt 0, there exists an integer M and sets of real
    constants ai, qi and wij, where i 1, , M and j
    1, , p, such that we may define as
    an approximate realization of the function f()
    that is, F(x1, , xp) - f(x1, , xp) lt e
    for all (x1, , xp) ÃŽ Ip.
  • Proof. See the references in Simon Haykin,
    Neural Networks, a Comprehensive
    Foundation,Macmillan, 1994, pp. 181-182. (There
    is a second edition out)

  • We can observe
  • the logistic function 1/1 exp(-v) used as
    the nonlinearity in a neuron model satisfies the
    conditions on f() above
  • the network can be thought as having p input
    nodes and a single hidden layer with M neurons
    the inputs are x1, x2, , xp
  • the hidden neuron i has synaptic weights wi1,
    wi2, , wip and threshold qi
  • the network output is a linear combination of
    the outputs of the hidden neurons, with a1, , aM
    defining the coefficients of this combination.

  • The theorem itself is a simple existence theorem
    - a generalization of Fouriers result going back
    to the 1820s. It tells us that a single layer
    suffices, but it does not tell us anything about
    efficiency of representation, minimality, ease of
    training, size of the required set of hidden
    nodes, etc.
  • It allows for further results relating the
    accuracy of estimation of a function f by the
    approximation F to the accuracy of the empirical
    fit - i.e., a way of relating M, the number of
    hidden nodes, to N, the size of the training set.
    One can also show that there exists a choice of
    M so that the rate of convergence of training is
    O((1/N)1/2) time a logarithmic factor.

  • One of the problems with single hidden layer
    perceptrons is that all intermediate information
    is global it becomes impossible to interpret
    anything as a local feature.
  • Introducing two hidden layers allows one to
    interpret the information at the layer closest to
    the input as local while the information at the
    layer closest to the output is global.
  • Other types of problems lead in a natural way to
    two-layer networks.

  • The point to be made is that, although
    mathematical theory supports the approximability
    of most reasonable classes of functions from an
    n-dimensional euclidean space to an m-dimensional
    one, coming up with a usable approximation in a
    reasonable amount of time ( use of computational
    resources) given a small amount of information
    remains (for now and forever) a completely
    non-trivial problem.
  • The problem is, essentially, one of search
    among all possible approximations find one that
  • fits the known input/output relation best
    (e.g., with minimum least square error)
  • allows us to interpolate at points where we do
    not have data
  • has acceptable cost in both construction and

  • What general method can we devise?
  • Random Search start with a pair, add another,,
    and so on, rejecting all those that do not meet
    acceptable (input, output) criteria. You may use
    continuity - which, essentially, says that nearby
    input value should lead to nearby output values.
    After a while, you may have enough of a function
    to perform (linear) interpolation with some
    feeling that you wont have too many surprises.
  • Non-Random Search find a clever general purpose
    algorithm that will allow us to build a good
    function with acceptable cost.
  • For the second one, recall that just about all
    the NP-Complete problems we know require a
    complete enumeration of all possible
    configurations to guarantee finding a solution

  • More formally
  • The No Free Lunch Theorems No search algorithm
    is superior to any other algorithm on average
    across all possible problems.
  • Consequence algorithms that perform better than
    random search on one class of problems will
    perform worse than random search on another.
  • Consequence all algorithms we devise will have
    to be tailored to the search domain - it is the
    use of specific domain-dependent information that
    gives us algorithms better than random search.
    Outside of a given domain, our results will be
    (much) worse.

  • We are going to concentrate on some aspects of
    search and of function construction through
    search - the techniques to be introduced attempt
    to find computational analogues to methods that
    (to the best of our current knowledge) appear
    used in biological evolution.
  • Since biological evolution appears based on
    modification of the DNA exchanged from parent(s)
    to offspring we are going to have to find ways
  • encode all desired characteristics of a problem
    in a data structure that can support DNA-type
    modifications (whatever they will be, but they
    must include some analogues of chromosomal
    recombination and mutation applied to strings
    over some alphabet) and still remain meaningful
    in our context
  • construct an evaluation function (equivalent to
    evolutionary fitness determination) on either the
    underlying data structure

  • derivable from the original one (i.e., the
  • c) devise a strategy for differential
    reproduction, so that fitter individuals
    produce more offspring with a high chance of
    possessing desirable characteristics, while
    guarding against genetic overspecialization.
  • A lot of the terms are in quotes, since it is not
    obvious how the analogy with biological systems
    will be implemented as they say, the devil is in
    the details

  • The Problem of Representation. How do we
    represent the information to be searched for?
  • In DNA-based search we deal with an alphabet
    over 4 letters (A, C, G, T), and finite strings
    over that alphabet. The length of the strings is
    (more or less) fixed and parent-child information
    exchange seems to consist (at least at a first
    approximation - and for two-parent species) of
    the joining together of two single parental
    chromosomal strands into a double descendant
    strand with dominant and recessive alleles for
    nearly all the genes so transmitted. Two other
    known mechanisms involve movement of sections of
    the string from one location to another and
    single-locus changes (point mutations). Besides
    some other known mechanisms, there may well be
    many unknown ones.

  • Some early representations used fixed-length
    strings of binary digits each position had a 0
    or a 1 (a bit-gene). A function scored the
    string. Evolution of a solution involved
  • Determining which individuals would reproduce.
  • Selecting the pairs of individuals contributing
    to the offspring.
  • Determining how they would so contribute.
  • Determining the role and frequency of
  • Defining the new population.
  • Repeating the process from 1. above.
  • Determining when termination was reached.
  • Extraction of the best individual as the

  • There are several slightly different ways of
    looking at the evolution of a population via
    genetic-analogue methods
  • each generation corresponds to another subset of
    possible individuals. Convergence will
    correspond to a family of subsets of decreasing
    diameter (under some metric).

  • 2) Given a population of M individuals with N
    binary genes each, the number of states such a
    population can be in is given by the
    formula Finding a solution
    means, essentially, finding a limiting state for
    the evolving family of populations. A best
    individual of the limit population is our desired

  • 3) We can interpret each possible individual as a
    point in some space. The evaluation function
    defines a function from this space to R. We look
    for the maximum of this function over the space.
    The graph of this function provides us with what
    is called a fitness landscape. The evolution
    mechanism creates sets of different points in the
    domain from one generation to the next. We stop
    when several successive generations have not
    produced a substantially better individual, or
    when a given amount of computational resources
    has been expended.

  • We need to consider how the offspring will
    receive the information from the parents, and the
    mechanisms that correspond to point-mutation and,
    possibly, to larger scale mutation (e.g.,
    repositioning of substrings within a chromosome).
  • Start from binary strings each parent is
    represented by a binary string of length N. A
    reasonable analog to the contribution of single
    chromosomal strands from each parent, with
    dominant and recessive alleles may be to just
    take a prefix substring of (random) length 0 n
    N from the first parent, a suffix substring of
    length N - n from the second parent and
    concatenate them in the same prefix-suffix order.
    This provides a new chromosome of length N.
    Mutation can be simulated by walking down the
    new string and randomly resetting the bits.

  • At the simplest level, we now have a data
    structure which is just a fixed length string of
    bits, an evaluation function to provide relative
    ranks of different individuals, a mechanism for
    enforcing differential reproduction, and a
    mechanism to provide the next generation. A
    variant may include allowing some of the best
    individuals of one generation to have exact
    copies appear in the next.
  • A question that begs to be asked is given a
    current generation, can we say anything about the
    evolution of genetic patterns from one generation
    to the next? This is crucial to our being able
    to believe that the process set in motion has
    some convergence properties, rather than just
    leading us to a completely random population.

  • Other Representations. Some problems have
    natural representations in terms of larger
    alphabets (actual DNA comes to mind - 4 letters)
    or in terms of continuous quantities (requiring
    floating point numbers over various ranges). If
    the cardinality of the alphabet is a power of 2,
    we can still use the same bit-oriented
    mechanisms, and the chromosome at the next
    generation will remain meaningful.
  • If the alphabet is made up of continuous
    ranges, the problem of representation becomes
    more complex. A possible solution involves using
    a contiguous range of bits to represent each more
    structured entity, with the caveat that mutation
    and recombination must be constrained not to exit
    the appropriate ranges. Another solution
    involves accepting floating point values (rather
    than bits) as genes.

  • This may simplify the interpretation (and
    implementation) of recombination and mutation,
    but we are still left with the problem of
    guaranteeing meaning for the results of such
  • Another representational problem arises in what
    is properly called genetic programming where
    the object being evolved is a program that
    attempts to compute a specific (only partially
    known) function. Natural representation of
    programs may involve trees, where the nodes are
    functions (interior nodes) or parameter values
    (leaves - if the program does not require
    iteration or recursion), or graphs. Although it
    is always possible to reduce everything to bits,
    any intuition is likely to be so removed from the
    bit-representation as to be useless.

A difference in approach between genetic
algorithms and genetic programming can be
exemplified in the two diagrams below in genetic
algorithms, once the problem is codified into a
data structure, we just apply the genetic
algorithm in genetic programming the interaction
with the original problem remains more direct and
  • More specifically, the genetic algorithm approach
    results in defining a binary string and then
    modifying and evaluating binary strings,
    constructing successive generations using (at
    least) crossover and mutation operators

  • The genetic programming approach leads to a cycle

  • The string is replaced by a tree, and the tree
    can be modified in ways that are more complex
    than those supported by a string (the apparent
    cycle is not crucial - the methods are all
    cyclical in nature). More crucial is the
    observation that we cannot limit a program to a
    fixed number of tree nodes the search would be
    much too limited. This implies that the usual
    methods (which we will study in more detail
    later) for evaluating convergence will have to
    be modified - if they are applicable at all.

  • Why should anything converge? By allowing a
    best element of the population P to survive
    from one generation to the next, we can ensure
    that the derived evaluation function F(t)
    maxx ÃŽPt f(x) is monotone non-decreasing in t,
    but this does not mean that we should expect
    improvement or convergence. Another approach,
    based on the idea of a schema, provides a
    probabilistic approach, still with substantial

  • Some Examples 1 - Function Optimization. You
    are given the function f(x) xsin(10p x) 1.0
    over the closed interval -1.0, 2.0, and you are
    expected to find the value of x in that range
    that maximizes it Z. Michalewicz, p.18.
  • An analytic approach would first compute the
    zeros of the first derivative (the function
    possesses a first derivative at every point in
    (-1.0, 2.0) so any interior maxima or minima will
    appear only at zeros of the derivative). There
    are finitely many such values over the given
    interval. We can now evaluate the function at
    all such values, plus -1.0 and 2.0 (solving
    tan(10px) 10px 0 will require some numerics -
    nothing too hard). A value of x for which we
    attain the maximum of this finite set, plus
    endpoints, provides us with a correct (input,
    output) pair, and a solution to our problem.

  • In the absence of analytic techniques, what can
    we do? We can choose a finite random set of
    points in -1.0, 2.0, evaluate the function at
    those points, choose a point where the function
    achieves a largest value and stop.
  • A modification would entail choosing a small
    random set, finding the x-value where the
    function has the largest value choosing a second
    random set near this value, and repeating the
    process with smaller and smaller sets near better
    and better values. Stop when you dont improve
    from one generation to the next, or when you
    run out of computational cycles.
  • The second technique, just like the first, is
    likely to leave us stuck near a relative
    maximum which is not optimal Can we do better?

  • How do we devise a genetic algorithm?
    Essentially, we want to add some intelligence
    to this random search try to avoid getting stuck
    on local maxima, and direct the search so that it
    is - hopefully - more efficient than strictly
    random. How?
  • Representation what precision do we want? Lets
    choose six places after the decimal point (there
    has to be a point beyond which we dont care).
    Six decimal points 31000000 intervals over
    -1.0, 2.0. Notice that 2097152 221 lt 3000000
    lt 222 4194304, so we will use a string on 22
    bits to represent numbers in the desired range.
  • We now have binary strings b21b20b1b0, which we
    can convert to decimal in -1.0, 2.0

  • The two chromosomes 000 and 111 correspond to
    the endpoints -1.0 and 2.0, respectively. All
    others correspond to interior points. The
    evaluation function simply takes a binary
    chromosome, say v, transforms into a decimal
    number, say dec(v), and evaluates f at that
    number eval(v) f(dec(v)).
  • Initialization. Create a random initial
    population of chromosomes.
  • Ranking. Rank the chromosomes according to the
    evaluation function.
  • Reproduction. Normalize the rankings so that each
    rank corresponds to an appropriate subinterval of
    0, 1. Run the random number generator twice,
    using the subintervals to determine probability
    of choice, to obtain two parents.

  • When the parents are found, we create the
    offspring. Two operators are used crossover and
    mutation. Select, randomly, the gene after which
    the crossover takes place. The first parent
    contributes the part of its chromosome ending at
    that gene (inclusive) the second parent
    contributes the final part of its chromosome to
    the offspring. If mutation is to be included,
    one must successively use the random number
    generator to determine if each gene of the
    offspring is to be changed. Repeat the process
    until a number of offspring equal to the desired
    population is obtained.
  • Next Generation. We now have a new generation,
    for the process to be repeated.

  • Several issues must be resolved
  • the size of the initial population (50, in this
  • the number of generations the process is allowed
    to continue (150, in this case)
  • the probability of crossover (the probability
    that a chromosome undegoes cross-over pc
  • the probability of mutation (the probability
    that a gene is flipped pm 0.01).
  • None of them can be optimally determined
  • A run provides the following results

  • The best individuals discovered within 150
    generations are given in the table below.

  • Some Examples 2 - The Prisoners Dilemma. This
    is discussed in Michalewiczs book, in
    Mitchells, and, quite extensively, in D. B.
  • The Problem two individuals are held prisoner
    and are under pressure to confess to some
    undesirable activity implicating the other. The
    options (and rewards/punishments) are summarized
    in the table below.

  • The aim of the game is to find a strategy that,
    over the long run, will maximize ones gains. At
    any one point, the maximizing strategy would have
    the winner defect, and the loser holding out, so
    there is always a temptation to defect. In fact,
    defection is always the safest individual
    choice at each point. On the other hand, a
    sequence of mutual defections has a combined
    payoff much smaller than that of a sequence of
    mutual cooperations. We can compute
  • An infinite sequence of random choices (each
    configuration has probability 0.25) has an
    expected return (for each of the players) of
    2.25, with an expected cumulative return of 4.5
  • An infinite sequence of defections, with the
    other choosing randomly has an expected return
    (for the defector) of 3 and of 0.5 (for the
    other), so the expected cumulative return is only

  • 3. Ex. compute the expected returns, individual
    and cumulative, for at least two other sequences
    of actions.
  • Representing a solution. Ideally, we should have
    a complete memory of the past to determine the
    next decision. Since this is not possible, a
    decision has to be made on the basis of a finite
    memory. Michalewiczs text uses the previous
    three moves. In fact (Mitchells and Fogels
    books), tournaments (human and computer) were
    organized by A. Axelrod in the 70s and 80s
    to try to determine best strategies, and he
    decided to use the three-move-memory we will
    use here. If a chromosome contains information
    about the three previous moves, and since there
    are 4 possible outcomes for each move, we have to
    keep track of 64 ( 444) possible games - 64
    different histories.

  • If we order the histories in some canonical
    order, a 64-bit array allows us to associate a
    response to each history. We must also prime the
    system with three initial games, with two bits
    required for each (index into histories). Total
    70 bits for a chromosome.
  • Choose an initial population. Create a number, N,
    of random 70-bit strings.
  • Test each player to determine effectiveness. Use
    the strategy encoded in the chromosome to play
    games against all other players. The fitness
    is the average score over all the games played.
    The original study by Axelrod had each member of
    the population play against a set of given
    strategies - culled from one of the human and
    computer tournaments. The initial game sequence
    provides an index into the bit-string, so one
    could use the first six bits to determine the

  • initial strategy for each player (player no. 1
    uses its initial conditions to determine the game
    sequence to be continued by both).
  • Select the player to breed. A player with an
    average score (within a standard deviation of
    average) is given one mating a player with a
    score one or more standard deviations above
    average is given two matings one with a score
    one or more standard deviations below average is
    give no matings (some minor adjustment may be
    necessary to keep the population of constant
  • Breed. Randomly pick pairs, and create two
    descendants per mating using both crossover and

  • Results. Experiments lead to a number of
    strategies being discovered.
  • Continue cooperation after three initial
    cooperations (CC)(CC)(CC) leads to C.
  • Defect when the other player defects
    (CC)(CC)(CD) leads to D.
  • Continue to cooperate after cooperation has been
    restored (CD)(CD)(CC) leads to C.
  • Cooperate when cooperation has been restored
    after your own defection (DC)(CC)(CC) leads to
  • Defect after three mutual defections
    (DD)(DD)(DD) leads to D.
  • If the payoff for successful defection is
    increased to 6, strategies can delop with
    expected payoffs gt 3.0 (Fogel).

  • The evolved strategies can be represented as
    finite-state machines, which can also mean that
    the final best strategy can be interpreted as a
    formal program - the best evolved.
  • This was an example of co-evolution, where the
    individuals in a population were pitted against
    one another and caused the population to evolve
    through mutual interactions.
  • For much more detail, see the papers by Axelrod
    or the book by D. B. Fogel Evolutionary

  • Some Examples 3 - The Traveling Salesman
    Problem. This is a well-known NP-Complete
    problem, which has some reasonable
    approximation schemes - at least in the case in
    which the distances between nodes satisfy a
    triangle inequality (see Cormen al., Ch. 35).
    There can be no expectation of an exact solution
    (find a tour of minimum length) in anything but
    exponential time, but it may be possible to
    beat the quality of the known deterministic
    approximation algorithms without giving up too
    much time
  • What is a chromosome? A reasonable
    interpretation for a chromosome is that it
    represents a tour. Since the complete graph
    consists of N nodes, is the chromosome a binary
    string (array) or an integer array?

  • This is an important decision because of the
    genetic operators do we split at a bit or at a
    full node index (integer)?
  • If we use a bit representation, our chromosome
    will need Nceiling(log2N) bits. Using crossover
    or mutation at the bit level cannot guarantee
    that the new chromosome represents a new tour,
    or, possibly, even that we have a path in the
    graph (if exp(ceiling(log2N)) gt N).
  • If we use integer representation (N integers),
    and we use our genetic operators at integer
    boundaries, we can at least guarantee that we
    still have a path, although maybe not a tour.

  • The choice in this case is of an integer
    representation for the chromosome, essentially
  • it avoids one set of problems - the possible
    introduction of non-existent nodes and,
  • it permits some simpler cross-over repair
    algorithms to make sure that no stillborn
    descendants are allowed into the population. The
    repair algorithms can be incorporated into the
    genetic operators.
  • Mutation can be handled ina similar way.
  • Intialization. For a population of size M, pick
    M random permutations on N items. Another
    possibility is to use a greedy algorithm to
    construct M approximate solutions, and start from

  • Evaluation and Ranking. Straightforward for each
    individual - just compute the value of the tour
    it represents.
  • Breeding. Both crossover and mutation must be
    implemented carefully to preserve a tour and to
    maintain a relationship with the parents. We
    will look at the details at some later date.
  • Results. The algorithm, as described, appears to
    be better than random search, but is not very
    efficient a 100 city tour, after 20,000
    generations, gives a value for the best tour
    found about 9.4 above optimum (Michalewicz, p.

  • Some Examples 4 - Tree-Based Genetic
    Programming. Assume we are given a function,
    either explicitly or, more likely, as a set of
    (input, output) pairs. Assume we have a finite
    set of pairs derived from the function y x2.
    We want to construct a best interpolating
    function we can, obviously starting only from
    the set of (input, output) pairs.
  • A program can be viewed as a tree structure,
    where the leaves are terminals ( parameter
    values), while the interior nodes are function
    calls whose children are values provided from
    farther down the tree.

  • One would start with an initial population of
    trees, evaluate the individuals on the set of
    (input, output) pairs assigning each a value
    dependent on how good the match is between the
    output values computed by the program and those
    given. One would then

apply some tree-compatible genetic algorithms to
generate the new population.
  • Another example of next generation

  • Final result we have a true match, although all
    we really know is that we have achieved an exact
    interpolation of the given (input, output) set.

  • The program simple-gp.c evolves such a solution.
  • It develops a formula (with some understanding of
    the need to take care of zero-divisions) which,
    complex as it appears, can, in fact be reduced to
    an actual function unfortunately it does not
    look much like the actual function on a first
    reading. This is not an unusual problem, since
    the rules for canonical representation of
    rational functions are fairly hard to implement
    see some of the early programs for symbolic

  • This approach raises a number of representational
    questions how do we represent a program? What is
    a chromosome? What is mutation? etc.
  • Part of the problem is that a constant length
    chromosome might correspond to a rather limited
    family of trees, making the whole evolutionary
    process moot. On the other hand, introducing
    variable length chromosomes with very large
    alphabets ( primitive functions and parameter
    values) may grow our chromosomes to enormous size
    (although our own DNA may be trying to warn us).
    Furthermore, we are looking for some theoretical
    models to at least give plausibility to our
    methods such theoretical models are likely to be
    far too complicated

  • Some other ideas on Genetic Programming. Other
    questions would arise on the meaning of a
    program as we indicated in the Prisoners
    Dilemma, one can evolve finite-state machines
    that are quite efficient. Those are, undeniably,
    programs. The next question might be on how to
    represent (and define and apply genetic
    operators) for stack machines (supporting
    recursion), assembly-language machines,
    graph-reduction machines (which are used in the
    compilation and optimization of functional
    language programs), etc.
  • And all of this requires some kind of supporting
    theoretical framework.
