Title: Coevolving Solutions to the Shortest Common Superstring Problem
1Coevolving Solutions to the Shortest Common
Superstring Problem
- Assaf Zaritsky Moshe Sipper
- Ben-Gurion University, Israel
- www.cs.bgu.ac.il/assafza
2Outline
- The Shortest Common Superstring problem.
- DNA sequencing and the input domain.
- Standard and cooperative coevolutionary genetic
algorithm (GA) experimental results. - The Puzzle approach experimental results.
- The Co-Puzzle algorithm experimental results.
- Conclusions and future work.
3The Shortest Common Superstring Problem (SCS)
- Let S s1,,sn be a set of strings (blocks)
over some alphabet S. A superstring of S is a
string x such that each si in S is a substring of
x. - Problem Find shortest (common) superstring.
- NP-Complete.
- MAX-SNP hard.
- Motivation DNA sequencing, data compression.
4SCS Example
- S ate, half, lethal, alpha, alfalfa
- A trivial superstring is atehalflethalalphaalfalf
a of length 25 (a simple concatenation of all
blocks). - A shortest common superstring is
lethalphalfalfate of length 17. - Note that a compressed permutation of the
blocks is actually a superstring.
5Approximation Algorithms
- Several linear approximations for SCS have been
proposed, most of which rely on greedy
approaches. - GREEDY
- The most widely heuristic used in DNA
sequencing. - Conjecture Blum 1994, Sweedyk 1999 Superstring
produced by GREEDY is of length at most two times
the optimal. - We are not aware of any previous evolutionary
approach to the SCS problem.
6Outline
- The Shortest Common Superstring problem.
- DNA sequencing and the input domain.
- Standard and cooperative coevolutionary genetic
algorithm (GA) experimental results. - The Puzzle approach experimental results.
- The Co-Puzzle algorithm experimental results.
- Conclusions and future work.
7DNA Sequencing
The most common usage of the SCS problem.
8DNA Sequencing (contd)
- The problem read a string of DNA.
- Short DNA strands can be read in laboratory.
- To sequence a long DNA strand
- (The DNA sequence appears in many copies)
- Cut the DNA to short fragments using restriction
enzymes. - Sequence each of the resulting fragments.
- Order those fragments using a SCS algorithm.
9The Input Domain
The input strings used in the experiments were
inspired by DNA sequencing
10Input Generation Setup Parameters
NB increasing number of blocks results in
exponential growth of the problems complexity.
11Outline
- The Shortest Common Superstring problem.
- DNA sequencing and the input domain.
- Standard and cooperative coevolutionary genetic
algorithm (GA) experimental results. - The Puzzle approach experimental results.
- The Co-Puzzle algorithm experimental results.
- Conclusions and future work.
12Simple Genetic Algorithm
produce an initial population of
individuals evaluate fitness of all
individuals while termination condition not met
do select fitter individuals for
reproduction recombine individuals mutate
individuals evaluate fitness of modified
individuals generate a new population end while
13EA Success Stories
http//evonet.lri.fr/evoweb/resources/evolution_w
ork/all.php
http//www.genetic-programming.com/humancompetitiv
e.html
14EA Success Stories
15EA Success Stories
16Simple GA for the SCS Problem
- Given a set of strings as input, generate initial
population of random candidate solutions. - The fitness of each individual depends on its
length and accuracy. - The GA uses selection, recombination, and
mutation to create the next generation, each
individual of which is then evaluated. - Theses steps are repeated a predefined number of
times or until the solution is deemed
satisfactory.
17Simple GA for the SCS Problem (contd)
- Blocks of the input set are atomic components.
- Representation An individuals genome is
represented as a sequence of blocks. - An individual may have missing blocks or contain
duplicate copies of the same block. - Permutation Representation Good or Bad?
18Simple GA for the SCS Problem (contd)
- Evaluation fitness of an individual is the
length of its compressed genome the total
length of the blocks that are not covered by the
individual. - Genetic operators
- Fitness proportionate selection.
- Two-points recombination. Allows growth and
reduction in genomes length. - Block-change mutation.
19Simple GA for the SCS Problem (example)
- S s1,s2,s3,s4 s1 0011, s2 1100, s3
1001, s4 111. - Fitness (lt s2,s1gt) 110011 111 6 3
9. - Fitness (lt s4,s2,s1,s4gt) 11100111 8.
- Recombination
- p1 lts1,s2,s3,s4gt
- p2 lts4,s1,s3,s2gt
- p3 recombine1(p1,p2) lts1,s1,s3 ,s2,s4gt
- p4 recombine2(p1,p2) lts4,s2,s3 gt
- mutate (lts1,s2,s2gt) lts1,s4,s2gt
20Coevolution
- Simultaneous evolution of two or more species
with coupled fitness. - Coevolving species either compete or cooperate.
- Competitive coevolution Fitness of individual
based on direct competition with individuals of
other species, which in turn evolve separately in
their own populations (prey-predator).
21Cooperative Coevolution
22Cooperative Coevolution (contd)
- Cooperative Coevolution involves a number of
independently evolving species. - Interaction between species occurs via fitness
function only. - The fitness of an individual depends on its
ability to collaborate with individuals from
other species.
23Cooperative Coevolution (contd)
Source Potter DeJong (1997)
24Cooperative Coevolutionary Algorithm for the SCS
Problem
- Two species evolve simultaneously.
- First species contains prefixes of candidate
solutions to the SCS problem at hand. - Second species contains candidate suffixes.
- Fitness of an individual in each species depends
on how good it interacts with representatives
from other species to construct a global solution.
25Cooperative Coevolutionary Algorithm for the SCS
Problem (evaluation process)
Merge
26Cooperative Coevolutionary Algorithm for the SCS
Problem (evaluation process)
Evaluate
27Experiments
Compare GREEDY, Standard GA, Cooperative
Coevolution
28Experimental Setup
Each type of GA was executed twice on each
problem instance the better run of the two was
used for statistical purposes.
29Results Experiment I (50 blocks)
30Results Experiment II (80 blocks)
31Results Summary
Average of the best superstring lengths
Algorithm
Problem size
GREEDY
Genetic
Cooperative
50 blocks
80 blocks
32Conclusion
The collaboration between the two populations
results in a good decomposition of the problem
into two smaller sub-problems, each is solved
using a standard GA.
33Outline
- The Shortest Common Superstring problem.
- DNA sequencing and the input domain.
- Standard and cooperative coevolutionary genetic
algorithm (GA) experimental results. - The Puzzle approach experimental results.
- The Co-Puzzle algorithm experimental results.
- Conclusions and future work.
34The Puzzle Algorithm
35The Schema Theorem
Short, low-order, above-average schemata receive
exponentially increasing trials in subsequent
generations of a genetic algorithm. Holland
(1975)
36Building Blocks Hypothesis
A genetic algorithm seeks near-optimal
performance through the juxtaposition of short,
low-order, high-performance schemata, called the
building blocks.
37Our Interpretation
The success of GAs stems from their ability to
combine quality sub-solutions (building blocks)
from separate individuals in order to form better
global solutions.
38The Main Assumption
Problems in nature have an inherent structural
design. Even when the structure is not known
explicitly GAs detect it implicitly and gradually
enhance good building blocks.
39A Problem
Recombination may destroy quality building blocks
found by the GA.
40The Preservation of Favoured Building Blocks in
the Struggle for Fitness The Puzzle Algorithm
41Puzzle Algorithm The Idea
- Improve Recombination Operator.
- Preserve good building blocks discovered by GA
using selection of recombination loci that do not
destroy good building blocks. - Result Assembly of good building blocks to
construct better solutions (as in a puzzle).
42Puzzle Algorithm (contd)
- Two populations
- 1. Candidate solutions As in simple GA.
- 2. Building blocks Each individual is a
sequence of blocks contained in at least one
candidate solution.
43Puzzle Algorithm (contd)
- Interaction between candidate solutions and
building blocks is through fitness function.
- Interaction between building blocks and candidate
solutions is through constraints on recombination
points.
Fitness evaluation
Crossover location
44Puzzle Algorithm Zoom In
45Puzzle Algorithm Zoom In
46The Candidate Solutions Population
- Representation, fitness evaluation, selection,
and mutation are identical to the simple GA. - Recombination-aid vector aids in selecting the
recombination loci. - Recombination-aid vector is updated by building
blocks individuals.
47The Building Blocks Population
- An individual is represented as a sequence of
blocks, contained in at least one candidate
solution. - Fitness of an individual is the average of the
fitness of candidate solutions containing it. - Fitness-proportionate selection.
48The Building Blocks Population (cont)
- Unisex individuals.
- Two modification operators
- Expansion Increase its genome by one block.
Occurs with high probability. - Exploration Die, and start over as a new
2-block individual. Occurs with low probability.
49Building Blocks Candidate Solutions
Fitness evaluation
f1
f2
f3
f4
50Building Blocks Candidate Solutions
Fitness evaluation
f1
f2
f3
f4
Update recombination-aid vector
51Update Recombination-aid vector
52Update Recombination-aid vector
53Update Recombination-aid vector
54Recombination-loci selection
Ties are broken arbitrarily
55Experiments
Compare GREEDY, Standard GA, Puzzle
56Building Blocks - Experimental Setup
57Results Experiment III (50 blocks)
58Results Experiment IV (80 blocks)
Did we lose to cooperative?
NO!
59Results Summary
Average of the best superstring lengths
Algorithm
Problem size
GREEDY
Genetic
Puzzle
50 blocks
80 blocks
60Outline
- The Shortest Common Superstring problem.
- DNA sequencing and the input domain.
- Standard and cooperative coevolutionary genetic
algorithm (GA) experimental results. - The Puzzle approach experimental results.
- The Co-Puzzle algorithm experimental results.
- Conclusions and future work.
61Relations Between The Algorithms
Co-Puzzle
GA
62The Co-Puzzle Algorithm
Fitness evaluation
Fitness eval
Fitness eval
Possible building blocks population
Candidate prefixes population
Possible building blocks population
Candidate suffixes population
Crossover location
Crossover location
63Experiments
Compare GREEDY, Cooperative Coevolution,
Co-Puzzle
64Results Experiment V (80 blocks)
65Results Experiment VI (50 blocks)
????
66Results Summary
size of shortest common superstring
Algorithm
Problem size
GREEDY
Cooperative
Co-puzzle
50 blocks
80 blocks
67Proposal The Messy Puzzle Algorithm
68The Messy Puzzle Algorithm
- The Linkage Problem.
- Messy genes
- Variable length genome.
- Gene is an ordered pair ltallele-locus,allele-value
gt. - Handling over- under-specification.
- Example
69The Messy Puzzle Algorithm (cont)
A building blocks genome is represented as a
sequence of messy genes.
70Example The MAXCUT Problem
- The MAXCUT problem.
- The main difficulty identifying the related
vertexes. - Possible solution
- Use some sort of messy genes to put related genes
close together. - Use the Puzzle approach to keep them together.
71Outline
- The Shortest Common Superstring problem.
- DNA sequencing and the input domain.
- Standard and cooperative coevolutionary genetic
algorithm (GA) experimental results. - The Puzzle approach experimental results.
- The Co-Puzzle algorithm experimental results.
- Conclusions and future work.
72Results Summary
size of shortest common superstring
Algorithm
Problem size
GREEDY
Cooperative
Co-puzzle
Puzzle
83 better
50 blocks
42 better
80 blocks
20 problem instances per experiment
25 better
90 blocks
13 better
100 blocks
73Larger Problems - Using More Species
size of shortest common superstring
Algorithm
Problem size
GREEDY
Co-puzzle
3-Co-puzzle
110 blocks
120 blocks
74Conclusions
- Cooperative coevolution might prove deleterious
when too many species are used. - When a suitable number of species are used,
cooperative coevolution improves performance by
decomposing the problem to several easier
subproblems.
75(Conjectured) Scaling Analysis of Cooperative
Coevolution
76Conclusions (cont)
- Evolving a population of building blocks to aid
in the selection of recombination loci improves
drastically the performance of a standard GA. - Cooperation between cooperative coevolution and
Puzzle ultimately improves global performance.
77Future Work
- The Messy Puzzle Algorithm.
- Scaling analysis of cooperative coevolution.
- Test the (Co-) Puzzle approach on other problem
domains. - A hybrid GA.
- Tackle larger problems.
- Comparison to greedy-stochastically based
local-search algorithms.
78(No Transcript)