Title: Coevolving Solutions to the Shortest Common Superstring Problem
 1Coevolving Solutions to the Shortest Common 
Superstring Problem
- Assaf Zaritsky  Moshe Sipper 
- Ben-Gurion University, Israel 
- www.cs.bgu.ac.il/assafza
2Outline
- The Shortest Common Superstring problem. 
- DNA sequencing and the input domain. 
- Standard and cooperative coevolutionary genetic 
 algorithm (GA)  experimental results.
- The Puzzle approach  experimental results. 
- The Co-Puzzle algorithm  experimental results. 
- Conclusions and future work.
3The Shortest Common Superstring Problem (SCS)
- Let S  s1,,sn be a set of strings (blocks) 
 over some alphabet S. A superstring of S is a
 string x such that each si in S is a substring of
 x.
- Problem Find shortest (common) superstring. 
- NP-Complete. 
- MAX-SNP hard. 
- Motivation DNA sequencing, data compression.
4SCS Example
- S  ate, half, lethal, alpha, alfalfa 
- A trivial superstring is atehalflethalalphaalfalf
 a of length 25 (a simple concatenation of all
 blocks).
- A shortest common superstring is 
 lethalphalfalfate of length 17.
- Note that a compressed permutation of the 
 blocks is actually a superstring.
5Approximation Algorithms
- Several linear approximations for SCS have been 
 proposed, most of which rely on greedy
 approaches.
- GREEDY 
-  The most widely heuristic used in DNA 
 sequencing.
- Conjecture Blum 1994, Sweedyk 1999 Superstring 
 produced by GREEDY is of length at most two times
 the optimal.
- We are not aware of any previous evolutionary 
 approach to the SCS problem.
6Outline
- The Shortest Common Superstring problem. 
- DNA sequencing and the input domain. 
- Standard and cooperative coevolutionary genetic 
 algorithm (GA)  experimental results.
- The Puzzle approach  experimental results. 
- The Co-Puzzle algorithm  experimental results. 
- Conclusions and future work.
7DNA Sequencing
The most common usage of the SCS problem. 
 8DNA Sequencing (contd)
- The problem read a string of DNA. 
- Short DNA strands can be read in laboratory. 
- To sequence a long DNA strand 
-  (The DNA sequence appears in many copies) 
- Cut the DNA to short fragments using restriction 
 enzymes.
- Sequence each of the resulting fragments. 
- Order those fragments using a SCS algorithm. 
9The Input Domain
The input strings used in the experiments were 
inspired by DNA sequencing 
 10Input Generation Setup Parameters
NB increasing number of blocks results in 
exponential growth of the problems complexity. 
 11Outline
- The Shortest Common Superstring problem. 
- DNA sequencing and the input domain. 
- Standard and cooperative coevolutionary genetic 
 algorithm (GA)  experimental results.
- The Puzzle approach  experimental results. 
- The Co-Puzzle algorithm  experimental results. 
- Conclusions and future work.
12Simple Genetic Algorithm
produce an initial population of 
individuals evaluate fitness of all 
individuals while termination condition not met 
do select fitter individuals for 
reproduction recombine individuals mutate 
individuals evaluate fitness of modified 
individuals generate a new population end while 
 13EA Success Stories
 http//evonet.lri.fr/evoweb/resources/evolution_w
ork/all.php
http//www.genetic-programming.com/humancompetitiv
e.html 
 14EA Success Stories 
 15EA Success Stories 
 16Simple GA for the SCS Problem
- Given a set of strings as input, generate initial 
 population of random candidate solutions.
- The fitness of each individual depends on its 
 length and accuracy.
- The GA uses selection, recombination, and 
 mutation to create the next generation, each
 individual of which is then evaluated.
- Theses steps are repeated a predefined number of 
 times or until the solution is deemed
 satisfactory.
17Simple GA for the SCS Problem (contd)
- Blocks of the input set are atomic components. 
- Representation An individuals genome is 
 represented as a sequence of blocks.
-  An individual may have missing blocks or contain 
 duplicate copies of the same block.
- Permutation Representation Good or Bad? 
18Simple GA for the SCS Problem (contd)
- Evaluation fitness of an individual is the 
 length of its compressed genome  the total
 length of the blocks that are not covered by the
 individual.
- Genetic operators 
- Fitness proportionate selection. 
- Two-points recombination. Allows growth and 
 reduction in genomes length.
- Block-change mutation.
19Simple GA for the SCS Problem (example)
- S  s1,s2,s3,s4 s1  0011, s2  1100, s3  
 1001, s4  111.
- Fitness (lt s2,s1gt)  110011  111  6  3  
 9.
- Fitness (lt s4,s2,s1,s4gt)  11100111  8. 
- Recombination 
- p1  lts1,s2,s3,s4gt 
- p2  lts4,s1,s3,s2gt 
- p3  recombine1(p1,p2)  lts1,s1,s3 ,s2,s4gt 
- p4  recombine2(p1,p2)  lts4,s2,s3 gt 
- mutate (lts1,s2,s2gt)  lts1,s4,s2gt
20Coevolution
- Simultaneous evolution of two or more species 
 with coupled fitness.
- Coevolving species either compete or cooperate. 
- Competitive coevolution Fitness of individual 
 based on direct competition with individuals of
 other species, which in turn evolve separately in
 their own populations (prey-predator).
21Cooperative Coevolution 
 22Cooperative Coevolution (contd)
- Cooperative Coevolution involves a number of 
 independently evolving species.
- Interaction between species occurs via fitness 
 function only.
- The fitness of an individual depends on its 
 ability to collaborate with individuals from
 other species.
23Cooperative Coevolution (contd)
Source Potter  DeJong (1997) 
 24Cooperative Coevolutionary Algorithm for the SCS 
Problem 
- Two species evolve simultaneously. 
- First species contains prefixes of candidate 
 solutions to the SCS problem at hand.
- Second species contains candidate suffixes. 
- Fitness of an individual in each species depends 
 on how good it interacts with representatives
 from other species to construct a global solution.
25Cooperative Coevolutionary Algorithm for the SCS 
Problem (evaluation process) 
Merge 
 26Cooperative Coevolutionary Algorithm for the SCS 
Problem (evaluation process) 
Evaluate 
 27Experiments
Compare GREEDY, Standard GA, Cooperative 
Coevolution 
 28Experimental Setup
Each type of GA was executed twice on each 
problem instance the better run of the two was 
used for statistical purposes. 
 29Results Experiment I (50 blocks) 
 30Results Experiment II (80 blocks) 
 31Results Summary 
Average of the best superstring lengths
Algorithm
Problem size
GREEDY
Genetic
Cooperative
50 blocks
80 blocks 
 32Conclusion
The collaboration between the two populations 
results in a good decomposition of the problem 
into two smaller sub-problems, each is solved 
using a standard GA. 
 33Outline
- The Shortest Common Superstring problem. 
- DNA sequencing and the input domain. 
- Standard and cooperative coevolutionary genetic 
 algorithm (GA)  experimental results.
- The Puzzle approach  experimental results. 
- The Co-Puzzle algorithm  experimental results. 
- Conclusions and future work.
34The Puzzle Algorithm 
 35The Schema Theorem
Short, low-order, above-average schemata receive 
exponentially increasing trials in subsequent 
generations of a genetic algorithm. Holland 
(1975) 
 36Building Blocks Hypothesis
A genetic algorithm seeks near-optimal 
performance through the juxtaposition of short, 
low-order, high-performance schemata, called the 
building blocks. 
 37Our Interpretation
The success of GAs stems from their ability to 
combine quality sub-solutions (building blocks) 
from separate individuals in order to form better 
global solutions. 
 38The Main Assumption
Problems in nature have an inherent structural 
design. Even when the structure is not known 
explicitly GAs detect it implicitly and gradually 
enhance good building blocks. 
 39A Problem
Recombination may destroy quality building blocks 
found by the GA. 
 40The Preservation of Favoured Building Blocks in 
the Struggle for Fitness The Puzzle Algorithm 
 41Puzzle Algorithm The Idea
- Improve Recombination Operator. 
- Preserve good building blocks discovered by GA 
 using selection of recombination loci that do not
 destroy good building blocks.
- Result Assembly of good building blocks to 
 construct better solutions (as in a puzzle).
42Puzzle Algorithm (contd)
- Two populations 
-  1. Candidate solutions As in simple GA. 
-  2. Building blocks Each individual is a 
 sequence of blocks contained in at least one
 candidate solution.
43Puzzle Algorithm (contd)
- Interaction between candidate solutions and 
 building blocks is through fitness function.
- Interaction between building blocks and candidate 
 solutions is through constraints on recombination
 points.
Fitness evaluation
Crossover location 
 44Puzzle Algorithm Zoom In 
 45Puzzle Algorithm Zoom In 
 46The Candidate Solutions Population
- Representation, fitness evaluation, selection, 
 and mutation are identical to the simple GA.
- Recombination-aid vector aids in selecting the 
 recombination loci.
- Recombination-aid vector is updated by building 
 blocks individuals.
47The Building Blocks Population
- An individual is represented as a sequence of 
 blocks, contained in at least one candidate
 solution.
- Fitness of an individual is the average of the 
 fitness of candidate solutions containing it.
- Fitness-proportionate selection.
48The Building Blocks Population (cont)
- Unisex individuals. 
- Two modification operators 
- Expansion Increase its genome by one block. 
 Occurs with high probability.
- Exploration Die, and start over as a new 
 2-block individual. Occurs with low probability.
49Building Blocks  Candidate Solutions
Fitness evaluation
f1
f2
f3
f4 
 50Building Blocks  Candidate Solutions
Fitness evaluation
f1
f2
f3
f4
Update recombination-aid vector 
 51Update Recombination-aid vector 
 52Update Recombination-aid vector 
 53Update Recombination-aid vector 
 54Recombination-loci selection 
 Ties are broken arbitrarily 
 55Experiments
Compare GREEDY, Standard GA, Puzzle 
 56Building Blocks - Experimental Setup 
 57Results Experiment III (50 blocks) 
 58Results Experiment IV (80 blocks)
Did we lose to cooperative?
NO! 
 59Results Summary 
Average of the best superstring lengths
Algorithm
Problem size
GREEDY
Genetic
Puzzle
50 blocks
80 blocks 
 60Outline
- The Shortest Common Superstring problem. 
- DNA sequencing and the input domain. 
- Standard and cooperative coevolutionary genetic 
 algorithm (GA)  experimental results.
- The Puzzle approach  experimental results. 
- The Co-Puzzle algorithm  experimental results. 
- Conclusions and future work.
61Relations Between The Algorithms 
Co-Puzzle
GA 
 62The Co-Puzzle Algorithm
Fitness evaluation
Fitness eval
Fitness eval
Possible building blocks population
Candidate prefixes population
Possible building blocks population
Candidate suffixes population
Crossover location
Crossover location 
 63Experiments
Compare GREEDY, Cooperative Coevolution, 
Co-Puzzle 
 64Results Experiment V (80 blocks) 
 65Results Experiment VI (50 blocks)
???? 
 66Results Summary
size of shortest common superstring
Algorithm
Problem size
GREEDY
Cooperative
Co-puzzle
50 blocks
80 blocks 
 67Proposal The Messy Puzzle Algorithm 
 68The Messy Puzzle Algorithm
- The Linkage Problem. 
- Messy genes 
- Variable length genome. 
- Gene is an ordered pair ltallele-locus,allele-value
 gt.
- Handling over-  under-specification. 
- Example 
69The Messy Puzzle Algorithm (cont)
A building blocks genome is represented as a 
sequence of messy genes. 
 70Example The MAXCUT Problem
- The MAXCUT problem. 
- The main difficulty identifying the related 
 vertexes.
- Possible solution 
- Use some sort of messy genes to put related genes 
 close together.
- Use the Puzzle approach to keep them together. 
71Outline
- The Shortest Common Superstring problem. 
- DNA sequencing and the input domain. 
- Standard and cooperative coevolutionary genetic 
 algorithm (GA)  experimental results.
- The Puzzle approach  experimental results. 
- The Co-Puzzle algorithm  experimental results. 
- Conclusions and future work.
72Results Summary
size of shortest common superstring
Algorithm
Problem size
GREEDY
Cooperative
Co-puzzle
Puzzle
83 better
50 blocks
42 better
80 blocks
20 problem instances per experiment
25 better
90 blocks
13 better
100 blocks 
 73Larger Problems - Using More Species
size of shortest common superstring
Algorithm
Problem size
GREEDY
Co-puzzle
3-Co-puzzle
110 blocks
120 blocks 
 74Conclusions
- Cooperative coevolution might prove deleterious 
 when too many species are used.
- When a suitable number of species are used, 
 cooperative coevolution improves performance by
 decomposing the problem to several easier
 subproblems.
75(Conjectured) Scaling Analysis of Cooperative 
Coevolution 
 76Conclusions (cont)
- Evolving a population of building blocks to aid 
 in the selection of recombination loci improves
 drastically the performance of a standard GA.
- Cooperation between cooperative coevolution and 
 Puzzle ultimately improves global performance.
77Future Work
- The Messy Puzzle Algorithm. 
- Scaling analysis of cooperative coevolution. 
- Test the (Co-) Puzzle approach on other problem 
 domains.
- A hybrid GA. 
- Tackle larger problems. 
- Comparison to greedy-stochastically based 
 local-search algorithms.
78(No Transcript)