Title: Multiple Sequence Alignment Using A Genetic Algorithm
1Multiple Sequence Alignment Using A Genetic
Algorithm
2Project Goals
Goal 1
Extend my work from last year to produce a
program that performs multiple alignments using a
genetic algorithm
Goal 2
Evaluate the performance of my program by varying
the number of sequences, the length of the
sequences, and the similarity of the sequences
Goal 3
Compare the quality of my alignments as well as
the time it takes to generate them to one or
more commonly used alignment programs
3History of Multiple Alignment using GAs
Programs that perform multiple alignment using a
genetic algorithm
- SAGA (sequence alignment by genetic algorithm)
breaks possible alignments into pieces and then
introduces gaps at various positions - L. Abdesslem, M. Soham, and B. Mohamed produced
an alignment program by combining a genetic
algorithm and quantum computing principals. The
quantum computing portion was intended to reduce
the number of generations needed to obtain an
alignment. - MAGA (multiple alignment by genetic algorithm)
for protein sequences implemented a genetic
algorithm with basic scoring function, random
introduction of gaps, and linear propagation to
the next generation. Results were impressive for
small alignments with high similarity but not
very good for larger, more divergent alignments.
4Review of Multiple Alignment
Definition
Given a set of sequences, gaps are inserted into
each of the sequences in such a way that all
characters in a column are as similar as possible
Example
Alignment
Comments
This problem is known to be NP-Hard and is
additionally complicated by the fact that there
is no obvious way to score an alignment
5Review of Genetic Algorithms
Choose an initial population
Evaluate the fitness of each individual
Select individuals to reproduce
Breed a new generation via mutation and crossover
Evaluate the fitness of the offspring
Replace part of the population with the offspring
Report the best Individual
15
6Review of Mutation and Crossover
Mutation
Definition Mutation occurs when a randomly
chosen bit of an individual is changed to a
different state
Example Box c is mutated from blue to purple and
becomes c
Mutation Event
Crossover
Definition Crossover occurs when part of one
individual is replaced with the analogous
part of a second individual
Example abdc and abcd undergo crossover to
produce abcd
Crossover
7Choosing an Initial Population
Current Scheme
Currently I begin with an alignment produced by
another program and copy it multiple times to
create an initial population
Alignment 1
Alignment 2
Alignment 4
Alignment 3
Improved Scheme
An easy improvement to implement would be to
copy the initial alignment multiple times and
then randomly distribute the gaps within each
alignment
Alignment 1
Alignment 2
Alignment 3
Alignment 4
Ideal Scheme
Ideally I would like to start with sequences
without gaps and find a way to introduce gaps
into the alignment. This would avoid the
problem of having to start with an alignment
generated by another program.
Alignment 1
Alignment 2
Alignment 3
Alignment 4
8Evaluating Fitness
Naive Scoring Function
- One point for each match per column
- A gap does not match a gap
Total Score for this column 4
Formula
Score(column) nc2(A) nc2(C) nc2(G)
nc2(T) where nc2(x) is the number of ways you
can choose 2 items from a group of x
Improved Scoring Function
Ideally, I would like to implement a scoring
function with more biological significance. The
most likely choice at present is a pairwise
function with affine gap penalties, but I intend
to research other possibilities as well.
9Mutation
Current Scheme
Step 1 Determine whether or not to mutate a gap
using a random number generator
Gap selected for mutation
Step 2 Determine how much that gap will be
mutated using a random number
generator
-
2 spaces
C
A
A
T
G
A
-
Step 3 Move gap
C
-
C
G
T
A
A
A
C
A
-
-
A
A
G
C
A
T
-
A
A
10Crossover
Current Scheme
A crossover alignment is built by randomly
selecting a first sequence from one of the two
alignments, a second sequence from one of the
two alignments, etc.
Example
Alignment 1
Alignment 2
Crossover Alignment
11Termination
Current Scheme
The algorithm terminates after a set number of
generations
Improved Scheme
- The algorithm terminates when one of the
following three - criteria are met
- The alignment has not improved in a set number of
generations - The algorithm has run for a maximum number of
generations - The algorithm has run for a set amount of time
Additional Ideas
- Add a function to graph scores vs. generations
- Output best alignment every n generations so
that the user can kill - the program at any time
12Parameter Optimization
Parameters to Algorithm
- Probability of gap getting mutated
- Gap Mutation
- Probability of crossover
- Probability of alignment making it to next
generation - Population size
- Number of generations
Ideas for Optimization
- I hope to implement a GARLI-type scheme where
parameters are - adjusted based on the alignment and the
number of generations that - have been bread
- I also intend to run several alignments and try
to figure out what default - parameters work well
13Time Optimization
Goal
Make my algorithm run as fast and efficiently
as possible while still producing the best
alignment
Ideas for improvement
- Find/build a good data structure to store
alignments - Implement some (or all) of the algorithm in c
14Comparing to Other Programs
Ideas for testing quality of my algorithm
- Find at least one other well known alignment
program - Run several different test cases on both my
program and a well-established - alignment program
- Compare quality of alignments
- Compare running times
15References
- Wikipedia pages on Multiple Alignment and Genetic
Algorithms - Gusfield, Dan. Algorithms on Strings, Trees, and
Sequences. New York Cambridge University Press,
1997. - Harada, Yoshitomo, Masato Wayama, and Toshio
Shimizu. An Inspection of the Multiple Alignment
Method with use of a Genetic Algorithm. Genome
Informatics 8 (1997) 272-273. - Abdesslem, L., M. Soham, and B. Mohamed.
Multiple Sequence Alignment by Quantum Genetic
Algorithm. Parallel and Distributed Processing
Symposium, 2006. IPDPS 2006. 20th International
25-29 April 2006 8 pp-