Computational Molecular Biology
  • Multiple Sequence Alignment

Sequence Alignment
  • Problem Definition
  • Given 2 DNA or protein sequences
  • Find Best match between them
  • What is an Alignment
  • Given 2 Strings S and S
  • Goal The lengths of S and S are the same by
    inserting spaces (-- sometimes denote as ?) into
    these strings

A -- T C -- A
-- C T C A A
Matches, Mismatches and Indels
  • Match two aligned, identical characters in an
  • Mismatch two aligned, unequal characters
  • Indel A character aligned with a space

A A C T A C T -- C C T A A C A C T -- -- -- -- C
T C C T A C C T -- -- T A C T T T
10 matches, 2 mismatches, 7 indels
Basic Algorithmic Problem
  • Find the alignment of the two strings that
  • max m where m ( matches mismatches indels)
  • Or min m where m is the SP-score of an alignment
  • m defines the similarity of the two strings, also
    called Optimal Global Alignment
  • Biologically a mismatch represents a mutation,
    whereas an indel represents a historical
    insertion or deletion of a single character

Multiple Sequence Alignment
  • Problem Definition
  • Similar to the sequence alignment problem but the
    input has more than 2 strings
  • Challenges
  • NP-hard, MAX-SNP
  • Guarantee factor 2 2/k where k is the number
    of the input sequences.
  • More work to reduce the time and space complexity

Sum of Pairs Score (SP-Score)
  • Given a finite alphabet and where ? denotes
    a space
  • Consider k sequences over that we want to
    align. After an alignment, each sequence has
    length l
  • A score d is assigned to each pair of letters

  • The SP-Score of an alignment A is defined as
  • Consider a matrix of l columns and k rows where
    the rows represents the sequences and columns
    represent the letters
  • SP-Score is the sum of the scores of all columns
  • Score of each column is the sum of the scores of
    all distinct unordered pairs of letters in the
  • Or we can view as sum of pairwise sequence
    alignment values.
  • Find an (optimal) alignment to minimize the
    SP-Score value

  • Proving MSA with SP-Score that is a Metric
    is NP-hard

Some Notations
Some Basic Properties
  • Lemma 1 Let s1, s2 be two sequences over S such
    that l1s1, l2s2, l2l1 and there are m
    symbols of s1 that are not in s2. Then every
    alignment of the set s1,s2 has at least ml2-l1

The construction
  • Reduce the vertex cover (or node cover) to MSA.
  • Vertex cover
  • Instance A graph G(V,E) and an integer kV
  • Question Is there a vertex cover V1 of G of size
    k or less?
  • MSA
  • Instance A set Ss1, , sn of finite sequences
    over a fixed alphabet S, an SP-score and an
    integer C
  • Question Is there a multiple alignment of the
    sequences in S that is of value C or less?

SP-Score (alphabet of size 6)
The Reduction
So, we have , T is a set of C2 sequences t
and X contains C1 sequences x(k), where C1 and C2
will be determined later
An Example
  • By the above construction, an optimal alignment A
    of S is obtained when A satisfies certain
    properties (called standard alignment)
  • The value of standard alignment is bounded by a
    given threshold C only where G has a vertex cover
    of size k
  • How to obtain
  • Force ds of the test sequences to be aligned
    with bs of the edge sequences
  • Only one b of each edge sequence can be aligned
    to a d
  • The number of such alignment determines the value
    of the alignment

Standard Alignemnt
  • Let US and US,X denote the upper bounds of D(AS)
    and D(AS,X) respectively
  • By Corollary 8 and Lemma 9, we have the standard
    alignment has value not greater than DSD US
  • where DSD D(AX) D(AT) D(AX,T) D(AS,T)
    over a standard alignment A
  • Now, let C1 gt US and C2 gt US US,X, we can prove
    that an optimal alignment must be a standard one

  • Show the NP-hardness of any scoring matrix in
    a broad class M
  • Show that there is a scoring matrix M0 such that
    MSA for M0 is MAX-SNP hard

Interesting Observation
  • Via the brute force, optimal MSA contains very
    few gaps
  • Suggesting the study of gap limitations
  • Have an upper bound of the number of gaps one can
    insert during the alignment
  • Special case
  • Gap-0 No gap allows, but we can shift the
    strings for an alignment (insert gaps at the
    beginning or at the end of a string)
  • Gap-0-1 a gap-0 alignment such that the gaps at
    the beginning or at the end of each string is
    exactly one space

Problem Definition
  • Given a finite alphabet
  • Scoring matrix
  • For i, j gt 0, si,j represents the penalty for
    aligning ai with aj
  • For i gt 0, s0,i and si,0 are called indel
  • Gap opening penalties (in addition to the indel
    penalties) for aligning ai with the first or last
    ? in the string of ?s

Generic Scoring Matrix
Where SA,T, x, y, x are fixed nonnegative
numbers and u gt max0, vA, vT holds
  • Let M2 be the class of all scoring matrices that
    contain a generic submatrix M
  • Let M1 be the class of all scoring matrices that
    contain a sub-matrix isomorphic
  • to a generic matrix M with z gt vT.
  • Let M be the class of all scoring matrices that
    contain a submatrix isomorphic
  • to a generic matrix M with y gt u and z gt vT.
  • Theorem 1
  • The gap-0-1 multiple alignment problem is NP-hard
    for every scoring matrix M
  • in M2.
  • (b) The gap-0 multiple alignment problem is
    NP-hard for every M in M1
  • (c) The multiple alignment problem is NP-hard for
    every M in M
  • Note that M is quite broad and covers most
    scoring schemes used in
  • biological applications.

  • Reduce the MAX-CUT-B
  • Given G(V,E) where kV and each vertex has a
    degree at most B
  • Find a partition of V into two disjoint sets such
    that to maximize the number of edges crossing
    these two sets
  • Given a graph G(V,E) with k vertices v0, , vk-1
    and l edges e0, , el-1. We will construct a set
    of k2 sequences t0, , tk2-1 as follows

  • For each vertex vi, construct a sequence ti such
  • for each edge emvh, vi incident at vi, h lt i,
    n lt k5, set
  • where ti,j represents the character at the jth
    position in ti.
  • For other j, let ti,j T
  • For i k, set ti T T T T with length k12l

An Example
Proof of Theorem 1(a)
  • We will show that a gap-0-1 alignment will
    partition V into two disjoint subsets V0 and V1
  • V0 all vertices vi such that ti remains in place
    (a space appends at the end)
  • V1 all vertices vi such that ti shifts to the
  • Thus, based on the alignment, we can find the
    cut. And vice versa, based on the cut, we can
    find the alignment
  • The left part is prove that if k is sufficiently
    large, the optimal gap-0-1 alignment yields a
    partion of V with maximum edge cut.

Proof of Theorem 1(a)
  • Let c denote the cut based on the alignment A
  • Consider all the sequences ti after that
    alignment A
  • The total indel penalties is of order O(k4)
    (appears at the first and last column in the SP
    score matrix)
  • The total number of mismatches before the
    alignment is 3k5l(k2-1)
  • To maximally reduce this number
  • 1 A-A match reduces 2 A-T mismatches
  • For each edge (vh, vi), if there are in different
    subsets (of the partition), then a total of k5
    A-A matches between sequences th and ti are
  • No other A-T mismatches can be elimiated
  • Thus the SP-score
  • k12lvTk2(k2-1)23k5l(u-vT)(k2-1)-ck5(2u-vA-vT)O(k

Theorem 2
  • Consider the following scoring matrix M0 for the
  • alphabet ?0 A,T,C.
  • The gap-0-1 MSA problem is MAX-SNP-hard
  • The gap-0 MSA problem in MAX-SNP-hard
  • The MSA problem in MAX-SNP-hard

MAX-SNP-hard Proof
  • To prove problem A is MAX-SNP-hard, we need to
    L-reduce problem A, which is MAX-SNP-hard to A
  • L-reduce
  • There are two polynomial-time algorithms f, g and
    constants a, b gt 0 such that for each instance I
    of A
  • f produces an instance I f(I) of A such that
    OPT(I) aOPT(I)
  • Given any solution of I with cost c, g produces
    a solution of I with cost c such that c-OPT(I)

Proof of Theorem 2
  • To prove MSA (with M0 and the scoring matrix
    mentioned before) MAX-SNP-hard
  • L-reduce the MAX-CUT-B to another optimization
    problem, called A, which is L-reduce to a scaled
    version of MSA
  • Problem A
  • Given a graph G(V,E) with bounded degree B. For
    every partition PV0, V1, let cp be the size of
    cut determined by P.
  • Find the partition P of V that minimizes dp

Show A is MAX-SNP-hard
  • Let f and g be an identity function
  • Set a 3B and b 2, we can easily prove the two
    properties of the L-reduction since
  • cp E/B and dp 3E - 2 cp 3 E
  • Any increase of cp by 1 decrease dp by 2

Show A L-reduce to scaled MSA
Similar to the above construction, we have
  • Similar to the proof of Theorem 1, we have the
    optimal SP-score where
  • If the SP-score is scaled by a factor of k-5/2
    for a MSA of k sequences, then A L-reduce to MSA.

How do GAs work?
  • Create a population of random solutions
  • Use natural selection
  • crossover and mutation to improve the solutions
  • Stop the operation if satisfying some certain
    criteria such as
  • No improvement on fitness function
  • The improvement is less than some certain
  • The number of iteration is more than some certain

Terms and Definitions
  • Chromosomes
  • Potential solutions
  • Population
  • Collection of chromosomes
  • Generations
  • Successive populations

Terms and Definitions
  • Crossover
  • Exchange of genes between two chromosomes
  • Mutation
  • Random change of one or more genes in a
  • Elitism
  • Copy the best solutions without doing crossover
    or mutation.

Terms and Definitions
  • Offspring
  • New chromosome created by crossover between two
    parent chromosomes
  • Fitness function
  • Measures how good a chromosome is.
  • Encoding scheme
  • How do we represent every chromosome/gene?
  • Binary, combination, syntax trees.

Why are GAs attractive?
  • No need for a particular algorithm to solve the
    given problem. Only the fitness function is
    required to evaluate the quality of the
  • Implicitly a parallel technique and can be
    implement efficiently on powerful parallel
    computers for demanding large scale problems.

Basic Outline of a GA
  • Initial population composed of random
    chromosomes, called first generation
  • Evaluate the fitness of each chromosome in the
  • Create a new population
  • Select two parent chromosomes from a population
    according to their fitness
  • Crossover (with some probability) to form a new
  • Mutation (with some probability) to mutate new
  • Place new offspring in a new population
  • Process is repeated until a satisfactory solution

  • Mutation Operation
  • Modify a single parent
  • Try to avoid local minima

Let's see some running examples
  • Minimum of a function
  • http//
  • Elitism
  • http//
  • The travelling salesman problem
  • http//

Multiple Sequence Alignment
  • Fitness function is used to compare the different
  • Based on the number of matching symbols and the
    number and size of gaps
  • Also called the cost function
  • Different weights for different types of matches
  • Gap costs
  • can be simple and count the total matching
  • can be complicated and consider the type of
    matching symbols, location in the sequence,
    neighboring symbols etc.

  • Approximation Algorithms

Scoring method
  • Score zero for a match or for two opposing spaces
  • Score one for a mismatch or for a character
    opposite a space

  • Assume that two opposing spaces have a zero value
  • Assume other values satisfies triangle inequality
  • s(x,z) s(x,y) s(y,z)
  • s(x,z) cost of transforming character x into
    character z

Objective Functions
  • Two objective functions
  • SP
  • The sum of the values of pairwise alignments
    induced by an alignment A
  • TA
  • Using the topology of the tree, map the strings
    to the nodes of the tree
  • The sum of the selected pairwise alignments is
    called tree alignment

Center Star Method
  • For a set of k strings X
  • Choose a center string Xc of X which minimizes
  • Let M min Sj?cD(Xc,Xj)
  • Center star is a star tree of k nodes with the
    center node labeled Xc and each of the k-1
    remaining nodes labeled by a distinct string in X
    \ Xc
  • If Xi and Xj are strings labeling adjacent nodes
    of tree T, then alignment of Xi and Xj induced by
    A(T) has value D(Xi,Xj)

Center Star Method Alg Ac
  • Do an optimal alignment for each pair (Xc, Xj)
    for all j ? c
  • s0 max number of spaces placed before the first
    char of Xc
  • sf max number of spaces placed after the last
    char of Xc
  • si max number of spaces placed between Xc(i)
    and Xc(i1)

Center Star Method Alg Ac
  • For Xc, insert s0, si, and sf spaces at the
    beginning, between, and the end of Xc
    respectively. Call Xc
  • Then for each Xj, do the optimal alignment
    without modifying Xc

  • d(Xi,Xj) D(Xi,Xj)
  • V(Ac) Siltjd(Xi,Xj)
  • V(Ac) is at most twice the value of the optimal
    multiple alignment of X

  • Lemma 3.1 For any 2 strings Xi,Xj, we have
  • d(Xi,Xj) d(Xi,Xc) d(Xc,Xj)
  • D(Xi,Xc) D(Xc,Xj)
  • triangle inequality

  • A be the optimal multiple alignment of k strings
  • Define V(A) Siltjd(Xi,Xj)

  • Theorem 3.1
  • V(Ac) / V(A) 2(k-1)/ k lt 2
  • Proof

  • Requires all pairwise alignments
  • Computationally expensive
  • Faster, Randomized alignments
  • Randomly select string Xi
  • Build multiple alignment with star centered at Xi
  • Select best multiple alignment A from p such
  • At most (k-1)p pairwise alignments need to be

Randomized Alignments
  • Theorem 3.2For any r gt1, let e(r) be the
    expected number of stars needed to be chosen at
    random before the value of best resulting
    alignment is within a factor of 21/(r-1) of the
    optimal alignment. Then e(r) r.
  • e(r) is independent of k and the length of the

Proof of Theorem 3.2
  • For r 2, for each string Xi
  • define M(i) SjD(Xi,Xj) then M(c) M
  • From Theorem 3.1,
  • S(i,j)D(Xi,Xj) SjM(i) 2(k-1)M so the Avg
    value of M(i) lt 2 M
  • Since min M(i) M, then Median M(i) lt 3M
  • Number of centers selected before a selected
    M(i) is less than the median 2

  • Suppose median is ?M for 1 ? 3
  • Then S(i,j)D(Xi,Xj) kM/2 k ? M/2
  • Value of the alignment obtained from any below
    median star 2(k-1) ? M
  • Therefore, error ratio for this star 2 ? /
    (1/2 ? /2)
  • When ? 3, error ratio 3.
  • So we have e(2) 2

  • Now generalize this proof for r gt 2
  • At least k/r stars have M(i) less than or equal
    to (2r-1)M/(r-1)
  • Minimum M(i) is M
  • Mean lt 2M
  • expected number of stars to pick with M(i) lt ? M
    is r for 1 ? (2r-1)/(r-1)
  • error ratio 2 ? /1/r (r-1) ? /r
  • (2r-1)/(r-1)2 1/(r-1)

Theorem 3.3
  • Picking p stars at random, the best resulting
    alignment will have value within a factor of 2
    1/(r-1) of the optimal with probability at least
  • 1 (r-1)/rp

Center Star Method
  • Proof
  • From theorem 3.2, if Median value was actually 3M
  • For half the stars M(i) M and M(i) 3M for the
    other half
  • S(i,j)D(Xi,Xj)2kM
  • optimal SP alignment can be obtained from any
    center string Xiwith M(i) M
  • Probability of selecting such a string is one-half

Tree Alignment Method
  • Typical approach
  • first find multiple alignment and then build a
    tree showing the evolutionary derivations
  • Another approach (called tree alignment)
  • first choose the typology of the tree and then
    map the strings to the nodes of the tree
  • Alignment is the pairwise alignments of the
    strings at the ends of the edges of the tree

Formal Definitions
  • Let K be an input set of k strings
  • K K be a set of strings containing K
  • Evolutionary tree TK for K is a tree
  • with at least k nodes
  • each string in K labels exactly one node each
    node gets exactly one label in K
  • The value of TK V(TK) SD(X,Y)
  • the problem is to find a set of strings K and
    T(K) for K which minimizes V (TK)

  • The alignment value D(X,Y ) is interpreted as
    the minimum cost" to transform string X to
    string Y
  • The sum of the alignment values of the edges
    gives the evolutionary cost implied by the tree.

  • Let G be a graph with k nodes labeled with a
    distinct string in K
  • Each edge (X,Y) has a weight D(X,Y)
  • Find the MST of G. This MST is an evolutionary
    tree for K

  • T denote the optimal evolutionary tree for K.
  • Prove V(MST)/V(T) lt 2OPT
  • Let C be a traversal of edges of T which
    traverses everyy edge exactly once in each
  • Let C1, , Ck be the order that C encounters
  • Let V(C) D(Ck,C1) SiltkD(Ci,Ci1)

  • Corollary 4.1 V(C) 2V(T),
  • Let D(Ci,Ci1) be the largest distance of any
    adjacent strings in C traversal
  • Lemma(4.2)
  • V(MST) V(C) D(Ci,Ci1) V(C) V(C)/K

  • Theorem 4.1
  • For any set K of k strings, we have
  • V(MST)/ V(Tk) 2(k-1)/k lt 2
  • Theorem 4.2
  • V(MST) / V(Tk) (k-1)/k V(C)/V(Tk) 2 (k-1)/k
  • Corollary 4.2
  • V(Tk) gt kV(MST)/2(k-1)

Constrained MSA
  • General SP MSA problem
  • NP-completeness has already been established
  • Appromixation algorithms have been developed
  • Heuristics are also avaliable
  • Constrained MSA
  • Biologists often have additional knowledge of
    data (e.g. active site residues)
  • Additional knowledge can specify matches at
    certain locations
  • Models allow users to provide additional

Definition of CMSA Problem
  • Suppose that P p1p2 . . . pa is a common
    subsequence of S1, S2, . . . , SK
  • The constrained multiple sequence alignment of S
    with respect to P is
  • an MSA A with the constraints that there are a
    columns in A, c1, c2, . . . , ca with c1 lt c2 lt
    lt ca, such that the characters of column ci, 1
    i a, are all equal to pi.

Optimal CPSA
Dynamic Algorithm
Time and Space Complexities
  • The improvement of CPSA in turn improves the time
    space complexity of
  • Progressive CMSA from O(akn4) and O(an4) to
    O(ak2n2) and O(an2).
  • Optimal CMSA
  • This Optimal CMSA algorithm involves the creation
    of a matrix with k1 dimensions.
  • (Assume d(x,y) is the distance function and
    satisfies the triangle inequality.)
  • Let D(i1, . . . , ik ?) be the optimal CMSA
    score matrix for
  • S11..i1, . . . , Sk1..ik where P1..? is
    aligned in ? columns.
  • Then optimal alignment score is D(n1, . . . , nk
    a), where ni Si.
  • Computing D
  • D(0k 0) 0
  • Let ej 0 or 1 with ejSjij where j 0
    represents a space, and
  • d(x1, . . . , xk) S1iltjkd(xi, xj).
  • D(i1, i2, . . . , ik ?) is the minimum of
  • if S1i1 . . . Skik P?,
  • D(i1 - 1, . . . , ik - 1 ? - 1) d(S1i1, . .
    . , Skik)
  • mine?0,1k (D(i1 - e1, . . . , ik - ek ?)
    d(e1S1i1, . . . , ekSkik)).

CMSA (Center Star)
  • The Center-Star method proposed for the general
  • MSA problem can be modified to apply to the CMSA
  • problem.
  • Consider each sequence as the center, Sc.
    Consider each list position that Sc is aligned
    with P.
  • Find the minimum star-sum score Sc.
  • Create a constrained alignment matrix by merging
  • constrained pairwise sequence alignments between
    Sc Sj.

CMSA (Center Star)
  • The recurrence of Thm. 3.1 is only slightly

