Computational Molecular Biology

About This Presentation

Title:

Computational Molecular Biology

Description:

Computational Molecular Biology Multiple Sequence Alignment – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 87

Provided by: MYTRATHAI

Learn more at: https://www.cise.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Computational Molecular Biology

1
Computational Molecular Biology

Multiple Sequence Alignment

2
Sequence Alignment

Problem Definition
Given 2 DNA or protein sequences
Find Best match between them
What is an Alignment
Given 2 Strings S and S
Goal The lengths of S and S are the same by
inserting spaces (-- sometimes denote as ?) into
these strings

A -- T C -- A
-- C T C A A
3
Matches, Mismatches and Indels

Match two aligned, identical characters in an
alignment
Mismatch two aligned, unequal characters
Indel A character aligned with a space

A A C T A C T -- C C T A A C A C T -- -- -- -- C
T C C T A C C T -- -- T A C T T T
10 matches, 2 mismatches, 7 indels
4
Basic Algorithmic Problem

Find the alignment of the two strings that
max m where m ( matches mismatches indels)
Or min m where m is the SP-score of an alignment
m defines the similarity of the two strings, also
called Optimal Global Alignment
Biologically a mismatch represents a mutation,
whereas an indel represents a historical
insertion or deletion of a single character

5
Multiple Sequence Alignment

Problem Definition
Similar to the sequence alignment problem but the
input has more than 2 strings
Challenges
NP-hard, MAX-SNP
Guarantee factor 2 2/k where k is the number
of the input sequences.
More work to reduce the time and space complexity

6
Sum of Pairs Score (SP-Score)

Given a finite alphabet and where ? denotes
a space
Consider k sequences over that we want to
align. After an alignment, each sequence has
length l
A score d is assigned to each pair of letters

7
SP-Score

The SP-Score of an alignment A is defined as
Consider a matrix of l columns and k rows where
the rows represents the sequences and columns
represent the letters
SP-Score is the sum of the scores of all columns
Score of each column is the sum of the scores of
all distinct unordered pairs of letters in the
column
Or we can view as sum of pairwise sequence
alignment values.
Find an (optimal) alignment to minimize the
SP-Score value

Proving MSA with SP-Score that is a Metric
is NP-hard

9
Some Notations
10
Some Basic Properties

Lemma 1 Let s1, s2 be two sequences over S such
that l1s1, l2s2, l2l1 and there are m
symbols of s1 that are not in s2. Then every
alignment of the set s1,s2 has at least ml2-l1
mismatches

11
(No Transcript)
12
The construction

Reduce the vertex cover (or node cover) to MSA.
Vertex cover
Instance A graph G(V,E) and an integer kV
Question Is there a vertex cover V1 of G of size
k or less?
MSA
Instance A set Ss1, , sn of finite sequences
over a fixed alphabet S, an SP-score and an
integer C
Question Is there a multiple alignment of the
sequences in S that is of value C or less?

13
SP-Score (alphabet of size 6)
14
The Reduction
So, we have , T is a set of C2 sequences t
and X contains C1 sequences x(k), where C1 and C2
will be determined later
15
An Example
16
Intuition

By the above construction, an optimal alignment A
of S is obtained when A satisfies certain
properties (called standard alignment)
The value of standard alignment is bounded by a
given threshold C only where G has a vertex cover
of size k
How to obtain
Force ds of the test sequences to be aligned
with bs of the edge sequences
Only one b of each edge sequence can be aligned
to a d
The number of such alignment determines the value
of the alignment

17
Standard Alignemnt
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22

Let US and US,X denote the upper bounds of D(AS)
and D(AS,X) respectively
By Corollary 8 and Lemma 9, we have the standard
alignment has value not greater than DSD US
US,X
where DSD D(AX) D(AT) D(AX,T) D(AS,T)
over a standard alignment A
Now, let C1 gt US and C2 gt US US,X, we can prove
that an optimal alignment must be a standard one

23
(No Transcript)
24
(No Transcript)
25

Show the NP-hardness of any scoring matrix in
a broad class M
Show that there is a scoring matrix M0 such that
MSA for M0 is MAX-SNP hard

26
Interesting Observation

Via the brute force, optimal MSA contains very
few gaps
Suggesting the study of gap limitations
Have an upper bound of the number of gaps one can
insert during the alignment
Special case
Gap-0 No gap allows, but we can shift the
strings for an alignment (insert gaps at the
beginning or at the end of a string)
Gap-0-1 a gap-0 alignment such that the gaps at
the beginning or at the end of each string is
exactly one space

27
Problem Definition

Given a finite alphabet
Scoring matrix
For i, j gt 0, si,j represents the penalty for
aligning ai with aj
For i gt 0, s0,i and si,0 are called indel
penalites
Gap opening penalties (in addition to the indel
penalties) for aligning ai with the first or last
? in the string of ?s

28
Generic Scoring Matrix
Where SA,T, x, y, x are fixed nonnegative
numbers and u gt max0, vA, vT holds

Let M2 be the class of all scoring matrices that
contain a generic submatrix M
Let M1 be the class of all scoring matrices that
contain a sub-matrix isomorphic
to a generic matrix M with z gt vT.
Let M be the class of all scoring matrices that
contain a submatrix isomorphic
to a generic matrix M with y gt u and z gt vT.
Theorem 1
The gap-0-1 multiple alignment problem is NP-hard
for every scoring matrix M
in M2.
(b) The gap-0 multiple alignment problem is
NP-hard for every M in M1
(c) The multiple alignment problem is NP-hard for
every M in M
Note that M is quite broad and covers most
scoring schemes used in
biological applications.

29
Reduction

Reduce the MAX-CUT-B
Given G(V,E) where kV and each vertex has a
degree at most B
Find a partition of V into two disjoint sets such
that to maximize the number of edges crossing
these two sets
Given a graph G(V,E) with k vertices v0, , vk-1
and l edges e0, , el-1. We will construct a set
of k2 sequences t0, , tk2-1 as follows

30
Reduction

For each vertex vi, construct a sequence ti such
that
for each edge emvh, vi incident at vi, h lt i,
n lt k5, set
where ti,j represents the character at the jth
position in ti.
For other j, let ti,j T
For i k, set ti T T T T with length k12l

31
An Example
32
Proof of Theorem 1(a)

We will show that a gap-0-1 alignment will
partition V into two disjoint subsets V0 and V1
V0 all vertices vi such that ti remains in place
(a space appends at the end)
V1 all vertices vi such that ti shifts to the
right
Thus, based on the alignment, we can find the
cut. And vice versa, based on the cut, we can
find the alignment
The left part is prove that if k is sufficiently
large, the optimal gap-0-1 alignment yields a
partion of V with maximum edge cut.

33
Proof of Theorem 1(a)

Let c denote the cut based on the alignment A
Consider all the sequences ti after that
alignment A
The total indel penalties is of order O(k4)
(appears at the first and last column in the SP
score matrix)
The total number of mismatches before the
alignment is 3k5l(k2-1)
To maximally reduce this number
1 A-A match reduces 2 A-T mismatches
For each edge (vh, vi), if there are in different
subsets (of the partition), then a total of k5
A-A matches between sequences th and ti are
created
No other A-T mismatches can be elimiated
Thus the SP-score
k12lvTk2(k2-1)23k5l(u-vT)(k2-1)-ck5(2u-vA-vT)O(k
4)

34
Theorem 2

Consider the following scoring matrix M0 for the
alphabet ?0 A,T,C.
The gap-0-1 MSA problem is MAX-SNP-hard
The gap-0 MSA problem in MAX-SNP-hard
The MSA problem in MAX-SNP-hard

35
MAX-SNP-hard Proof

To prove problem A is MAX-SNP-hard, we need to
L-reduce problem A, which is MAX-SNP-hard to A
L-reduce
There are two polynomial-time algorithms f, g and
constants a, b gt 0 such that for each instance I
of A
f produces an instance I f(I) of A such that
OPT(I) aOPT(I)
Given any solution of I with cost c, g produces
a solution of I with cost c such that c-OPT(I)
bc-OPT(I)

36
Proof of Theorem 2

To prove MSA (with M0 and the scoring matrix
mentioned before) MAX-SNP-hard
L-reduce the MAX-CUT-B to another optimization
problem, called A, which is L-reduce to a scaled
version of MSA
Problem A
Given a graph G(V,E) with bounded degree B. For
every partition PV0, V1, let cp be the size of
cut determined by P.
Find the partition P of V that minimizes dp
3E-2cp

37
Show A is MAX-SNP-hard

Let f and g be an identity function
Set a 3B and b 2, we can easily prove the two
properties of the L-reduction since
cp E/B and dp 3E - 2 cp 3 E
Any increase of cp by 1 decrease dp by 2

38
Show A L-reduce to scaled MSA
Similar to the above construction, we have
39

Similar to the proof of Theorem 1, we have the
optimal SP-score where
If the SP-score is scaled by a factor of k-5/2
for a MSA of k sequences, then A L-reduce to MSA.

40
GENETIC ALGORITHMS
41
How do GAs work?

Create a population of random solutions
Use natural selection
crossover and mutation to improve the solutions
Stop the operation if satisfying some certain
criteria such as
No improvement on fitness function
The improvement is less than some certain
threshold
The number of iteration is more than some certain
threhold

42
Terms and Definitions

Chromosomes
Potential solutions
Population
Collection of chromosomes
Generations
Successive populations

43
Terms and Definitions

Crossover
Exchange of genes between two chromosomes
Mutation
Random change of one or more genes in a
chromosome
Elitism
Copy the best solutions without doing crossover
or mutation.

44
Terms and Definitions

Offspring
New chromosome created by crossover between two
parent chromosomes
Fitness function
Measures how good a chromosome is.
Encoding scheme
How do we represent every chromosome/gene?
Binary, combination, syntax trees.

45
Why are GAs attractive?

No need for a particular algorithm to solve the
given problem. Only the fitness function is
required to evaluate the quality of the
solutions.
Implicitly a parallel technique and can be
implement efficiently on powerful parallel
computers for demanding large scale problems.

46
Basic Outline of a GA

Initial population composed of random
chromosomes, called first generation
Evaluate the fitness of each chromosome in the
population
Create a new population
Select two parent chromosomes from a population
according to their fitness
Crossover (with some probability) to form a new
offspring
Mutation (with some probability) to mutate new
offspring
Place new offspring in a new population
Process is repeated until a satisfactory solution
evolves

47
(No Transcript)
48
Operations

Mutation Operation
Modify a single parent
Try to avoid local minima

49
Let's see some running examples

Minimum of a function
http//cs.felk.cvut.cz/xobitko/ga/example_f.html
Elitism
http//cs.felk.cvut.cz/xobitko/ga/params.html
The travelling salesman problem
http//cs.felk.cvut.cz/xobitko/ga/tspexample.html

50
Multiple Sequence Alignment

Fitness function is used to compare the different
alignments
Based on the number of matching symbols and the
number and size of gaps
Also called the cost function
Different weights for different types of matches
Gap costs
can be simple and count the total matching
symbols
can be complicated and consider the type of
matching symbols, location in the sequence,
neighboring symbols etc.

Approximation Algorithms

52
Scoring method

Score zero for a match or for two opposing spaces
Score one for a mismatch or for a character
opposite a space

53
Assumptions

Assume that two opposing spaces have a zero value
Assume other values satisfies triangle inequality
s(x,z) s(x,y) s(y,z)
s(x,z) cost of transforming character x into
character z

54
Objective Functions

Two objective functions
SP
The sum of the values of pairwise alignments
induced by an alignment A
TA
Using the topology of the tree, map the strings
to the nodes of the tree
The sum of the selected pairwise alignments is
called tree alignment

55
Center Star Method

For a set of k strings X
Choose a center string Xc of X which minimizes
Sj?cD(Xc,Xj)
Let M min Sj?cD(Xc,Xj)
Center star is a star tree of k nodes with the
center node labeled Xc and each of the k-1
remaining nodes labeled by a distinct string in X
\ Xc
If Xi and Xj are strings labeling adjacent nodes
of tree T, then alignment of Xi and Xj induced by
A(T) has value D(Xi,Xj)

56
Center Star Method Alg Ac

Do an optimal alignment for each pair (Xc, Xj)
for all j ? c
s0 max number of spaces placed before the first
char of Xc
sf max number of spaces placed after the last
char of Xc
si max number of spaces placed between Xc(i)
and Xc(i1)

57
Center Star Method Alg Ac

For Xc, insert s0, si, and sf spaces at the
beginning, between, and the end of Xc
respectively. Call Xc
Then for each Xj, do the optimal alignment
without modifying Xc

58
Analysis

d(Xi,Xj) D(Xi,Xj)
V(Ac) Siltjd(Xi,Xj)
V(Ac) is at most twice the value of the optimal
multiple alignment of X

59
Analysis

Lemma 3.1 For any 2 strings Xi,Xj, we have
d(Xi,Xj) d(Xi,Xc) d(Xc,Xj)
D(Xi,Xc) D(Xc,Xj)
triangle inequality

60
Analysis

A be the optimal multiple alignment of k strings
X
Define V(A) Siltjd(Xi,Xj)

61
Analysis

Theorem 3.1
V(Ac) / V(A) 2(k-1)/ k lt 2
Proof

62
Disadvantages

Requires all pairwise alignments
Computationally expensive
Faster, Randomized alignments
Randomly select string Xi
Build multiple alignment with star centered at Xi
Select best multiple alignment A from p such
stars
At most (k-1)p pairwise alignments need to be
computed

63
Randomized Alignments

Theorem 3.2For any r gt1, let e(r) be the
expected number of stars needed to be chosen at
random before the value of best resulting
alignment is within a factor of 21/(r-1) of the
optimal alignment. Then e(r) r.
e(r) is independent of k and the length of the
strings.

64
Proof of Theorem 3.2

For r 2, for each string Xi
define M(i) SjD(Xi,Xj) then M(c) M
From Theorem 3.1,
S(i,j)D(Xi,Xj) SjM(i) 2(k-1)M so the Avg
value of M(i) lt 2 M
Since min M(i) M, then Median M(i) lt 3M
Number of centers selected before a selected
M(i) is less than the median 2

65
Proof

Suppose median is ?M for 1 ? 3
Then S(i,j)D(Xi,Xj) kM/2 k ? M/2
Value of the alignment obtained from any below
median star 2(k-1) ? M
Therefore, error ratio for this star 2 ? /
(1/2 ? /2)
When ? 3, error ratio 3.
So we have e(2) 2

66
Proof

Now generalize this proof for r gt 2
At least k/r stars have M(i) less than or equal
to (2r-1)M/(r-1)
Minimum M(i) is M
Mean lt 2M
expected number of stars to pick with M(i) lt ? M
is r for 1 ? (2r-1)/(r-1)
error ratio 2 ? /1/r (r-1) ? /r
(2r-1)/(r-1)2 1/(r-1)

67
Theorem 3.3

Picking p stars at random, the best resulting
alignment will have value within a factor of 2
1/(r-1) of the optimal with probability at least
1 (r-1)/rp

68
Center Star Method

Proof
From theorem 3.2, if Median value was actually 3M
For half the stars M(i) M and M(i) 3M for the
other half
S(i,j)D(Xi,Xj)2kM
optimal SP alignment can be obtained from any
center string Xiwith M(i) M
Probability of selecting such a string is one-half

69
Tree Alignment Method

Typical approach
first find multiple alignment and then build a
tree showing the evolutionary derivations
Another approach (called tree alignment)
first choose the typology of the tree and then
map the strings to the nodes of the tree
Alignment is the pairwise alignments of the
strings at the ends of the edges of the tree

70
Formal Definitions

Let K be an input set of k strings
K K be a set of strings containing K
Evolutionary tree TK for K is a tree
with at least k nodes
each string in K labels exactly one node each
node gets exactly one label in K
The value of TK V(TK) SD(X,Y)
the problem is to find a set of strings K and
T(K) for K which minimizes V (TK)

The alignment value D(X,Y ) is interpreted as
the minimum cost" to transform string X to
string Y
The sum of the alignment values of the edges
gives the evolutionary cost implied by the tree.

72
Method

Let G be a graph with k nodes labeled with a
distinct string in K
Each edge (X,Y) has a weight D(X,Y)
Find the MST of G. This MST is an evolutionary
tree for K

73
Analysis

T denote the optimal evolutionary tree for K.
Prove V(MST)/V(T) lt 2OPT
Let C be a traversal of edges of T which
traverses everyy edge exactly once in each
direction
Let C1, , Ck be the order that C encounters
Let V(C) D(Ck,C1) SiltkD(Ci,Ci1)

74
Analysis
75
Analysis

Corollary 4.1 V(C) 2V(T),
Let D(Ci,Ci1) be the largest distance of any
adjacent strings in C traversal
Lemma(4.2)
V(MST) V(C) D(Ci,Ci1) V(C) V(C)/K

76
Analysis

Theorem 4.1
For any set K of k strings, we have
V(MST)/ V(Tk) 2(k-1)/k lt 2
Theorem 4.2
V(MST) / V(Tk) (k-1)/k V(C)/V(Tk) 2 (k-1)/k
Corollary 4.2
V(Tk) gt kV(MST)/2(k-1)

77
Constrained MSA
78
Motivation

General SP MSA problem
NP-completeness has already been established
Appromixation algorithms have been developed
Heuristics are also avaliable
Constrained MSA
Biologists often have additional knowledge of
data (e.g. active site residues)
Additional knowledge can specify matches at
certain locations
Models allow users to provide additional
constraints

79
Definition of CMSA Problem

Suppose that P p1p2 . . . pa is a common
subsequence of S1, S2, . . . , SK
The constrained multiple sequence alignment of S
with respect to P is
an MSA A with the constraints that there are a
columns in A, c1, c2, . . . , ca with c1 lt c2 lt
lt ca, such that the characters of column ci, 1
i a, are all equal to pi.

80
Optimal CPSA
81
Dynamic Algorithm
82
Time and Space Complexities
83
CMSA

The improvement of CPSA in turn improves the time
space complexity of
Progressive CMSA from O(akn4) and O(an4) to
O(ak2n2) and O(an2).
Optimal CMSA
This Optimal CMSA algorithm involves the creation
of a matrix with k1 dimensions.
(Assume d(x,y) is the distance function and
satisfies the triangle inequality.)
Let D(i1, . . . , ik ?) be the optimal CMSA
score matrix for
S11..i1, . . . , Sk1..ik where P1..? is
aligned in ? columns.
Then optimal alignment score is D(n1, . . . , nk
a), where ni Si.
Computing D
D(0k 0) 0
Let ej 0 or 1 with ejSjij where j 0
represents a space, and
d(x1, . . . , xk) S1iltjkd(xi, xj).
D(i1, i2, . . . , ik ?) is the minimum of
if S1i1 . . . Skik P?,
D(i1 - 1, . . . , ik - 1 ? - 1) d(S1i1, . .
. , Skik)
mine?0,1k (D(i1 - e1, . . . , ik - ek ?)
d(e1S1i1, . . . , ekSkik)).

84
CMSA (Center Star)

The Center-Star method proposed for the general
MSA problem can be modified to apply to the CMSA
problem.
Consider each sequence as the center, Sc.
Consider each list position that Sc is aligned
with P.
Find the minimum star-sum score Sc.
Create a constrained alignment matrix by merging
the
constrained pairwise sequence alignments between
Sc Sj.

85
CMSA (Center Star)