Naive method Boyer-Moore Algorithm/Knuth-Morris Pratt

About This Presentation

Title:

Naive method Boyer-Moore Algorithm/Knuth-Morris Pratt

Description:

Naive method Boyer-Moore Algorithm/Knuth-Morris Pratt Algorithm Semi-global Comparison Semi-global Comparison Local Alignment Local Alignment Find the best fit ... – PowerPoint PPT presentation

Number of Views:184

Avg rating:3.0/5.0

Slides: 89

Provided by: dsmiUnisi

Category:

more less

Transcript and Presenter's Notes

Title: Naive method Boyer-Moore Algorithm/Knuth-Morris Pratt

1
Algorithms for biological sequence
Comparison and AlignmentSara Brunetti,
Dipartimento di Scienze Matematiche e
Informatica University of Siena,
Italy,sara.brunetti_at_unisi.it

2
History

1953 DNA structure, Watson e Crick
1975 development of the sequencing technique,
Ranger, Maxam e Gilbert
1980 American Supreme Court decides that bacteria
genetically modified were patented
1990 beginning of the Genome Project
Goals
sequence the entire human genome producing the
complete DNA trascript
produce maps of the genome showing locations of
expressed sites
2000 Tony Blair and Bill Clinton announce the
completion of the human genome sequencing
Cost 3 109 euros

Amounts of data
Human genome 3x109 bp word 5000 km long
(Roma-Capo Nord)
Contained in 1015 cells
Macromolecular structures 35460 entry (14 Marz
06, PDB)
Bioinformatics
study of problems of storage, organization and
distribution of large amounts of genomic data

4
Problems

Reconstructing long sequences of DNA from
overlapping sequence fragments.
Storing and retrieving DNA sequences in
Databases.
Searching databases for related sequences and
subsequences.

The human genome project is an international
multi-component effort to read out the
instruction book for human biology that is our
genome, and to understand what its telling us
Biological problems
Determine genetic maps from probe data
Determine the functionality of a protein
Characterize a protein
Prediction of protein structure
Protein interaction

6
Computational biology

study of mathematical and combinatorial problems
of modeling biological processes in the cell,
interpreting the data and providing theories
about their biological relations
Data representation
Problem formulation
(Efficient) algorithm design

7
Data representation

Alphabet
Italian
A B C D E F G H I L M N O P Q R S T U V Z
English
A B C D E F G H I J K L M N O P Q R S T U V W Y Z
DNA
A C G T
Protein
A Q W E R T Y I P L K H F D S C V N M
Binary
0 1

8
Data representation strings

DNA
String ACCGTATATAAAAGGCCGGGTT
Length 22

suffix
substring
prefix
9
Similarities and differences?

-Differences between the human genome and the
chimpanzee genome 2
-Differences betweeen human and worm 50
-Similarity between two humans 99,9
But genome length 3 109 bp
They can differ into 3 106 positions

10
Similarities and differences

The resemblance of two DNA sequences taken from
different organisms can be explained by the
theory that all contemporary genetic material has
one ancestral ancient DNA.
According to this theory, during the course of
evolution mutations occurred, creating
differences between families of contemporary
species.
Most of these changes are due to local mutations,
each modifying the DNA sequence at a specific
manner.

11
Biological motivation

Learning about the functionality or structure of
a protein without performing any experiments
Basic idea
In biomolecular sequences (DNA, RNA,
Aminoacid sequences) high similarity usually
implies significant functional or structural
similarity.
Usually 25 sequence identity suffice two
proteins to have same 3-dim structure and almost
identical function

WARNING Sequence similarities implies functional
similarities, but the reverse is not necessarily
true!
Beside sequences other levels to enquire 3D
protein structure, cellular biochemistry or
morphology etc., but sequences are easier to
study.

13
DNA Sequence Comparison First Success Story

Finding sequence similarities with genes of known
function is a common approach to infer a newly
sequenced genes function
In 1984 Russell Doolittle and colleagues found
similarities between cancer-causing gene and
normal growth factor (PDGF) gene

14
Cystic Fibrosis

Cystic fibrosis (CF) is a chronic and frequently
fatal genetic disease of the body's mucus glands
(abnormally high level of mucus in glands). CF
primarily affects the respiratory systems in
children.
Mucus is a slimy material that coats many
epithelial surfaces and is secreted into fluids
such as saliva

15
Cystic Fibrosis Inheritance

In early 1980s biologists hypothesized that CF is
an autosomal recessive disorder caused by
mutations in a gene that remained unknown till
1989
Heterozygous carriers are asymptomatic
Must be homozygously recessive in this gene in
order to be diagnosed with CF

16
Cystic Fibrosis Finding the Gene
17
Finding Similarities between the Cystic Fibrosis
Gene and ATP binding proteins

ATP binding proteins are present on cell membrane
and act as transport channel
In 1989 biologists found similarity between the
cystic fibrosis gene and ATP binding proteins
A plausible function for cystic fibrosis gene,
given the fact that CF involves sweet secretion
with abnormally high sodium level

18
Cystic Fibrosis Mutation Analysis

If a high of cystic fibrosis (CF) patients have
a certain mutation in the gene and the normal
patients dont, then that could be an indicator
of a mutation that is related to CF
A certain mutation was found in 70 of CF
patients, convincing evidence that it is a
predominant genetic diagnostics marker for CF

19
Cystic Fibrosis and CFTR Gene
20
Cystic Fibrosis and the CFTR Protein

CFTR (Cystic Fibrosis Transmembrane conductance
Regulator) protein is acting in the cell membrane
of epithelial cells that secrete mucus
These cells line the airways of the nose, lungs,
the stomach wall, etc.

21
Mechanism of Cystic Fibrosis
The CFTR protein (1480 amino acids) regulates a
chloride ion channel Adjusts the wateriness of
fluids secreted by the cell Those with cystic
fibrosis are missing one single amino acid in
their CFTR Mucus ends up being too thick,
affecting many organs
22
Bring in the Bioinformaticians

Gene similarities between two genes with known
and unknown function alert biologists to some
possibilities
Computing a similarity score between two genes
tells how likely it is that they have similar
functions

23
Sequence Comparison Problems

Informally find which parts of sequences are
alike and which parts are different.
Given two sequences over the same alphabet, about
of the same length (?10.000 char.), and almost
equal, find the places where differences occur.
Problem 1) the same gene is sequenced by two
laboratories and they want to compare the
results.
2) Given two sequences with a few hundred of
char., find two similar sub-strings (one from
each sequence).
3) Same as Problem 2), but one sequence is
compared with thousand of others.
Problems 2), 3) in searching local similarities
in large databases of bio-sequences.

24
Sequence Comparison Problems

4) Given two sequences with a few hundred of
char., find a prefix of one similar to the suffix
of the other.
Problem 4) in the fragment assembly procedure in
large scale DNA sequencing.
We introduce a single basic algorithmic idea to
solve all the above problems.

25
Pairwise alignment

How to compare two sequences?
Alignment
Similarity

26
Sequence alignment an example
s ATGCAGCTGAGCATCG
t ATACAGCGAGTATCG
27
Sequence alignment an example
s ATGCAGCTGAGCATCG
t A
T
A
C
A
G
C
G
A
G
T
A
T
C
G
28
Sequence alignment

Sequence 1 s(s1,,sm) of size m
Sequence2 t(t1,,tn) of size n
An alignment (s,t) between s and t is obtained
by insertion of spaces in arbitrary positions
along the sequences so that they end up with the
same size
s1 s2 sl
t1 t2 tl
(si,ti) pair of characters in s and t or -
Not allowed (-,-)

29
Longest path in a network

The nodes of the network are (i,j) where each
(i,j) node denotes an aligned pair (si,tj).
There are edges between nodes (i,j) and (k,l) if
i lt k and j lt l (order preserving)
We add a source and a sink
(The weights on the edges are given by function p)

A
T
A
G
C
0 1 2 2 3 3 4
A

s A T G A t T A
G C A
T
G
A
0 0 1 2 3 4 5
30
The Alignment Grid

Every alignment path is from source to sink

31
Edit Distance vs Hamming Distance
Edit distance may compare i-th letter of v
with j-th letter of w
Hamming distance always compares i-th letter
of v with i-th letter of w
V - ATATATAT
V ATATATAT
Just one shift
Make it all line up
W TATATATA
W TATATATA
Hamming distance Edit
distance d(v, w)8
d(v, w)2 Computing Hamming distance
Computing edit distance is a trivial task
is a non-trivial
task
32
Edit Distance Example

TGCATAT ? ATCCGAT in 5 steps
TGCATAT ? (delete last T)
TGCATA ? (delete last A)
TGCAT ? (insert A at front)
ATGCAT ? (substitute C for 3rd G)
ATCCAT ? (insert G before last A)
ATCCGAT (Done)
What is the edit distance? 5?

33
Edit Distance Example (contd)
TGCATAT ? ATCCGAT in 4 steps TGCATAT ? (insert
A at front) ATGCATAT ? (delete 6th T) ATGCATA ?
(substitute G for 5th A) ATGCGTA ? (substitute
C for 3rd G) ATCCGAT (Done) Can it be
done in 3 steps???
34
Alignment as a Path in the Edit Graph
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6),
(7,7)
- Corresponding path -
35
Alignments in Edit Graph (contd)

and represent indels in v and w with
score 0.
represent matches with score 1.
The score of the alignment path is 5.

36
Alignment as a Path in the Edit Graph
Every path in the edit graph corresponds to an
alignment
37
Alignment as a Path in the Edit Graph
0122345677 v AT_GTTAT_ w ATCGT_A_C
0123455667 (0,0) , (1,1) , (2,2), (2,3),
(3,4), (4,5), (5,5), (6,6), (7,6), (7,7)
38
Alignment as a Path in the Edit Graph
Old Alignment 0122345677 v AT_GTTAT_ w
ATCGT_A_C 0123455667
New Alignment 0122345677 v AT_GTTAT_ w
ATCG_TA_C 0123445667
39
Sequence similarity an example
ATGCAGCTGAGCATCG ATACAGC GAGTATCG

We assign a score to each alignment
p(a,a) 1
p(a,b) -1
p(a,-)p(-,a)g -2
Score of the example 13 2 1-1 9
The best alignment between s and t is the one
with maximal total score (similarity).

40
Global alignment
41
Global alignment

Given two sequences s and t of roughly the same
length, determine the alignment of s and t with
maximal score
AC - GCTTTG
- CATG TAT-
(NeedlemanWunsch Algorithm)
Motivation the same gene is sequenced by two
laboratories and they want to compare the results

42
Number of alignments

How many ways s can be aligned with t?
s1 s2 sl
t1 t2 tl
Max(n,m) lt l lt nm
s1 .. sm - - - -
- - - - - t1 tn
- s1 .-. sm -
t1 . tn
f(i,j)alignments of one sequence of i letters
with another of j letters
f(n,m)f(n-1,m)f(n-1,m-1)f(n,m-1) and
f(n,n)?(1?2)2n1 ?n as n??

Es. two sequences of length 1000 have the
following number of possible alignments
f(1000,1000)?(1?2)2001 ?100010 767,4.. !!!!!!!!
(there are 10 80 elementary particles in the
universe)

44
Dynamic Programming Algorithm

Basic Idea of Dynamic Programming a problem is
solved taking advantage of the already solved
sub-problems.
Each optimal alignment contains optimal
alignments of the subproblems (example)
- GCTGATATAGCT
GGGTGAT -TAGCT
Additivity of the penalty function
Three essential components
Recurrence relation
Tabular computation
Traceback

45
Dynamic Programming Algorithm Recurrence
relation

Sequence 1 s of size m
Sequence 2 t of size n
si..j sub-string from char i to
char j of s.
M(i,j) is the score of the best alignment between
s1..i and t1..j
M(j,0) M(0,j)-2j
M(m,n) is computed by solving the more general
problem of computing M(i,j) for all i,j
No top-down approach, but bottom up
The computation is arranged in a (m1) (n1)
array M

Mi,j-1 - 2
1, if si tj Mi,j max
Mi-1,j-1 p(i,j), p(i,j)
Mi-1,j -2
-1, if si ? tj
46
Dynamic Programming Algorithm Tabular computation
s t A G C
A A A C
row 0 comparison between t and an empty
sequence. column 0
0
-2
-4
-6
comparison between s and an empty sequence

1 -4 -4 1
-3 -6 -1 1
-2
-3
Mi,j is computed by observing the 3 previous
entries Mi-1,j-1, Mi,j-1 and Mi-1,j.
M
-4 -1 0 -2
-6 -3 -2 -1 -8 -5 -4 -1
Mi-1,j-1 a new char of s and a new char of t
are considered 1 is added in
case of match and 1 in case of mismatch.
Align s1..i-1 with t1..j-1 Mi,j-1 a
new char of the sequence t is considered
corresponding to a space in s
(-2). Align s1..i with t1..j-1 and
match a space with tj Mi-1,j a new char of
the sequence s is considered corresponding
to a space in t (-2). Align
s1..i-1 with t1..j and match a space with si
47
Dynamic Programming Algorithm- Traceback
Trace back to find the best alignment(s)
solution2
t A G C
s
0
-2
-4
-6
A A A C
1
1 -3
-2
-4 -1 0 -2
-6 -3 -2 -1 -8 -5 -4 -1
Best Score
-1
48
Algorithm Similarity

input S,T,m,n
output M
for i ?1 to m do
M(i,0) ?i g
for j ?0 to n do
M(0,j) ? j g
for i ? 1 to m do
for j ?1 to n do
M(i,j) ? max( M(i-1,j)g
M(i-1,j-1)p(i,j)
M(i,j-1)g )
return M
Complexity O(nm)

49
Align (i, j, len)input i, j, array
M obtained by Similarity Alg.output alignment
in align-s align-t, vectors of length lenif i
0 and j 0 then len 0 else if i gt 0 and
Mi, j Mi-1, j cs then Align
(i-1,j,len) len len1 align-s
si align-t (space) else if i gt
0 and jgt0 and Mi, j Mi-1, j-1 c(i,j)
then Align (i-1,j-1,len) len
len1 align-s si align-t tj
else Align (i,j-1,len) len len1
align-s (space) align-t tj

First call Align(m,n,len)
max (s,t) ? len ? m n
Algorithm Align finds solution1.
By inverting the order of the if statements it
is possible to find
the other solutions.

50
Complexity

Algorithm Similarity takes O( m n) time and
space
Algorithm Align takes O (m n) time
Let hmn
T(h) k for h ? 2
T(h) T(h-1) k, for h gt 2 (k
constant)
T(h) O(h) O(mn)
Algorithm Similarity can be refined to run with
O(mn) space.
In a row by row computation store the last
and the current
row only.
Algorithm Align can be designed to run with
O(mn) space with a
divide and conquer strategy.
It is not a trivial task!

The basic algorithm Similarity can be modified
to solve a variety of different problems!!
51
More about similarity and distance
52
Similarity and distance

Two approaches to comparing strings
Similarity measures how much the strings are
alike
Its definition derives from the concept of one
ancestral ancient DNA
An alignment (s,t) of the strings s and t is
obtained by inserting space characters in them in
such a way that
st
Removal of - from s gives s
Removal of - from t gives t
For every i, either si or ti is not
A scoring system (p,g) has members
pAxA-gtR, glt0
additive scoring
sim(s,t)max score(s,t)

53
Similarity and distance

Distance measures how much the strings differ
Its definition derives from the concept of
mutations
A distance d on E is dExE-gtR
d(x,x)0 for all x in E and d(x,y)gt0 for xltgty
d(x,y)d(y,x) for all x,y in E
d(x,y)ltd(x,z)d(y,z) for all x,y in E
An allignment is obtained by successive
applications of a number of admissible operations
transforming s into t
substitution a-gtb
insertion or deletion of any character (indel)
A cost measure (c,h) has members
cAxA-gtR, hgt0

When are similarity and distance algorithms
equivalent?
When sequences are aligned by distance in global
alignment, there is a similarity algorithm that
gives the same set of optimal alignments, and
vice versa
The measures are related by the formula
p(a,b)M-c(a,b) g-hM/2
dist(s,t)sim(s,t)M/2(st)
Es.
Edit distance, M0gt p(a,a)0, p(a,b)-1, g-1
M2gt p(a,a)2, p(a,b)1,
g0
Same set of optimal solutions, different scores.
Usually 0ltMltmax c(a,b)

aligned letterslt-gtsubstitutions l
spaceslt-gtindel operations r
score(s,t) ?il p(ai,bi)rg
cost(s-gtt) ?il c(ai,bi)rh
score(s,t)cost(s-gtt)lMr M/2
if global alignment st2lr,
score(s,t)cost(s-gtt)M/2(st)
dist(s-gtt)min(M/2(st)- score(s,t))
M/2(st)- sim(s,t)

56
Semi-global alignment
57
Semi-global Comparison

Find the best fit of a short sequence t of size n
into a larger sequence s of size m
s1 sk
sl sm
s
t
The solution to this problem as formulated above
will take time proportional to
?k1..m ?lk..m n(l-k)O(nm3)

58
(Exact matching)

Problem given a pattern p and a larger string s,
find all the occurrences of the pattern p in s
Is there an occurrence?
How many times p occurrs in s?
Naive method
Boyer-Moore Algorithm/Knuth-Morris Pratt Algorithm

59
Semi-global Comparison
Ignore the spaces at the beginning and at the end
of a sequence. Problem Find the highest score
semi-global alignment between t and substring
(prefix of a suffix) of s. s
CAGCA -CTTGG ATTCTCGG t - - -
CAGCTTGG(- - - - - - - - 1. Ignore final spaces.
Find the best score between t and a prefix of
s.
Mi,j of problem1 contains the best score
between s1..i and t1..j, hence take the
maximum value Mi,n in the last column n.
There is no need to reach the last row.
60
Semi-global Comparison
s CAGCA -CTTGG ATTCTCGG t - - -)CAGCTTGG - - -
- - - - - 2. Ignore initial spaces Find the
best alignment between t and a suffix of
s. Mi,j now contains the best score between
t1..j and a suffix of s1..i, hence in the
first column we have all zeroes.
C A G C . C
0 A 0 G 0
C 0 1 A
Initial char !!Join solutions 1 and 2 to
solve semi-global
comparison!!
61
Local Alignment
62
Local Alignment

Find the best fit between a sub-string of s and
a sub-string of t.
s1 sk
si sm
s
t
t1 th tj
tn
Motivation Ignore streches of non-coding DNA

63
Local Alignment Algorithm SmithWaterman
Global Alignment Local Alignmentbetter
alignment to find conserved segment
--T-CC-C-AGT-TATGT-CAGGGGACACGA-GCATGCAGA-G
AC

AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-CAGAT-
-C
tccCAGTTATGTCAGgggacacgagcatgcagag
ac

aattgccgccgtcgttttcagCAGTTATGTCAGatc
64
Local Alignment Example
Local alignment
Global alignment
65
Local Alignment Algorithm SmithWaterman

The LA problem is still solved computing M.
Mi,j holds the value of the best alignment
between a suffix
of s1..i and a suffix of t1..j.
The first row and the first column are
initialized with zeros.

66
Local Alignment
Mi,j-1 - 2
1 if si tj Mi,j max Mi-1,j-1
p(i,j), p(i,j)
Mi-1,j -2 -1 if si ?
tj 0
For any entry Mi,j there exists always the
alignment between the empty suffixes of s1..i
and of t1..j with score 0 At the end choose the
entry Mi,j with maximal score in any position.
Start align tracing back, as before, from there
until you find a value 0.
67
Example
HEGAWGHEE PAW HEAE
68
End free-space alignment -Motivation

Find the best fit of substrings of s and t, where
at least one of these substrings must be a prefix
of the original string and one must be a suffix?
Motivation in the shotgun sequence assembly
procedure, one has a large set of partially
overlapping substrings that come from many copies
of one original but unknown DNA sequences.
The problem is to use comparisons of pairs of
substrings to infer the correct original string.

69
End free-space alignment
70
Example
HEGAWGHEE PAW HEAE
71
Kinds of Alignment
72
Complexity of Alignments
The space complexity could be a critical
bottleneck. How we can improve such a
complexity? Linear-Space Alignment Hirschbergs
algorithm -- Miller and Myers algorithm
73
Extensions to the basic algorithm

Hirschbergs linear space method for alignment
uses a divide-et-conquer strategy

The computation of sim(s,t) can be easily done in
linear space
Each row (or column) depends only on the
preceding one
Optimal alignment in linear space?
Divide et conquer strategy divide the problem
into smaller subproblems and combine their
solutions to the solution of the whole problem

temp
0 A G C
old
0 AAA C
0
-2
-4
-6
a0
1
-2
1 -3
BestScore(s1..m,t1..n,a)
-4 -1 0 -2
-6 -3 -2 -1 -8 -5 -4 -1
75

The idea is to consider the possible
configurations of an alignment
1.
(s1..i-1 si si1..m
- - t1..n
for j0)
s1..i-1 si si1..m
t1..j-1 tj tj1..n
for 1ltjltn-1
(s1..i-1 si si1..m
t1..n - -
for jn)

2.
s1..i-1 si si1..m
t1..j - tj1..n
for oltjltn

An optimal alignment must satisfied also
Optimal(s1..i-1, t1..j-1) (si,tj)
Optimal(si1..m,tj1..n)
for 0ltjltn, in case 1
or
Optimal(s1..i-1, t1..j) (si,-)
Optimal(si1..m,tj1..n)
for 0ltjltn, in case 2
For a fixed i,
align s1..i-1 with a prefix of t, and si1..m
with a suffix of t, for 0ltjltn
Take im/2 M(m,n)maxj ( M(m/2,j)Mr(m/2,n-j) )

77
Recursive algorithm

Compute BestScore(s1..m/2,t1..n,a)
-gtM(m/2,n)
Compute BestScore(sm/2..m,tn..1,b)
-gtMr(m/2,n)
Find the column k so that
M(m/2,k)Mr(m/2,n-k)M(m,n)
Recursively partition the problem to two
subproblems
Find the path from (0,0) to (m/2,k)
Find the path from (m,n) to (m/2,n-k)
Space complexity min(m,n)
At each step save a and b, and the traceback
pointers for the cells in a and b, and then for
(m/2,k) only
Time complexity T(m,n)number of times a best
score is computed
T(m,n)2T(m,n)T(m/2,k)T(m/2,n-k)O(mn)

78
k
m/2
m/2
n-k
79
Gap penalty
80
Gap penalty function
Gap consecutive number (kgt1) of spaces. From
Biology we know that when mutations are involved,
gap of k spaces are more probable than k isolated
spaces. One concrete example is given by the
c-DNA matching. In the previous problems the cost
w(k) of k internal consecutive spaces was
proportional to k, w(k) k g. Now w(k) h
kg where h g is the cost of the first space
of a gap and g
the cost of the following ones, kgt1.
CA-----CTTGG
hg
g
w(k) h 5g
g
g
g
gap
81

Attention!
The scoring system is no more additive, i.e. we
cannot break an alignment in two parts and expect
the total score to be the sum of the partial
scores
AAC- - -AATTC C GACTAC
ACTACCT - - - - - - CGC - -
The scoring of an alignment is done at the block
level

82
Similarities with gap
We need three matrices a, b, c, with the
following meaning ai,j maximum score of an
alignment between s1..i and t1..j
where si is matched with tj. bi,j maximum
score of an alignment between s1..i and t1..j
that ends in a - aligned with
tj. ci,j maximum score of an alignment
between s1..i and t1..j that ends
in si aligned with a - . Where
ai-1,j-1
ai,j p(i,j) max
bi-1,j-1
ci-1,j-1
ai,j-1 -(hg)
ai-1,j -(hg) bi,j max bi,j-1-g
ci,j max
(bi-1,j-(hg) ) (ci,j-1
-(hg))
ci-1,j-g
First space
83
Initialization a0,0 0, ai,0 - ? for 0 ?
i ? m, a0,j - ? for 0?j ? n bi,0 - ?
for 0 ? i ? m b0,j -(hgj) for 0 ? j ?
n ci,0 -(hgi) for 0 ? i ? m c0,j - ?
for 0 ? j ? n

am,n Final result
Get the maximum among bm,n

cm,n
Trace back to obtain the optimal alignment,
remembering the current position and which array
belongs to. Time
O(mn) Space 3(mn) O(mn)
84
Other gap penalty models

Constant.
Affine.
Convex
each additional space in a gap contributes less
to the gap weight than the previous space (ex.
Log(q))
the problem is solvable in O(nm log(m)) time
Arbitrary
Any gap weight function is acceptable
the problem is solvable in O(nm (mn)) time

85
Multiple alignment
86
Multiple Alignment

Compare multiple sequences s1, s2, . . . , sk
over the same alphabet
Insert spaces to make them of the same size.
MQPI LLL
M LR - LL-
MK - I LLL
MP P V LIL
no column is made exclusively of spaces
SCORING IS MORE COMPLEX!

87
Motivations

Fact evolutionary and functionally related
molecular strings can differ significantely and
yet preserve the same 3-d and 2-d structure
It is a very important tool to extract and
represent biologically important similarities
(critical common patterns) from a set of several
strings.
Critical patterns may suggest evolutionary
history, a common structure of the protein
product, or a common biological function.
Since critical patterns can be widely dispersed
or faint, they may be not apparent in the
comparison of two strings (ex. hemoglobin).
Such patterns are used to characterize families
of proteins (famility representations)
These representations can be used to identify
other potential members of the family

88
SP(Sum of Pairs)-measure

Possible Scoring Function
SP-function Sum of pair-wise scores of all
pairs of symbols in the
column
SP-score (I, -, I, V) p(I,-) p(I,I) p(-,V)
p(I,V) p(-,I) p(I,V)
p(a, b) pair-wise score of char a and b. a
or/and b can be space.
p(- ,-) 0 (common practice).

89
From multiple alignment Alg. We can also derive
pairwise alignment by just removing columns made
of all spaces. 1.
PEAALYGRFT - - - IKSDUW
2. PEAALYGRFT - - - IKSDUW
3. PESLAYNKF - - -S IKSDUW
4. PEALNYG RY - - -SSESDUW
5. PEALNYGWY - - -SSESDUW
6. PEVIRMQDD
NPFSFQSDYY Projection
2. PEAALYGRF T- - - IKSDUW
4. PEALNYGRY - - -SSESDUW Property if
p(-,-) 0, then for a multiple alignment ?, we
have SP-score (?) ??
score (?ij)
iltj Where ?ij is the pairwise
alignment induced by ? on si and sj
90
Combining Optimal Pairwise Alignments into
Multiple Alignment
In the optimal multiple alignment is not true
that induced pairwise alignments ?ij are
optimal for all i,j
91
Compute Optimal Alignment (O.A.) according to
SP-score
Dynamic Programming algorithm on k-dimensional
array M of size n x n x n nk for k sequences

Mi1, i2, , ik holds the score of the O.A. of
s1 .. i1, , s1 .. ik.
Each entry of M depends on 2k -1 other entries,
one for each possible configuration of the
current column.
A A A A -
B B B B .. -
forbidden configuration
C C - - -
D - D - -
To compute SP-score O(k2) steps are required to
sum the scores
of the k(k-1) pair-wise alignment.
In total O(nk 2k k2) . O.A. is
NP-complete(WangJiang)

92
We develop the exact algorithm only for k3.
Consider 3 sequences s1, s2, s3 of size n1, n2,
n3. Initialization M 0,0,0
0 M i,j,0 sp-score(i,j)
(i j) p(x,-) M i,0,k
sp-score(i,k) (i k) p(x,-)
M 0,j,k sp-score(j,k) (j k) p(x,-)
MQPILLL
MQP MLR- LL-
M3,3,0 MLR
MK- ILLL - - -
c(M,M)c(Q,L)c(P,R)c(M,-) c(Q,-) c(P,-)
c(M,-)
c(L,-) c(R,-).
93
Mi,j,k is the maximum value among 7 values
S
A

V S N S
S N A
- - - A S

A
N
S
S
V
S
N
94
Multiple Alignment Projections
A 3-D alignment can be projected onto the 2-D
plane to represent an alignment between a pair of
sequences.
All 3 Pairwise Projections of the Multiple
Alignment
95
Algorithm Optimal Alignment of 3 sequences
For i 1 to n1 for j 1 to n2
for k 1 to n3 if s1(i) s2(j) then
c(i,j) cmatch else
c(i,j) cmis if s1(i) s3(k) then c(i,k)
cmatch else c(i,k)
cmis if s2(j) s3(k) then c(j,k) cmatch
else c(j,k) cmis
m1 Mi-1,j-1,k-1 c(i,j) c(i,k)
c(j,k) m2 Mi-1,j-1,k c(i,j) 2csp
m3 Mi-1,j,k-1 c(i,k) 2csp 1
space on 1 sequence m4 Mi,j-1,k-1
c(j,k) 2csp m5 Mi-1,j,k 2csp
m6 Mi,j-1,k 2csp 1 space on two
sequences m7 Mi,j,k-1 2csp
Mi,j,k max (m1, , m7)
O(n1n2n3)

96
Heuristic Algorithms

Star alignment (do not yield any guarantee of
quality of the resulting)

97
Star Alignment-Gusfield

Build a multiple alignment ? based upon the
pairwise alignments between a fixed sequence, the
center of the star, and the others
is such that its projections ?ij are optimal when
either i or j is the center
Select one of the sequences as center, say sc
Compute an optimal alignment between sc and si,
for all i
Merge the alignments by inserting spaces in the
sequences

98
s4
s3
s5
sc
s2
sk

How selecting the center sc?
Try with all and then pick the best
Compute all the pairwise alignments and take the
best
How inserting the spaces? once a gap in sc,
always a gap
Example
Let s1 be the center
ATTGCCATT
ATGGCCATT
ATCCAATTTT
ATCTTCTT
ACTGACC

s1
99

ATTGCCATT
ATGGCCATT
ATTGCCATT - -
ATC- CAATTTT
ATTGCCATT - -
ATCTTC TT - -
ATTGCCATT - -
ACTGACC - - - -

ATTGCCATT - -
ATGGCCATT - -
ATC- CAATTTT
ATCTTC TT - -
ACTGACC - - - -

100

Define the score in terms of distance d instead
of similarity the triangle inequality holds
Complexity
Computing the center as the best sequences
minimizing ?id(Si,Sc), takes O(k2 n2)
Computing and merging the pairwise alignment
takes
?i1 k-1O((i n)n)O(k2 n2)

101

Let M be denotes the optimal multiple alignment
and M the computed
multiple alignment
d(Si,Sj) is the score of the pairwise alignment
of Si,Sj induced by M
d(M) is the sum of d(Si,Sj) for iltj
Theorem d(M) ? (2-2/k)d(M)lt2 d(M)
Computational experiments show that the
approximation is overly pessimistic
Finding solutions are within 15 from the optimum
on average

102
The profile

Given a multiple sequence alignment M involving N
sequences of length l a profile P is a l ?
S?- matrix whose columns are probability
vectors denoting the frequencies of each symbol
in the corresponding alignment column.
Example
A-GTTTA
AC-TTA--
ACGTAAG-
-C-TA---

103
(No Transcript)
104
(No Transcript)
105
(No Transcript)
106
Aligning a string to a profile

Given a profile P and a string S, how well does S
fit P?
P can be considered as a string of profile
columns positions, and aligning S to P consists
in inserting spaces into S and P.
Scoring a char of S to every char in a column of
P
Example a,a -gt2 a,- and a,b -gt -1 a,c-gt -3
S aabbc P
s(a,1)0.75x2 0.25x3
is the contribute of the first column of P
A dynamic programming algorithm permits to
compute the optimal al.