Title: Sequence alignment and similarity search the gory details
1Sequence alignment and similarity searchthe gory
details
- Program
- Exact algorithms to find best local or global
alignment - Effect on scoring systems on alignment properties
- Amino acid substitution matrices
- Heuristic search algorithms
EPFL Bioinformatics I 31 Oct 2005
2Needleman-Wunsch algorithm for finding the
optimal global alignment
Principle The optimal alignment between
sequences a1 aN and b1 bM is one of the
following three alignments a) the optimal
alignment between a1 aN-1 and b1 bM-1
extended by a match step b) the optimal alignment
between a1 aN and b1 bM-1 extended by a insert
step b) the optimal alignment between a1 aN-1
and b1 bM extended by a delete step
A C C G C A T A G C A A C C C G C A T G G C -
match step
insert step
delete step
The optimal alignment score can be computed
recursively by computing the optimal alignment
scores for pairs of subsequences starting with
a1 and b1.
EPFL Bioinformatics I 31 Oct 2005
3Computation of optimal global alignment score
Recursive equations X0,0 0, for i 1 N
X0,i 0, for j 1 M Xj,0 0, for i 1 N,
j 1 M Xi,j max( Xi-1,j-1
s(ai,bj), Xi-1,j d Xi,j-1 d) Xopt
max(Xi,M,XN,j) Notation s(ai,bj) match score
for residue pair ai,bj d penalty for inserting
or deleting 1 residue
Scoring system match/mismatch 1/-1, indel -2
ACCG--CATAGCA ACCGCCCATGGC-
EPFL Bioinformatics I 31 Oct 2005
4Computing the optimal local alignment score
Recursive equations X0,0 0, for i 1 N
X0,i 0, for j 1 M Xj,0 0, for i 1 N,
j 1 M Xi,j max(0 Xi-1,j-1
s(ai,bj), Xi-1,j d Xi,j-1 d) Xopt
max(Xi,j,) Notation s(ai,bj) match score for
residue pair ai,bj d penalty for inserting or
deleting 1 residue
Scoring system match/mismatch 1/-1, indel -2
ACCGC ACCGC
EPFL Bioinformatics I 31 Oct 2005
5More on alignment types and alignment algorithms
Global sequence alignment algorithms were
introduced in Needleman SB, Wunsch CD (1970). A
general method applicable to the search for
similarities in the amino acid sequence of two
proteins. J Mol Biol. 1970 48, 443-453. Local
sequence alignment algorithms were introduced in
Smith TF, Waterman MS (1981). Identification of
common molecular subsequences. J Mol
Biol.147195-197. The two alignment types are
therefore also called Needleman-Wunsch and
Smith-Waterman algorithms. Global alignment
algorithms come in two flavors with end-gap
weighting without end-gap weighting (version on
previous slide) Note further The algorithms
presented on the previous slides are for simple
gap penalty functions of the type gap penalty
b gap_length (b 0) Similar algorithms exist
for gap penalty functions of the so-called
affine type gap penalty a b gap_length
(a, b 0)
EPFL Bioinformatics I 31 Oct 2005
6Finding multiple local alignments
- Problem
- Find the N best disjoint local alignments between
two sequences - Definition
- Two local alignments are disjoint if they have no
residue pair in common - Algorithm
- Find best local alignment using a standard
Smith-Waterman algorithm and record coordinates
of used residue pairs. - Find next best local alignment using a modified
Smith-Waterman algorithm under the constraint
that already used residue pairs cannot be used
anymore. - Repeat step 3 until N best alignments are found.
After each round, record coordinates of newly
used residue pairs.
EPFL Bioinformatics I 31 Oct 2005
7Effect of scoring system on local sequence
alignment
Simple scoring system for DNA sequences match
m, mismatch r, penalty for gap of length l
dl. The average score for aligned residues has
to be 0, therefore identity 100 -r /
(m-r) (with gaps, the percentage needs to be
even higher) Example m 1, r -3 ?
identity 100 3 / (13) 75 Note, however,
that with this scoring a system, an alignment of
exactly 75 identity and no gaps will still not
exceed a score of 0. In reality, optimal
alignments found with such a scoring system will
have substantially higher identity.
EPFL Bioinformatics I 31 Oct 2005
8Influence of scoring system on local alignments
Example 1
gttest1 test sequence 1 TAACTAATTCATACCAACAAAGAATGG
TTGGTCAAGAAGGGAAATCTGACACATGCCCAA GCTCATGGGACTTAGG
TGCATGACTGTGGGAAAGAAGTCAGTCACAAAGGCCAGAGGCTA CACGA
TTCCCTTCTGCAGGCTATCTGGAGGTTGCAGATTTATAGAAGCAGAGGAA
AAGGA
gttest2 test sequence 2 TCTACATACAATGCAATATTATTCAGC
CTTAAAAAGGAAAGCGCTTCTGACACAGGCTAC ACCATGGATAGTCCTT
GACGGCATATTGCCGAGTGAAATGAGCTAGTCACAAAAAGACAA ATACT
GTATGATTTCAACTTACATGAGGCACCTGGAGGAATGAAGTCCATAGAGA
CAGAA
Scoring system match 2, mismatch -2, gap
-4 gal_length
64.7 identity in 139 nt overlap score 48
test1 TCTGACACATGCCCAAGC-TC-ATGGGACTT-AGGTGCA
TGACTGT-G-G-GAAAGAAGT
test2
TCTGACACAGGCTACACCATGGATAGTCCTTGACG-GCAT-ATTGCCGAG
TGAAATGAGC test1 CAGTCACAAAG-GCCAGAGGCTACACGATTC
C--CTT-C-TGCAGGCTATCTGGAGGT-T
test2 TAGTCACAAAAAGACAAATACTGTATGATTTCAACTTACAT
G-AGGC-ACCTGGAGGAAT test1 GCAGATTTATAGAAGCAGA
test2
GAAG-TCCATAGAGACAGA
EPFL Bioinformatics I 31 Oct 2005
9Influence of scoring system on local alignments
Example 2
gttest1 test sequence 1 TAACTAATTCATACCAACAAAGAATGG
TTGGTCAAGAAGGGAAATCTGACACATGCCCAA GCTCATGGGACTTAGG
TGCATGACTGTGGGAAAGAAGTCAGTCACAAAGGCCAGAGGCTA CACGA
TTCCCTTCTGCAGGCTATCTGGAGGTTGCAGATTTATAGAAGCAGAGGAA
AAGGA
gttest2 test sequence 2 TCTACATACAATGCAATATTATTCAGC
CTTAAAAAGGAAAGCGCTTCTGACACAGGCTAC ACCATGGATAGTCCTT
GACGGCATATTGCCGAGTGAAATGAGCTAGTCACAAAAAGACAA ATACT
GTATGATTTCAACTTACATGAGGCACCTGGAGGAATGAAGTCCATAGAGA
CAGAA
Scoring system match 2, mismatch -3, gap
-4 gal_length
68.4 identity in 79 nt overlap score
22 test1 AGTCACAAAG-GCCAGAGGCTACACGATTCC--CTT-C-
TGCAGGCTATCTGGAGGT-TG
test2
AGTCACAAAAAGACAAATACTGTATGATTTCAACTTACATG-AGGC-ACC
TGGAGGAATG test1 CAGATTTATAGA-AGCAGA
test2 AAG-TCCATAGAGA-CAGA
EPFL Bioinformatics I 31 Oct 2005
10More on amino acid substitution matrices
Amino acid substitution matrices are based on
observed or expected residue pair
frequencies. The so-called Dayhoff or PAM
matrices are based on an evolutionary model
assuming amino acid pair specific substitution
frequencies. PAM stays for percent amino acid
change. Some substitutions are frequent, for
instance A?S, others are very rare, for instance,
W?D. PAM matrices are computed for a given
evolutionary distance, for instance the popular
PAM250 matrix corresponds to an evolutionary
distance of 2.5 substitutions per amino acid. The
parameters of the Dayhoff model are derived from
multiple alignments of closely related sequences
(PAM distance less than 10). The parameters of
the Blosum matrices are derived from gap-free
multiple alignments (block alignment) with a
fixed percentage of identity between sequence
pairs. For instance, Blosum62 corresponds to 62
sequence identity. The matrix scores are
computed as log-odds scores where pab is the
probability to observe an aligned residue pair
a,b, and qa, qb are the frequencies of amino
acids a and b, respectively, in the sequences.
EPFL Bioinformatics I 31 Oct 2005
11Heuristic similarity search algorithms
The Smith-Waterman algorithm takes about O(NM)
time to compute the best local alignment for two
sequences of length N and M. The O() notation
means proportional to the term inside the
parentheses. The Smith-Waterman algorithms is a
so-called rigorous algorithms guaranteed to find
the optimal solution to the problem. Blast is a
heuristic local similarity search algorithm not
guaranteed to find the best local alignment but
much faster. It uses computing time close to
O(NM). Algorithms like Blast are usually used
to compare one query sequence against many
sequences in a database. The algorithm uses a
fast method to find common k-words (subsequences
of length k) between two sequences. If there are
no common words, the sequence pair is not further
analysed. Otherwise, an attempt is made to extend
the word matches, first by an un-gapped
alignment, and finally by a Smith-Waterman-like
algorithm. For DNA searches, the word length of
Blast is 11 bases. Thus sequence matches which do
not share a subsequence of 11 identical bases,
will not be found.
EPFL Bioinformatics I 31 Oct 2005