Title: Pairwise Alignments and Sequence Database Searches: Algorithms
1Pairwise Alignments and Sequence Database
Searches Algorithms
- BIOINFORMATICS I
- Protein and DNA Sequence Analysis
- Jaime E. Ramirez-Vick, PhD
2Sequence Analysis - Overview
3Types of Database Searches
- Sequence against database of abstractions from
sequences - Sequence against database of sequences
- Abstraction from your sequence against a database
of sequences - Sometimes against a database of abstraction from
sequences - Sequence database searches are carried out by
doing pairwise alignments
4Focus What to Remember
- The different kinds of alignments and algorithms
- The model of sequence evolution implied in
database searching and sequence alignment -
Scoring Matrices - Differences, in terms of restrictions placed on
the model of sequence evolution, among the three
main database searching (local alignment)
algorithms - The nature of the knowledge or information
encoded in amino acid or nucleotide similarity
matrices - The effect of different gap penalty choices on
the model of sequence evolution - The relationship between the amount of
information in an alignment and statistical
significance
5What are you doing?????
- Database search
- Separate the library sequences that are related
to your query sequence (by some model) from the
sequences that are not. - SEQUENCE ALIGNMENTS
- Alignment
- Represents our best guess of the history of
substitutions at each position.
6What is a sequence alignment?
- THE PROBLEM
- Given two sequences of letters, and a scoring
scheme for evaluating matching letters, find the
optimal pairing of letters from one sequence to
letters of the other sequence. - Align
- THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
- THIS IS A SHORT SENTENCE.
- Alignments
- THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
- THIS IS A SHORT SENTENCE.
- OR
- THIS IS A SHORTSENTENCE.
From Russ Altman - Stanford University
7Why align sequences?
- Lots of sequences with unknown structure and
function. A few sequences with known structure
and function. - If they align, they are similar, maybe due to
common descent. - If they are similar, then they might have the
same structure or function. - If one of them has known structure/function, then
alignment to the other yields insight about how
the structure or function works.
From Russ Altman - Stanford University
8Database Searching An Overview
- Database searching is the application of
previously determined experimental knowledge to
the problem of discovering the biochemistry and
physiology of a newly discovered gene or its
protein product. - This involves finding an homologous gene or
protein - one related to our newly determined
sequence through a common evolutionary ancestor.
9Knowledge in Database Searching
- Search programs Smith-Waterman, BLAST, FASTA
- Each is a mathematical method for finding the
best fit between two sets of discreet data (e.g.,
biological sequences) - Any limitations in finding the best fit between
sequences can be interpreted as restrictions on
the model of evolution relating the sequences and
what kinds of changes are allowed. - Similarity scores for alignments PAM, Blosum,
DNA - For a given evolutionary divergence what
substitutions are likely to be observed and which
are unlikely to be observed - Sequence Databases GenBank, PIR, Swiss-Prot,
UNIPROT - Many sequences where we already have knowledge of
their biochemistry or physiology - Conserved features in the sequence and their
functional role
10The Model of Sequence Evolution
- The sequences being sought with the query
sequence have an ancestral sequence in common
from which they and the query sequence have
descended. - Our best guess of the actual path of evolution is
the path that requires the fewest evolutionary
events. - All substitutions are not equally likely and
should be weighted to account for this. - Insertions and deletions are less likely than
substitutions and should be weighted to account
for this.
11Database Searching The Modelwith the
assumption of reversibility
Unobserved Ancestral Sequence
Descendant 1 gctggaaggcat
Descendant 2 gcagagcact
12Database Searching The Choices
- The choice of search algorithm determines which,
if any, restrictions you place on the
evolutionary model of the database search this
determines the sensitivity, selectivity, and
computational requirements - The choice of scoring table (similarity matrix)
determines the degree of divergence and pattern
of substitutions that will be found - The choice of gap penalty influences the shape of
the distribution of alignment scores for random
sequences and hence your ability to distinguish
homologous from nonhomologous sequences
13Drawing alignments
- Exact Matches OK, Inexact Costly, Gaps cheap.
- This is a rather longer sentence than the next.
- This is a sentence.
- OR
- This is a rather longer sentence than the next.
- This is a shortsentence.
- Exact Matches OK, Inexact Moderate, Gaps cheap.
- This is a rather longer sentence than the next.
- This is a short sentence.
- Exact Matches cheap, Inexact cheap, Gaps
expensive. - This is a rather longer sentence than the next.
- This is a short sentence.
From Russ Altman - Stanford University
14Outline of Database Searching
- Algorithms
- Dynamic Programming (Needleman-Wunsch,
Smith-Waterman, Sellers) - Heuristics (FASTA, BLAST)
- Similarity Scores
- Information theory (log-odds) and entropy
(information content in bits) - Approaches for creating matrices (PAM vs. BLOSUM,
Protein vs. Nucleic acid) - Efficiency multiple matrices?
- Gaps
- InDel Insertion/Deletion
- Statistical Significance
- Database as a reference distribution
- Information content of the alignment
15Database Searching Algorithm
- Dynamic Programming Algorithm
- required computational resources are proportional
to the products of the lengths of the sequences - more sensitive, less selective
- rigorous - finds the best possible alignment
given a set of scores - Needleman-Wunsch (global)
- Smith-Waterman (local)
- Sellers (quasi-global)
- Heuristics
- less sensitive, more selective, not rigorous -
local minima problem - local alignments only
- FASTA (Word search)
- BLAST (Maximal Segment Pairs)
16Search Programs Fitting the Model
- Needleman-Wunsch
- Aligns the entire lengths of a pair of sequences,
thus assumes they are related over their entire
lengths - Places the fewest restrictions on the model of
sequence evolution implicit in the scores and
context - Mathematically rigorous - all possible alignments
are considered - Computationally intensive requires memory and
time proportional to the product of the lengths
of the sequences
17Dynamic Programming Needleman-Wunsch
NWi,j max NWi-1,j-1 s(ai,bj) NWi-1,j g
NWi,j-1 g
NWi,j is the cumulative similarity of the
alignment up to residue i in sequence A and
residue j in sequence B
s(ai,bj) the similarity score for aligning
residue ai with residue bj (e.g., from a PAM of
Blosum table)
g is the gap penalty, g open gap extend gap
gap length
18Computation Grid for Needleman-Wunsch
a5
a6
a1
a2
a3
a4
(open gap 3 extend gap)
b1
b2
(open gap 2 extend gap)
NW3,3 s(a4,b4)
b3
(open gap extend gap)
b4
NW4-l,4 g l 1, 2, or 3
b5
19Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7
- 0 G C T G G A A G G C A
T - 0 0 -7 -14 -21 -28 -35 -42 -49 -56 -63 -70
-77 -84 - G -7 5 -2 -9
- C -14 -2 10
- A -21 -9
- G -28
- A -35
- G -42
- C -49
- A -56
- C -63
- T -70
20Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7
21Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7
22 Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7
23Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7
- 0 G C T G G A A G G C A
T - 0 0 -7 -14 -21 -28 -35 -42 -49 -56 -63 -70
-77 -84 - G -7 5 -2 -9 -16 -23 -30 -37 -44 -51 -58
-65 -72 - C -14 -2 10 3 -4 -11 -18 -25 -32 -39 -46
-53 -60 - A -21 -9 3 6 -1 -8 -6 -13 -20 -27 -34
-41 -48 - G -28 -16 -4 -1 11 4 -3 -10 -8 -15 -22
-29 -36 - A -35 -23 -11 -8 4 7 9 2 -5 -12 -19
-17 -24 - G -42 -30 -18 -15 -3 9 3 5 7 0 -7
-14 -21 - C -49 -37 -25 -22 -10 2 5 -1 1 3 5 -2
-9 - A -56 -44 -32 -29 -17 -5 7 10 3 -3 -1
10 3 - C -63 -51 -39 -36 -24 -12 0 3 6 -1 2 3
6 - T -70 -58 -46 -34 -31 -19 -7 -4 -1 2 -5 -2
8
G C T G G A A G G C A - T G
C - A G - A - G C A C T
24Global vs. Local Alignment
- Global alignment find alignment in which total
score is highest, perhaps at expense of areas of
great local similarity. - Local alignment find alignment in which the
highest scoring subsequences are identified, at
the expense of the overall score. - Local alignment can be done with minor
modifications of the global alignment algorithm!
From Russ Altman - Stanford University
25Dynamic Programming Smith-Waterman
- Finds local alignments - the best matching
regions in a pair of sequences rather than
aligning their entire lengths. Ignores poorly
matching regions. - Mathematically rigorous - finds the best
alignment between two sequences given the choice
of similarity scores and gap penalties. - Solves the problem of aligning sequences that
share a functional or structural domain in common
but are not related over their entire length. - Requires computational resources proportional to
the product of the lengths of the sequences.
26Dynamic Programming Smith-Waterman
SWi,j max SWi-1,j-1 s(ai,bj) SWi-1,j g
SWi,j-1 g 0
SWi,j is the cumulative similarity of the
alignment up to residue i in sequence A and
residue j in sequence B
s(ai,bj) the similarity score for aligning
residue ai with residue bj (e.g., a PAM of
Blosum table)
g is the gap penalty, g open gap extend gap
gap length
0, zero, prevents poorly matching regions from
hiding good matching regions
27Computation Grid for Smith-Waterman
a5
a6
a1
a2
a3
a4
(open gap 3 extend gap)
b1
b2
(open gap 2 extend gap)
SW3,3 s(a4,b4)
b3
(open gap extend gap)
b4
SW4-l,4 g l 1, 2, or 3
b5
28Smith-Waterman Alignment PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7
- 0 G C T G G A A G G C
A T - 0 0 0 0 0 0 0 0 0 0 0 0
0 0 - G 0 5 0 0 5 5 0 0 5 5 0
0 0 - C 0 0 10 3 0 1 1 0 0 1 10
3 0 - A 0 0 3 6 0 0 6 6 0 0 3
15 8 - G 0 5 0 0 11 5 0 2 11 5 0
8 11 - A 0 0 1 0 4 7 10 5 4 7 1
5 4 - G 0 5 0 0 5 9 3 6 10 9 3
0 1 - C 0 0 10 3 0 2 5 0 3 6 14
7 0 - A 0 0 3 6 0 0 7 10 3 0 7
19 12 - C 0 0 5 0 2 0 0 3 6 0 5
12 15 - T 0 0 0 10 3 0 0 0 0 2 0
5 17
G A A G - G C A G C A G A G C A
29SW Alignment Waterman-Eggert Extension
(MaxSegs) PAM 47, Match 5, Mismatch -4, Open
Gap 0, Extend Gap -7
- 0 G C T G G A A G G C
A T - 0 0 0 0 0 0 0 0 0 0 0 0
0 0 - G 0 5 0 0 5 0 0 0 5 5 0
0 0 - C 0 0 10 3 0 1 0 0 0 1 10
3 0 - A 0 0 3 6 0 0 6 0 0 0 3
15 8 - G 0 5 0 0 11 5 0 2 0 5 0
8 11 - A 0 0 1 0 4 7 10 5 0 0 1
5 4 - G 0 5 0 0 5 9 3 6 10 0 0
0 1 - C 0 0 10 3 0 2 5 0 3 6 0
0 0 - A 0 0 3 6 0 0 7 10 3 0 2
0 0 - C 0 0 5 0 2 0 0 3 6 0 5
0 0 - T 0 0 0 10 3 0 0 0 0 2 0
0 0
1ST (Score 19) G A A G - G C A G C
A G A G C A
2ND (15) G C A G C A
3RD (11) G C T G G C A G
30Sellers (Quasi-Global) Alignment PAM 47, Match
5, Mismatch -4, Open Gap 0, Extend Gap -7
- 0 G C T G G A A G G C
A T - 0 0 0 0 0 0 0 0 0 0 0 0
0 0 - G 0 5 -2 -4 5 5 -2 -4 5 5 -2
-4 -4 - C 0 -2 10 3 -2 1 1 -6 -2 1 10
3 -4 - A 0 -4 3 6 -1 -6 6 6 -1 -6 3
15 8 - G 0 5 -2 -1 11 4 -1 2 11 4 -3
8 11 - A 0 -2 1 -6 4 7 9 4 4 7 0
1 4 - G 0 5 -2 -3 1 9 3 5 9 9 3
-4 -3 - C 0 -2 10 3 -4 2 5 -1 2 5 14
7 0 - A 0 -4 3 6 -1 -5 7 10 3 -2 7
19 12 - C 0 -4 1 -1 2 -5 0 3 6 -1 3
12 15 - T 0 -4 -6 5 -2 -2 -5 -4 -1 2 -4
5 17
NWi,j max NWi-1,j-1 s(ai,bj) NWi-1,j g
NWi,j-1 g
G A A G - G C A - T G C A G A G C
A C T
31Comparison of Alignments from the Algorithms PAM
47, Match 5, Mismatch -4, Open Gap 0,
Extend Gap -7
- 0 G C T G G A A G G C
A T - 0
- G
- C
- A
- G
- A
- G
- C
- A
- C
- T
Needleman-Wunsch (global)
G C T G G A A G G C A - T G C - A G - A - G C A C
T
Smith-Waterman (local)
1ST (Score 19) G A A G - G C A G C A G A G C
A
2ND (15) G C A G C A
3RD (11) G C T G G C A G
Sellers (Quasi-Global)
G A A G - G C A - T G C A G A G C A C T
32Dynamic Programming Algorithms
- Place no restrictions on the model of evolution
- Mathematically rigorous -- finds maximum
similarity - Computationally demanding
- CPU scales ? 3 length sequence 1 length
sequence 2 - Memory scales ? length sequence 1 length
sequence 2 - Algorithmic Variants
- Global -- Needleman-Wunsch aligns entire
lengths of sequences - Quasi-global -- Sellers fits a short sequence
into a longer sequence - Local -- Smith-Waterman finds most similar
regions
33Comparison of Algorithms
Sequence 1 -gt
- Needleman-Wunsch
- Global alignment
- Must start in upper left corner and finish in
lower right corner. - Smith-Waterman
- Local Alignment
- Small regions of high similarity
- Sellers
- Quasiglobal alignment
- Must start on left or top edge and finish on
right or bottom edge - In well behaved cases, forces an alignment of
smaller sequence into larger sequence
Sequence 2 ?
Global
Local-1
Quasi- global
Local-2
34Coagulation pre-proenzymes
35Heuristic Programs FASTA and BLAST
- Does a fast initial screen of each database
sequence to see if the similarity with the query
sequence is high enough to warrant a more
complete examination. - The initial screen is based on finding highly
similar, small, fixed length segments of sequence
called words or tuples. - This initial screen can be thought of as a
restriction on the model of sequence evolution
relative to that embodied in dynamic programming
algorithms. - Entails some loss of sensitivity and sometimes a
gain in selectivity over dynamic programming.
36Generalized Heuristic Algorithm Structure
- Initial Word Search
- Exact
- Similarity
- First evaluation Threshold length or score
- Initial Alignment
- Join words or
- Extend words toward maximal segment pair
- Final, optimized alignment
- Join MSPs
- Banded Smith-Waterman (Fasta, prior BLASTs)
- Centered Smith-Waterman (BLAST 2)
37Search Programs Fitting the Model
- FASTA
- Restricts the evolutionary model to require a few
unmutated dipeptides or hexanucleotides - Carefully considers only alignments including the
unmutated sites - The window size limits the maximum cumulative
gaps - Plus or minus 32 residues in the GCG
implementation - Much faster, more selective, and less sensitive
than Smith-Waterman searches for sequences with
several identical dipeptides or hexanucleotides
38A Word List for FASTA, Word Size 6
-
- G C T G A A G G C A T
- G C T G A A
- C T G A A G
- T G A A G A
- G A A G G C
- A A G G C A
- A G G C A T
39FASTA Algorithm
- 4 steps
- use lookup table to find all identities at least
ktup long, find regions of identities (Fig.A) - rescan 10 regions (diagonals) with highest
density of identities using PAM250 (Fig.B) - join regions if possible without decreasing score
below threshold (Fig.C) - rescore ala Smith-Waterman 32 residues around
initial region (Note doesnt save alignment)
(Fig.D) - Pearson and Lippman 1988
40Search Programs Fitting the Model
- BLAST
- Requires the evolutionary model to include a site
with a long stretch of mostly identities and
conservative substitutions, with very few
unlikely substitutions and no insertions or
deletions - Less sensitive and more selective than
Smith-Waterman when the evolutionary model is
appropriate - Intended for proteins - BLAST is the least
effective algorithm for nucleic acid sequences - The restrictions allow rigorous statistical
evaluation of the results of the database search - Recent extension allows insertions and deletions
through a second pass Smith-Waterman alignment
41BLAST Algorithm
- 3 steps
- Compile a list of high-scoring words
- scan database for hits
- extend hits
42BLAST Algorithm
43BLAST Algorithm
44BLAST Algorithm
45BLAST 2 - Improved Sensitivity
- BLAST 2 requires two different words to match
- Both matches must be on the same diagonal of the
path graph - The matches must not overlap
- The pair of matches must be within a limited
number of amino acids of one another, typically
40 amino acids along the diagonal - This allows the threshold score for each match to
be lowered from the value used in the original
BLAST - These matches are extended as in the original
BLAST - If the resulting HSP score is high enough a
limited Smith-Waterman alignment is done.
46BLAST 2 Smith-Waterman
- Finds the highest scoring 11 amino acid segment
in the HSPs from the first pass BLAST2 analysis - Takes the middle aligned pair of amino acids from
the segment as the starting point for the
Smith-Waterman alignment - Extends the alignment in both directions until
the alignment score drop to some limit X below
the highest Smith-Waterman score calculated up to
that point for the current pair of sequences - This allows odd shaped areas rather than the band
used in FASTA and the earlier gapped BLAST