Pairwise Alignments and Sequence Database Searches: Algorithms

About This Presentation

Title:

Pairwise Alignments and Sequence Database Searches: Algorithms

Description:

Search programs: Smith-Waterman, BLAST, FASTA ... Smith-Waterman Alignment. PAM 47, Match = 5, Mismatch ... SW Alignment Waterman-Eggert Extension (MaxSegs) ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 47

Provided by: jaimeram

Category:

more less

Transcript and Presenter's Notes

Title: Pairwise Alignments and Sequence Database Searches: Algorithms

1
Pairwise Alignments and Sequence Database
Searches Algorithms

BIOINFORMATICS I
Protein and DNA Sequence Analysis
Jaime E. Ramirez-Vick, PhD

2
Sequence Analysis - Overview
3
Types of Database Searches

Sequence against database of abstractions from
sequences
Sequence against database of sequences
Abstraction from your sequence against a database
of sequences
Sometimes against a database of abstraction from
sequences
Sequence database searches are carried out by
doing pairwise alignments

4
Focus What to Remember

The different kinds of alignments and algorithms
The model of sequence evolution implied in
database searching and sequence alignment -
Scoring Matrices
Differences, in terms of restrictions placed on
the model of sequence evolution, among the three
main database searching (local alignment)
algorithms
The nature of the knowledge or information
encoded in amino acid or nucleotide similarity
matrices
The effect of different gap penalty choices on
the model of sequence evolution
The relationship between the amount of
information in an alignment and statistical
significance

5
What are you doing?????

Database search
Separate the library sequences that are related
to your query sequence (by some model) from the
sequences that are not.
SEQUENCE ALIGNMENTS
Alignment
Represents our best guess of the history of
substitutions at each position.

6
What is a sequence alignment?

THE PROBLEM
Given two sequences of letters, and a scoring
scheme for evaluating matching letters, find the
optimal pairing of letters from one sequence to
letters of the other sequence.
Align
THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
THIS IS A SHORT SENTENCE.
Alignments
THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
THIS IS A SHORT SENTENCE.
OR
THIS IS A SHORTSENTENCE.

From Russ Altman - Stanford University
7
Why align sequences?

Lots of sequences with unknown structure and
function. A few sequences with known structure
and function.
If they align, they are similar, maybe due to
common descent.
If they are similar, then they might have the
same structure or function.
If one of them has known structure/function, then
alignment to the other yields insight about how
the structure or function works.

From Russ Altman - Stanford University
8
Database Searching An Overview

Database searching is the application of
previously determined experimental knowledge to
the problem of discovering the biochemistry and
physiology of a newly discovered gene or its
protein product.
This involves finding an homologous gene or
protein - one related to our newly determined
sequence through a common evolutionary ancestor.

9
Knowledge in Database Searching

Search programs Smith-Waterman, BLAST, FASTA
Each is a mathematical method for finding the
best fit between two sets of discreet data (e.g.,
biological sequences)
Any limitations in finding the best fit between
sequences can be interpreted as restrictions on
the model of evolution relating the sequences and
what kinds of changes are allowed.
Similarity scores for alignments PAM, Blosum,
DNA
For a given evolutionary divergence what
substitutions are likely to be observed and which
are unlikely to be observed
Sequence Databases GenBank, PIR, Swiss-Prot,
UNIPROT
Many sequences where we already have knowledge of
their biochemistry or physiology
Conserved features in the sequence and their
functional role

10
The Model of Sequence Evolution

The sequences being sought with the query
sequence have an ancestral sequence in common
from which they and the query sequence have
descended.
Our best guess of the actual path of evolution is
the path that requires the fewest evolutionary
events.
All substitutions are not equally likely and
should be weighted to account for this.
Insertions and deletions are less likely than
substitutions and should be weighted to account
for this.

11
Database Searching The Modelwith the
assumption of reversibility
Unobserved Ancestral Sequence
Descendant 1 gctggaaggcat
Descendant 2 gcagagcact
12
Database Searching The Choices

The choice of search algorithm determines which,
if any, restrictions you place on the
evolutionary model of the database search this
determines the sensitivity, selectivity, and
computational requirements
The choice of scoring table (similarity matrix)
determines the degree of divergence and pattern
of substitutions that will be found
The choice of gap penalty influences the shape of
the distribution of alignment scores for random
sequences and hence your ability to distinguish
homologous from nonhomologous sequences

13
Drawing alignments

Exact Matches OK, Inexact Costly, Gaps cheap.
This is a rather longer sentence than the next.
This is a sentence.
OR
This is a rather longer sentence than the next.
This is a shortsentence.
Exact Matches OK, Inexact Moderate, Gaps cheap.
This is a rather longer sentence than the next.
This is a short sentence.
Exact Matches cheap, Inexact cheap, Gaps
expensive.
This is a rather longer sentence than the next.
This is a short sentence.

From Russ Altman - Stanford University
14
Outline of Database Searching

Algorithms
Dynamic Programming (Needleman-Wunsch,
Smith-Waterman, Sellers)
Heuristics (FASTA, BLAST)
Similarity Scores
Information theory (log-odds) and entropy
(information content in bits)
Approaches for creating matrices (PAM vs. BLOSUM,
Protein vs. Nucleic acid)
Efficiency multiple matrices?
Gaps
InDel Insertion/Deletion
Statistical Significance
Database as a reference distribution
Information content of the alignment

15
Database Searching Algorithm

Dynamic Programming Algorithm
required computational resources are proportional
to the products of the lengths of the sequences
more sensitive, less selective
rigorous - finds the best possible alignment
given a set of scores
Needleman-Wunsch (global)
Smith-Waterman (local)
Sellers (quasi-global)
Heuristics
less sensitive, more selective, not rigorous -
local minima problem
local alignments only
FASTA (Word search)
BLAST (Maximal Segment Pairs)

16
Search Programs Fitting the Model

Needleman-Wunsch
Aligns the entire lengths of a pair of sequences,
thus assumes they are related over their entire
lengths
Places the fewest restrictions on the model of
sequence evolution implicit in the scores and
context
Mathematically rigorous - all possible alignments
are considered
Computationally intensive requires memory and
time proportional to the product of the lengths
of the sequences

17
Dynamic Programming Needleman-Wunsch
NWi,j max NWi-1,j-1 s(ai,bj) NWi-1,j g
NWi,j-1 g

NWi,j is the cumulative similarity of the
alignment up to residue i in sequence A and
residue j in sequence B

s(ai,bj) the similarity score for aligning
residue ai with residue bj (e.g., from a PAM of
Blosum table)
g is the gap penalty, g open gap extend gap
gap length
18
Computation Grid for Needleman-Wunsch
a5
a6
a1
a2
a3
a4
(open gap 3 extend gap)
b1
b2
(open gap 2 extend gap)
NW3,3 s(a4,b4)
b3
(open gap extend gap)
b4
NW4-l,4 g l 1, 2, or 3
b5
19
Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7

0 G C T G G A A G G C A
T
0 0 -7 -14 -21 -28 -35 -42 -49 -56 -63 -70
-77 -84
G -7 5 -2 -9
C -14 -2 10
A -21 -9
G -28
A -35
G -42
C -49
A -56
C -63
T -70

20
Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7

21
Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7

22

Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7
23
Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7

0 G C T G G A A G G C A
T
0 0 -7 -14 -21 -28 -35 -42 -49 -56 -63 -70
-77 -84
G -7 5 -2 -9 -16 -23 -30 -37 -44 -51 -58
-65 -72
C -14 -2 10 3 -4 -11 -18 -25 -32 -39 -46
-53 -60
A -21 -9 3 6 -1 -8 -6 -13 -20 -27 -34
-41 -48
G -28 -16 -4 -1 11 4 -3 -10 -8 -15 -22
-29 -36
A -35 -23 -11 -8 4 7 9 2 -5 -12 -19
-17 -24
G -42 -30 -18 -15 -3 9 3 5 7 0 -7
-14 -21
C -49 -37 -25 -22 -10 2 5 -1 1 3 5 -2
-9
A -56 -44 -32 -29 -17 -5 7 10 3 -3 -1
10 3
C -63 -51 -39 -36 -24 -12 0 3 6 -1 2 3
6
T -70 -58 -46 -34 -31 -19 -7 -4 -1 2 -5 -2
8

G C T G G A A G G C A - T G
C - A G - A - G C A C T
24
Global vs. Local Alignment

Global alignment find alignment in which total
score is highest, perhaps at expense of areas of
great local similarity.
Local alignment find alignment in which the
highest scoring subsequences are identified, at
the expense of the overall score.
Local alignment can be done with minor
modifications of the global alignment algorithm!

From Russ Altman - Stanford University
25
Dynamic Programming Smith-Waterman

Finds local alignments - the best matching
regions in a pair of sequences rather than
aligning their entire lengths. Ignores poorly
matching regions.
Mathematically rigorous - finds the best
alignment between two sequences given the choice
of similarity scores and gap penalties.
Solves the problem of aligning sequences that
share a functional or structural domain in common
but are not related over their entire length.
Requires computational resources proportional to
the product of the lengths of the sequences.

26
Dynamic Programming Smith-Waterman
SWi,j max SWi-1,j-1 s(ai,bj) SWi-1,j g
SWi,j-1 g 0
SWi,j is the cumulative similarity of the
alignment up to residue i in sequence A and
residue j in sequence B

s(ai,bj) the similarity score for aligning
residue ai with residue bj (e.g., a PAM of
Blosum table)
g is the gap penalty, g open gap extend gap
gap length
0, zero, prevents poorly matching regions from
hiding good matching regions
27
Computation Grid for Smith-Waterman
a5
a6
a1
a2
a3
a4
(open gap 3 extend gap)
b1
b2
(open gap 2 extend gap)
SW3,3 s(a4,b4)
b3
(open gap extend gap)
b4
SW4-l,4 g l 1, 2, or 3
b5
28
Smith-Waterman Alignment PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7

0 G C T G G A A G G C
A T
0 0 0 0 0 0 0 0 0 0 0 0
0 0
G 0 5 0 0 5 5 0 0 5 5 0
0 0
C 0 0 10 3 0 1 1 0 0 1 10
3 0
A 0 0 3 6 0 0 6 6 0 0 3
15 8
G 0 5 0 0 11 5 0 2 11 5 0
8 11
A 0 0 1 0 4 7 10 5 4 7 1
5 4
G 0 5 0 0 5 9 3 6 10 9 3
0 1
C 0 0 10 3 0 2 5 0 3 6 14
7 0
A 0 0 3 6 0 0 7 10 3 0 7
19 12
C 0 0 5 0 2 0 0 3 6 0 5
12 15
T 0 0 0 10 3 0 0 0 0 2 0
5 17

G A A G - G C A G C A G A G C A
29
SW Alignment Waterman-Eggert Extension
(MaxSegs) PAM 47, Match 5, Mismatch -4, Open
Gap 0, Extend Gap -7

0 G C T G G A A G G C
A T
0 0 0 0 0 0 0 0 0 0 0 0
0 0
G 0 5 0 0 5 0 0 0 5 5 0
0 0
C 0 0 10 3 0 1 0 0 0 1 10
3 0
A 0 0 3 6 0 0 6 0 0 0 3
15 8
G 0 5 0 0 11 5 0 2 0 5 0
8 11
A 0 0 1 0 4 7 10 5 0 0 1
5 4
G 0 5 0 0 5 9 3 6 10 0 0
0 1
C 0 0 10 3 0 2 5 0 3 6 0
0 0
A 0 0 3 6 0 0 7 10 3 0 2
0 0
C 0 0 5 0 2 0 0 3 6 0 5
0 0
T 0 0 0 10 3 0 0 0 0 2 0
0 0

1ST (Score 19) G A A G - G C A G C
A G A G C A
2ND (15) G C A G C A
3RD (11) G C T G G C A G
30
Sellers (Quasi-Global) Alignment PAM 47, Match
5, Mismatch -4, Open Gap 0, Extend Gap -7

0 G C T G G A A G G C
A T
0 0 0 0 0 0 0 0 0 0 0 0
0 0
G 0 5 -2 -4 5 5 -2 -4 5 5 -2
-4 -4
C 0 -2 10 3 -2 1 1 -6 -2 1 10
3 -4
A 0 -4 3 6 -1 -6 6 6 -1 -6 3
15 8
G 0 5 -2 -1 11 4 -1 2 11 4 -3
8 11
A 0 -2 1 -6 4 7 9 4 4 7 0
1 4
G 0 5 -2 -3 1 9 3 5 9 9 3
-4 -3
C 0 -2 10 3 -4 2 5 -1 2 5 14
7 0
A 0 -4 3 6 -1 -5 7 10 3 -2 7
19 12
C 0 -4 1 -1 2 -5 0 3 6 -1 3
12 15
T 0 -4 -6 5 -2 -2 -5 -4 -1 2 -4
5 17

NWi,j max NWi-1,j-1 s(ai,bj) NWi-1,j g
NWi,j-1 g
G A A G - G C A - T G C A G A G C
A C T
31
Comparison of Alignments from the Algorithms PAM
47, Match 5, Mismatch -4, Open Gap 0,
Extend Gap -7

0 G C T G G A A G G C
A T
0
G
C
A
G
A
G
C
A
C
T

Needleman-Wunsch (global)
G C T G G A A G G C A - T G C - A G - A - G C A C
T
Smith-Waterman (local)
1ST (Score 19) G A A G - G C A G C A G A G C
A
2ND (15) G C A G C A
3RD (11) G C T G G C A G
Sellers (Quasi-Global)
G A A G - G C A - T G C A G A G C A C T
32
Dynamic Programming Algorithms

Place no restrictions on the model of evolution
Mathematically rigorous -- finds maximum
similarity
Computationally demanding
CPU scales ? 3 length sequence 1 length
sequence 2
Memory scales ? length sequence 1 length
sequence 2
Algorithmic Variants
Global -- Needleman-Wunsch aligns entire
lengths of sequences
Quasi-global -- Sellers fits a short sequence
into a longer sequence
Local -- Smith-Waterman finds most similar
regions

33
Comparison of Algorithms
Sequence 1 -gt

Needleman-Wunsch
Global alignment
Must start in upper left corner and finish in
lower right corner.
Smith-Waterman
Local Alignment
Small regions of high similarity
Sellers
Quasiglobal alignment
Must start on left or top edge and finish on
right or bottom edge
In well behaved cases, forces an alignment of
smaller sequence into larger sequence

Sequence 2 ?
Global
Local-1
Quasi- global
Local-2
34
Coagulation pre-proenzymes
35
Heuristic Programs FASTA and BLAST

Does a fast initial screen of each database
sequence to see if the similarity with the query
sequence is high enough to warrant a more
complete examination.
The initial screen is based on finding highly
similar, small, fixed length segments of sequence
called words or tuples.
This initial screen can be thought of as a
restriction on the model of sequence evolution
relative to that embodied in dynamic programming
algorithms.
Entails some loss of sensitivity and sometimes a
gain in selectivity over dynamic programming.

36
Generalized Heuristic Algorithm Structure

Initial Word Search
Exact
Similarity
First evaluation Threshold length or score
Initial Alignment
Join words or
Extend words toward maximal segment pair
Final, optimized alignment
Join MSPs
Banded Smith-Waterman (Fasta, prior BLASTs)
Centered Smith-Waterman (BLAST 2)

37
Search Programs Fitting the Model

FASTA
Restricts the evolutionary model to require a few
unmutated dipeptides or hexanucleotides
Carefully considers only alignments including the
unmutated sites
The window size limits the maximum cumulative
gaps
Plus or minus 32 residues in the GCG
implementation
Much faster, more selective, and less sensitive
than Smith-Waterman searches for sequences with
several identical dipeptides or hexanucleotides

38
A Word List for FASTA, Word Size 6

G C T G A A G G C A T
G C T G A A
C T G A A G
T G A A G A
G A A G G C
A A G G C A
A G G C A T

39
FASTA Algorithm

4 steps
use lookup table to find all identities at least
ktup long, find regions of identities (Fig.A)
rescan 10 regions (diagonals) with highest
density of identities using PAM250 (Fig.B)
join regions if possible without decreasing score
below threshold (Fig.C)
rescore ala Smith-Waterman 32 residues around
initial region (Note doesnt save alignment)
(Fig.D)
Pearson and Lippman 1988

40
Search Programs Fitting the Model

BLAST
Requires the evolutionary model to include a site
with a long stretch of mostly identities and
conservative substitutions, with very few
unlikely substitutions and no insertions or
deletions
Less sensitive and more selective than
Smith-Waterman when the evolutionary model is
appropriate
Intended for proteins - BLAST is the least
effective algorithm for nucleic acid sequences
The restrictions allow rigorous statistical
evaluation of the results of the database search
Recent extension allows insertions and deletions
through a second pass Smith-Waterman alignment

41
BLAST Algorithm

3 steps
Compile a list of high-scoring words
scan database for hits
extend hits

42
BLAST Algorithm
43
BLAST Algorithm
44
BLAST Algorithm
45
BLAST 2 - Improved Sensitivity

BLAST 2 requires two different words to match
Both matches must be on the same diagonal of the
path graph
The matches must not overlap
The pair of matches must be within a limited
number of amino acids of one another, typically
40 amino acids along the diagonal
This allows the threshold score for each match to
be lowered from the value used in the original
BLAST
These matches are extended as in the original
BLAST
If the resulting HSP score is high enough a
limited Smith-Waterman alignment is done.

46
BLAST 2 Smith-Waterman

Finds the highest scoring 11 amino acid segment
in the HSPs from the first pass BLAST2 analysis
Takes the middle aligned pair of amino acids from
the segment as the starting point for the
Smith-Waterman alignment
Extends the alignment in both directions until
the alignment score drop to some limit X below
the highest Smith-Waterman score calculated up to
that point for the current pair of sequences
This allows odd shaped areas rather than the band
used in FASTA and the earlier gapped BLAST