Pairwise Alignments and Sequence Database Searches: Algorithms - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Pairwise Alignments and Sequence Database Searches: Algorithms

Description:

Search programs: Smith-Waterman, BLAST, FASTA ... Smith-Waterman Alignment. PAM 47, Match = 5, Mismatch ... SW Alignment Waterman-Eggert Extension (MaxSegs) ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 47
Provided by: jaimeram
Category:

less

Transcript and Presenter's Notes

Title: Pairwise Alignments and Sequence Database Searches: Algorithms


1
Pairwise Alignments and Sequence Database
Searches Algorithms
  • BIOINFORMATICS I
  • Protein and DNA Sequence Analysis
  • Jaime E. Ramirez-Vick, PhD

2
Sequence Analysis - Overview
3
Types of Database Searches
  • Sequence against database of abstractions from
    sequences
  • Sequence against database of sequences
  • Abstraction from your sequence against a database
    of sequences
  • Sometimes against a database of abstraction from
    sequences
  • Sequence database searches are carried out by
    doing pairwise alignments

4
Focus What to Remember
  • The different kinds of alignments and algorithms
  • The model of sequence evolution implied in
    database searching and sequence alignment -
    Scoring Matrices
  • Differences, in terms of restrictions placed on
    the model of sequence evolution, among the three
    main database searching (local alignment)
    algorithms
  • The nature of the knowledge or information
    encoded in amino acid or nucleotide similarity
    matrices
  • The effect of different gap penalty choices on
    the model of sequence evolution
  • The relationship between the amount of
    information in an alignment and statistical
    significance

5
What are you doing?????
  • Database search
  • Separate the library sequences that are related
    to your query sequence (by some model) from the
    sequences that are not.
  • SEQUENCE ALIGNMENTS
  • Alignment
  • Represents our best guess of the history of
    substitutions at each position.

6
What is a sequence alignment?
  • THE PROBLEM
  • Given two sequences of letters, and a scoring
    scheme for evaluating matching letters, find the
    optimal pairing of letters from one sequence to
    letters of the other sequence.
  • Align
  • THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
  • THIS IS A SHORT SENTENCE.
  • Alignments
  • THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
  • THIS IS A SHORT SENTENCE.
  • OR
  • THIS IS A SHORTSENTENCE.

From Russ Altman - Stanford University
7
Why align sequences?
  • Lots of sequences with unknown structure and
    function. A few sequences with known structure
    and function.
  • If they align, they are similar, maybe due to
    common descent.
  • If they are similar, then they might have the
    same structure or function.
  • If one of them has known structure/function, then
    alignment to the other yields insight about how
    the structure or function works.

From Russ Altman - Stanford University
8
Database Searching An Overview
  • Database searching is the application of
    previously determined experimental knowledge to
    the problem of discovering the biochemistry and
    physiology of a newly discovered gene or its
    protein product.
  • This involves finding an homologous gene or
    protein - one related to our newly determined
    sequence through a common evolutionary ancestor.

9
Knowledge in Database Searching
  • Search programs Smith-Waterman, BLAST, FASTA
  • Each is a mathematical method for finding the
    best fit between two sets of discreet data (e.g.,
    biological sequences)
  • Any limitations in finding the best fit between
    sequences can be interpreted as restrictions on
    the model of evolution relating the sequences and
    what kinds of changes are allowed.
  • Similarity scores for alignments PAM, Blosum,
    DNA
  • For a given evolutionary divergence what
    substitutions are likely to be observed and which
    are unlikely to be observed
  • Sequence Databases GenBank, PIR, Swiss-Prot,
    UNIPROT
  • Many sequences where we already have knowledge of
    their biochemistry or physiology
  • Conserved features in the sequence and their
    functional role

10
The Model of Sequence Evolution
  • The sequences being sought with the query
    sequence have an ancestral sequence in common
    from which they and the query sequence have
    descended.
  • Our best guess of the actual path of evolution is
    the path that requires the fewest evolutionary
    events.
  • All substitutions are not equally likely and
    should be weighted to account for this.
  • Insertions and deletions are less likely than
    substitutions and should be weighted to account
    for this.

11
Database Searching The Modelwith the
assumption of reversibility
Unobserved Ancestral Sequence
Descendant 1 gctggaaggcat
Descendant 2 gcagagcact
12
Database Searching The Choices
  • The choice of search algorithm determines which,
    if any, restrictions you place on the
    evolutionary model of the database search this
    determines the sensitivity, selectivity, and
    computational requirements
  • The choice of scoring table (similarity matrix)
    determines the degree of divergence and pattern
    of substitutions that will be found
  • The choice of gap penalty influences the shape of
    the distribution of alignment scores for random
    sequences and hence your ability to distinguish
    homologous from nonhomologous sequences

13
Drawing alignments
  • Exact Matches OK, Inexact Costly, Gaps cheap.
  • This is a rather longer sentence than the next.
  • This is a sentence.
  • OR
  • This is a rather longer sentence than the next.
  • This is a shortsentence.
  • Exact Matches OK, Inexact Moderate, Gaps cheap.
  • This is a rather longer sentence than the next.
  • This is a short sentence.
  • Exact Matches cheap, Inexact cheap, Gaps
    expensive.
  • This is a rather longer sentence than the next.
  • This is a short sentence.

From Russ Altman - Stanford University
14
Outline of Database Searching
  • Algorithms
  • Dynamic Programming (Needleman-Wunsch,
    Smith-Waterman, Sellers)
  • Heuristics (FASTA, BLAST)
  • Similarity Scores
  • Information theory (log-odds) and entropy
    (information content in bits)
  • Approaches for creating matrices (PAM vs. BLOSUM,
    Protein vs. Nucleic acid)
  • Efficiency multiple matrices?
  • Gaps
  • InDel Insertion/Deletion
  • Statistical Significance
  • Database as a reference distribution
  • Information content of the alignment

15
Database Searching Algorithm
  • Dynamic Programming Algorithm
  • required computational resources are proportional
    to the products of the lengths of the sequences
  • more sensitive, less selective
  • rigorous - finds the best possible alignment
    given a set of scores
  • Needleman-Wunsch (global)
  • Smith-Waterman (local)
  • Sellers (quasi-global)
  • Heuristics
  • less sensitive, more selective, not rigorous -
    local minima problem
  • local alignments only
  • FASTA (Word search)
  • BLAST (Maximal Segment Pairs)

16
Search Programs Fitting the Model
  • Needleman-Wunsch
  • Aligns the entire lengths of a pair of sequences,
    thus assumes they are related over their entire
    lengths
  • Places the fewest restrictions on the model of
    sequence evolution implicit in the scores and
    context
  • Mathematically rigorous - all possible alignments
    are considered
  • Computationally intensive requires memory and
    time proportional to the product of the lengths
    of the sequences

17
Dynamic Programming Needleman-Wunsch
NWi,j max NWi-1,j-1 s(ai,bj) NWi-1,j g
NWi,j-1 g

NWi,j is the cumulative similarity of the
alignment up to residue i in sequence A and
residue j in sequence B

s(ai,bj) the similarity score for aligning
residue ai with residue bj (e.g., from a PAM of
Blosum table)
g is the gap penalty, g open gap extend gap
gap length
18
Computation Grid for Needleman-Wunsch
a5
a6
a1
a2
a3
a4
(open gap 3 extend gap)
b1
b2
(open gap 2 extend gap)
NW3,3 s(a4,b4)
b3
(open gap extend gap)
b4
NW4-l,4 g l 1, 2, or 3
b5
19
Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7
  • 0 G C T G G A A G G C A
    T
  • 0 0 -7 -14 -21 -28 -35 -42 -49 -56 -63 -70
    -77 -84
  • G -7 5 -2 -9
  • C -14 -2 10
  • A -21 -9
  • G -28
  • A -35
  • G -42
  • C -49
  • A -56
  • C -63
  • T -70

20
Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7

21
Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7

22

Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7
23
Needleman-Wunsch Alignments PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7
  • 0 G C T G G A A G G C A
    T
  • 0 0 -7 -14 -21 -28 -35 -42 -49 -56 -63 -70
    -77 -84
  • G -7 5 -2 -9 -16 -23 -30 -37 -44 -51 -58
    -65 -72
  • C -14 -2 10 3 -4 -11 -18 -25 -32 -39 -46
    -53 -60
  • A -21 -9 3 6 -1 -8 -6 -13 -20 -27 -34
    -41 -48
  • G -28 -16 -4 -1 11 4 -3 -10 -8 -15 -22
    -29 -36
  • A -35 -23 -11 -8 4 7 9 2 -5 -12 -19
    -17 -24
  • G -42 -30 -18 -15 -3 9 3 5 7 0 -7
    -14 -21
  • C -49 -37 -25 -22 -10 2 5 -1 1 3 5 -2
    -9
  • A -56 -44 -32 -29 -17 -5 7 10 3 -3 -1
    10 3
  • C -63 -51 -39 -36 -24 -12 0 3 6 -1 2 3
    6
  • T -70 -58 -46 -34 -31 -19 -7 -4 -1 2 -5 -2
    8

G C T G G A A G G C A - T G
C - A G - A - G C A C T
24
Global vs. Local Alignment
  • Global alignment find alignment in which total
    score is highest, perhaps at expense of areas of
    great local similarity.
  • Local alignment find alignment in which the
    highest scoring subsequences are identified, at
    the expense of the overall score.
  • Local alignment can be done with minor
    modifications of the global alignment algorithm!

From Russ Altman - Stanford University
25
Dynamic Programming Smith-Waterman
  • Finds local alignments - the best matching
    regions in a pair of sequences rather than
    aligning their entire lengths. Ignores poorly
    matching regions.
  • Mathematically rigorous - finds the best
    alignment between two sequences given the choice
    of similarity scores and gap penalties.
  • Solves the problem of aligning sequences that
    share a functional or structural domain in common
    but are not related over their entire length.
  • Requires computational resources proportional to
    the product of the lengths of the sequences.

26
Dynamic Programming Smith-Waterman
SWi,j max SWi-1,j-1 s(ai,bj) SWi-1,j g
SWi,j-1 g 0
SWi,j is the cumulative similarity of the
alignment up to residue i in sequence A and
residue j in sequence B


s(ai,bj) the similarity score for aligning
residue ai with residue bj (e.g., a PAM of
Blosum table)
g is the gap penalty, g open gap extend gap
gap length
0, zero, prevents poorly matching regions from
hiding good matching regions
27
Computation Grid for Smith-Waterman
a5
a6
a1
a2
a3
a4
(open gap 3 extend gap)
b1
b2
(open gap 2 extend gap)
SW3,3 s(a4,b4)
b3
(open gap extend gap)
b4
SW4-l,4 g l 1, 2, or 3
b5
28
Smith-Waterman Alignment PAM 47, Match 5,
Mismatch -4, Open Gap 0, Extend Gap -7
  • 0 G C T G G A A G G C
    A T
  • 0 0 0 0 0 0 0 0 0 0 0 0
    0 0
  • G 0 5 0 0 5 5 0 0 5 5 0
    0 0
  • C 0 0 10 3 0 1 1 0 0 1 10
    3 0
  • A 0 0 3 6 0 0 6 6 0 0 3
    15 8
  • G 0 5 0 0 11 5 0 2 11 5 0
    8 11
  • A 0 0 1 0 4 7 10 5 4 7 1
    5 4
  • G 0 5 0 0 5 9 3 6 10 9 3
    0 1
  • C 0 0 10 3 0 2 5 0 3 6 14
    7 0
  • A 0 0 3 6 0 0 7 10 3 0 7
    19 12
  • C 0 0 5 0 2 0 0 3 6 0 5
    12 15
  • T 0 0 0 10 3 0 0 0 0 2 0
    5 17



G A A G - G C A G C A G A G C A
29
SW Alignment Waterman-Eggert Extension
(MaxSegs) PAM 47, Match 5, Mismatch -4, Open
Gap 0, Extend Gap -7
  • 0 G C T G G A A G G C
    A T
  • 0 0 0 0 0 0 0 0 0 0 0 0
    0 0
  • G 0 5 0 0 5 0 0 0 5 5 0
    0 0
  • C 0 0 10 3 0 1 0 0 0 1 10
    3 0
  • A 0 0 3 6 0 0 6 0 0 0 3
    15 8
  • G 0 5 0 0 11 5 0 2 0 5 0
    8 11
  • A 0 0 1 0 4 7 10 5 0 0 1
    5 4
  • G 0 5 0 0 5 9 3 6 10 0 0
    0 1
  • C 0 0 10 3 0 2 5 0 3 6 0
    0 0
  • A 0 0 3 6 0 0 7 10 3 0 2
    0 0
  • C 0 0 5 0 2 0 0 3 6 0 5
    0 0
  • T 0 0 0 10 3 0 0 0 0 2 0
    0 0



1ST (Score 19) G A A G - G C A G C
A G A G C A
2ND (15) G C A G C A
3RD (11) G C T G G C A G
30
Sellers (Quasi-Global) Alignment PAM 47, Match
5, Mismatch -4, Open Gap 0, Extend Gap -7
  • 0 G C T G G A A G G C
    A T
  • 0 0 0 0 0 0 0 0 0 0 0 0
    0 0
  • G 0 5 -2 -4 5 5 -2 -4 5 5 -2
    -4 -4
  • C 0 -2 10 3 -2 1 1 -6 -2 1 10
    3 -4
  • A 0 -4 3 6 -1 -6 6 6 -1 -6 3
    15 8
  • G 0 5 -2 -1 11 4 -1 2 11 4 -3
    8 11
  • A 0 -2 1 -6 4 7 9 4 4 7 0
    1 4
  • G 0 5 -2 -3 1 9 3 5 9 9 3
    -4 -3
  • C 0 -2 10 3 -4 2 5 -1 2 5 14
    7 0
  • A 0 -4 3 6 -1 -5 7 10 3 -2 7
    19 12
  • C 0 -4 1 -1 2 -5 0 3 6 -1 3
    12 15
  • T 0 -4 -6 5 -2 -2 -5 -4 -1 2 -4
    5 17



NWi,j max NWi-1,j-1 s(ai,bj) NWi-1,j g
NWi,j-1 g
G A A G - G C A - T G C A G A G C
A C T
31
Comparison of Alignments from the Algorithms PAM
47, Match 5, Mismatch -4, Open Gap 0,
Extend Gap -7
  • 0 G C T G G A A G G C
    A T
  • 0
  • G
  • C
  • A
  • G
  • A
  • G
  • C
  • A
  • C
  • T

Needleman-Wunsch (global)
G C T G G A A G G C A - T G C - A G - A - G C A C
T
Smith-Waterman (local)
1ST (Score 19) G A A G - G C A G C A G A G C
A
2ND (15) G C A G C A
3RD (11) G C T G G C A G
Sellers (Quasi-Global)
G A A G - G C A - T G C A G A G C A C T
32
Dynamic Programming Algorithms
  • Place no restrictions on the model of evolution
  • Mathematically rigorous -- finds maximum
    similarity
  • Computationally demanding
  • CPU scales ? 3 length sequence 1 length
    sequence 2
  • Memory scales ? length sequence 1 length
    sequence 2
  • Algorithmic Variants
  • Global -- Needleman-Wunsch aligns entire
    lengths of sequences
  • Quasi-global -- Sellers fits a short sequence
    into a longer sequence
  • Local -- Smith-Waterman finds most similar
    regions

33
Comparison of Algorithms
Sequence 1 -gt
  • Needleman-Wunsch
  • Global alignment
  • Must start in upper left corner and finish in
    lower right corner.
  • Smith-Waterman
  • Local Alignment
  • Small regions of high similarity
  • Sellers
  • Quasiglobal alignment
  • Must start on left or top edge and finish on
    right or bottom edge
  • In well behaved cases, forces an alignment of
    smaller sequence into larger sequence

Sequence 2 ?
Global
Local-1
Quasi- global
Local-2
34
Coagulation pre-proenzymes
35
Heuristic Programs FASTA and BLAST
  • Does a fast initial screen of each database
    sequence to see if the similarity with the query
    sequence is high enough to warrant a more
    complete examination.
  • The initial screen is based on finding highly
    similar, small, fixed length segments of sequence
    called words or tuples.
  • This initial screen can be thought of as a
    restriction on the model of sequence evolution
    relative to that embodied in dynamic programming
    algorithms.
  • Entails some loss of sensitivity and sometimes a
    gain in selectivity over dynamic programming.

36
Generalized Heuristic Algorithm Structure
  • Initial Word Search
  • Exact
  • Similarity
  • First evaluation Threshold length or score
  • Initial Alignment
  • Join words or
  • Extend words toward maximal segment pair
  • Final, optimized alignment
  • Join MSPs
  • Banded Smith-Waterman (Fasta, prior BLASTs)
  • Centered Smith-Waterman (BLAST 2)

37
Search Programs Fitting the Model
  • FASTA
  • Restricts the evolutionary model to require a few
    unmutated dipeptides or hexanucleotides
  • Carefully considers only alignments including the
    unmutated sites
  • The window size limits the maximum cumulative
    gaps
  • Plus or minus 32 residues in the GCG
    implementation
  • Much faster, more selective, and less sensitive
    than Smith-Waterman searches for sequences with
    several identical dipeptides or hexanucleotides

38
A Word List for FASTA, Word Size 6
  • G C T G A A G G C A T
  • G C T G A A
  • C T G A A G
  • T G A A G A
  • G A A G G C
  • A A G G C A
  • A G G C A T

39
FASTA Algorithm
  • 4 steps
  • use lookup table to find all identities at least
    ktup long, find regions of identities (Fig.A)
  • rescan 10 regions (diagonals) with highest
    density of identities using PAM250 (Fig.B)
  • join regions if possible without decreasing score
    below threshold (Fig.C)
  • rescore ala Smith-Waterman 32 residues around
    initial region (Note doesnt save alignment)
    (Fig.D)
  • Pearson and Lippman 1988

40
Search Programs Fitting the Model
  • BLAST
  • Requires the evolutionary model to include a site
    with a long stretch of mostly identities and
    conservative substitutions, with very few
    unlikely substitutions and no insertions or
    deletions
  • Less sensitive and more selective than
    Smith-Waterman when the evolutionary model is
    appropriate
  • Intended for proteins - BLAST is the least
    effective algorithm for nucleic acid sequences
  • The restrictions allow rigorous statistical
    evaluation of the results of the database search
  • Recent extension allows insertions and deletions
    through a second pass Smith-Waterman alignment

41
BLAST Algorithm
  • 3 steps
  • Compile a list of high-scoring words
  • scan database for hits
  • extend hits

42
BLAST Algorithm
43
BLAST Algorithm
44
BLAST Algorithm
45
BLAST 2 - Improved Sensitivity
  • BLAST 2 requires two different words to match
  • Both matches must be on the same diagonal of the
    path graph
  • The matches must not overlap
  • The pair of matches must be within a limited
    number of amino acids of one another, typically
    40 amino acids along the diagonal
  • This allows the threshold score for each match to
    be lowered from the value used in the original
    BLAST
  • These matches are extended as in the original
    BLAST
  • If the resulting HSP score is high enough a
    limited Smith-Waterman alignment is done.

46
BLAST 2 Smith-Waterman
  • Finds the highest scoring 11 amino acid segment
    in the HSPs from the first pass BLAST2 analysis
  • Takes the middle aligned pair of amino acids from
    the segment as the starting point for the
    Smith-Waterman alignment
  • Extends the alignment in both directions until
    the alignment score drop to some limit X below
    the highest Smith-Waterman score calculated up to
    that point for the current pair of sequences
  • This allows odd shaped areas rather than the band
    used in FASTA and the earlier gapped BLAST
Write a Comment
User Comments (0)
About PowerShow.com