Pairwise Sequence Alignment Part 1 - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Pairwise Sequence Alignment Part 1

Description:

Local Alignment (Smith-Waterman) Word, or k-tuple methods ... Smith Waterman. Dynamic programming ... of word (k-tup) search and Smith-Waterman algorithm ... – PowerPoint PPT presentation

Number of Views:649
Avg rating:3.0/5.0
Slides: 39
Provided by: macieksa
Category:

less

Transcript and Presenter's Notes

Title: Pairwise Sequence Alignment Part 1


1
Pairwise Sequence AlignmentPart 1
  • VIBE Education Edition (VIBE-Ed) Initiative

2
Overview
  • Part 1 Introduction
  • Why compare sequences?
  • Dynamic Programming Algorithms (Global vs. Local)
  • Heuristic algorithms (K-tuple / Word-size)
  • Scoring Matrices
  • Part 2 Statistics of Similarity Searches
  • Scoring Matrices, contd
  • Statistics of similarity searching

3
Why compare sequences?
  • Nature is conservative
  • Incremental modifications give rise to genetic
    diversity and novel function
  • Detection of similarity between sequences allows
    us to transfer information about one sequence to
    other similar sequences with reasonable, though
    not always total, confidence

4
Sequence Alignment
  • Before we can make comparative statements about
    two (nucleic acid or protein) sequences, we have
    to produce a pairwise sequence alignment
  • What is the optimal alignment between two
    sequences?
  • Quantitative? Match/mismatch? Gaps/extensions? Is
    an optimal alignment always significant? Random
    sequences?

5
Protein Evolution
  • For many proteins, evolutionary history can be
    traced back gt 1 billion years
  • Evolutionary time scales / Tree of Life
  • Sequence Homology vs. Sequence Similarity
  • Homology means common ancestry

6
Three alignments, three meanings
  • HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDL
    HAHKL
  • G VKHGKKV AAHD LSLH KL
  • HBB _HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSEL
    HCDKL
  • HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALS
    ALSDLHAHKL
  • H KV A L LH K
  • LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLG
    SVHVSKG
  • HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD-
    ---LHAHKL
  • GS G D L H D A AL D AH
  • F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFK
    AHQE

7
Pairwise Sequence Alignment Methods
  • Dynamic Programming
  • Global Alignment (Needleman-Wunsch)
  • Local Alignment (Smith-Waterman)
  • Word, or k-tuple methods
  • FASTA
  • BLAST

8
Dynamic Programming Algorithm
  • Provides the best (optimal) alignment between two
    sequences
  • Includes matches, mismatches and gaps to maximize
    the number of matched characters
  • Score match, mismatch, gap (non-affine vs.
    affine)

9
Example
  • Find optimal global alignment for sequences
  • GAAGA
  • and
  • GTTTAAG

10
Define Rules
  • Score for match 1
  • Score for mismatch -1
  • Score for gap -3

11
(No Transcript)
12
Implement Rules
  • When Moving Horizontally (gap)
  • Alignment Score Existing Score Gap Score
  • When Moving Vertically (gap)
  • Alignment Score Existing Score Gap Score
  • When Move Diagonally (match or mismatch)
  • Alignment Score Existing Score Corner Value

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Optimal Global Alignments
1) GAAGA__ -7 GTTTAAG
5) GAAA_G_ -7 GTTTAAG
6) GAA_GA_ -7 GTTTAAG
2) G_A_AGA -7 GTTTAAG
3) G__AAGA -7 GTTTAAG
4) G_AAGA_ -7 GTTTAAG
22
Global vs. Local Alignment
23
Smith Waterman
  • Dynamic programming - local alignment
  • Compares query to each sequence in database.
  • Performs full pairwise comparison.
  • More sensitive, but much slower than heuristic
    algorithms (BLAST or FASTA)

24
(No Transcript)
25
Optimal Local Alignment
GAAGA 3 GTTTAAG
AAG 3
AAG
26
Heuristic (word or k-tuple based) algorithms
  • Make reasonable assumptions about nature of
    sequence alignments and try out only most
    likely alignments
  • Find perfect match (word, k-tuple)
  • Extend alignment until
  • One or the other sequences ends
  • Score drops below a threshold
  • Much faster than dynamic programming methods, but
    less sensitive

27
FASTA (Pearson and Lippman 1988)
  • Combination of word (k-tup) search and
    Smith-Waterman algorithm
  • The query sequence is divided into small words of
    certain size.
  • The initial comparison of the query sequence to
    the database is performed using these words.
  • If these words are located on the same diagonal
    in an array the region surrounding the diagonals
    are analyzed further.
  • Search time is only proportional to size of
    database not (databasequery sequence)

28
The ktup value
  • The ktup (for k-tuples) value stands for the
    length of the word used
  • to search for identity.
  • For proteins a ktup value of 3 would give a hash
    table of 203
  • elements (8000 entries).
  • The higher the ktup value the less likely you
    will get a match
  • unless it is identical (remember the dot
    plots).
  • The lower the ktup value the more background you
    will have
  • The higher the ktup value the faster analysis
    (fewer diagonals).

Typical ktup values
ktup analysis____________________
1 proteins- distantly related 2
proteins- somewhat related (default)
3 DNA-default
29
FASTA Steps
2
Different offset values
1
Identical offset values in a contiguous sequence
Diagonals are extended
Local regions of identity are found
Rescore the local regions using scoring matrices
4
3
Create a gapped alignment in a narrow segment and
then perform S-W alignment
Eliminate short diagonals below a cutoff score
30
Summary of FASTA steps
  • 1. Analyzes database for identical matches that
    are contiguous.
  • 2. Longest diagonals are scored again using the
    PAM matrix (or other matrix). The best scores
    are saved as init1 scores.
  • 3. Short diagonals are removed.
  • 4. Long diagonals that are neighbors are joined.
    The score for this joined region is initn.
    This score may be lower due to a penalty for a
    gap.
  • 5. A S-W dynamic programming alignment is
    performed around the joined sequences to give an
    opt score.
  • Thus, the time-consuming S-W step is performed
    only on top scoring sequences

31
FASTA Versions
fasta compares a protein sequence to a protein
database or nucleotide sequence to a nucleotide
database fastx compares a translated query
sequence fasty to a protein sequence database
(forward or backward translation of the
query) tfastx compares protein query sequence
to tfasty nucleotide sequence database that
has been translated into three forward
and three reverse reading frames
32
BLAST(Karlin and Altschul 1990)
  • Basic Local Alignment Search Tool
  • Database is pre-indexed to increase speed
  • The initial search is done for a word of length
    "W" that scores at least "T" when compared to the
    query using a substitution matrix.
  • Word hits are then extended in either direction
    in an attempt to generate an alignment with a
    score exceeding the threshold of "S".
  • The "T" parameter dictates the speed and
    sensitivity of the search.

33
BLAST Versions
BLASTN Compares a nucleotide query to a
nucleotide database BLASTP Compares a protein
query to a protein database BLASTX Compares
a translated nucleotide query to a protein
database. TBLASTN Compares a protein query to a
translated nucleotide database. TBLASTX Compare
s a translated nucleotide query to a translated
nucleotide database.
34
Scoring Matrices
  • The alignment score represent odds of obtaining
    that score between sequences known to be related
    to that obtained by chance alignment between
    unrelated sequences
  • When the correct scoring matrix is used,
    alignment statistics are meaningful

35
Dayhoff PAM Matrix(Point Accepted Mutation)
  • Lists the likelihood of change from one amino
    acid to another in homologous protein sequences
    during evolution
  • PAM 1 estimated using 1572 changes in 71 groups
    of protein sequences that were at least 85
    similar
  • Assumes each amino acid change at a site is
    independent of previous changes at the site
  • PAM 250 (20 similarity) obtained by multiplying
    PAM1 by itself 250 times

36
Blocks Amino Acid Substitution (BLOSUM) Matrix
  • Based on the observed amino acid substitutions in
    blocks (large set of 2000 conserved amino acid
    patters)
  • Used 500 families of related proteins
  • Not based on explicit evolutionary model, but
    from considering all amino acid changes observed
    in an aligned region from a related family of
    proteins.

37
PAM250 Scoring Matrix
A R N D C Q E G H I L K M F P
S T W Y V B Z A 2 -2 0 0 -2 0 0 1 -1
-1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2
6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2
-4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2
-3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1
2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4
-2 5 4 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6
-5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1
2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2
3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0
-2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1
-3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1
2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2
-2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2
-2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1
-1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4
2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1
0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2
-2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5
-2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P
1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0
-6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1
-1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1
-1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3
-5 -3 0 2 1 W -6 2 -4 -7 -8 -5 -7 -7 -3
-5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3
-4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3
0 10 -2 -2 -3 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4
2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1
4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2
0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1
2 0 -4 1 1 1 -4 -3 0 5 6
38
Summary
  • Choose appropriate algorithm (speed vs.
    sensitivity)
  • Use smallest database that will answer your
    question
  • Default matrices may not always give a meaningful
    score
Write a Comment
User Comments (0)
About PowerShow.com