Pairwise Sequence Alignment Part 1 presentation

About This Presentation

Transcript and Presenter's Notes

Title: Pairwise Sequence Alignment Part 1

1
Pairwise Sequence AlignmentPart 1

VIBE Education Edition (VIBE-Ed) Initiative

2
Overview

Part 1 Introduction
Why compare sequences?
Dynamic Programming Algorithms (Global vs. Local)
Heuristic algorithms (K-tuple / Word-size)
Scoring Matrices
Part 2 Statistics of Similarity Searches
Scoring Matrices, contd
Statistics of similarity searching

3
Why compare sequences?

Nature is conservative
Incremental modifications give rise to genetic
diversity and novel function
Detection of similarity between sequences allows
us to transfer information about one sequence to
other similar sequences with reasonable, though
not always total, confidence

4
Sequence Alignment

Before we can make comparative statements about
two (nucleic acid or protein) sequences, we have
to produce a pairwise sequence alignment
What is the optimal alignment between two
sequences?
Quantitative? Match/mismatch? Gaps/extensions? Is
an optimal alignment always significant? Random
sequences?

5
Protein Evolution

For many proteins, evolutionary history can be
traced back gt 1 billion years
Evolutionary time scales / Tree of Life
Sequence Homology vs. Sequence Similarity

Homology means common ancestry

6
Three alignments, three meanings

HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDL
HAHKL
G VKHGKKV AAHD LSLH KL
HBB _HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSEL
HCDKL
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALS
ALSDLHAHKL
H KV A L LH K
LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLG
SVHVSKG
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD-
---LHAHKL
GS G D L H D A AL D AH
F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFK
AHQE

7
Pairwise Sequence Alignment Methods

Dynamic Programming
Global Alignment (Needleman-Wunsch)
Local Alignment (Smith-Waterman)
Word, or k-tuple methods
FASTA
BLAST

8
Dynamic Programming Algorithm

Provides the best (optimal) alignment between two
sequences
Includes matches, mismatches and gaps to maximize
the number of matched characters
Score match, mismatch, gap (non-affine vs.
affine)

9
Example

Find optimal global alignment for sequences
GAAGA
and
GTTTAAG

10
Define Rules

Score for match 1
Score for mismatch -1
Score for gap -3

11
(No Transcript)
12
Implement Rules

When Moving Horizontally (gap)
Alignment Score Existing Score Gap Score
When Moving Vertically (gap)
Alignment Score Existing Score Gap Score
When Move Diagonally (match or mismatch)
Alignment Score Existing Score Corner Value

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Optimal Global Alignments
1) GAAGA__ -7 GTTTAAG
5) GAAA_G_ -7 GTTTAAG
6) GAA_GA_ -7 GTTTAAG
2) G_A_AGA -7 GTTTAAG
3) G__AAGA -7 GTTTAAG
4) G_AAGA_ -7 GTTTAAG
22
Global vs. Local Alignment
23
Smith Waterman

Dynamic programming - local alignment
Compares query to each sequence in database.
Performs full pairwise comparison.
More sensitive, but much slower than heuristic
algorithms (BLAST or FASTA)

24
(No Transcript)
25
Optimal Local Alignment
GAAGA 3 GTTTAAG
AAG 3
AAG
26
Heuristic (word or k-tuple based) algorithms

Make reasonable assumptions about nature of
sequence alignments and try out only most
likely alignments
Find perfect match (word, k-tuple)
Extend alignment until
One or the other sequences ends
Score drops below a threshold
Much faster than dynamic programming methods, but
less sensitive

27
FASTA (Pearson and Lippman 1988)

Combination of word (k-tup) search and
Smith-Waterman algorithm
The query sequence is divided into small words of
certain size.
The initial comparison of the query sequence to
the database is performed using these words.
If these words are located on the same diagonal
in an array the region surrounding the diagonals
are analyzed further.
Search time is only proportional to size of
database not (databasequery sequence)

28
The ktup value

The ktup (for k-tuples) value stands for the
length of the word used
to search for identity.
For proteins a ktup value of 3 would give a hash
table of 203
elements (8000 entries).
The higher the ktup value the less likely you
will get a match
unless it is identical (remember the dot
plots).
The lower the ktup value the more background you
will have
The higher the ktup value the faster analysis
(fewer diagonals).

Typical ktup values
ktup analysis____________________
1 proteins- distantly related 2
proteins- somewhat related (default)
3 DNA-default
29
FASTA Steps
2
Different offset values
1
Identical offset values in a contiguous sequence
Diagonals are extended
Local regions of identity are found
Rescore the local regions using scoring matrices
4
3
Create a gapped alignment in a narrow segment and
then perform S-W alignment
Eliminate short diagonals below a cutoff score
30
Summary of FASTA steps

1. Analyzes database for identical matches that
are contiguous.
2. Longest diagonals are scored again using the
PAM matrix (or other matrix). The best scores
are saved as init1 scores.
3. Short diagonals are removed.
4. Long diagonals that are neighbors are joined.
The score for this joined region is initn.
This score may be lower due to a penalty for a
gap.
5. A S-W dynamic programming alignment is
performed around the joined sequences to give an
opt score.
Thus, the time-consuming S-W step is performed
only on top scoring sequences

31
FASTA Versions
fasta compares a protein sequence to a protein
database or nucleotide sequence to a nucleotide
database fastx compares a translated query
sequence fasty to a protein sequence database
(forward or backward translation of the
query) tfastx compares protein query sequence
to tfasty nucleotide sequence database that
has been translated into three forward
and three reverse reading frames
32
BLAST(Karlin and Altschul 1990)

Basic Local Alignment Search Tool
Database is pre-indexed to increase speed
The initial search is done for a word of length
"W" that scores at least "T" when compared to the
query using a substitution matrix.
Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of "S".
The "T" parameter dictates the speed and
sensitivity of the search.

33
BLAST Versions
BLASTN Compares a nucleotide query to a
nucleotide database BLASTP Compares a protein
query to a protein database BLASTX Compares
a translated nucleotide query to a protein
database. TBLASTN Compares a protein query to a
translated nucleotide database. TBLASTX Compare
s a translated nucleotide query to a translated
nucleotide database.
34
Scoring Matrices

The alignment score represent odds of obtaining
that score between sequences known to be related
to that obtained by chance alignment between
unrelated sequences
When the correct scoring matrix is used,
alignment statistics are meaningful

35
Dayhoff PAM Matrix(Point Accepted Mutation)

Lists the likelihood of change from one amino
acid to another in homologous protein sequences
during evolution
PAM 1 estimated using 1572 changes in 71 groups
of protein sequences that were at least 85
similar
Assumes each amino acid change at a site is
independent of previous changes at the site
PAM 250 (20 similarity) obtained by multiplying
PAM1 by itself 250 times

36
Blocks Amino Acid Substitution (BLOSUM) Matrix

Based on the observed amino acid substitutions in
blocks (large set of 2000 conserved amino acid
patters)
Used 500 families of related proteins
Not based on explicit evolutionary model, but
from considering all amino acid changes observed
in an aligned region from a related family of
proteins.

37
PAM250 Scoring Matrix
A R N D C Q E G H I L K M F P
S T W Y V B Z A 2 -2 0 0 -2 0 0 1 -1
-1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2
6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2
-4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2
-3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1
2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4
-2 5 4 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6
-5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1
2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2
3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0
-2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1
-3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1
2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2
-2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2
-2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1
-1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4
2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1
0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2
-2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5
-2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P
1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0
-6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1
-1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1
-1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3
-5 -3 0 2 1 W -6 2 -4 -7 -8 -5 -7 -7 -3
-5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3
-4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3
0 10 -2 -2 -3 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4
2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1
4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2
0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1
2 0 -4 1 1 1 -4 -3 0 5 6
38
Summary

Choose appropriate algorithm (speed vs.
sensitivity)
Use smallest database that will answer your
question
Default matrices may not always give a meaningful
score

Write a Comment

User Comments (0)

About PowerShow.com

Pairwise Sequence Alignment Part 1 PowerPoint PPT Presentation