Title: Sequence Alignment
1Sequence Alignment
2Sequence Alignment
- A procedure of comparing two or more sequences by
searching for a series of individual characters
or character patterns that are in the same order
in the sequences being compared. - Two sequences are said to be aligned by writing
them across in two rows - Identical (or similar) characters are matches,
and non-identical characters are mismatches. - Gaps can be introduced in either (or both)
sequences to produce a better alignment.
3Sequence alignment between two zinc finger
protein sequences
Colors Red Small (small hydrophobic (incl.
aromatic -Y)) - AVFPMILW Blue Acidic Magenta
Basic - RHK Green Hydroxyl Amine Basic Q -
STYHCNGQ Gray Others Symbols Identical
Conserved substitutions (Same color group) .
Semi-conserved substitution
Example from Wikipedia, http//en.wikipedia.org/wi
ki/Sequence_alignment
4Types of Sequence Alignments
- Portion of sequences aligned
- Global alignment - aligns sequences over their
entire length - Dotmatrix, Needleman-Wunsch, ClustalW
- Local alignment - determines the longest/best
subsequence pair that gives maximum similarity - Smith-Waterman, BLAST
- Number of sequences
- Pairwise alignment - only two sequences compared
- Dotmatrix, Needleman-Wunsch, Smith-Waterman,
BLAST - Multiple alignment - multiple sequences compared
- ClustalW, MEME
5Dot Plot
- Global, pairwise alignment method
- Full visual comparison of two sequences
- Gives a big picture a visual depiction of
sequence relationship - Steps
- Create a two-dimensional matrix placing the
N-terminal end (in the case of proteins) in the
top-left corner - For each cell, a dot is placed in the position of
the intersection if the row and column matches
6 Anatomy of a Dot Plot
Matrix, M, is a two-dimensional grid.
j entries
Sequence A
i entries
We move through M in row-wise fashion...M(i,j)
Sequence B
A cell is the intersection of a row, i, and a
column, j
7Anatomy of a Dot Plot
In this example, identities were found at M(1,1),
M(2,2), M(3,3).
Connecting the dots, we can see a diagonal, the
identity diagonal
Because the sequences are the same, this is an
intrasequence comparison.
8Dot Plots
This is an intrasequence comparison (inversion in
sequence A)
Note inversion in this portion of Sequence B
9Dot Plot Patterns
Gaps dissimilarity
Displaced Main Diagonal
Main Diagonal
Similar, but not identical
An indel (insertion/deletion)
Displacement of main diagonal parallel to the
sequence with the insertion
10Dot Plot Patterns
Repeated sequence
Non-self-dotplot (different sequences), tandem
duplication
Self-dotplot (same sequence), tandem duplication
ABCDEFGEFGHIJKLMNO
11Dot Plot Patterns
Number of diags. Interval between diags.
Complex sequence expansion
Inversion (Transposition)
12Dot Plot Patterns
?
Palindrome (Intrastrand)
5 GGCGG 3
Intrasequence comparison is method of choice for
characterizing internal repeats
Prev. examples from http//bioinformatics.weizman
n.ac.il/courses/BCG/lectures/02_pairwise/2.2method
s/01dotplots.html
13Dot Matrices
protein sequences
DNA sequences
14Random Matches in Dot Matrix
- When comparing DNA sequences, random matches
occur with probability 1/4 - When comparing protein sequences, 1/20
- Thus, for comparisons of protein coding DNA
sequences, we should translate them to amino
acids first
15To Reduce Random Noise in Dot Matrix
- Specify a window size, w
- Look at w consecutive residues from each of the
two sequences - Specify a stringency
- Among the w pairs of residues, count how many
pairs are match within the window
16Simple Dot Matrix, Window Size 1
17Window Size is 3
18Window Size is 3 Stringency is 2
19DNA Sequences
single residue identity
16 out of 23 identical
20Protein Sequences
single residue identity
6 out of 23 identical
21Two examples of dotplots
- http//emboss.umdnj.edu
- Dottup
- http//www.isrec.isb-sib.ch/java/dotlet/Dotlet.htm
l - Dotlet - Java-based, interactive
- Rat cytochrome c
- Protein NP_036971
- retrieve the mRNA based on this protein
- Genomic DNA NW_0476912934400-2934900
- Human zinc finger
- S52507
- S52508
22Two examples of dotplots
- http//emboss.umdnj.edu
- Dottup
- http//www.isrec.isb-sib.ch/java/dotlet/Dotlet.htm
l - Dotlet - Java-based, interactive
- HOXB4
- Chicken
- NM_205294NW_001471737 5341000 - 5359000
- HOXD8
- Chicken
- NM_205354NW_001471688 1907000 - 1910000
23Dot Plot
- Advantages
- All possible matches of residues between two
sequences are found - Good for finding direct and inverted repeats
- Allows for fast visual inspection
- Disadvantages
- Random matches cause noise
- Computer cannot visually detect diagonals
- Diagonals can be missed by visual inspection
- Unreasonable for large number of comparisons
- Conclusions
- For DNA Comparisons
- Long windows, high stringencies
- For Protein Comparisons
- Use short windows and stringencies
- For a short domain of partial similarity, use a
longer window and a small stringency
24Needleman-Wunsch
- Global, pairwise alignment method
- Uses a technique called dynamic programming
- ie guaranteed to find the alignment giving the
maximum score between two sequences - Works well for aligning sequences that are
similar and roughly equal size - Steps
- Construct a matrix similar to a dot-plot of the
two sequences - Assign similarity scores to each cell in the
matrix - Trace through the scores in the matrix to find
the optimal path
25Similarity Scores
- A method of assigning a score of aligning two
amino acids or two DNA bases to each other - Represented in a matrix similar to this
A G C T A 1 3 3 -3 G 3 1 3 -3 C 3 3
1 -3 T 3 3 3 1
26Needleman-Wunsch Example
27Needleman-Wunsch Example
28Needleman-Wunsch Example
29Smith-Waterman
- Based on Needleman-Wunsch
- Instead of looking at each sequence in its
entirety, compare segments of all possible
lengths and choose whichever optimizes the
similarity measure - Assign negative score for a mismatch and a
negative score based on introduction of
insertion/deletion and length of insert/delete
30Linear Scores
- Match 2, Mismatch -1, Gap -2
G A A T T C C G T T A G G
A T _ C _ G _ _ A
- Changing the size of the gap doesnt affect the
score
G A A T T C C G T T A G G
A T _ _ C G _ _ A
31Affine Gap Penalties
- Match 2, Mismatch -1, Gap Opening -2, Gap
Extension -1
G A A T T C C G T T A G G
A T _ C _ G _ _ A
- Changing the size of the gap does affect the score
G A A T T C C G T T A G G
A T _ _ C G _ _ A
32Affine Gap Penalties
- Affine gap penalties provide incentive for the
alignment algorithm to keep sequence together
where possible rather than inserting large
numbers of small gaps - Wk1 (1/3)k
- Gap opening penalty 1 1/3
- Gap extension penalty 1/3 length of gap
33Illustration of Dynamic Programming
34Intuition of Dynamic Programming
If we already have the optimal solution
to XY AB then we know the next pair of
characters will either be XYZ or XY-
or XYZ ABC ABC AB- (where - indicates a
gap). So we can extend the match by determining
which of these has the highest score.
35Illustration of Gotohs Algorithm
36Gotoh
- Local, pairwise alignment method
- Uses a technique called dynamic programming
- ie guaranteed to find the alignment giving the
maximum score between two sequences - De-facto standard for performing local alignments
- Steps
- Construct a matrix similar to a dot-plot of two
sequences - Assign similarity scores to each cell in the
matrix - Trace through the scores in the matrix to find
the optimal path
37Example match 1, mismatch -1, gap -1
38Example match 1, mismatch -1, gap -1
39Example match 1, mismatch -1, gap -1
40Example match 1, mismatch -1, gap -1
41Example match 1, mismatch -1, gap -1
42Example match 1, mismatch -1, gap -1
43Example match 1, mismatch -1, gap -1
44Example match 1, mismatch -1, gap -1