Title: Sequence Alignment I
1Sequence Alignment I
- Dot plots
- Dynamic Programming
2Why align sequences?
- conserved sequences?conserved function
- Assess ancestry among homologs (sequences with
common ancestory) to help in gene finding and
annotation - Find consensus motifs among related sequences
(e.g., regulatory and structural regions) - Estimate the rate of evolution
3Definitions
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
4Definitions
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
5Problem
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
6Are the sequences related?
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
7Types of Alignment
- Local
- Global
- Gap-free
- Gapped
8Methods of Alignment
- Dot matrix
- Dynamic programming
- K-tuple
9Dot plots
- To evaluate/visualize similarity between two
sequences - Create a matrix when sequence 1 is a row vector
and sequence 2 is a column vector.
10Dot plot (identity matrix)
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
11Self-alignment
12Self-alignment with sliding window
13Self-alignment with sliding window
14Dotmatrix
A T C A A
A 1 0 0 1 1
T 0 1 0 0 0
C 0 0 1 0 0
G 0 0 0 0 0
A 1 0 0 1 1
Seq1 ATCAA Seq2 ATCGA
15Function dotplot1.m
- function dotmatrixdotplot1(seq1,seq2)
- OUTPUT dotmatrix is an n x m matrix with 1s
and 0s, where 0 means a mismatch and 1 means a
match between two nucleotides - INPUT seq1 and seq2 are strings made of a,t,c,
or g. (seq1 is a row vector 1 x m and seq2 is a
column vector (n x1).
16Run dotplot1.m
- Place dotplot1.m into your directory.
- Open a Matlab session.
- Open the file named dotplot1.m under the file
option. - Define seq1 and seq2 as the following and run
the program - gtgt seq1 attataagg
- gtgt seq2 attataggg
- gtgt dotmatrix dotplot1(seq1,seq2)
- Next, copy and paste each line of the dotplot1.m
on the command window without using to track
what the program is doing.
17Dot plot
Seq1'attataagg Seq2'attataggg'
A match is red (1) a mismatch is blue (0)
18Dot plot with sliding windows
- Sliding windows consider more than one position
at a time. - Similarity cutoffs (threshold) also allow to get
rid of positions with low similarity across the
window. - Sliding window can reduce the noise in the dot
plot.
19Dot plot with sliding windows
Use a sliding window of size w, for example w 3,
such that only positions with w number of
consecutive matches along the diagonal count.
A T C A A
A 1 0 0 1 1
T 0 1 0 0 0
C 0 0 1 0 0
G 0 0 0 0 0
A 1 0 0 1 1
A T C A A
A 3 0 0 0 0
T 0 3 0 0 0
C 0 0 3 0 0
G 0 0 0 0 0
A 0 0 0 0 0
?
20Dot plot with sliding windows
3
0
3
21Dot plot with sliding windows
3
0
0
3
0
0
3
0
0
2
0
0
22Dot plot with sliding windows
A T C A A
A 3 0 0 0 0
T 0 2 0 0 0
C 0 0 2 0 0
G 0 0 0 0 0
A 0 0 0 0 0
23Function dotplot2.m
- function dotmatrix,dotdotplot1(seq1,seq2,w,t)
- Introduced two variables
- W size of the sliding window
- T threshold, number of matches along the
diagonal to assume a match.
24Function dotplot2.mWorksheet (due end of lecture)
- Examine the code in dotplot2.m to see what
commands have been used to slide the window and
count the number of consecutive matches. Briefly
write down your assessment. - Type in different sequences for seq1 and seq2 (no
longer than 30) and try two different w and t
values to run dotplot2.m
25Dot plot with sliding windows
Seq1'attataagg Seq2'attataggg'
A match is red a mismatch is blue
Sliding window size 3 Threshold is 3
Sliding window size 3 Threshold is 2
Sliding window size 1 Threshold is 1
Seq1'attataagg Seq2'attataggg'
Seq1'attataagg Seq2'attataggg'
26Alignment
- Is a pairwise match between the characters of
each sequence. - A true alignment reflects evolutionarily common
ancestry (homology).
27Changes that occur in sequences
- A mutation that replaces one character with
another is a substitution. - An insertion that adds one or more positions and
a deletion that deletes one or more positions are
known as indels (gaps)
28Alignment example
- AATCTATA
- AAGATA
- AATCTATA
- AAGATA
- AATCTATA
- AAGATA
29Gap-free alignment match and mismatch
- An alignment receives for each aligned pair of
identical residues (the match score) and the
penalty for aligned pair of nonidentical residues
(mismatch score). - ? ?
n
match score if seq1seq2
mismatch score if seq1?seq2
- where n is the length of the longer sequence.
30Gap-free alignment example
- AATCTATA
- AAGATA
- AATCTATA
- AAGATA
- AATCTATA
- AAGATA
Alignment scores would be 4, 1, 3, respectively
if the match score is 1 and mismatch score is 0.
31Gaps
- Indels complicate alignments by increasing the
number of possible alignments between two or more
sequences.
32Alignment match, mismatch, and gap penalty
- An alignment receives a score for each aligned
pair of identical residues (the match score) and
the penalty for aligned pair of nonidentical
residues (mismatch score), and a penalty for
insertion of gaps. - ? ?
gap penalty if seq1 - or seq2 -
n
match score if no gaps and seq1seq2
mismatch score if no gaps and seq1?seq2
- where n is the length of the longer sequence.
33Gap-free alignment example
- AATCTATA
- AAG-AT-A
- AATCTATA
- AA-G-ATA
- AATCTATA
- AA--GATA
Alignment scores would be 1, 3, 3, respectively
if the match score is 1 mismatch score is 0 and
gap penalty is -1.
34Origination and Length Penalties
- Simple gap penalties lead to many optimal
alignments (those having the same score). - Mutations are rare, invoking fewest number of
unlikely events is evolutionarily sound. - 3-nt indel would be more common than multiple
single indels.
35Origination and Length Penalties
- Origination penalty starting a new series of
gaps in one of the sequences being aligned. - Length penalty number of sequential missing
characters.
36Gap-free alignment example
- AATCTATA
- AAG-AT-A
- AATCTATA
- AA-G-ATA
- AATCTATA
- AA--GATA
Alignment scores would be -3, -1, 1,
respectively if the match score is 1 mismatch
score is 0 and origination penalty is -2 and
length penalty is -1.
37Scoring matrices
- Mismatch penalty can be used to provide further
discrimination - Two protein sequences, one of which has an
alanine in a given position A substitution to
valine (another small hydrophobic aa) would have
less impact than a change to lysine (a large,
charged residue). - One can weigh these substitutions differently
based on the likelihood of occurrence over time
or based on other characteristics.
38Scoring matrices
- A scoring matrix is used to score each nongap
position in the alignment. - Nucleotide sequences
- Amino acid sequences
39Identity matrix
40Scoring matrices for DNA sequences
A T C G
A 5 -4 -4 -4
T -4 5 -4 -4
C -4 -4 5 -4
G -4 5 -4 5
A T C G
A 1 -5 -5 -1
T -5 1 -1 -5
C -5 -1 1 -5
G -1 -5 -5 1
BLAST matrix
Transition/Transversion matrix
41Scoring Matrices
Need to know the frequency of one amino acid
substituting for another versus that event
happening by chance alone based on frequency of
occurrence of each amino acid odds ratio
P(ab)/q(a)q(b)
42Scoring matrices for amino acids
- Blosum
- PAM (point accepted mutation)
43Alignment and score for aa
44BLOSUM
- Ungapped alignments of related proteins are
grouped using clustering techniques, substitution
rates between clusters are calculated. - A BLOSUM-62 matrix is appropriate for comparing
sequences of approximately 62 sequence
similarity.
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
45PAM matrices
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
46PAM matrices
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
47PAM-1
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
48PAM1(multiplied by 10000)
49PAM1PAM1PAM2
50PAM250 PAM1250
51The Point-Accepted-Mutation (PAM) model of
evolution and the PAM scoring matrix
Observed Difference
Evolutionary Distance In PAMs
1 5 10 20 40 50 60 70 80
1 5 11 23 56 80 112 159 246
52Final Scoring Matrix is the Log-Odds Scoring
Matrix
Replacement amino acid
Original amino acid
Frequency of amino acid b
Mutational probability matrix number
53PAM unit
- PAM-1 is 1 substitution per 100 residues.
54Point accepted mutation (PAM) matrix
- Calculated by observing the substitutions that
occur in alignments between similar sequences
with very high (gt85) identity.
55Construct a multiple sequence alignment
- ACGCTAFKI
- GCGCTAFKI
- ACGCTAFKL
- GCGCTGFKI
- GCGCTLFKI
- ASGCTAFKL
- ACACTAFKL
56Construct a multiple sequence alignment
A phylogenetic tree is created indicating the
order in which various substitutions taken place.
A?G
I?L
A?G
A?L
G?A
C?S
57Generating PAM matrix
- For each amino acid, the frequency with which it
is substituted by every other amino acid is
calculated. Substitutions are considered
symmetric A?G counts as G?A - For example for FGA count all A?G and G?A
substitutions.
58Construct a multiple sequence alignment
FGA A?G A?G G?A 3
A?G
I?L
A?G
A?L
G?A
C?S
59Relative mutability, mi
- Number of times the amino acid is substituted by
any other amino acid in the phylogenetic tree. - This number is then divided by the total number
of mutations that could have affected the
residue. - This denominator is the total number of subs
across entire tree times two, multiplied by freq.
of the amino acid, times a scaling factor
60Construct a multiple sequence alignment
FGA A?G A?G G?A 3
A?G
I?L
A?G
A?L
G?A
C?S
61Relative mutability of A mA
- Total of 4 mutations involving A.
- Total number mutations in the entire tree is 6
should be multiplied by two 6 x 2 12 - Relative frequency of A residues (10 As out of
7x963 residues) is 10/63 0.159. - Thus mA 4/12 x 0.159 x 100 0.0209
62Mutation probability of A to G MGA
- mA 4/12 x 0.159 x 100 0.0209
- MGA mA multiplied by (A?G and G?A subs) and
then this is divided by (all subs involving A) - MGA (0.0209 x 3)/4 0.0156
63Scoring matrix RGA
- log(Mij/fi), where Mij mutation probability of ij
and fi equals to relative frequency of j type
residue. For example f(G) 10/63 0.1587 - RGA log(MGA/f(G)) log(0.0156/0.1587)
- -1.01
64PAM matrix calculator
- http//www.bioinformatics.nl/tools/pam.html
65Choice of matrix
- PAM vs. BLOSUM
- polar residues are less variable in BLOSUM
- Since PAM is based on pairs of entire sequences
with less than 15 divergence, most of the
variations counted are in surface loop regions,
regions under low evolutionary constraints. Asn
is one of the most mutable residues. - BLOSUM matrices are based on conserved regions of
more distantly related proteins and thus ignore
residues in surface loop regions. The mutability
of Asn is close to the average mutability. - Surface loop regions preferentially contain polar
residues - Thus, BLOSUM matrices strongly overestimate the
conservation of polar residues relative to PAM
matrices. -
- Matrix Conservation polar/apolar
- PAM 0.49
- BLOSUM 0.93
66Choice of matrix
- General the choice of matrix should be governed
by the use to which the matrix is put. - Searches for distantly similar sequences, in
which only the BLOCKS are likely to be conserved
should rely on BLOSUM matrices. - Complete alignments of globular protein sequences
should use PAM matrices. - Proteins with extensive transmembrane helices
should be aligned using the transmembrane matrix.
67Choice of matrix
http//mcb.berkeley.edu/labs/king/blast/docs/matri
x_info.html
68Dynamic Programming
- Exhaustive search search all possible
alignments NOT FEASIBLE - Dynamic programming a method of breaking a
problem apart into reasonably sized subproblems
and using these partial results to compute the
final answer. - Needleman and Wunsch
69How to start the alignment?
- How to break down the problem
- Seq1 CACGA
- Seq2 CGA
- Three possible ways to start the problem
- C of seq1 and C of seq2 match
- C ACGA
- C GA
- A gap is inserted into 1st position of seq1
- - CACGA
- C GA
- A gap is inserted into 1st position of seq2
- C ACGA
- - CGA
70How to end the alignment?
- If we knew the score for the best alignment, we
can track back the steps to reach to the best
score. - Use a table to store each step of the alignment
to refer back later on. - Depends on storing partial sequence alignments so
you dont do it again and again.
71Dynamic Programming
72Partial scores table
A C T C G
0 -1 -2 -3 -4 -5
A -1
C -2
A -3
G -4
T -5
A -6
G -7
Initialize a matrix where seq1 is on the rows and
seq2 is on the colums Fill in the first row and
first column with multiples of gap penalty, in
this case it equals to -1.
73Partial scores table
- Start with the first residue of each sequence
- Should A match A?
- The alignment score between A of seq1 and A of
seq2 can come from - diagonal match (1) or mismatch (0)
- From top means a gap (-1) inserted in the first
sequence (from left to right) - From left means that a gap (-1) is inserted in
the second sequence (from top to bottom) - SELECT THE MAXIMUM AMONG THREE
A C T C G
0 -1 -2 -3 -4 -5
A -1 1
C -2
A -3
G -4
T -5
A -6
G -7
74Partial scores table
- Continue towards right
- Does A match C
- diagonal match (1) or mismatch (0)
- From top means a gap (-1) inserted in the first
sequence (from left to right) - From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0
C -2
A -3
G -4
T -5
A -6
G -7
75Partial scores table
- Continue towards right
- Does A match T
- diagonal match (1) or mismatch (0)
- From top means a gap (-1) inserted in the first
sequence (from left to right) - From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1
C -2
A -3
G -4
T -5
A -6
G -7
76Partial scores table
- Continue towards right
- Does A match C
- diagonal match (1) or mismatch (0)
- From top means a gap (-1) inserted in the first
sequence (from left to right) - From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2
C -2
A -3
G -4
T -5
A -6
G -7
77Partial scores table
- Continue towards right
- Does A match G
- diagonal match (1) or mismatch (0)
- From top means a gap (-1) inserted in the first
sequence (from left to right) - From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2
A -3
G -4
T -5
A -6
G -7
78Partial scores table
- Go to second row and continue towards right
- Does C match A
- diagonal match (1) or mismatch (0)
- From top means a gap (-1) inserted in the first
sequence (from left to right) - From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0
A -3
G -4
T -5
A -6
G -7
79Partial scores table
- Continue towards right
- Does C match C
- diagonal match (1) or mismatch (0)
- From top means a gap (-1) inserted in the first
sequence (from left to right) - From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2
A -3
G -4
T -5
A -6
G -7
80Partial scores table
- Continue towards right
- Does C match T
- diagonal match (1) or mismatch (0)
- From top means a gap (-1) inserted in the first
sequence (from left to right) - From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1
A -3
G -4
T -5
A -6
G -7
81Partial scores table
- Continue towards right
- Does C match C
- diagonal match (1) or mismatch (0)
- From top means a gap (-1) inserted in the first
sequence (from left to right) - From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0
A -3
G -4
T -5
A -6
G -7
82Partial scores table
- Continue towards right
- Does C match G
- diagonal match (1) or mismatch (0)
- From top means a gap (-1) inserted in the first
sequence (from left to right) - From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0 -1
A -3
G -4
T -5
A -6
G -7
83Partial scores table
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0 -1
A -3 -1 1 2 1 0
G -4 -2 0 1 2 2
T -5 -3 -1 1 1 2
A -6 -4 -2 0 1 1
G -7 -5 -3 -1 0 2
Fill in all cells in the partial scores
table While you fill in keep tract of from which
direction you have carried away the scores from.
84Partial scores table
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0 -1
A -3 -1 1 2 1 0
G -4 -2 0 1 2 2
T -5 -3 -1 1 1 2
A -6 -4 -2 0 1 1
G -7 -5 -3 -1 0 2
Back trace your steps from the optimal alignment
score. Sometimes more than one optimal alignment
is possible.
85Partial scores table
From the end point G G TCG TAG --TCG AGTAG AC
-TCG ACAGTAG
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0 -1
A -3 -1 1 2 1 0
G -4 -2 0 1 2 2
T -5 -3 -1 1 1 2
A -6 -4 -2 0 1 1
G -7 -5 -3 -1 0 2
86Dynamic Programming
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
87Dynamic Programming
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
88Dynamic Programming
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
89Dynamic Programming
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
90Dynamic Programming
Mismatch between D and V is -3 gap is -8
91Dynamic Programming
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
92Dynamic Programming
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
93Trace-back
94Global alignments
- Needleman and Wunch algorithm is global
- compares two sequences in their entirety
- a gap penalty is assessed regardless of whether
gaps are located internally within a sequence, or
at the end of one of both sequences.
95Semi-global alignment
- Terminal gaps are undermined because they are
generally due to incomplete data and have no
biological significance. - How is it different than the global algorithm?
96Semi-global vs. global
- In global a vertical move equals to a gap in the
horizontal axis while a move to left equals to a
gap in the vertical axis they are both
penalized. - If one allows initial gaps without penalty in the
sequences, should set the first row and first
column of the partial scores table to 0.
97Semi-global alignment
A C A C T G A T C G
0 0 0 0 0 0 0 0 0 0 0 0
A 0 1 0 1 0 0 0 1 0 0 0
C 0 0 2 1 2 1 0 0 1 1 0
A 0 1 1 3 2 2 1 1 0 1 1
C 0 0 2 2 4 3 2 1 1 1 1
T 0 0 1 2 3 5 4 3 2 1 1
G 0 0 0 1 2 3 6 6 6 6 6
Horizontal moves on the bottom row and vertical
moves on the right most column are penalty free.
98Semi-global alignment
ACACTGATCG ACACTG----
99Local alignment
- Smith-Waterman Algorithm
- If you have a long sequence, and want to find any
subsequences that are similar to any part of
yeast genome. - Local alignment finds the best matching
subsequences within two search sequences.
100Modify global alignment algorithm
- Smith-Waterman Algorithm
- Place a zero in any position in the table if all
of the other methods result in scores lower than
zero.
101Local Alignment
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
102local alignment
A A C C T A T A G C T
0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 1 0 1 0 0
C 0 0 0 1 1 0 0 0 0 0 2 1
G 0 0 0 0 0 0 0 0 0 1 0 1
A 0 1 1 0 0 0 1 0 1 0 0 0
T 0 0 0 0 0 1 0 2 1 0 0 1
A 0 1 1 0 0 0 2 0 3 2 1 0
T 0 0 0 0 0 1 1 3 2 2 1 2
A 0 1 1 0 0 0 2 2 4 3 2 1
103Multiple Sequence Alignment
104GAPS
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
105Use align1 to align enter x and y in the command
line (gtgt)
- gtgt x'ggagaggat'
- x
- ggagaggat
- gtgt y'ccagacct'
- y
- ccagacct
- gtgt align1
106Use align1 to align enter x and y in the command
line (gtgt)
- gtgtalign1
- xalign
- ggagaggat
- yalign
- ccagacc-t
- ggagaggat
- ccagacc-t
107Use alignloc1 to align type in alignloc1 at the
gtgt
- gtgtalignloc1
- xalign
- aga
- yalign
- aga
- aga
- aga
108What is different?No gap penalty at the
initiation
- Initial Conditions for matrices F and I
- for i 2M1
- F(i,1) (i-1)0
- I(i,1) 1 Ivertical
- end
- for j 2N1
- F(1,j) (j-1)0
- I(1,j) 3 Ihorizontal
- end
109What is different?Maximization
- F(i,j) I(i,j) max(F(i-1,j)g F(i-1,j-1)w
F(i,j-1)g 0)
110What is different?Find the max value of the F
matrix
- Yj,fjmax(max(F)) find column id
- Yi,fimax(max(F')) find row id
- ifi(1) set current i
- jfj(1) set current j
111What is different?When maximum value is 0 then
stop
- if I(i,j) 1
- i i-1
- xrev(k) x(i)
- yrev(k) '-'
- elseif I(i,j) 2
- i i-1 j j-1
- xrev(k) x(i)
- yrev(k) y(j)
- elseif I(i,j) 3
- j j-1
- xrev(k) '-'
- yrev(k) y(j)
- else
- break
- end