Sequence Alignment I

About This Presentation

Title:

Sequence Alignment I

Description:

Assess ancestry among homologs (sequences with common ancestory) to help in gene ... AA--GATA ... AA--GATA. Alignment scores would be -3, -1, 1, respectively; ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 112

Provided by: fenBilk

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Alignment I

1
Sequence Alignment I

Dot plots
Dynamic Programming

2
Why align sequences?

conserved sequences?conserved function
Assess ancestry among homologs (sequences with
common ancestory) to help in gene finding and
annotation
Find consensus motifs among related sequences
(e.g., regulatory and structural regions)
Estimate the rate of evolution

3
Definitions
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
4
Definitions
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
5
Problem
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
6
Are the sequences related?
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
7
Types of Alignment

Local
Global
Gap-free
Gapped

8
Methods of Alignment

Dot matrix
Dynamic programming
K-tuple

9
Dot plots

To evaluate/visualize similarity between two
sequences
Create a matrix when sequence 1 is a row vector
and sequence 2 is a column vector.

10
Dot plot (identity matrix)
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
11
Self-alignment
12
Self-alignment with sliding window
13
Self-alignment with sliding window
14
Dotmatrix
A T C A A
A 1 0 0 1 1
T 0 1 0 0 0
C 0 0 1 0 0
G 0 0 0 0 0
A 1 0 0 1 1
Seq1 ATCAA Seq2 ATCGA
15
Function dotplot1.m

function dotmatrixdotplot1(seq1,seq2)
OUTPUT dotmatrix is an n x m matrix with 1s
and 0s, where 0 means a mismatch and 1 means a
match between two nucleotides
INPUT seq1 and seq2 are strings made of a,t,c,
or g. (seq1 is a row vector 1 x m and seq2 is a
column vector (n x1).

16
Run dotplot1.m

Place dotplot1.m into your directory.
Open a Matlab session.
Open the file named dotplot1.m under the file
option.
Define seq1 and seq2 as the following and run
the program
gtgt seq1 attataagg
gtgt seq2 attataggg
gtgt dotmatrix dotplot1(seq1,seq2)
Next, copy and paste each line of the dotplot1.m
on the command window without using to track
what the program is doing.

17
Dot plot
Seq1'attataagg Seq2'attataggg'
A match is red (1) a mismatch is blue (0)
18
Dot plot with sliding windows

Sliding windows consider more than one position
at a time.
Similarity cutoffs (threshold) also allow to get
rid of positions with low similarity across the
window.
Sliding window can reduce the noise in the dot
plot.

19
Dot plot with sliding windows
Use a sliding window of size w, for example w 3,
such that only positions with w number of
consecutive matches along the diagonal count.
A T C A A
A 1 0 0 1 1
T 0 1 0 0 0
C 0 0 1 0 0
G 0 0 0 0 0
A 1 0 0 1 1
A T C A A
A 3 0 0 0 0
T 0 3 0 0 0
C 0 0 3 0 0
G 0 0 0 0 0
A 0 0 0 0 0
?
20
Dot plot with sliding windows
3
0
3
21
Dot plot with sliding windows
3
0
0
3
0
0
3
0
0
2
0
0
22
Dot plot with sliding windows
A T C A A
A 3 0 0 0 0
T 0 2 0 0 0
C 0 0 2 0 0
G 0 0 0 0 0
A 0 0 0 0 0
23
Function dotplot2.m

function dotmatrix,dotdotplot1(seq1,seq2,w,t)
Introduced two variables
W size of the sliding window
T threshold, number of matches along the
diagonal to assume a match.

24
Function dotplot2.mWorksheet (due end of lecture)

Examine the code in dotplot2.m to see what
commands have been used to slide the window and
count the number of consecutive matches. Briefly
write down your assessment.
Type in different sequences for seq1 and seq2 (no
longer than 30) and try two different w and t
values to run dotplot2.m

25
Dot plot with sliding windows
Seq1'attataagg Seq2'attataggg'
A match is red a mismatch is blue
Sliding window size 3 Threshold is 3
Sliding window size 3 Threshold is 2
Sliding window size 1 Threshold is 1
Seq1'attataagg Seq2'attataggg'
Seq1'attataagg Seq2'attataggg'
26
Alignment

Is a pairwise match between the characters of
each sequence.
A true alignment reflects evolutionarily common
ancestry (homology).

27
Changes that occur in sequences

A mutation that replaces one character with
another is a substitution.
An insertion that adds one or more positions and
a deletion that deletes one or more positions are
known as indels (gaps)

28
Alignment example

AATCTATA
AAGATA
AATCTATA
AAGATA
AATCTATA
AAGATA

29
Gap-free alignment match and mismatch

An alignment receives for each aligned pair of
identical residues (the match score) and the
penalty for aligned pair of nonidentical residues
(mismatch score).
? ?

n
match score if seq1seq2
mismatch score if seq1?seq2

where n is the length of the longer sequence.

30
Gap-free alignment example

AATCTATA
AAGATA
AATCTATA
AAGATA
AATCTATA
AAGATA

Alignment scores would be 4, 1, 3, respectively
if the match score is 1 and mismatch score is 0.
31
Gaps

Indels complicate alignments by increasing the
number of possible alignments between two or more
sequences.

32
Alignment match, mismatch, and gap penalty

An alignment receives a score for each aligned
pair of identical residues (the match score) and
the penalty for aligned pair of nonidentical
residues (mismatch score), and a penalty for
insertion of gaps.
? ?

gap penalty if seq1 - or seq2 -
n
match score if no gaps and seq1seq2
mismatch score if no gaps and seq1?seq2

where n is the length of the longer sequence.

33
Gap-free alignment example

AATCTATA
AAG-AT-A
AATCTATA
AA-G-ATA
AATCTATA
AA--GATA

Alignment scores would be 1, 3, 3, respectively
if the match score is 1 mismatch score is 0 and
gap penalty is -1.
34
Origination and Length Penalties

Simple gap penalties lead to many optimal
alignments (those having the same score).
Mutations are rare, invoking fewest number of
unlikely events is evolutionarily sound.
3-nt indel would be more common than multiple
single indels.

35
Origination and Length Penalties

Origination penalty starting a new series of
gaps in one of the sequences being aligned.
Length penalty number of sequential missing
characters.

36
Gap-free alignment example

AATCTATA
AAG-AT-A
AATCTATA
AA-G-ATA
AATCTATA
AA--GATA

Alignment scores would be -3, -1, 1,
respectively if the match score is 1 mismatch
score is 0 and origination penalty is -2 and
length penalty is -1.
37
Scoring matrices

Mismatch penalty can be used to provide further
discrimination
Two protein sequences, one of which has an
alanine in a given position A substitution to
valine (another small hydrophobic aa) would have
less impact than a change to lysine (a large,
charged residue).
One can weigh these substitutions differently
based on the likelihood of occurrence over time
or based on other characteristics.

38
Scoring matrices

A scoring matrix is used to score each nongap
position in the alignment.
Nucleotide sequences
Amino acid sequences

39
Identity matrix
40
Scoring matrices for DNA sequences
A T C G
A 5 -4 -4 -4
T -4 5 -4 -4
C -4 -4 5 -4
G -4 5 -4 5
A T C G
A 1 -5 -5 -1
T -5 1 -1 -5
C -5 -1 1 -5
G -1 -5 -5 1
BLAST matrix
Transition/Transversion matrix
41
Scoring Matrices
Need to know the frequency of one amino acid
substituting for another versus that event
happening by chance alone based on frequency of
occurrence of each amino acid odds ratio
P(ab)/q(a)q(b)
42
Scoring matrices for amino acids

Blosum
PAM (point accepted mutation)

43
Alignment and score for aa
44
BLOSUM

Ungapped alignments of related proteins are
grouped using clustering techniques, substitution
rates between clusters are calculated.
A BLOSUM-62 matrix is appropriate for comparing
sequences of approximately 62 sequence
similarity.

http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
45
PAM matrices
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
46
PAM matrices
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
47
PAM-1
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
48
PAM1(multiplied by 10000)
49
PAM1PAM1PAM2
50
PAM250 PAM1250
51
The Point-Accepted-Mutation (PAM) model of
evolution and the PAM scoring matrix
Observed Difference
Evolutionary Distance In PAMs
1 5 10 20 40 50 60 70 80
1 5 11 23 56 80 112 159 246
52
Final Scoring Matrix is the Log-Odds Scoring
Matrix

S (a,b) 10 log10(Mab/Pb)

Replacement amino acid
Original amino acid
Frequency of amino acid b
Mutational probability matrix number
53
PAM unit

PAM-1 is 1 substitution per 100 residues.

54
Point accepted mutation (PAM) matrix

Calculated by observing the substitutions that
occur in alignments between similar sequences
with very high (gt85) identity.

55
Construct a multiple sequence alignment

ACGCTAFKI
GCGCTAFKI
ACGCTAFKL
GCGCTGFKI
GCGCTLFKI
ASGCTAFKL
ACACTAFKL

56
Construct a multiple sequence alignment
A phylogenetic tree is created indicating the
order in which various substitutions taken place.
A?G
I?L
A?G
A?L
G?A
C?S
57
Generating PAM matrix

For each amino acid, the frequency with which it
is substituted by every other amino acid is
calculated. Substitutions are considered
symmetric A?G counts as G?A
For example for FGA count all A?G and G?A
substitutions.

58
Construct a multiple sequence alignment
FGA A?G A?G G?A 3
A?G
I?L
A?G
A?L
G?A
C?S
59
Relative mutability, mi

Number of times the amino acid is substituted by
any other amino acid in the phylogenetic tree.
This number is then divided by the total number
of mutations that could have affected the
residue.
This denominator is the total number of subs
across entire tree times two, multiplied by freq.
of the amino acid, times a scaling factor

60
Construct a multiple sequence alignment
FGA A?G A?G G?A 3
A?G
I?L
A?G
A?L
G?A
C?S
61
Relative mutability of A mA

Total of 4 mutations involving A.
Total number mutations in the entire tree is 6
should be multiplied by two 6 x 2 12
Relative frequency of A residues (10 As out of
7x963 residues) is 10/63 0.159.
Thus mA 4/12 x 0.159 x 100 0.0209

62
Mutation probability of A to G MGA

mA 4/12 x 0.159 x 100 0.0209
MGA mA multiplied by (A?G and G?A subs) and
then this is divided by (all subs involving A)
MGA (0.0209 x 3)/4 0.0156

63
Scoring matrix RGA

log(Mij/fi), where Mij mutation probability of ij
and fi equals to relative frequency of j type
residue. For example f(G) 10/63 0.1587
RGA log(MGA/f(G)) log(0.0156/0.1587)
-1.01

64
PAM matrix calculator

http//www.bioinformatics.nl/tools/pam.html

65
Choice of matrix

PAM vs. BLOSUM
polar residues are less variable in BLOSUM
Since PAM is based on pairs of entire sequences
with less than 15 divergence, most of the
variations counted are in surface loop regions,
regions under low evolutionary constraints. Asn
is one of the most mutable residues.
BLOSUM matrices are based on conserved regions of
more distantly related proteins and thus ignore
residues in surface loop regions. The mutability
of Asn is close to the average mutability.
Surface loop regions preferentially contain polar
residues
Thus, BLOSUM matrices strongly overestimate the
conservation of polar residues relative to PAM
matrices.
Matrix Conservation polar/apolar
PAM 0.49
BLOSUM 0.93

66
Choice of matrix

General the choice of matrix should be governed
by the use to which the matrix is put.
Searches for distantly similar sequences, in
which only the BLOCKS are likely to be conserved
should rely on BLOSUM matrices.
Complete alignments of globular protein sequences
should use PAM matrices.
Proteins with extensive transmembrane helices
should be aligned using the transmembrane matrix.

67
Choice of matrix
http//mcb.berkeley.edu/labs/king/blast/docs/matri
x_info.html
68
Dynamic Programming

Exhaustive search search all possible
alignments NOT FEASIBLE
Dynamic programming a method of breaking a
problem apart into reasonably sized subproblems
and using these partial results to compute the
final answer.
Needleman and Wunsch

69
How to start the alignment?

How to break down the problem
Seq1 CACGA
Seq2 CGA
Three possible ways to start the problem
C of seq1 and C of seq2 match
C ACGA
C GA
A gap is inserted into 1st position of seq1
- CACGA
C GA
A gap is inserted into 1st position of seq2
C ACGA
- CGA

70
How to end the alignment?

If we knew the score for the best alignment, we
can track back the steps to reach to the best
score.
Use a table to store each step of the alignment
to refer back later on.
Depends on storing partial sequence alignments so
you dont do it again and again.

71
Dynamic Programming
72
Partial scores table
A C T C G
0 -1 -2 -3 -4 -5
A -1
C -2
A -3
G -4
T -5
A -6
G -7
Initialize a matrix where seq1 is on the rows and
seq2 is on the colums Fill in the first row and
first column with multiples of gap penalty, in
this case it equals to -1.
73
Partial scores table

Start with the first residue of each sequence
Should A match A?
The alignment score between A of seq1 and A of
seq2 can come from
diagonal match (1) or mismatch (0)
From top means a gap (-1) inserted in the first
sequence (from left to right)
From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)
SELECT THE MAXIMUM AMONG THREE

A C T C G
0 -1 -2 -3 -4 -5
A -1 1
C -2
A -3
G -4
T -5
A -6
G -7
74
Partial scores table

Continue towards right
Does A match C
diagonal match (1) or mismatch (0)
From top means a gap (-1) inserted in the first
sequence (from left to right)
From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)

A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0
C -2
A -3
G -4
T -5
A -6
G -7
75
Partial scores table

Continue towards right
Does A match T
diagonal match (1) or mismatch (0)
From top means a gap (-1) inserted in the first
sequence (from left to right)
From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)

A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1
C -2
A -3
G -4
T -5
A -6
G -7
76
Partial scores table

Continue towards right
Does A match C
diagonal match (1) or mismatch (0)
From top means a gap (-1) inserted in the first
sequence (from left to right)
From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)

A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2
C -2
A -3
G -4
T -5
A -6
G -7
77
Partial scores table

Continue towards right
Does A match G
diagonal match (1) or mismatch (0)
From top means a gap (-1) inserted in the first
sequence (from left to right)
From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)

A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2
A -3
G -4
T -5
A -6
G -7
78
Partial scores table

Go to second row and continue towards right
Does C match A
diagonal match (1) or mismatch (0)
From top means a gap (-1) inserted in the first
sequence (from left to right)
From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)

A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0
A -3
G -4
T -5
A -6
G -7
79
Partial scores table

Continue towards right
Does C match C
diagonal match (1) or mismatch (0)
From top means a gap (-1) inserted in the first
sequence (from left to right)
From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)

A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2
A -3
G -4
T -5
A -6
G -7
80
Partial scores table

Continue towards right
Does C match T
diagonal match (1) or mismatch (0)
From top means a gap (-1) inserted in the first
sequence (from left to right)
From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)

A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1
A -3
G -4
T -5
A -6
G -7
81
Partial scores table

Continue towards right
Does C match C
diagonal match (1) or mismatch (0)
From top means a gap (-1) inserted in the first
sequence (from left to right)
From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)

A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0
A -3
G -4
T -5
A -6
G -7
82
Partial scores table

Continue towards right
Does C match G
diagonal match (1) or mismatch (0)
From top means a gap (-1) inserted in the first
sequence (from left to right)
From left means that a gap (-1) is inserted in
the second sequence (from top to bottom)

A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0 -1
A -3
G -4
T -5
A -6
G -7
83
Partial scores table
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0 -1
A -3 -1 1 2 1 0
G -4 -2 0 1 2 2
T -5 -3 -1 1 1 2
A -6 -4 -2 0 1 1
G -7 -5 -3 -1 0 2
Fill in all cells in the partial scores
table While you fill in keep tract of from which
direction you have carried away the scores from.
84
Partial scores table
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0 -1
A -3 -1 1 2 1 0
G -4 -2 0 1 2 2
T -5 -3 -1 1 1 2
A -6 -4 -2 0 1 1
G -7 -5 -3 -1 0 2
Back trace your steps from the optimal alignment
score. Sometimes more than one optimal alignment
is possible.
85
Partial scores table
From the end point G G TCG TAG --TCG AGTAG AC
-TCG ACAGTAG
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0 2 1 0 -1
A -3 -1 1 2 1 0
G -4 -2 0 1 2 2
T -5 -3 -1 1 1 2
A -6 -4 -2 0 1 1
G -7 -5 -3 -1 0 2
86
Dynamic Programming
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
87
Dynamic Programming
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
88
Dynamic Programming
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
89
Dynamic Programming
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
90
Dynamic Programming
Mismatch between D and V is -3 gap is -8
91
Dynamic Programming
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
92
Dynamic Programming
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
93
Trace-back
94
Global alignments

Needleman and Wunch algorithm is global
compares two sequences in their entirety
a gap penalty is assessed regardless of whether
gaps are located internally within a sequence, or
at the end of one of both sequences.

95
Semi-global alignment

Terminal gaps are undermined because they are
generally due to incomplete data and have no
biological significance.
How is it different than the global algorithm?

96
Semi-global vs. global

In global a vertical move equals to a gap in the
horizontal axis while a move to left equals to a
gap in the vertical axis they are both
penalized.
If one allows initial gaps without penalty in the
sequences, should set the first row and first
column of the partial scores table to 0.

97
Semi-global alignment
A C A C T G A T C G
0 0 0 0 0 0 0 0 0 0 0 0
A 0 1 0 1 0 0 0 1 0 0 0
C 0 0 2 1 2 1 0 0 1 1 0
A 0 1 1 3 2 2 1 1 0 1 1
C 0 0 2 2 4 3 2 1 1 1 1
T 0 0 1 2 3 5 4 3 2 1 1
G 0 0 0 1 2 3 6 6 6 6 6
Horizontal moves on the bottom row and vertical
moves on the right most column are penalty free.
98
Semi-global alignment
ACACTGATCG ACACTG----
99
Local alignment

Smith-Waterman Algorithm
If you have a long sequence, and want to find any
subsequences that are similar to any part of
yeast genome.
Local alignment finds the best matching
subsequences within two search sequences.

100
Modify global alignment algorithm

Smith-Waterman Algorithm
Place a zero in any position in the table if all
of the other methods result in scores lower than
zero.

101
Local Alignment
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
102
local alignment
A A C C T A T A G C T
0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 1 0 1 0 0
C 0 0 0 1 1 0 0 0 0 0 2 1
G 0 0 0 0 0 0 0 0 0 1 0 1
A 0 1 1 0 0 0 1 0 1 0 0 0
T 0 0 0 0 0 1 0 2 1 0 0 1
A 0 1 1 0 0 0 2 0 3 2 1 0
T 0 0 0 0 0 1 1 3 2 2 1 2
A 0 1 1 0 0 0 2 2 4 3 2 1
103
Multiple Sequence Alignment
104
GAPS
http//ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/
CourseHome/
105
Use align1 to align enter x and y in the command
line (gtgt)