Title: Sequence Analysis
1Sequence Analysis
- CSC 487/687 Introduction to computing for
Bioinformatics
2Aligning Sequences
- Sequences
- Representing proteins or nucleic acid (DNA/RNA)
molecules - Order of amino acids (for proteins nucleotides
for DNA/RNA) along one chain - Sequence alignment
- The identification of residue-residue
correspondences - Any assignment of correspondences that preserves
the order of residues within the sequences
3Evolutionary Basis of Sequence Alignment
- Identity Quantity that describes how much
- two sequences are alike in the strictest
terms. - Similarity Quantity that relates how much
- two amino acid sequences are alike.
- Homology a conclusion drawn from data
- suggesting that two genes share a common
- evolutionary history.
4Evolutionary Basis of Sequence Alignment
- Homologous sequences
- Related by evolution (common ancestors)
- Alignment of homologous sequences
- Identifying relationship between the sequence
elements - Match up characters coming from same characters
in ancestor
5Alignment and Evolution
- Assume we know evolutionary history relating q
and d - The true alignment can be found using h as a
template - h GLVS T
- q GLISVT
- d GIV--T
6Alignment Evolution
- Given an alignment, several different
evolutionary histories may be (equally) plausible - Example
- Alignment
- q GLISVT
- d G-I-VT
- One possible history
- HGLIVT
- /\
- -gtS / \ L-gt
- / \
- qGLISVT dGIVT
7Global and Local Alignment
- Global
- Assuming that the complete sequences are the
results of evolution from the same ancestor
sequence - Local
- Align segments of the sequences so that the
segments are evolutionarily related
8Pairwise sequence alignments Vs Multiple sequence
alignments
- Pairwise sequence alignment two sequence
- Multiple sequence alignments a mutual alignment
of more than two sequences
9The dotplot
10The dotplot
- Captures not only the overall similarity of two
sequences, but also the complete set and relative
quality of different possible alignments - Diagonal ?
- Horizontal ? a gap is introduced in the sequence
indexing the rows - Vertical ? a gap is introduced in the sequence
indexing the columns
11Dotplots and alignments
- A path through the dotplot is as an edit script
- Each move performs an operation ? a
substitution, an insertion or a deletion. - When the end of the path is reached, the effect
will change one sequence into the other. - Several different sequences of edit operations
may convert one string to the other in the same
number of steps.
12Dotplots and alignments
- Although a sequence of edit operations derived
from an optimal alignment may correspond to an
actual evolutionary pathway - Impossible to prove that it does.
- The larger the edit distance, the larger the
number of reasonable evolutionary pathways
between two sequences.
13Dotplots and alignments
- The dotplots between pairs of proteins with
increasingly more distant relationships. - The dotplot comparisons of the sulphydryl
proteinase papain from papaya, with four
homologues ? the close relative, kiwi fruit
actinidin, the more distant relatives, human
procathepsin L, human cathepsin B, and
staphyloccus anueus.
14Example
15Example
16Example
17Example
18Measures of sequence similarity
- Hamming distance ? the number of positions with
mismatching characters. - Edit distance ? the minimum number of edit
operations required to change one string into
the other.
19What is an Alignment?
- A global alignment of two sequences A and B
contains all characters of A and B in the same
order - one symbol from A can be aligned with one symbol
from B - a symbol can be aligned with a blank, written as
- - two blanks cannot be aligned
- Every symbol from A and from B must be aligned
- Example
- AINVEST, BINTEREST
- IN--VEST INV--EST IN-V--EST
- INTEREST INTEREST IN-TEREST
20Computing Alignments
- There exist a large number of alignments for a
pair of sequences - In order to use a computer to do the alignment
process in a meaningful way, we need - Scoring scheme mathematical way to calculate
goodness of candidate alignments - Search method algorithm able to identify high
scoring alignments
21Choosing Scoring Scheme
- Scoring scheme should be
- Simple to allow for
- efficient calculation and
- search for best alignment
- Biologically meaningful (give score to
biologically good alignments)
22Simple Scoring Scheme
- Assign score to each column in the alignment
- Columns are of the following sorts
- Alignment score sum of score over all columns
- R matrix giving score for all possible character
pairs (e.g., all pairs of amino acid symbols)
23Alignment Score Example
- R identity matrix identical characters score1,
unequal 0, - g1
- ALIGN1
- V - E I T G E I S T
- P R E - T E R I - T
- 0 -1 1 -1 1 0 0 1 -1 1 Score 1
- ALIGN2
- V E I T G E I S T
- P R E T - E R I T
- 0 0 0 1 -1 1 0 0 1 Score 2
24Finding the Minimum Scoring Alignment
- Large number of possible alignments cannot
generate all and score them to find the best - Task align
- Aa1a2...am and
- Bb1b2...bn
25Independence Between Sub-alignments
- Observations
- The score of the alignment up to and including
character i from A and character j from B is
independent of how the rest of the sequences are
aligned - The best solution to (i,j) can be locked, its
score recorded in Di,j - Dm,n is the score of the best global alignment
- Amenable to dynamic Programming
26Dynamic programming algorithm
- Individual edit operations include
- Substitution of bj for ai ? represented (ai, bj)
- Deletion of ai from sequence A? represented
(ai,?) - Deletion of bj from sequence B? represented
(?,bj)
27Dynamic programming algorithm
- A cost function d is defined on edit operations
- d(ai, bj)cost of a mutation in an alignment in
which position i of sequence A corresponds to
position j of sequence B - d(ai,?) or d(? bj) cost of a deletion or
insertion - The minimum weighted distance between sequences A
and B as - D(A,B)min (?d(x,y))
28Three Alternative Alignment Ends
- The alignment between a1a2...ai and b1b2...bi
ends in one of three ways
a1..i-1 b1..j
ai -
To calculate Di,j we pick the one thatgives the
lowest cost
a1..i-1 b1..j-1
ai bj
a1..i b1..j-1
- bj
29Recurrence Relation
Assume that Di-1,j, Di-1,j-1, Di,j-1 have been
calculated already
a1..i-1 b1..j
ai -
d(ai,?)
a1..i-1 b1..j-1
ai bj
d(ai,bj)
a1..i b1..j-1
d(?,bj)
- bj
30Basis of Recursion
- Align empty string to string of length i (resp.
j) can be done by aligning to i (resp. j)
blanks
31Calculating Score of Best Alignment Using Matrix
H matrix
32Time Complexity
- Sequences of lengths n and m
- Two sequences of length l