Title: Sequence Alignment
1Sequence Alignment
- Csc 487/687 Computing for bioinformatics
2Refining the Scoring Scheme- Scoring Matrix
- To measure the relative probability of any
particular substitution. - The relative frequencies of such changes to form
a scoring matrix for substitution - A likely change will score higher than a rare one.
3Scoring matrix for nucleic acid sequences
- A simple scheme for substitutions
- 1 for a match, -1 for a mismatch.
- A more complicated scheme based on the higher
frequency of transition mutations than
transversion mutations - a g and t c
- (a or g) (t or c)
4Refining the Scoring Scheme- Scoring Matrix
- The scheme should return high values for
alignment of homologous proteins - Should reward higher alignment of amino acids
often seen in corresponding positions in
homologous proteins
5Scoring Matrices
- Importance of scoring matrices
- Scoring matrices appear in all analyses involving
sequence comparisons. - The choice of matrix can strongly influence the
outcome of the analysis. - Scoring matrices implicitly represent a
particular theory of relationships. - Understanding theories underlying a given scoring
matrix can aid in making proper choice.
6Identity Matrix
1
A
1
0
C
1
0
0
I
1
0
0
0
L
L
I
C
A
Simplest type of scoring matrix
7Similarity
It is easy to score if an amino acid is identical
to another (the score is 1 if identical and 0 if
not). However, it is not easy to give a score
for amino acids that are somewhat similar.
CO2-
CO2-
NH3
NH3
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1
(identical) or Something in between?
8Scoring matrices
- Gives scores between each pair of amino acids
- Should reflect
- The degree of biological relatedness
- The probability that two amino acids occurring
in different sequences have common ancestor - Should be symmetric
- Substitution matrices
- The probability that an amino acid a is changed
to amino acid b (in a certain evolutionary time) - Is generally not symmetric
9Scoring matrices
- Identity matrix (scoring 0/1)
- Use of the distances in the genetic codes
- Use of the amino acid similarities based on
physico-chemical properties - Scoring matrices based on experimental data (PAM
BLOSUM)
10DAYHOFFs PAM-MATRICES
- Based on experimental data
- t evolutionary time interval
- Sequences from 34 superfamilies were used
- Divide the sequences into groups (71) of
homologous sequences, and make a multiple
alignment for each of them - Construct evolutionary trees for each group, and
estimate the mutations that have occurred - Define an evolutionary model to explain the
evolution - Construct substitution matrices, for each amino
acid pairs (a,b) an estimate of the probability
that an amino acid a has mutated to an amino acid
b in time interval t - Construct scoring matrices from the substitution
matrices. - Note that a and b are variables that mean any
amino acid.
11Example
12The model of the evolution
- The probability of a mutation in a position is
independent on - Position and neighbour residues
- Previous mutations in the position
- The biological (evolutionary) clock is assumed
(meaning constant rate of mutations) - This means that evolutionary time can be measured
in number of mutations (here substitutions) - The measure is PAM (Point Accepted Mutations)
- 1 PAM is one accepted mutation per 100 residues
13The Point-Accepted-Mutation (PAM) model of
evolution and the PAM scoring matrix
A 1-PAM unit is equivalent to 1 mutation found in
a stretch of 2 sequences each containing 100
amino acids that are aligned Example 1
..CNGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPE
IQV..
..CNGTTDQVDKIVKIRNEGQIASTDVVEVVVSPPYVFLPV
VKSQLRPEIQV.. length 100, 1 Mismatch, PAM
distance 1 A k-PAM unit is equivalent to k
1-PAM units (or Mk).
14Substitution matrix M1
15Calculate Mz by matrix multiplication, show for
z2
- Z2 mean two mutations per 100 residues
- A residue a can be changed to residue b after 2
PAM of following reasons - a is mutated to b in first PAM, unchanged in the
next, with probability MabMbb - a is unchanged in first PAM, changed in the next,
probability MaaMab - a is mutated to an amino acid x in the first PAM,
and then to b in the next, probability MaxMxb, x
being any amino acid unequal (a,b) - These three cases are disjunctive, hence
16Final Scoring Matrix is the Log-Odds Scoring
Matrix
Replacement amino acid
Original amino acid
Frequency of amino acid b
Mutational probability matrix number
17M250
18PAM-250 scoring matrix
19BLOSUM (Henikoff Henikoff)
- Perform best in identifying distant relationships
- Making use of the much larger amount of data that
become available since Dayhoffs work - Based on BLOCKS database of aligned protein
sequence
20BLOSUM (Henikoff Henikoff)
- Make multiple alignments and discover blocks not
containing gaps (used over 2,000 blocks) - ...KIFIMK.......GDEVK...
- ...NLFKTR GDSKK...
- KIFKTK GDPKA
- KLFESR GDAER
- KIFKGR GDAAK
- For each column in each block they counted the
number of occurrences of each pair of amino acids
(210 different pairs (2021/2) ) - A block of length w from an alignment of n
sequences has wn(n-1)/2 occurrences of amino acid
pairs - Let hab be the number of occurrences of the pair
(ab) in all blocks (habhba) - T total number of pairs
- fabhab/T
21Gap weighting
- CLUSTAL-W
- For aligning DNA sequences
- Use of identity matrix for substitution
- Gap penalties 10 for gap initiation and 0.1 for
gap extension by one residue - For aligning protein sequences
- BLOSUM62 matrix
- Gap penalties 11 for gap initiation and 1 for gap
extension by one residue