Title: Scoring matrices
1Scoring matrices
2Scoring Matrices Types
- Identity matrix exact matches receive one score
and non-exat matches a different score (say 1 and
0, or 6 and 1 for local alignment.). - Mutation data matrix a scoring matrix compiled
based on observation of protein point mutation
(PAM, BLOSUM). - Physical properties matrix amino acids with
with similar properties (e.G. hydrophobicity )
receive high score. - Genetic code matrix amino acids are scored
based on similarities in the coding triple
(codons).
3Substitution Matrix
- Amino acids substitute easily for another due to
similar physicochemical properties - Isoleucine for Valine (both small, hydrophobic)
- Serine for Threonine (both polar)
- Such changes conservative
- Thus, need a way to increase sensitivity of the
alignment algorithm - Solution substitution matrix
- Therefore, we need a range of values that depend
on the nature of sequences being compared - Identical amino acids gt Conservative
substitutions gt Nonconservative substitutions
4 Choice of scoring matrix is dictated by the
alignment goals
- Two proteins are homologous if (and only if) they
are evolutionarily related (have a common
ancestor) - Homologous proteins are likely to have related
functions (and have the same fold) - Scoring matrices must in some way model our
understanding of protein evolution. - Based on the result of the search we have to be
able to decide if the discovered sequence
similarity could happen by chance or is a
signature of likely homology.
5BLOSUM
- Block a short contiguous interval of multiple
aligned sequences. - BLOCKS data base of 3 000 blocks of highly
conserved sequences representing hundreds of
protein groups. - Http//www.Blocks.fhcrc.Org/.
- BLOCKS ? substitutions frequency ? log odds
score. - Within each block cluster sequences within
certain similarity threshold (80 similarity
yields BLOSUM80) and have such cluster be
represented by one sequence or average the
contribution. - BLOSUM62 most similar to PAM250 (believed to be
better).
6BLOSUM METHOD
Data Base of blocks
Data base
?
?
7Methods
- Deriving a frequency table from a data base of
blocks.
Frequency table consisting of all possible amino
acid pairs in a column
- 9A 1S there are 87136 AA pairs
- 9 AS or SA pairs
- no SS pairs
For a block width of w and a depth of S, it
contribute WS(S-1)/2 ?1.10.(10-1)/245
8METHODS
- The result of this counting is a frequency table
listing the number of time each of the
20191210 different amino acid pairs occurs
among the blocks. - The table is used to calculate a matrix
representing odds ratio between these observed
frequency and those calculated by chance.
9METHODS
?
fAA 36, fAS 9 qAA 36/45 0.8 qAS 9/45
0.2
10Methods
?
pA 36 (9/2)/45 0.9 pS 00 (9/2) /45
0.1
- for ij ? eij pi.pj
- eAA pA.pA 0.9 x 0.9 0.81
- for i?j ? eij pi.pj pi.pj 2 pi.pj
- eAS pA.pS pA.pS 2 pA.pS 2 (0.9 x
0.1) 0.18
11Methods
- An odds ratio matrix is calculated where each
entry is qij/eij - The logarithm of odds ratio (Lod) in bit unit
- Sij log2qij/eij
- A Lod is then calculated as score
- If the observed frequency is
- as the expected, then Sij 0
- if less than expected Sij lt 0
- if more than expected Sij gt 0
12METHODS
- Clustering segment within blocks
- Sequences are clustered within blocks, and each
cluster is weighted. This is done by specifying a
clustering percentage in which sequence segments
that are identical for at least that percentage
of amino acids are grouped together. - The lod matrix derived from a database of blocks
in which sequences that are identical at ? 80 of
aligned residues are clustered is referred to as
BLOSUM 80, and so forth.
13The Dayhoff Matrix (PAM)
- Developed by Margaret Dayhoff, 1978.
- Counted likelihood of all possible substitutions
in closely related proteins. - Derived mutability matrix Mi,j
- Probability that Ai mutates to Aj in one
evolutionary unit, PAM. - Multiplying M by itself extrapolate to higher
evolutionary orders (Mk).
14PAM units
- Log-odds approach Scores proportional to the log
of the ratio of target frequencies to background
frequencies - PAM Point Accepted Mutation /Percent Accepted
Mutation - Two sequences S and T are defined to be one PAM
unit diverged if a series of accepted point
mutation (and no insertion/deletion) can convert
S to T with an average of one mutation per 100
res. - Point accepted mutation mutation of one residue
accepted by evolution.
15PAM units
- Problem 1 given two sequences you cannot tell
their PAM distance in the strict sense of the
above definition since one residue could mutate
more than once - BUT If you take sequences that are closely
related then problem above is unlikely to occur. - Problem 2 A change could happen by
deletion/insertion
16PAM Matrices - Summary
- There is a sequence of PAM matrices
- PAMn attempts to provide proper scoring for
sequences that diverged n PAM units. - PAMn matrix is obtained from PAM1 assuming Markov
model of protein evolution where transition
probabilities in 1 PAM step are given by PAM1. - PAMn PAM1 n
- PAM1 is constructed based on highly similar
sequences (believed to be apart at most few PAM
units) so that Problems1 2 are unlikely to
occur.)
17Computation representation
- Define
- fp(a) probabilities of occurrence for each
amino acid a. - f(a,b) the number of times the mutation a?b
( f(a,b) f(b,a) ) - f(a) b?f(a,b) ( b?a )
- m(a) mutability of amino acid a f(a) / fp(a)
18Computation representation ,cnd
- M(a,b) the probability of amino acid a changing
to amino acid b - M(a,b) Pr(a?b)
- Pr(a?b a changed)Pr(a changed)
- f(a,b) m(a) / f(a)
- (the conditional probability above is
estimated as the ratio between the a?b mutations
and the total number of mutations involving a ) - M(a,a) 1- m(a) unchange probability
- (the diagonal
elements)
19Relatedness odds Matrix
- M(a,b) gives the probability that amino acid a
will change to b in a related sequence in a
interval - f(b) is the chance of a random occurrence of
amino acid b - Score(a,b) 10logM(a,b)/f(b)
- (symmetric matrix)
20PAM
- Let us assume to AA (or nucleotides) i and j,
with frequency fi and fj. - P(random alignment of i and j)fi fj.
21PAM
22Long Distance Evolution
- There is a different mutation probability matrix
for each evolutionary interval. These can be
derived from the one for 1 PAM by matrix
multiplication. - e.g.
- in 2 PAM units of evolution
- a?c?b (c can be anything including a or
b) - In general Mn is the transition probability
matrix for a period of n units of evolution
23Estimation of Evolutionary Distance
- Different mutation probability matrix for each
evolutionary interval measured in PAMs. - Calculate the percentage of amino acids that will
be observed to change on the average in the
interval - P 100(1 ?f(i)M(i,i))
- A PAM250 matrix usually represents two sequences
which have about 20 identity
24Nucleotide PAM scoring matrices
Assuming equal probability for each mutation PAM1
would be A T G C A
.99 .0033 .0033 .0033 T .0033 .99 .0033
.0033 G .0033 .0033 .99 .0033C .0033
.0033 .0033 .99 Some models would score higher
transitions (purine into purine pirimidine into
pirimidine) that transversions A T
G C A .99 .0002 .0006
.0002 T .0002 .99 .0002 .0006 G .0006
.0002 .99 .0002C .0002 .0006 .0002 .99
25(No Transcript)
26Discrimination of real local alignment from by
chance alignment
- Method Compute mutual information
- Sx Syp(x,y) log (p(x,y)/ p(x)p(y))
- Recall that score s(x,y) log (p(x,y)/ p(x)p(y))
- Thus we simply compute
- Sx1..20 Sy1,..20 p(x,y) s(x,y)
- Examples (in bits)
- PAM160 .7 PAM250 .36
- Higher mutual information ? better discrimination
between true and by chance alignment.
27Problems with PAM
- Defining PAM 1 in terms of amino acid mutation
rather than number of nucleotide changes. - Some mutation may be rare and underrepresented in
PAM1 (which is based on closely related proteins
only). - The mutation rate depends on the position of an
amino-acid in the structure. - Require construction phylogenic tree which in
turn need scoring matrices for proper
construction. (remains a problem for many other
methods)
28Some more problems with PAM Matrices
- Derived from global alignments of closely related
sequences. - Matrices for greater evolutionary distances are
extrapolated from those for lesser ones. - The number with the matrix (PAM40, PAM100) refers
to the evolutionary distance greater numbers are
greater distances. - Does not take into account different evolutionary
rates between conserved and non-conserved
regions.
29BLOSUM matrices
- BLOcks SUbstitution Matrix
- Amino acid substitution matrices from protein
blocks
S. HENIKOFF and J. HENIKOFF Proc. Natl. Acad.
Sci.USA Vol.89, pp. 10915-10919, November
1992 Biochmistry
30Comparison to PAM
- The BLOSUN series derived from alignments in
blocks is fundamentally different from the
Dayhoff PAM series, which is derived from the
estimation of mutation rates. - Nevertheless, the BLOSUM series based on percent
clustering of aligned segments in blocks, can be
compared to the Dayhoff matrices based on percent
accepted mutation (PAM) using the measure of
average information per residue pair in bits
units called relative entropy.
31Comparison between BLOSUM 62 and PAM 160
- The BLOSUM 62 is less tolerant to substitutions
involving hydrophilic amino acids, while it is
more tolerant to substitutions involving
hydrophobic amino acids. - For rare amino acids especially cysteine and
tryptophane, BLOSUM 62 is typically more tolerant
to mismatches than is PAM 160.
32PAM vs BLOSUM
- Dayhoff estimated mutation rates from
substitutions observed in closely related
proteins and extrapolated those rates to models
distant relationships. - In BLOSUM approach, frequencies were obtained
directly from relationships represented in the
block, regardless of evolutionary distance. - The Dayhoff frequency table included 36 pairs in
which no accepted point mutations.
33Differences Between the PAM and BLOSUM Approach
- In contrast, the pairs counted with BLOSUM,
included no fewer than 2369 occurrences of any
particular substitution. - The BLOSUM matrices depend only on the identity
and composition of groups protein in Prosite. - Therefore, there is no expectation that these
substitution matrices will change significantly
in the future.
34PAM Versus BLOSUM
- PAM is based on an evolutionary model.
- BLOSUM is based on protein families.
- PAM is based on global alignment.
- BLOSUM is based on local alignment.