Scoring the Alignment of Amino Acid Sequences

1 / 19
About This Presentation
Title:

Scoring the Alignment of Amino Acid Sequences

Description:

Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices –

Number of Views:75
Avg rating:3.0/5.0
Slides: 20
Provided by: LEIN150
Learn more at: http://cs.gettysburg.edu
Category:

less

Transcript and Presenter's Notes

Title: Scoring the Alignment of Amino Acid Sequences


1
Scoring the Alignment of Amino Acid Sequences
  • Constructing PAM and Blosum Matrices

2
Quotes from page 11 of our Lab Manual
Proteins are huge molecules made up of large
numbers of amino acids. The proteins are
usually 100 to 500 amino acids long There are 20
different amino acids that make up the proteins
3
Name Abbr.
Linear structure formula
Alanine
ala a CH3-CH(NH2)-COOH Arginine
arg r
HNC(NH2)-NH-(CH2)3-CH(NH2)-COOH Asparagine
asn n H2N-CO-CH2-CH(NH2)-COOH
Aspartic acid asp d
HOOC-CH2-CH(NH2)-COOH Cysteine cys
c HS-CH2-CH(NH2)-COOH Glutamine
gln q H2N-CO-(CH2)2-CH(NH2)-
COOH Glutamic acid glu e
HOOC-(CH2)2-CH(NH2)-COOH Glycine gly
g NH2-CH2-COOH Histidine
his h NH-CHN-CHC-CH2-CH(NH2)-C
OOH Isoleucine ile i
CH3-CH2-CH(CH3)-CH(NH2)- Leucine leu
l (CH3)2-CH-CH2-CH(NH2)-COOH
Lysine lys k
H2N-(CH2)4-CH(NH2)-COOH Methionine met m
CH3-S-(CH2)2-CH(NH2)-COOH
Phenylalanine phe f
Ph-CH2-CH(NH2)-COOH Proline pro
p NH-(CH2)3-CH-COOH Serine
ser s HO-CH2-CH(NH2)-COOH
Threonine thr t
CH3-CH(OH)-CH(NH2)-COOH Tryptophan trp
w Ph-NH-CHC-CH2-CH(NH2)-COOH Tyrosin
e tyr y
HO-p-Ph-CH2-CH(NH2)-COOH Valine val
v (CH3)2-CH-CH(NH2)-COOH
4
The number, variety, and chemical properties of
the Amino Acids make the problem of scoring a
pair of Amino Acids a much more complicated
problem than scoring a pair of nucleotides. In
the late 1970s Dayhoff, Schwartz, and Orcutt
decided to look at a database of similar proteins
having common ancestors and obtain substitution
frequency data. They looked at 71 groupings of
protein data that differed by no more than 15 of
their residues, i.e. at least 85 similar. They
then built phylogenetic trees where each
transition from generation to generation has as
few changes as possible, given the data, in each
ancestral sequence. From this a value is
determined for the entry Aab in a matrix giving
the frequency data for each pairing.
5
Constructing a Parsimonious Phylogenetic Tree
(taken from page 40 of Krane Raymer)
Dayhoff and her team used sequences that were at
least 85 similar and calculated the frequency
with which each protein was substituted for each
of the other proteins.
6
Dayhoffs Data
NOTE The diagonals are blank since only the
changes are recorded. Also, the upper
triangular half of the matrix is not shown since
it is assumed that the changes a?? and ??a are
symmetrical.
7
Calculating the Entry in The Substitution Matrix
Let P(ba,t) Probability that a is
substituted for b in t time units
adjusted for divergence time (Dayhoff time
unit) qaqb Probability that a would
randomly follow b
(frequency if a)(freqency of b) s(a,b t)
an entry at position (a, b) or (b, a) in the
scoring matrix Then,
s(a, b t)
8
The Probabilities Found By Dayhoff
The entry in cell Mab is the probability that a
would be followed by b in one Dayhoff time unit
multiplied by 100. Thus, for example, Alanine
would be followed by Proline 0.22 of the time.
9
Note The previous matrix is NOT the scoring
matrix. It is used to derive the scoring matrix.
Recall s(a, b t) However, the
probability matrix is the main tool for deriving
a sensible scoring matrix. To find the
probability that amino acid a will mutate be
replaced by amino acid b at a time t time units
later, we need to calculate the a,b-th entry of
the matrix Mt. After calculating this entry,
then we apply the log-odds formula given
above. The reason that the logarithm is used in
the scoring formula is that it allows us, among
other things, to add the scores of the aligned
residues when we compute the score for an overall
alignment of two sequences.
10
The matrix having scores found from the original
probability matrix is called a 1 PAM matrix PAM
stands for Point Accepted Mutation or Percent
Accepted Mutation Dayhoffs term was Accepted
Point Mutation, but PAM rolls off the tongue
easier than APM. The 1 means that given the
degree of similarity between the sequences used
to make up the matrix, the scores in this matrix
are the frequencies for one evolutionary time
unit. Scores representing longer times and are
called PAMt matrices Mt. The most widely
used matrix is PAM250 or the log-odds matrix
based on M250 the 250th power of M. This
matrix shows the probability of change over a
long period of time. However, for closely
related sequences, say mouse and rat MSH2, a
PAM10 matrix may be more appropriate
11
The PAM250 Matrix
We only show the top half because the bottom half
is a reflection of the top half, i.e. Sa,b Sb,a
12
  • Discussion of PAM
  • The 1 PAM matrix was derived by constructing
    hypothetical phylogenetic trees relating
    sequences in 71 families.
  • The higher the power of the matrix, the more
    evolutionary time units represented by the
    matrix.
  • Criticism raising M to high powers does not
    capture the true difference between short time
    substitutions and long time substitutions.
  • Note short time substitutions are dominated by
    amino acid substitutions that come from a single
    base change in the codon triplets of an Amino
    Acid, whereas the long time substitutions show
    all kinds of codon changes

13
BLOSUM (BLOck Substitution Matrix) Matrices
14
The criticism given at the end of the last
discussion is that the large PAM matrices tend to
minimize the effects of short time substitutions
such as Llt-gtI Llt-gtV and Ylt-gtF
In 1991 1992 Henikoff and Henikoff used the
BLOCKS database at the Fred Hutchison Cancer
Research Center This database contains blocks of
multiple alignments of more distantly related
sequences Such a database can be used to derive
scores more directly
15
Methodology
  • Sequences from each block were clustered
  • Two sequences were placed in the same cluster if
    their percent differences were above some level,
    say a
  • The frequency Aab is calculated from observing
    residue a in one clustered alignment against
    residue b in another clustered alignment.
  • Corrections are made for clusters of differing
    sizes

16
Calculating the Matrix Entries
Let the following be determined from the observed
data qa the fraction of pairings
that include an a pab the fraction
of parings of a and b Then
and
The score is calculated as
These values are then scaled and rounded to make
calculations easier.
17
If we set the limit, a, to 62, the we have a
BLOSUM62 Matrix
18
  • Most popular BLOSUM Matrices are BLOWSUM62 and
    BLOWSUM50.
  • BLOWSUM62 is used mainly for ungapped matching.
  • BLOWSUM50 is used for alignments with gaps.
  • Note the lower the number the longer the time
    span in evolutionary units.

19
Differences Between PAM and BLOSUM
PAM assumes that substitutions probabilities for
highly related proteins can be extrapolated to
the probabilities for distantly related
proteins. BLOWSUM matrices are based on the
observation of more distantly related protein
alignments. NOTE Both types of matrices use
log-odds values in their scoring systems.
Write a Comment
User Comments (0)
About PowerShow.com