Title: Introduction to bioinformatics lecture 8
1Introduction to bioinformaticslecture 8
- Deriving amino acid exchange matrices (II) and
Multiple sequence alignment (I)
2Summary Dayhoffs PAM-matrices
- Derived from global alignments of closely
related sequences. - Matrices for greater evolutionary distances are
extrapolated from those for lesser ones. - The number with the matrix (PAM40, PAM100)
refers to the evolutionary distance greater
numbers are greater distances. - Several later groups have attempted to extend
Dayhoff's methodology or re-apply her analysis
using later databases with more examples. - Extensions of Dayhoffs methodology gt Jones,
Thornton and coworkers used the same methodology
as Dayhoff but with modern databases
(CABIOS 8275). gt Gonnett and coworkers
(Science 2561443) used a slightly different
(but theoretically equivalent) methodology. gt
Henikoff Henikoff (Proteins 1749) compared
these two newer versions of the PAM
matrices with Dayhoff's originals.
3The BLOSUM matrices(BLOcks SUbstitution Matrix)
- The BLOSUM series of matrices were created by
Steve Henikoff and colleagues (PNAS 8910915). - Derived from local, un-gapped alignments of
distantly related sequences. - All matrices are directly calculated no
extrapolations are used. - Again the observed frequency of each pair is
compared to the expected frequency (which is
essentially the product of the frequencies of
each residue in the dataset). Then Log-odds
matrix.
4The Blocks Database
- The Blocks Database contains multiple
alignments of conserved regions in protein
families. - Blocks are multiply aligned un-gapped segments
corresponding to the most highly conserved
regions of proteins. - The blocks for the BLOCKS database are made
automatically by looking for the most highly
conserved regions in groups of proteins
represented in the PROSITE database. These blocks
are then calibrated against the SWISS-PROT
database to obtain a measure of the random
distribution of matches. It is these calibrated
blocks that make up the BLOCKS database. - The database can be searched by e-mail and
World Wide Web (WWW) servers (http//blocks.fhcr
c.org/help) to classify protein and nucleotide
sequences.
5The Blocks Database
Gapless alignment blocks
6The BLOSUM series
- BLOSUM30, 35, 40, 45, 50, 55, 60, 62, 65, 70,
75, 80, 85, 90. - The number after the matrix (BLOSUM62) refers
to the minimum percent identity of the blocks
(in the BLOCKS database) used to construct the
matrix (all blocks have gt62 sequence
identity) - No extrapolations are made in going to higher
evolutionary distances - High number - closely related sequences Low
number - distant sequences - BLOSUM62 is the most popular best for general
alignment.
7The log-odds matrix for BLOSUM62
8PAM versus BLOSUM
- Based on an explicit evolutionary model
- Derived from small, closely related proteins with
15 divergence - Higher PAM numbers to detect more remote sequence
similarities - Errors in PAM 1 are scaled 250X in PAM 250
- Based on empirical frequencies
- Uses much larger, more diverse set of protein
sequences (30-90 ID) - Lower BLOSUM numbers to detect more remote
sequence similarities - Errors in BLOSUM arise from errors in alignment
9Comparing exchange matrices
- To compare amino acid exchange matrices, the
"Entropy" value can be used. This is a relative
entropy value (H) which describes the amount of
information available per aligned residue pair.
10Specialized matrices
- Claverie (J.Mol.Biol 2341140) developed a set
of substitution matrices designed explicitly
for finding possible frameshifts in protein
sequences.These matrices are designed solely
for use in protein-protein comparisons they
should not be used with programs which blindly
translate DNA (e.g. BLASTX, TBLASTN).
11Specialized matrices
- Rather than starting from alignments generated
by sequence comparison, Rissler et al (1988)
and later Overington et al (1992) only
considered proteins for which an experimentally
determined three dimensional structure was
available. - They then aligned similar proteins on the basis
of their structure rather than sequence and
used the resulting sequence alignments as their
database from which to gather substitution
statistics. In principle, the Rissler or
Overington matrices should give more reliable
results than either PAM or BLOSUM. However, the
comparatively small number of available protein
structures (particularly in the Rissler et al
study) limited the reliability of their
statistics. - Overington et al (1992) developed further
matrices that consider the local environment of
the amino acids.
12A note on reliability
- All these matrices are designed using standard
evolutionary models. - It is important to understand that evolution is
not the same for all proteins, not even for the
same regions of proteins. - No single matrix performs best on all
sequences. Some are better for sequences with
few gaps, and others are better for sequences
with fewer identical amino acids. - Therefore, when aligning sequences, applying a
general model to all cases is not ideal. Rather,
re-adjustment can be used to make the general
model better fit the given data.
13Pair-wise alignment quality versus sequence
identity(Vogt et al., JMB 249, 816-831,1995)
14Summary
- If ORF exists, then align at protein level.
- Amino acid substitution matrices reflect the
log-odds ratio between the evolutionary and
random model and can therefore help in
determining homology via the alignment score. - The evolutionary and random models depend on
the generalized data used to derive them. This
not an ideal solution. - Apart from the PAM and BLOSUM series, a great
number of further matrices have been developed. - Matrices have been made based on DNA, protein
structure, information content, etc. - For local alignment, BLOSUM62 is often
superior for distant (global) alignments,
BLOSUM50, GONNET, or (still) PAM250 work well. - Remember that gap penalties are always a
problem unlike the matrices themselves, there
is no formal way to calculate their values --
you can follow recommended settings, but these
are based on trial and error and not on a
formal framework.
15Biological definitions for related sequences
- Homologues are similar sequences in two different
organisms that have been derived from a common
ancestor sequence. Homologues can be described
as either orthologues or paralogues. - Orthologues are similar sequences in two
different organisms that have arisen due to a
speciation event. Orthologs typically retain
identical or similar functionality throughout
evolution. - Paralogues are similar sequences within a single
organism that have arisen due to a gene
duplication event. - Xenologues are similar sequences that do not
share the same evolutionary origin, but rather
have arisen out of horizontal transfer events
through symbiosis, viruses, etc.
16So this means
Source http//www.ncbi.nlm.nih.gov/Education/BLAS
Tinfo/Orthology.html
17Multiple sequence alignment
- Sequences can be conserved across species and
perform similar or identical functions.gt hold
information about which regions have high
mutation rates over evolutionary time and
which are evolutionarily conservedgt
identification of regions or domains that are
critical to functionality. - Sequences can be mutated or rearranged to perform
an altered function.gt which changes in the
sequences have caused a change in the
functionality.
Multiple sequence alignment the idea is to take
three or more sequences and align them so that
the greatest number of similar characters are
aligned in the same column of the alignment.
18What to ask yourself
- How do we get a multiple alignment?(three or
more sequences) - What is our aim? Do we go for max accuracy,
least computational time or the best
compromise? - What do we want to achieve each time
19Sequence-sequence alignment
sequence
sequence
20Multiple alignment methods
- Multi-dimensional dynamic programminggt extension
of pairwise sequence alignment. - Progressive alignmentgt incorporates phylogenetic
information to guide the alignment process - Iterative alignmentgt correct for problems with
progressive alignment by repeatedly realigning
subgroups of sequence
21Simultaneous multiple alignmentMulti-dimensional
dynamic programming
- The combinatorial explosion
- 2 sequences of length n
- n2 comparisons
- Comparison number increases exponentially
- i.e. nN where n is the length of the sequences,
and N is the number of sequences - Impractical for even a small number of short
sequences
22Multi-dimensional dynamic programming (Murata et
al., 1985)
Sequence 1
Sequence 3
Sequence 2
23The MSA approach
- MSA (Lipman et al., 1989, PNAS 86, 4412)
- MSA restricts the amount of memory by computing
bounds that approximate the centre of a
multi-dimensional hypercube. - Calculate all pair-wise alignment scores.
- Use the scores to to predict a tree.
- Calculate pair weights based on the tree (lower
bound). - Produce a heuristic alignment based on the tree.
- Calculate the maximum weight for each sequence
pair (upper bound). - Determine the spatial positionsthat must be
calculated to obtain the optimal alignment. - Perform the optimal alignment.
- Report the weight found comparedto the maximum
weight previouslyfound (measure of divergence). - Extremely slow and memory intensive.
- Max 8-9 sequences of 250 residues.
-
24The DCA approach
- DCA (Stoye et al., 1997, Appl. Math. Lett. 10(2),
67-73) - Each sequence is cut in two behinda suitable cut
position somewhere close to its midpoint. - This way, the problem of aligningone family of
(long) sequences is divided into the two
problems of aligning two families of (shorter)
sequences. - This procedure is re-iterated untilthe sequences
are sufficiently short. - Optimal alignment by MSA.
- Finally, the resulting short alignments are
concatenated. -
25So in effect
Sequence 1
Sequence 3
Sequence 2
26Multiple alignment methods
- Multi-dimensional dynamic programminggt extension
of pairwise sequence alignment. - Progressive alignmentgt incorporates phylogenetic
information to guide the alignment process - Iterative alignmentgt correct for problems with
progressive alignment by repeatedly realigning
subgroups of sequence
27The progressive alignment method
- Underlying idea usually we are interested in
aligning families of sequences that are
evolutionary related. - Principle construct an approximate phylogenetic
tree for the sequences to be aligned and than to
build up the alignment by progressively adding
sequences in the order specified by the tree. - But before going into details, some notices of
multiple alignment profiles
28How to represent a block of sequences?
- Historically consensus sequence single
sequence that best represents the amino acids
observed at each alignment position. - Modern methods Alignment profile
representation that retains the information about
frequencies of amino acids observed at each
alignment position.
29Multiple alignment profiles (Gribskov et al. 1987)
- Gribskov created a probe group of typical
sequences of functionally related proteins that
have been aligned by similarity in sequence or
three-dimensional structure (in his case globins
immunoglobulins). - Then he constructed a profile, which consists of
a sequence position-specific scoring matrix
M(p,a) composed of 21 columns and N rows (N
length of probe). - The first 20 columns of each row specify the
score for finding, at that position in the
target, each of the 20 amino acid residues. An
additional column contains a penalty for
insertions or deletions at that position
(gap-opening and gap-extension).
30Multiple alignment profiles
Core region
Core region
Gapped region
i
A C D ? ? ? W Y
fA.. fC.. fD.. ? ? ? fW.. fY..
fA.. fC.. fD.. ? ? ? fW.. fY..
fA.. fC.. fD.. ? ? ? fW.. fY..
-
Gapo, gapx
Gapo, gapx
Gapo, gapx
Position dependent gap penalties
31Profile building
- Example each aa is represented as a frequency
penalties as weights.
i
A C D ? ? ? W Y
0.3 0.1 0 ? ? ? 0.3 0.3
0.5 0 0 ? ? ? 0 0.5
0 0.5 0.2 ? ? ? 0.1 0.2
Gap penalties
0.5
1.0
1.0
Position dependent gap penalties
32Profile-sequence alignment
sequence
ACDVWY
33Sequence to profile alignment
A A V V L
0.4 A 0.2 L 0.4 V
Score of amino acid L in sequence that is aligned
against this profile position Score 0.4
s(L, A) 0.2 s(L, L) 0.4 s(L, V)
34Profile-profile alignment
profile
A C D . . Y
profile
ACDVWY
35Profile to profile alignment
0.4 A 0.2 L 0.4 V
0.75 G 0.25 S
Match score of these two alignment columns using
the a.a frequencies at the corresponding profile
positions Score 0.40.75s(A,G)
0.20.75s(L,G) 0.40.75s(V,G)
0.40.25s(A,S) 0.20.25s(L,S)
0.40.25s(V,S) s(x,y) is value in amino acid
exchange matrix (e.g. PAM250, Blosum62) for amino
acid pair (x,y)
36So, for scoring profiles
- Think of sequence-sequence alignment.
- Same principles but more information for each
position. - Reminder
- The sequence pair alignment score S comes from
the sum of the positional scores M(aai,aaj) (i.e.
the substitution matrix values at each alignment
position minus penalties if applicable) - Profile alignment scores are exactly the same,
but the positional scores are more complex