Introduction to bioinformatics lecture 8 - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to bioinformatics lecture 8

Description:

Matrices have been made based on DNA, protein. structure, information content, etc. ... Produce a heuristic alignment based on the tree. ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 37
Provided by: pir80
Category:

less

Transcript and Presenter's Notes

Title: Introduction to bioinformatics lecture 8


1
Introduction to bioinformaticslecture 8
  • Deriving amino acid exchange matrices (II) and
    Multiple sequence alignment (I)

2
Summary Dayhoffs PAM-matrices
  • Derived from global alignments of closely
    related sequences.
  • Matrices for greater evolutionary distances are
    extrapolated from those for lesser ones.
  • The number with the matrix (PAM40, PAM100)
    refers to the evolutionary distance greater
    numbers are greater distances.
  • Several later groups have attempted to extend
    Dayhoff's methodology or re-apply her analysis
    using later databases with more examples.
  • Extensions of Dayhoffs methodology gt Jones,
    Thornton and coworkers used the same methodology
    as Dayhoff but with modern databases
    (CABIOS 8275). gt Gonnett and coworkers
    (Science 2561443) used a slightly different
    (but theoretically equivalent) methodology. gt
    Henikoff Henikoff (Proteins 1749) compared
    these two newer versions of the PAM
    matrices with Dayhoff's originals.

3
The BLOSUM matrices(BLOcks SUbstitution Matrix)
  • The BLOSUM series of matrices were created by
    Steve Henikoff and colleagues (PNAS 8910915).
  • Derived from local, un-gapped alignments of
    distantly related sequences.
  • All matrices are directly calculated no
    extrapolations are used.
  • Again the observed frequency of each pair is
    compared to the expected frequency (which is
    essentially the product of the frequencies of
    each residue in the dataset). Then Log-odds
    matrix.

4
The Blocks Database
  • The Blocks Database contains multiple
    alignments of conserved regions in protein
    families.
  • Blocks are multiply aligned un-gapped segments
    corresponding to the most highly conserved
    regions of proteins.
  • The blocks for the BLOCKS database are made
    automatically by looking for the most highly
    conserved regions in groups of proteins
    represented in the PROSITE database. These blocks
    are then calibrated against the SWISS-PROT
    database to obtain a measure of the random
    distribution of matches. It is these calibrated
    blocks that make up the BLOCKS database.
  • The database can be searched by e-mail and
    World Wide Web (WWW) servers (http//blocks.fhcr
    c.org/help) to classify protein and nucleotide
    sequences.

5
The Blocks Database
Gapless alignment blocks
6
The BLOSUM series
  • BLOSUM30, 35, 40, 45, 50, 55, 60, 62, 65, 70,
    75, 80, 85, 90.
  • The number after the matrix (BLOSUM62) refers
    to the minimum percent identity of the blocks
    (in the BLOCKS database) used to construct the
    matrix (all blocks have gt62 sequence
    identity)
  • No extrapolations are made in going to higher
    evolutionary distances
  • High number - closely related sequences Low
    number - distant sequences
  • BLOSUM62 is the most popular best for general
    alignment.

7
The log-odds matrix for BLOSUM62
8
PAM versus BLOSUM
  • Based on an explicit evolutionary model
  • Derived from small, closely related proteins with
    15 divergence
  • Higher PAM numbers to detect more remote sequence
    similarities
  • Errors in PAM 1 are scaled 250X in PAM 250
  • Based on empirical frequencies
  • Uses much larger, more diverse set of protein
    sequences (30-90 ID)
  • Lower BLOSUM numbers to detect more remote
    sequence similarities
  • Errors in BLOSUM arise from errors in alignment

9
Comparing exchange matrices
  • To compare amino acid exchange matrices, the
    "Entropy" value can be used. This is a relative
    entropy value (H) which describes the amount of
    information available per aligned residue pair.

10
Specialized matrices
  • Claverie (J.Mol.Biol 2341140) developed a set
    of substitution matrices designed explicitly
    for finding possible frameshifts in protein
    sequences.These matrices are designed solely
    for use in protein-protein comparisons they
    should not be used with programs which blindly
    translate DNA (e.g. BLASTX, TBLASTN).

11
Specialized matrices
  • Rather than starting from alignments generated
    by sequence comparison, Rissler et al (1988)
    and later Overington et al (1992) only
    considered proteins for which an experimentally
    determined three dimensional structure was
    available.
  • They then aligned similar proteins on the basis
    of their structure rather than sequence and
    used the resulting sequence alignments as their
    database from which to gather substitution
    statistics. In principle, the Rissler or
    Overington matrices should give more reliable
    results than either PAM or BLOSUM. However, the
    comparatively small number of available protein
    structures (particularly in the Rissler et al
    study) limited the reliability of their
    statistics.
  • Overington et al (1992) developed further
    matrices that consider the local environment of
    the amino acids.

12
A note on reliability
  • All these matrices are designed using standard
    evolutionary models.
  • It is important to understand that evolution is
    not the same for all proteins, not even for the
    same regions of proteins.
  • No single matrix performs best on all
    sequences. Some are better for sequences with
    few gaps, and others are better for sequences
    with fewer identical amino acids.
  • Therefore, when aligning sequences, applying a
    general model to all cases is not ideal. Rather,
    re-adjustment can be used to make the general
    model better fit the given data.

13
Pair-wise alignment quality versus sequence
identity(Vogt et al., JMB 249, 816-831,1995)
14
Summary
  • If ORF exists, then align at protein level.
  • Amino acid substitution matrices reflect the
    log-odds ratio between the evolutionary and
    random model and can therefore help in
    determining homology via the alignment score.
  • The evolutionary and random models depend on
    the generalized data used to derive them. This
    not an ideal solution.
  • Apart from the PAM and BLOSUM series, a great
    number of further matrices have been developed.
  • Matrices have been made based on DNA, protein
    structure, information content, etc.
  • For local alignment, BLOSUM62 is often
    superior for distant (global) alignments,
    BLOSUM50, GONNET, or (still) PAM250 work well.
  • Remember that gap penalties are always a
    problem unlike the matrices themselves, there
    is no formal way to calculate their values --
    you can follow recommended settings, but these
    are based on trial and error and not on a
    formal framework.

15
Biological definitions for related sequences
  • Homologues are similar sequences in two different
    organisms that have been derived from a common
    ancestor sequence. Homologues can be described
    as either orthologues or paralogues.
  • Orthologues are similar sequences in two
    different organisms that have arisen due to a
    speciation event. Orthologs typically retain
    identical or similar functionality throughout
    evolution.
  • Paralogues are similar sequences within a single
    organism that have arisen due to a gene
    duplication event.
  • Xenologues are similar sequences that do not
    share the same evolutionary origin, but rather
    have arisen out of horizontal transfer events
    through symbiosis, viruses, etc.

16
So this means
Source http//www.ncbi.nlm.nih.gov/Education/BLAS
Tinfo/Orthology.html
17
Multiple sequence alignment
  • Sequences can be conserved across species and
    perform similar or identical functions.gt hold
    information about which regions have high
    mutation rates over evolutionary time and
    which are evolutionarily conservedgt
    identification of regions or domains that are
    critical to functionality.
  • Sequences can be mutated or rearranged to perform
    an altered function.gt which changes in the
    sequences have caused a change in the
    functionality.

Multiple sequence alignment the idea is to take
three or more sequences and align them so that
the greatest number of similar characters are
aligned in the same column of the alignment.
18
What to ask yourself
  • How do we get a multiple alignment?(three or
    more sequences)
  • What is our aim? Do we go for max accuracy,
    least computational time or the best
    compromise?
  • What do we want to achieve each time

19
Sequence-sequence alignment
sequence
sequence
20
Multiple alignment methods
  • Multi-dimensional dynamic programminggt extension
    of pairwise sequence alignment.
  • Progressive alignmentgt incorporates phylogenetic
    information to guide the alignment process
  • Iterative alignmentgt correct for problems with
    progressive alignment by repeatedly realigning
    subgroups of sequence

21
Simultaneous multiple alignmentMulti-dimensional
dynamic programming
  • The combinatorial explosion
  • 2 sequences of length n
  • n2 comparisons
  • Comparison number increases exponentially
  • i.e. nN where n is the length of the sequences,
    and N is the number of sequences
  • Impractical for even a small number of short
    sequences

22
Multi-dimensional dynamic programming (Murata et
al., 1985)
Sequence 1
Sequence 3
Sequence 2
23
The MSA approach
  • MSA (Lipman et al., 1989, PNAS 86, 4412)
  • MSA restricts the amount of memory by computing
    bounds that approximate the centre of a
    multi-dimensional hypercube.
  • Calculate all pair-wise alignment scores.
  • Use the scores to to predict a tree.
  • Calculate pair weights based on the tree (lower
    bound).
  • Produce a heuristic alignment based on the tree.
  • Calculate the maximum weight for each sequence
    pair (upper bound).
  • Determine the spatial positionsthat must be
    calculated to obtain the optimal alignment.
  • Perform the optimal alignment.
  • Report the weight found comparedto the maximum
    weight previouslyfound (measure of divergence).
  • Extremely slow and memory intensive.
  • Max 8-9 sequences of 250 residues.

24
The DCA approach
  • DCA (Stoye et al., 1997, Appl. Math. Lett. 10(2),
    67-73)
  • Each sequence is cut in two behinda suitable cut
    position somewhere close to its midpoint.
  • This way, the problem of aligningone family of
    (long) sequences is divided into the two
    problems of aligning two families of (shorter)
    sequences.
  • This procedure is re-iterated untilthe sequences
    are sufficiently short.
  • Optimal alignment by MSA.
  • Finally, the resulting short alignments are
    concatenated.

25
So in effect
Sequence 1
Sequence 3
Sequence 2
26
Multiple alignment methods
  • Multi-dimensional dynamic programminggt extension
    of pairwise sequence alignment.
  • Progressive alignmentgt incorporates phylogenetic
    information to guide the alignment process
  • Iterative alignmentgt correct for problems with
    progressive alignment by repeatedly realigning
    subgroups of sequence

27
The progressive alignment method
  • Underlying idea usually we are interested in
    aligning families of sequences that are
    evolutionary related.
  • Principle construct an approximate phylogenetic
    tree for the sequences to be aligned and than to
    build up the alignment by progressively adding
    sequences in the order specified by the tree.
  • But before going into details, some notices of
    multiple alignment profiles

28
How to represent a block of sequences?
  • Historically consensus sequence single
    sequence that best represents the amino acids
    observed at each alignment position.
  • Modern methods Alignment profile
    representation that retains the information about
    frequencies of amino acids observed at each
    alignment position.

29
Multiple alignment profiles (Gribskov et al. 1987)
  • Gribskov created a probe group of typical
    sequences of functionally related proteins that
    have been aligned by similarity in sequence or
    three-dimensional structure (in his case globins
    immunoglobulins).
  • Then he constructed a profile, which consists of
    a sequence position-specific scoring matrix
    M(p,a) composed of 21 columns and N rows (N
    length of probe).
  • The first 20 columns of each row specify the
    score for finding, at that position in the
    target, each of the 20 amino acid residues. An
    additional column contains a penalty for
    insertions or deletions at that position
    (gap-opening and gap-extension).

30
Multiple alignment profiles
Core region
Core region
Gapped region
i
A C D ? ? ? W Y
fA.. fC.. fD.. ? ? ? fW.. fY..
fA.. fC.. fD.. ? ? ? fW.. fY..
fA.. fC.. fD.. ? ? ? fW.. fY..
-
Gapo, gapx
Gapo, gapx
Gapo, gapx
Position dependent gap penalties
31
Profile building
  • Example each aa is represented as a frequency
    penalties as weights.

i
A C D ? ? ? W Y
0.3 0.1 0 ? ? ? 0.3 0.3
0.5 0 0 ? ? ? 0 0.5
0 0.5 0.2 ? ? ? 0.1 0.2
Gap penalties
0.5
1.0
1.0
Position dependent gap penalties
32
Profile-sequence alignment
sequence
ACDVWY
33
Sequence to profile alignment
A A V V L
0.4 A 0.2 L 0.4 V
Score of amino acid L in sequence that is aligned
against this profile position Score 0.4
s(L, A) 0.2 s(L, L) 0.4 s(L, V)
34
Profile-profile alignment
profile
A C D . . Y
profile
ACDVWY
35
Profile to profile alignment
0.4 A 0.2 L 0.4 V
0.75 G 0.25 S
Match score of these two alignment columns using
the a.a frequencies at the corresponding profile
positions Score 0.40.75s(A,G)
0.20.75s(L,G) 0.40.75s(V,G)
0.40.25s(A,S) 0.20.25s(L,S)
0.40.25s(V,S) s(x,y) is value in amino acid
exchange matrix (e.g. PAM250, Blosum62) for amino
acid pair (x,y)
36
So, for scoring profiles
  • Think of sequence-sequence alignment.
  • Same principles but more information for each
    position.
  • Reminder
  • The sequence pair alignment score S comes from
    the sum of the positional scores M(aai,aaj) (i.e.
    the substitution matrix values at each alignment
    position minus penalties if applicable)
  • Profile alignment scores are exactly the same,
    but the positional scores are more complex
Write a Comment
User Comments (0)
About PowerShow.com