Multiple sequence alignment based on Larry Hunters Slides - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Multiple sequence alignment based on Larry Hunters Slides

Description:

Generalize our pairwise alignment of sequences to include more ... Size, hydrophobicity, ionizability, etc. An amino acid is generally in more than one group ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 31
Provided by: digita1
Category:

less

Transcript and Presenter's Notes

Title: Multiple sequence alignment based on Larry Hunters Slides


1
Multiple sequence alignment(based on Larry
Hunters Slides)
  • Generalize our pairwise alignment of sequences to
    include more than two homologous proteins.
  • Looking at more than two sequences gives us much
    more information
  • Which amino acids are required? correlated?
  • Evolutionary/phylogenetic relationships

2
Phylogenetic Trees
Analysis of 20 samples of Cytochrome c protein
Sequences Numbers represent nucleotide
substitutions in the gene for Cytochrome c
3
Sample MSA
FOS_RAT MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPA
DSFSSMGSPVNTQDFCADLSVSSANF 60 FOS_MOUSE
MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFC
ADLSVSSANF 60 FOS_CHICK MMYQGFAGEYEAPSSRCSSA
SPAGDSLTYYPSPADSFSSMGSPVNSQDFCTDLAVSSANF
60 FOSB_MOUSE -MFQAFPGDYDS-GSRCSS-SPSAESQ--YL
SSVDSFGSPPTAAASQE-CAGLGEMPGSF 54 FOSB_HUMAN
-MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASQE-C
AGLGEMPGSF 54 .. . .
. ... .. .. ... FOS_RAT
IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHP
YGLPTPS-TGAYARAGVV 112 FOS_MOUSE
IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTQS-
AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQP
TLISSVAPSQ-------NRG-HPYGVPAPAPPAAYSRPAVL
112 FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQG
QPLASQPPAVDPYDMPGTS----YSTPGLS 110 FOSB_HUMAN
VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTS
----YSTPGMS 110
... . .. ..
FOS_RAT KTMSGGRAQSIG--------------------
RRGKVEQLSPEEEEKRRIRRERNKMAAA 152 FOS_MOUSE
KTVSGGRAQSIG--------------------RRGKVEQLSPEEEEKRRI
RRERNKMAAA 152 FOS_CHICK KAP-GGRGQSIG-------
-------------RRGKVEQLSPEEEEKRRIRRERNKMAAA
151 FOSB_MOUSE AYSTGGASGSGGPSTSTTTSGPVSARPARA
RPRRPREETLTPEEEEKRRVRRERNKLAAA 170 FOSB_HUMAN
GYSSGGASGSGGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRR
VRRERNKLAAA 170 . .
.. . . .

4
Optimal MSA
  • Use Dynamic Programming?
  • Optimal alignment algorithm exists, but is
    O(2nln) where n is the number of sequences and l
    is the length of the longest sequence.
  • 10 sequences of length 100 take 210100101023
    operations, around 1 million years at 3GHz
  • Exponential algorithms strike again.
  • So, approximation approaches?

5
Progressive MSA
  • Start with pairwise alignments of closely related
    sequences, and then add more distantly related
    sequences one at a time.
  • Requires information (assumptions) about the
    phylogenetic relationship a priori.
  • Can be estimated from all pairwise comparisons.
  • Give total MSA score based on sum of pairwise
    scores
  • Perhaps weighted to reduce the influence of very
    similar sequences.

6
Gaps in Progressive MSAs
  • How to score gaps?
  • Want to align gaps with each other over all
    sequences. A gap in a pairwise alignment that
    matches a gap in another pairwise alignment
    should cost less than introducing a totally new
    gap.
  • Possible that a new gap could be made to match
    an older one by shifting around the original
    pairwise alignment, but at great computational
    cost.
  • Change gap penalty near conserved domains of
    various kinds (e.g. secondary structure,
    hydrophobic regions)
  • CLUSTALW http//www.ebi.ac.uk/clustalw

7
Greedy algorithms
  • Progressive MSA programs make the best alignment
    of a new sequence with the existing ones they can
    at the time, and then never revisit the decision.
  • Even if changing an old decision (e.g. the gaps)
    could increase the score, this approach doesn't.
  • Approach is called greedy (because it takes the
    best first), and is a common way to resolve
    exponential problems.

8
Problems with progressive MSA
  • Depends crucially on the quality of the pairwise
    alignments, particularly among the closest
    matches.
  • No suitable resolution to the problem of gap
    penalties over multiple sequences.
  • Works reasonably well for closely related
    sequences. Even then, manual adjustments are
    common.

9
Iterative MSA methods
  • The idea here is to start with a reasonable
    approximation to the optimal MSA (e.g. by using a
    progressive method) and then tweaking to
    improve it.
  • Various optimization techniques have been tried
    here (e.g. GAs and simulated annealing).
  • Key is the scoring function for the whole MSA.
  • Also, what steps to take that are likely to
    improve the score.

10
Block based methods
  • Another approach to iterative methods are to
    start with short local alignments (sometimes
    called blocks) and then to reduce the problem to
    aligning the regions between the blocks
  • Divide and conquer is another common CS
    approach to exponential problems.
  • How to find the blocks?
  • DALIGN (local alignment methods)
  • DCA (divide and conquer alignments)
  • Tmsa (identify patterns and use them to define
    blocks).

11
Databases of MSAs
  • Once they have been calculated, they can be saved
    and shared
  • Pfam database of protein families. Alignments of
    large numbers of homologous proteins.
  • http//www.sanger.ac.uk/Software/Pfam/index.shtml
  • TigerFam database of protein families curated
    for function, rather than homology
  • http//www.tigr.org/TIGRFAMs/index.shtml

12
More web sites
  • Web sites offer multiple approaches to MSA.
  • Interfaces to multiple different programs
  • http//searchlauncher.bcm.tmc.edu/multi-align
  • http//www.techfak.uni-bielefeld.de/bcd/Curric/Mul
    Ali
  • Main web-based MSA servers
  • http//www.ebi.ac.uk/clustalw
  • http//baboon.math.berkeley.edu/mavid/ (genomic
    seqs)
  • See course website for many more listings

13
Protein motifs
  • Recall that local alignments can identify similar
    regions in non-homologous proteins
  • These regions (sometimes called domains) often
    have shared structure and/or function.
  • Example Zinc-finger DNA binding motif

14
Zinc-finger DNA binding motif
15
Protein motifs
  • How to define them?
  • Consensus sequence
  • Regular expression
  • Profile (probability for each amino acid at each
    position)

16
ProSite consensus sequences
17
Recognizing ProSite patterns
  • L14 Ribosome pattern GA-LIV(3)-x(9,10)-DNS
    -G-x(4)-FY-x(2)-NT-x(2)-V-LIV
  • Some matching sequences
  • GIIIACGHLIPQTNGACRTYILNDRVV
  • GVLLWQPKHCSNAADGAWAWFAATAAVL
  • ALIVEANIIILSISGRATTFHATSAVI
  • ProSite patterns can be translated into regular
    expressions, although the bounded length patterns
    (e.g. LIV(3,5) are unwieldy to write down as
    regexps.

18
Example of ProSite
  • AC-x-V-x(4)-EDThis pattern is translated as
    Ala or Cys-any-Val-any-any-any-any-any but Glu
    or Asplt A-x-ST(2)-x(0,1)-VThis pattern,
    which must be in the N-terminal of the sequence
    (lt'), is translated as Ala-any-Ser or
    Thr-Ser or Thr-(any or none)-ValltCgtThis
    pattern describes all sequences which do not
    contain any Cysteines.IIRIFHLRNIThis pattern
    describes all sequences which contain the
    subsequence 'IIRIFHLRNI'.

19
Regular expressions
  • Wide use in computer science. Basis of PERL
    language (see also BioPERL).For proteins,
    a language like prosite patterns is more
    intuitive, but often equivalent.

20
Profiles
  • Rather than identifying only the consensus
    (i.e. most common) amino acid at a particular
    location, we can assign a probability to each
    amino acid in each position of the domain.
  • Example

1 2 3 A .1 .5 .25C .3 .1 .25 D .2
.2 .25E .4 .2 .25
21
Applying a profile
  • Calculate score (probability of match) for a
    profile at each position in a sequence by
    multiplying individual probabilities. Sliding
    window
  • Can transform probability to significance given
    random distribution assumption

22
Applying a profile
  • Calculate score (probability of match) for a
    profile at each position in a sequence by
    multiplying individual probabilities. Sliding
    window
  • Can transform probability to significance given
    random distribution assumption

23
Using motifs
  • Great for annotating a sequence with no strong
    homologs.
  • INTERPRO is an uniform interface to many
    different motif methods and databases
  • ProSite
  • Prints (fingerprints multiple motifs)
  • ProDom (like Pfam, but for domains)
  • SMART (mobile domains)

24
Interpro example
25
InterPro example (con't).
  • Then, match the pattern to a protein database

26
How do we create motifs?
  • General problem of inducing patterns from
    sequences is difficult
  • Classic language result (Gold) Context-free
    grammars can not be induced from only positive
    examples
  • Many patterns are compatible with any MSA. How
    to decide which constituents are required?
  • In general case, we need positive examples (in
    the class) but also near misses sequences that
    are similar but not members of the class.
  • Not absolutely true for protein sequences.

27
Finding Consensus Sequences
  • Based on local MSAs.
  • ProSite consensus built from MSA on (Amos
    Bairoch's) biological intuition, tweaked by
    calculating sensitivity and specificity of the
    patterns over SwissProt.
  • True (False) positives defined by Bairoch's
    understanding.
  • Not an automatable procedure!

28
Creating profiles
  • Given a local MSA, creating a profile is
    straightforward.
  • Calculate frequency of each amino acid at each
    position to create profile.
  • What to do about zero frequencies?
  • Could be sampling errors, not real zero
    probabilities.
  • Zero probabilities always make zero scores!
  • Regularization
  • pseudocounts
  • Dirichlet mixtures (blend in background
    frequencies)

29
Profile example
  • MSA Counts Add 1
    pseudocount
  • Profiles

1 2 3 A 2 0 1B 1 4 1 C 1 0
1 D 0 0 1
1 2 3 A 3 1 2B 2 5 2 C 2 1
2 D 1 1 2
BBB ABC ABD CBA
1 2 3 A .5 0 .25B .25 1 .25 C .25 0
.25 D 0 0 .25
1 2 3 A .37 .12 .25B .25 .63 .25 C .25
.12 .25 D .12 .12 .25
30
Feature alphabets
  • Amino acids can be grouped by their
    characteristics
  • Size, hydrophobicity, ionizability, etc.
  • An amino acid is generally in more than one group
  • Can set different regularizers (pseudocounts) for
    each different feature
  • Most useful when there are multiple features
    (otherwise many amino acids get same pseudocount)
Write a Comment
User Comments (0)
About PowerShow.com