Multiple Alignment by profile HMM training and Phylogenetic Trees - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Multiple Alignment by profile HMM training and Phylogenetic Trees

Description:

Title: Probability Theory and Basic Alignment of String Sequences Author: apstjhan Last modified by: Nastya Created Date: 11/17/2004 1:36:46 PM Document presentation ... – PowerPoint PPT presentation

Number of Views:189
Avg rating:3.0/5.0
Slides: 51
Provided by: apst
Category:

less

Transcript and Presenter's Notes

Title: Multiple Alignment by profile HMM training and Phylogenetic Trees


1
Multiple Alignment by profile HMM
trainingandPhylogenetic Trees
  • Elze de Groot
  • Anastacia Berdnikova

2
Topics
  • Multiple alignment with known HMM
  • HMM training from unaligned sequences
  • Avoiding local maxima
  • Simulated annealing
  • Noise injection
  • Stochastic sampling traceback algorithm
  • Model surgery
  • Phylogenetic trees

3
Multiple alignment with known profile HMM
  • Multiple alignment and model known -gt align large
    number of other family members
  • Calculating Viterbi alignment for every sequence
  • Residues in same match state are aligned in
    columns
  • Thats a difference between profile HMM and
    traditional multiple alignment

4
Example
  • Model estimated from an alignment

5
Example continued
  • The most probable paths and alignment

6
Profile HMM training from unaligned sequences
  • Algorithm

7
Initial Model
  • Choose length of model
  • - M is number of match states
  • - set M to be the average length
  • Choose initial models carefully
  • Randomness in choice of initial model

8
Parameter Estimation
  • Use forward and backward variables to re-estimate
    emission and transition probability parameters
  • Baum-Welch re-estimation can be replaced by
    viterbi alternative

9
Forward Algorithm
10
Backward algorithm
11
Baum-Welch re-estimation equations
  • Expected emission counts from sequence x

12
Baum-Welch re-estimation equations
  • Expected transition counts from sequence x

13
Avoiding local maxima
  • Baum-Welch guaranteed to find local maxima
  • Not guaranteed it is anywhere near global optimum
    or biologically reasonable solution
  • Reason models are long -gt many options to get
    wrong solution

14
Avoiding local maxima
  • Use stochastic search algorithm
  • Commonly used Simulated annealing

15
Simulated annealing
  • Some compounds only cristallise if they are
    slowly annealed from high to low temperature
  • Optimisation problem minimise function energy
    E(x)
  • Maximising function same as minimising negative
    value of function

16
Simulated annealing (2)
  • temperature T
  • Probability of state x is given by Gibbs
    distribution
  • Partition function
  • x usually multidimensional so impossible to
    calculate Z

17
Simulated annealing (3)
  • T?0, all configurations except with lowest energy
    are prob 0 (system is frozen)
  • T??, All configuration have same prob (system is
    molten)
  • With crystallisation minimum can be found by
    sampling this distribution at high temperature
    first and then decreasing temperatures

18
Simulated annealing for HMM
  • Natural energy function negative log of
    likelihood logP(data?)
  • Non-trivial, the two methods Im going to mention
    are approximations

19
Noise injection
  • Adding noise to counts estimated in
    forward-backward procedure and let size of noise
    decrease slowly
  • In Krogh et al.1994 the noise was generated by
    a random walk in the initial model

20
Simulated annealing Viterbi estimation
  • If there are N sequences, theres an exact
    translation from the N paths ?1,, ?N to the
    parameters of the model
  • Treat the paths as fundamental parameters in
    which to maximise the likelihood
  • Simulated annealing done in these variables
    instead of the model parameters

21
Simulated annealing Viterbi estimation
  • Denominator is Z, the partition function -gt sum
    over all paths
  • Can be obtained by modified forward algorithm
    using exponentiated transmission and emission
    parameters

22
Simulated annealing Viterbi estimation
  • Exponentiated transmission parameter
  • âij aij1/T
  • Exponentiated emission parameter
  • êj(x) ej(x)1/T
  • Used in place of unmodified probability
    parameters in forward algorithm
  • Z is result of forward algorithm

23
Simulated annealing Viterbi estimation
  • Algorithm Stochastic sampling traceback
    algorithm for HMMs

Initialisation pL1 End. Recursion for L1
i 1,
24
Simulated annealing Viterbi vs Viterbi
  • Key difference
  • Viterbi selects highest probable path for each
    sequence
  • Simulated annealing samples each path according
    to the likelihood of the path

25
Model Surgery
  • During training a model two things can happen
  • (a) some match states are redundant and should be
    absorbed in insert state
  • (b) one or more insert states aborb too much
    sequence, in which case they should be expanded

26
Model Surgery
  • How much is a certain transition used by training
    sequences
  • Usage of match state is sum of counts for all
    letters in state

27
Model surgery
  • If match state is used by less than ½ sequences
    -gt delete module
  • If more than ½ of sequences use the transitions
    into an insert state, this is expanded to new
    modules

28
Model surgery Example SAM
  • I tried a sequence in SAM with and without model
    surgery
  • Same 7 sequences as in example before
  • Parameters ltcutinsert 0.25gt ltcutmatch 0.5gt -gt
    delete any match state used by fewer than half
    the sequences, and insert match states for any
    insert node used by greater than one quarter of
    the sequences

29
Model surgery Example SAM
  • Without model surgery
  • gtseq1
  • FPHFD.....L...S.....-HGSAQ
  • gtseq2
  • FESFG.....D...LstpdaVMGNPK
  • gtseq3
  • FDRFKhlkteA...E.....MKASED
  • gtseq4
  • FTQFA.....G...Kdles.IKGTAP
  • gtseq5
  • FPKFK.....G...LttadqLKKSAD
  • gtseq6
  • FSFLK.....GtseV.....PQNNPE
  • gtseq7
  • FGFSG.....A...-.....--SDPG
  • With model surgery
  • gtseq1
  • FPHF.DLS-..-..--HGSAQ
  • gtseq2
  • FESF.GDLStpD..AVMGNPK
  • gtseq3
  • FDRF.KHLK..TeaEMKASED
  • gtseq4
  • FTQFaGKDL..E..SIKGTAP
  • gtseq5
  • FPKF.KGLTtaD..QLKKSAD
  • gtseq6
  • FSFL.KGTS..E..VPQNNPE
  • gtseq7
  • FGFS.G---..-..--ASDPG

30
Building phylogenetic trees
31
Overview
  • The tree of life description
  • Background on trees

32
Multiple alignment and trees
  • Alignment of sequences should take account of
    their evolutionary relationship. Sankoff, Morel
    Cedergren, 1973
  • Several progressive alignment algorithms use a
    guide tree (to guide the clustering process).
  • We begin to build trees.

33
The tree of life
  • The similarity of molecular mechanisms of the
    organisms that have been studied strongly
    suggests that all organisms on Earth had a common
    ancestor. Thus any sets of species is related,
    and this relationship is called a phylogeny.
  • Usually the relationship can be represented by a
    phylogenetic tree.

34
  • Zuckerkandl Paulings paper 1962 showed that
    molecular sequences provide sets of morphological
    characters that can carry a large amount of
    information.
  • An assumption the sequencies we want to analyze
    on the phylogeny matter have descended from some
    common ancestral gene in a common ancestral
    species.
  • Gene duplication exists gt we have to check the
    assumption carefully.

35
Gene duplication and speciation
  • By another mechanism, gene duplication, two
    sequences can also be separated and diverge from
    the common ancestor.
  • Genes which diverged because of speciation are
    called orthologues. Genes which diverged by gene
    duplication are called paralogues.

36
A tree of orthologues alpha haemoglobins
HBA_ACCGE, HBA_AEGMO, HBA_AILFU, HBA_AILME,
HBA_ALCAA, HBA_ALLMI, HBA_AMBME, HBA_ANAPL
(SWISS-PROT).
37
A tree of paralogues HBAT_HUMAN, HBAZ_HUMAN,
HBA_HUMAN, HBB_HUMAN, HBD_HUMAN, HBE_HUMAN,
HBG_HUMAN, MYG_HUMAN (SWISS-PROT).
38
Background on trees
  • All trees will be assumed to be binary (an edge
    that branches splits into two daughter edges).
  • Each edge of the tree has a certain amount of
    evolutionary divergence associated to it. We
    adopt the general term length, which will be
    represented by lengthes of edges on figures.
  • A true biological phylogeny has a root, or
    ultimate ancestor of all sequences.

39
Rooted and unrooted tree
40
  • A tree with a given labelling will be called a
    labelled branching pattern.
  • We refer to this as the tree topology and denote
    it by T.
  • Lengths of the edges ti with a suitable
    numbering scheme for the is.

41
Counting and labelling
  • Rooted tree
  • n leaves, plus (n-1) branch nodes in addition to
    leaves -gt we have 2n-1 nodes in all, and 2n-2
    edges.
  • leaves 1..n, branch nodes n1 .. 2n-1,
    (2n-1)th node is root.

42
Counting and labelling
  • Unrooted tree
  • n leaves, 2n-2 nodes and 2n-3 edges.
  • a root can be added at any of its edges gt we can
    get 2n-3 rooted trees.

43
Number of rooted and unrooted trees
A root can be added at any edge, producing 2n-3
rooted trees from unrooted tree gt there are
(2n-3) times as many rooted trees as unrooted
trees, for a given number n of leaves.
44
Instead of the root, we can add an extra edge or
branch with a distinct label in its leaf.
45
  • There are three such trees with (2n-3)5 leaves
    they are distinct labelled branching patterns.
  • There are then five ways of adding a further
    branch labelled with a distinct label (5),
    giving in all 3x515 unrooted trees with five
    leaves.
  • The number of unrooted trees with n leaves is
    equal to 35...(2n-5) (2n-5)!! So, we have
    (2n-3)!! rooted trees with n leaves.

46
Building phylogenetic trees
  • Questions?

47
Exercise 7.2
  • The trees with three and four leaves in Figure
    7.3 all have the same unlabelled branching
    pattern. For both rooted and unrooted trees, how
    many leaves do there have to be to obtain more
    than one unlabelled branching pattern? Find a
    recurrence relation for the number of rooted
    trees. (Hint consider the trees formed by
    joining two trees at their root).

48
Exercise 7.2
49
Exercise 7.3
  • All trees considered so far have been binary, but
    one can envisage ternary trees that, in their
    rooted form, have three branches descending from
    a branch node. If there are m branch nodes in an
    unrooted ternary tree, how many leaves are there
    and how many edges?

50
Exercise 7.4
  • Consider next a composite unrooted tree with m
    ternary branch nodes and n binary branch nodes.
    How many leaves are there, and how many edges?
    Let Nm,n denote the number of distinct labelled
    branching patterns of this tree. Extend the
    counting argument for binary trees to show that
  • Nm,n (3m2n-1)N m,n-1 (n1)N m-1,n1
  • (Hint the first term after the counts
    the number of ways that a new edge can be added
    to an existing edge, thereby creating an
    additional binary node the second term
    corresponds to edges added at binary nodes,
    thereby producing ternary nodes.)
Write a Comment
User Comments (0)
About PowerShow.com