http://creativecommons.org/licenses/by-sa/2.0/ - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

http://creativecommons.org/licenses/by-sa/2.0/

Description:

http://creativecommons.org/licenses/by-sa/2.0/ BNFO 602 Lecture 1 Usman Roshan Phylogenetics Study of how species relate to each other Nothing in biology makes ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 66
Provided by: Usm16
Category:

less

Transcript and Presenter's Notes

Title: http://creativecommons.org/licenses/by-sa/2.0/


1
http//creativecommons.org/licenses/by-sa/2.0/
2
BNFO 602 Lecture 1
  • Usman Roshan

3
Phylogenetics
  • Study of how species relate to each other
  • Nothing in biology makes sense, except in the
    light of evolution, Theodosius Dobzhansky, Am.
    Biol. Teacher (1973)
  • Rich in computational problems
  • Fundamental tool in comparative bioinformatics

4
Why phylogenetics?
  • Study of evolution
  • Origin and migration of humans
  • Origin and spead of disease
  • Many applications in comparative bioinformatics
  • Sequence alignment
  • Motif detection (phylogenetic motifs,
    evolutionary trace, phylogenetic footprinting)
  • Correlated mutation (useful for structural
    contact prediction)
  • Protein interaction
  • Gene networks
  • Vaccine devlopment
  • And many more

5
Phylogeny Problem
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
6
Bipartitions
  • Phylogenies are equivalent to bipartitions

7
Topological differences
8
Phylogeny Problem
  • Two main methodologies
  • Alignment first and phylogeny second
  • Construct alignment using one of the MANY
    alignment programs in the literature
  • Do manual (eye) adjustments if necessary
  • Apply a phylogeny reconstruction method
  • Fast but biologically not realistic
  • Phylogeny is highly dependent on accuracy of
    alignment (but so is the alignment on the
    phylogeny!)
  • Simultaneously alignment and phylogeny
    reconstruction
  • Output both an alignment and phylogeny
  • Computationally much harder
  • Biologically more realistic as insertions,
    deletions, and mutations occur during the
    evolutionary process

9
First methodology
  • Compute alignment (for now we assume we are given
    an alignment)
  • Construct a phylogeny (two approaches)
  • Distance-based methods
  • Input Distance matrix containing pairwise
    statistical estimation of aligned sequences
  • Output Phylogenetic tree
  • Fast but less accurate
  • Character-based methods
  • Input Sequence alignment
  • Output Phylogenetic tree
  • Accurate but computationally very hard

10
Distance-based methods
11
Evolution on a single edge
  • Poisson process
  • Number of changes in a fixed time interval t is
    independent of changes in any other
    non-overlapping time interval u
  • Number of changes in time interval t is
    proportional to the length of the interval
  • No changes in time interval of length 0
  • Let X be the number of nucleotide changes on a
    single edge. We assume X is a Poisson process
  • Probability dictates that

12
Evolution on a single edge
  • We want to compute (the probability of a
    nucleotide change on edge e)
  • The probability of observing a change is just the
    sum of probabilities of observing k changes over
    all possible values of k (excluding even ones
    because those changes cannot be seen)

13
Evolution on a single edge
  • Expected number of nucleotide changes on a given
    edge is given by
  • Key is additive

14
Additivity
  • Assume we have a path of k edges and that p1,
    p2,, pk are the probabilities of change on each
    edge of the path
  • Using induction we can show that
  • Multiplicative term is hard to deal with and does
    not easily decompose into a product or sum of
    pis

15
Additivity
  • But the expected number of nucleotide changes on
    the path p is elegant

16
Evolutionary models
  • Simple 0,1 alphabet evolutionary model
  • i.i.d. model
  • uniformly random root sequence
  • Jukes-Cantor
  • Uniformly random root sequence
  • i.i.d. model

17
Evolutionary models
  • General Markov Model
  • Uniformly random root sequence
  • i.i.d. model
  • For time reversible models

18
Variation across sites
  • Standard assumption of how sites can vary is that
    each site has a multiplicative scaling factor
  • Typically these scaling factors are drawn from a
    Gamma distribution (or Gamma plus invariant)

19
Special issues
  • Molecular clock the expected number of changes
    for a site is proportional to time
  • No-common-mechanism model there is a random
    variable for every combination of edge and site

20
Evolutionary distance estimation
21
Estimating evolutionary distances
  • For sequences A and B what is the evolutionary
    distance under the Jukes-Cantor model?
  • ACCTGTGGGTAACCACCC
  • ACCTGAGGGATAGGTCCG
  • But we dont know what is

22
Estimating evolutionary distances
  • Assume nucleotide changes are Bernoulli trials
    (i.i.d. trials of success or failure)
  • is probability of head in n Bernoulli trials
    (n is sequence length)
  • Compute a maximum likelihood estimate for

ACCTGTGGGTAACCACCC ACCTGAGGGATAGGTCCG
0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1
23
Estimating evolutionary distance
  • We want to find the value of p that maximizes the
    probability
  • Set dP/dp to 0 and solve for p to get

24
Estimating evolutionary distances
  • 5/18
  • Continuing in this manner we estimate for
    all pairs of sequences in the alignment
  • We now have a distance matrix under a
    biologically sound evolutionary model

ACCTGTGGGTAACCACCC ACCTGAGGGATAGGTCCG
0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1
25
Distance methods
26
Distance methods
  • UPGMA similar to hierarchical clustering but not
    additive
  • Neighbor-joining more sophisticated and additive
  • What is additivity?

27
Additivity
28
UPGMA
  • UPGMA is not additive but works for
  • ultrametric trees. Takes O(n2) time

B
A
C
D
A
6
26
26
10
10
26
26
B
6
C
3
3
3
3
D
A
D
C
B
29
UPGMA
  • Initialize n clusters where each cluster i
    contains the sequence i
  • Find closest pair of clusters i, j, using
    distances in matrix D
  • Make them neighbors in the tree by adding new
    node (ij), and set distance from (ij) to i and j
    as Dij/2
  • Update distance matrix D for all clusters k do
    the following (ni and nj are size of clusters i
    and j respectively)
  • Delete columns and rows for i and j in D and add
    new ones corresponding to cluster (ij) with
    distances as computed above
  • Goto step 2 until only one cluster is left

30
UPGMA
B
A
C
D
13
13
A
6
26
26
26
26
B
3
6
3
C
3
3
D
A
D
C
B
31
UPGMA
  • Doesnt work (in general) for non-ultrametric
  • trees

B
A
C
D
3
3
A
13
16
26
3
3
12
19
B
10
10
C
B
13
C
D
D
A
32
UPGMA
  • UPGMA constructs incorrect tree here

7.25
B
A
C
D
7.25
A
13
16
26
7.25
7.25
12
19
B
6
6
13
C
B
D
A
C
D
33
UPGMA
  • Bipartition (BC,AD) is not in true tree

7.25
3
3
3
3
7.25
7.25
7.25
10
10
C
B
6
6
D
A
B
D
A
C
True tree
UPGMA tree
34
Neighbor joining
  • Additive and O(n2) time
  • Initialization same as UPGMA
  • For each species compute
  • Select i and j for which
    is minimum
  • Make them neighbors in the tree by adding new
    node (ij), and set distance from (ij) to i and j
    as

35
Neighbor joining
  1. Update distance matrix D for all clusters k do
    the following
  2. Delete columns and rows for i and j in D and add
    new ones corresponding to cluster (ij) with
    distances as computed above
  3. Go to 3 until two nodes/clusters are left

36
NJ
  • NJ constructs the correct tree for additive
  • matrices

B
A
C
D
3
3
A
13
16
26
3
3
12
19
B
10
10
C
B
13
C
D
D
A
37
Simulation studies
38
Simulation studies
  • The true evolutionary tree is never known in
    practice. Simulation allows us to study accuracy
    of methods under biologically realistic scenarios
  • Mathematics behind the phylogenetics is often
    complex and challenging. Simulation allows us to
    study algorithms when not possible theoretically
    and also examine algorithm performance under
    various conditions such as different evolutionary
    rates, sequence lengths, or numbers of taxa

39
Statistical consistency
  • As sequence lengths tend to infinity the distance
    estimation improves and eventually leads to the
    true additive matrix
  • If a method like NJ is then applied we get the
    true tree.
  • In practice, however, we have limited sequence
    length. Therefore we want to know how much
    sequence length a method requires to achieve low
    error

40
Convergence rates
  • Can be studied experimentally or theoretically
  • Theoretical results offer loose bounds
  • Experiments (under simulation) provide more
    realistic bounds on sequence lengths

41
Sequence length requirements
42
Sequence length requirements
43
Typical performance study
44
Sequence lengths for NJ
Sequence lengths required to obtain 90 accuracy
45
Error rate of NJ
46
Improving sequence length requirements
  • Later we will look at Disk-Covering Methods and
    study sequence length requirements of other
    methods (in addition to NJ)

47
Maximum Parsimony
  • Character based method
  • NP-hard (reduction to the Steiner tree problem)
  • Widely-used in phylogenetics
  • Slower than NJ but more accurate
  • Faster than ML
  • Assumes i.i.d.

48
Maximum Parsimony
  • Input Set S of n aligned sequences of length k
  • Output A phylogenetic tree T
  • leaf-labeled by sequences in S
  • additional sequences of length k labeling the
    internal nodes of T
  • such that is minimized.

49
Maximum parsimony (example)
  • Input Four sequences
  • ACT
  • ACA
  • GTT
  • GTA
  • Question which of the three trees has the best
    MP scores?

50
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
51
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
52
Maximum Parsimony computational complexity
53
Local search strategies
54
Local search for MP
  • Determine a candidate solution s
  • While s is not a local minimum
  • Find a neighbor s of s such that MP(s)ltMP(s)
  • If found set ss
  • Else return s and exit
  • Time complexity unknown---could take forever or
    end quickly depending on starting tree and local
    move
  • Need to specify how to construct starting tree
    and local move

55
Starting tree for MP
  • Random phylogeny---O(n) time
  • Greedy-MP

56
Greedy-MP
Greedy-MP takes O(n2k2) time
57
Local moves for MP NNI
  • For each edge we get two different topologies
  • Neighborhood size is 2n-6

58
Local moves for MP SPR
  • Neighborhood size is quadratic in number of taxa
  • Computing the minimum number of SPR moves between
    two rooted phylogenies is NP-hard

59
Local moves for MP TBR
  • Neighborhood size is cubic in number of taxa
  • Computing the minimum number of TBR moves between
    two rooted phylogenies is NP-hard

60
Local optima is a problem
61
Iterated local search escape local optima by
perturbation
Local optimum
Local search
62
Iterated local search escape local optima by
perturbation
Local optimum
Local search
Perturbation
Output of perturbation
63
Iterated local search escape local optima by
perturbation
Local optimum
Local search
Perturbation
Local search
Output of perturbation
64
ILS for MP
  • Ratchet
  • Iterative-DCM3
  • TNT

65
Next time
  • Performance studies on local search for MP
  • Maximum likelihood
  • Alignment
Write a Comment
User Comments (0)
About PowerShow.com