http://creativecommons.org/licenses/by-sa/2.0/ - PowerPoint PPT Presentation

1 / 65

About This Presentation

Title:

http://creativecommons.org/licenses/by-sa/2.0/

Description:

http://creativecommons.org/licenses/by-sa/2.0/ BNFO 602 Lecture 1 Usman Roshan Phylogenetics Study of how species relate to each other Nothing in biology makes ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 66

Provided by: Usm16

Category:

more less

Transcript and Presenter's Notes

Title: http://creativecommons.org/licenses/by-sa/2.0/

1
http//creativecommons.org/licenses/by-sa/2.0/
2
BNFO 602 Lecture 1

Usman Roshan

3
Phylogenetics

Study of how species relate to each other
Nothing in biology makes sense, except in the
light of evolution, Theodosius Dobzhansky, Am.
Biol. Teacher (1973)
Rich in computational problems
Fundamental tool in comparative bioinformatics

4
Why phylogenetics?

Study of evolution
Origin and migration of humans
Origin and spead of disease
Many applications in comparative bioinformatics
Sequence alignment
Motif detection (phylogenetic motifs,
evolutionary trace, phylogenetic footprinting)
Correlated mutation (useful for structural
contact prediction)
Protein interaction
Gene networks
Vaccine devlopment
And many more

5
Phylogeny Problem
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
6
Bipartitions

Phylogenies are equivalent to bipartitions

7
Topological differences
8
Phylogeny Problem

Two main methodologies
Alignment first and phylogeny second
Construct alignment using one of the MANY
alignment programs in the literature
Do manual (eye) adjustments if necessary
Apply a phylogeny reconstruction method
Fast but biologically not realistic
Phylogeny is highly dependent on accuracy of
alignment (but so is the alignment on the
phylogeny!)
Simultaneously alignment and phylogeny
reconstruction
Output both an alignment and phylogeny
Computationally much harder
Biologically more realistic as insertions,
deletions, and mutations occur during the
evolutionary process

9
First methodology

Compute alignment (for now we assume we are given
an alignment)
Construct a phylogeny (two approaches)
Distance-based methods
Input Distance matrix containing pairwise
statistical estimation of aligned sequences
Output Phylogenetic tree
Fast but less accurate
Character-based methods
Input Sequence alignment
Output Phylogenetic tree
Accurate but computationally very hard

10
Distance-based methods
11
Evolution on a single edge

Poisson process
Number of changes in a fixed time interval t is
independent of changes in any other
non-overlapping time interval u
Number of changes in time interval t is
proportional to the length of the interval
No changes in time interval of length 0
Let X be the number of nucleotide changes on a
single edge. We assume X is a Poisson process
Probability dictates that

12
Evolution on a single edge

We want to compute (the probability of a
nucleotide change on edge e)
The probability of observing a change is just the
sum of probabilities of observing k changes over
all possible values of k (excluding even ones
because those changes cannot be seen)

13
Evolution on a single edge

Expected number of nucleotide changes on a given
edge is given by
Key is additive

14
Additivity

Assume we have a path of k edges and that p1,
p2,, pk are the probabilities of change on each
edge of the path
Using induction we can show that
Multiplicative term is hard to deal with and does
not easily decompose into a product or sum of
pis

15
Additivity

But the expected number of nucleotide changes on
the path p is elegant

16
Evolutionary models

Simple 0,1 alphabet evolutionary model
i.i.d. model
uniformly random root sequence
Jukes-Cantor
Uniformly random root sequence
i.i.d. model

17
Evolutionary models

General Markov Model
Uniformly random root sequence
i.i.d. model
For time reversible models

18
Variation across sites

Standard assumption of how sites can vary is that
each site has a multiplicative scaling factor
Typically these scaling factors are drawn from a
Gamma distribution (or Gamma plus invariant)

19
Special issues

Molecular clock the expected number of changes
for a site is proportional to time
No-common-mechanism model there is a random
variable for every combination of edge and site

20
Evolutionary distance estimation
21
Estimating evolutionary distances

For sequences A and B what is the evolutionary
distance under the Jukes-Cantor model?
ACCTGTGGGTAACCACCC
ACCTGAGGGATAGGTCCG
But we dont know what is

22
Estimating evolutionary distances

Assume nucleotide changes are Bernoulli trials
(i.i.d. trials of success or failure)
is probability of head in n Bernoulli trials
(n is sequence length)
Compute a maximum likelihood estimate for

ACCTGTGGGTAACCACCC ACCTGAGGGATAGGTCCG
0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1
23
Estimating evolutionary distance

We want to find the value of p that maximizes the
probability
Set dP/dp to 0 and solve for p to get

24
Estimating evolutionary distances

5/18
Continuing in this manner we estimate for
all pairs of sequences in the alignment
We now have a distance matrix under a
biologically sound evolutionary model

ACCTGTGGGTAACCACCC ACCTGAGGGATAGGTCCG
0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1
25
Distance methods
26
Distance methods

UPGMA similar to hierarchical clustering but not
additive
Neighbor-joining more sophisticated and additive
What is additivity?

27
Additivity
28
UPGMA

UPGMA is not additive but works for
ultrametric trees. Takes O(n2) time

B
A
C
D
A
6
26
26
10
10
26
26
B
6
C
3
3
3
3
D
A
D
C
B
29
UPGMA

Initialize n clusters where each cluster i
contains the sequence i
Find closest pair of clusters i, j, using
distances in matrix D
Make them neighbors in the tree by adding new
node (ij), and set distance from (ij) to i and j
as Dij/2
Update distance matrix D for all clusters k do
the following (ni and nj are size of clusters i
and j respectively)
Delete columns and rows for i and j in D and add
new ones corresponding to cluster (ij) with
distances as computed above
Goto step 2 until only one cluster is left

30
UPGMA
B
A
C
D
13
13
A
6
26
26
26
26
B
3
6
3
C
3
3
D
A
D
C
B
31
UPGMA

Doesnt work (in general) for non-ultrametric
trees

B
A
C
D
3
3
A
13
16
26
3
3
12
19
B
10
10
C
B
13
C
D
D
A
32
UPGMA

UPGMA constructs incorrect tree here

7.25
B
A
C
D
7.25
A
13
16
26
7.25
7.25
12
19
B
6
6
13
C
B
D
A
C
D
33
UPGMA

Bipartition (BC,AD) is not in true tree

7.25
3
3
3
3
7.25
7.25
7.25
10
10
C
B
6
6
D
A
B
D
A
C
True tree
UPGMA tree
34
Neighbor joining

Additive and O(n2) time
Initialization same as UPGMA
For each species compute
Select i and j for which
is minimum
Make them neighbors in the tree by adding new
node (ij), and set distance from (ij) to i and j
as

35
Neighbor joining

Update distance matrix D for all clusters k do
the following
Delete columns and rows for i and j in D and add
new ones corresponding to cluster (ij) with
distances as computed above
Go to 3 until two nodes/clusters are left

36
NJ

NJ constructs the correct tree for additive
matrices

B
A
C
D
3
3
A
13
16
26
3
3
12
19
B
10
10
C
B
13
C
D
D
A
37
Simulation studies
38
Simulation studies

The true evolutionary tree is never known in
practice. Simulation allows us to study accuracy
of methods under biologically realistic scenarios
Mathematics behind the phylogenetics is often
complex and challenging. Simulation allows us to
study algorithms when not possible theoretically
and also examine algorithm performance under
various conditions such as different evolutionary
rates, sequence lengths, or numbers of taxa

39
Statistical consistency

As sequence lengths tend to infinity the distance
estimation improves and eventually leads to the
true additive matrix
If a method like NJ is then applied we get the
true tree.
In practice, however, we have limited sequence
length. Therefore we want to know how much
sequence length a method requires to achieve low
error

40
Convergence rates

Can be studied experimentally or theoretically
Theoretical results offer loose bounds
Experiments (under simulation) provide more
realistic bounds on sequence lengths

41
Sequence length requirements
42
Sequence length requirements
43
Typical performance study
44
Sequence lengths for NJ
Sequence lengths required to obtain 90 accuracy
45
Error rate of NJ
46
Improving sequence length requirements

Later we will look at Disk-Covering Methods and
study sequence length requirements of other
methods (in addition to NJ)

47
Maximum Parsimony

Character based method
NP-hard (reduction to the Steiner tree problem)
Widely-used in phylogenetics
Slower than NJ but more accurate
Faster than ML
Assumes i.i.d.

48
Maximum Parsimony

Input Set S of n aligned sequences of length k
Output A phylogenetic tree T
leaf-labeled by sequences in S
additional sequences of length k labeling the
internal nodes of T
such that is minimized.

49
Maximum parsimony (example)

Input Four sequences
ACT
ACA
GTT
GTA
Question which of the three trees has the best
MP scores?

50
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
51
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
52
Maximum Parsimony computational complexity
53
Local search strategies
54
Local search for MP

Determine a candidate solution s
While s is not a local minimum
Find a neighbor s of s such that MP(s)ltMP(s)
If found set ss
Else return s and exit
Time complexity unknown---could take forever or
end quickly depending on starting tree and local
move
Need to specify how to construct starting tree
and local move

55
Starting tree for MP

Random phylogeny---O(n) time
Greedy-MP

56
Greedy-MP
Greedy-MP takes O(n2k2) time
57
Local moves for MP NNI

For each edge we get two different topologies
Neighborhood size is 2n-6

58
Local moves for MP SPR

Neighborhood size is quadratic in number of taxa
Computing the minimum number of SPR moves between
two rooted phylogenies is NP-hard

59
Local moves for MP TBR

Neighborhood size is cubic in number of taxa
Computing the minimum number of TBR moves between
two rooted phylogenies is NP-hard

60
Local optima is a problem
61
Iterated local search escape local optima by
perturbation
Local optimum
Local search
62
Iterated local search escape local optima by
perturbation
Local optimum
Local search
Perturbation
Output of perturbation
63
Iterated local search escape local optima by
perturbation
Local optimum
Local search
Perturbation
Local search
Output of perturbation
64
ILS for MP