Class 9: Phylogenetic Trees - PowerPoint PPT Presentation

About This Presentation
Title:

Class 9: Phylogenetic Trees

Description:

Class 9: Phylogenetic Trees The Tree of Life Evolution Many theories of evolution Basic idea: speciation events lead to creation of different species Speciation ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 67
Provided by: NirF64
Category:

less

Transcript and Presenter's Notes

Title: Class 9: Phylogenetic Trees


1
Class 9 Phylogenetic Trees
2
The Tree of Life
Daprès Ernst Haeckel, 1891
3
Evolution
  • Many theories of evolution
  • Basic idea
  • speciation events lead to creation of different
    species
  • Speciation caused by physical separation into
    groups where different genetic variants become
    dominant
  • Any two species share a (possibly distant) common
    ancestor

4
Phylogenies
  • A phylogeny is a tree that describes the sequence
    of speciation events that lead to the forming of
    a set of current day species
  • Leafs - current day species
  • Nodes - hypothetical most recent common ancestors
  • Edges length - time from one speciation to the
    next

Aardvark
Bison
Chimp
Dog
Elephant
5
Phylogenetic Tree
  • Topology bifurcating
  • Leaves - 1N
  • Internal nodes N12N-2

6
Example Primate evolution
20-25 mya
35-37 mya
40-45 mya
7
How to construct a Phylogeny?
  • Until mid 1950s phylogenies were constructed by
    experts based on their opinion (subjective
    criteria)
  • Since then, focus on objective criteria for
    constructing phylogenetic trees
  • Thousands of articles in the last decades
  • Important for many aspects of biology
  • Classification (systematics)
  • Understanding biological mechanisms

8
Morphological vs. Molecular
  • Classical phylogenetic analysis morphological
    features
  • number of legs, lengths of legs, etc.
  • Modern biological methods allow to use molecular
    features
  • Gene sequences
  • Protein sequences
  • Analysis based on homologous sequences (e.g.,
    globins) in different species

9
Dangers in Molecular Phylogenies
  • We have to remember that gene/protein sequence
    can be homologous for different reasons
  • Orthologs -- sequences diverged after a
    speciation event
  • Paralogs -- sequences diverged after a
    duplication event
  • Xenologs -- sequences diverged after a horizontal
    transfer (e.g., by virus)

10
Dangers of Paralogues
Gene Duplication
Speciation events
2B
1B
3A
3B
2A
1A
11
Dangers of Paralogs
  • If we only consider 1A, 2B, and 3A...

Gene Duplication
Speciation events
2B
1B
3A
3B
2A
1A
12
Types of Trees
  • A natural model to consider is that of rooted
    trees

Common Ancestor
13
Types of Trees
  • Depending on the model, data from current day
    species does not distinguish between different
    placements of the root

vs
14
Types of trees
  • Unrooted tree represents the same phylogeny with
    out the root node

15
Positioning Roots in Unrooted Trees
  • We can estimate the position of the root by
    introducing an outgroup
  • a set of species that are definitely distant from
    all the species of interest

Proposed root
Falcon
Aardvark
Bison
Chimp
Dog
Elephant
16
Types of Data
  • Distance-based
  • Input is a matrix of distances between species
  • Can be fraction of residues they disagree on, or
    -alignment score between them, or
  • Character-based
  • Examine each character (e.g., residue) separately

17
Simple Distance-Based Method
  • Input distance matrix between species
  • Outline
  • Cluster species together
  • Initially clusters are singletons
  • At each iteration combine two closest clusters
    to get a new one

18
UPGMA Clustering
  • Let Ci and Cj be clusters, define distance
    between them to be
  • When combining two clusters, Ci and Cj, to form a
    new cluster Ck, then

19
Molecular Clock
  • UPGMA implicitly assumes that all distances
    measure time in the same way

2
3
2
3
4
1
1
4
20
Additivity
  • A weaker requirement is additivity
  • In real tree, distances between species are the
    sum of distances between intermediate nodes

k
c
b
j
a
i
21
Consequences of Additivity
  • Suppose input distances are additive
  • For any three leaves
  • Thus

k
c
b
j
a
m
i
22
Neighbor Joining
  • Can we use this fact to construct trees?
  • Let
  • where
  • Theorem if D(i,j) is minimal (among all pairs of
    leaves), then i and j are neighbors in the tree

23
Neighbor Joining
  • Set L to contain all leaves
  • Iteration
  • Choose i,j such that D(i,j) is minimal
  • Create new node k, and set
  • remove i,j from L, and add k
  • Terminatewhen L 2, connect two remaining
    nodes

24
Distance Based Methods
  • If we make strong assumptions on distances, we
    can reconstruct trees
  • In real-life distances are not additive
  • Sometimes they are close to additive

25
Character Based Methods
  • We start with a multiple alignment
  • Assumptions
  • All sequences are homologous
  • Each position in alignment is homologous
  • Positions evolve independently
  • No gaps
  • We seek to explain the evolution of each position
    in the alignment

26
Parsimony
  • Character-based method
  • A way to score trees (but not to build trees!)
  • Assumptions
  • Independence of characters (no interactions)
  • Best tree is one where minimal changes take place

27
A Simple Example
  • What is the parsimony score of

A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA
28
A Simple Example
A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA
  • Each column is scored separately.
  • Lets look at the first column
  • Minimal tree has one evolutionary change

C
T
C
T
C
C
C
T
T ? C
29
Evaluating Parsimony Scores
  • How do we compute the Parsimony score for a given
    tree?
  • Traditional Parsimony
  • Each base change has a cost of 1
  • Weighted Parsimony
  • Each change is weighted by the score c(a,b)

30
Traditional Parsimony
a
a
  • Solved independently for each position
  • Linear time solution

a,g
a
31
Evaluating Weighted Parsimony
  • Dynamic programming on the tree
  • S(i,a) cost of tree rooted at i if i is labeled
    by a
  • Initialization
  • For each leaf i set S(i,a) 0 if i is labeled by
    a, otherwise S(i,a) ?
  • Iteration
  • if k is a node with children i and j, then
    S(k,a) minb(S(i,b)c(a,b))
    minb(S(j,b)c(a,b))
  • Termination
  • cost of tree is minaS(r,a) where r is the root

32
Cost of Evaluating Parsimony
  • Score is evaluated on each position independetly.
    Scores are then summed over all positions.
  • If there are n nodes, m characters, and k
    possible values for each character, then
    complexity is O(nmk)
  • By keeping traceback information, we can
    reconstruct most parsimonious values at each
    ancestor node

33
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 Species 1
- A G G G T A A C T G Species 2 - A C G A T T A
T T A Species 3 - A T A A T T G T C T Species 4
- A A T G T T G T C G
How many possible unrooted trees?
34
Maximum Parsimony
How many possible unrooted trees?
1 2 3 4 5 6 7 8 9
10 Species 1 - A G G G T A A C T G Species 2 - A
C G A T T A T T A Species 3 - A T A A T T G T C
T Species 4 - A A T G T T G T C G
35
Maximum Parsimony
How many substitutions?
MP
36
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A
C T G 2 - A C G A T T A T T A 3 - A T A A T T G
T C T 4 - A A T G T T G T C G
37
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
38
Maximum Parsimony
2 1 - G 2 - C 3 - T 4 - A
39
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
40
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
41
Maximum Parsimony
4 1 - G 2 - A 3 - A 4 - G
42
Maximum Parsimony
43
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
0 3 2 2 0 1 1 1 1 3 14
44
Searching for Trees
45
Searching for the Optimal Tree
  • Exhaustive Search
  • Very intensive
  • Branch and Bound
  • A compromise
  • Heuristic
  • Fast
  • Usually starts with NJ

46
Phylogenetic Tree Assumptions
  • Topology bifurcating
  • Leaves - 1N
  • Internal nodes N12N-2
  • Lengths t ti for each branch
  • Phylogenetic tree (Topology, Lengths) (T,t)

47
Probabilistic Methods
  • The phylogenetic tree represents a generative
    probabilistic model (like HMMs) for the observed
    sequences.
  • Background probabilities q(a)
  • Mutation probabilities P(ab,t)
  • Models for evolutionary mutations
  • Jukes Cantor
  • Kimura 2-parameter model
  • Such models are used to derive the probabilities

48
Jukes Cantor model
  • A model for mutation rates
  • Mutation occurs at a constant rate
  • Each nucleotide is equally likely to mutate into
    any other nucleotide with rate a.

49
Kimura 2-parameter model
  • Allows a different rate for transitions and
    transversions.

50
Mutation Probabilities
  • The rate matrix R is used to derive the mutation
    probability matrix S
  • S is obtained by integration. For Jukes Cantor
  • q can be obtained by setting t to infinity

51
Mutation Probabilities
  • Both models satisfy the following properties
  • Lack of memory
  • Reversibility
  • Exist stationary probabilities Pa s.t.

52
Probabilistic Approach
  • Given P,q, the tree topology and branch lengths,
    we can compute

x5
t4
x4
t2
t3
t1
x1
x2
x3
53
Computing the Tree Likelihood
  • We are interested in the probability of observed
    data given tree and branch lengths
  • Computed by summing over internal nodes
  • This can be done efficiently using a tree upward
    traversal pass.

54
Tree Likelihood Computation
  • Define P(Lka) prob. of leaves below node k
    given that xka
  • Init for leaves P(Lka)1 if xka 0 otherwise
  • Iteration if k is node with children i and j,
    then
  • Termination Likelihood is

55
Maximum Likelihood (ML)
  • Score each tree by
  • Assumption of independent positions
  • Branch lengths t can be optimized
  • Gradient ascent
  • EM
  • We look for the highest scoring tree
  • Exhaustive search
  • Sampling methods (Metropolis)

56
Optimal Tree Search
  • Perform search over possible topologies

Parameter space
Parametric optimization (EM)
Local Maxima
57
Computational Problem
  • Such procedures are computationally expensive!
  • Computation of optimal parameters, per candidate,
    requires non-trivial optimization step.
  • Spend non-negligible computation on a candidate,
    even if it is a low scoring one.
  • In practice, such learning procedures can only
    consider small sets of candidate structures

58
Structural EM
  • Idea Use parameters found for current topology
    to help evaluate new topologies.
  • Outline
  • Perform search in (T, t) space.
  • Use EM-like iterations
  • E-step use current solution to compute expected
    sufficient statistics for all topologies
  • M-step select new topology based on these
    expected sufficient statistics

59
The Complete-Data Scenario
  • Suppose we observe H, the ancestral sequences.

60
Expected Likelihood
  • Start with a tree (T0,t0)
  • Compute
  • Formal justification
  • Define
  • Theorem
  • Consequence improvement in expected score ?
    improvement in likelihood

61
Proof
  • Theorem
  • Simple application of Jensens inequality

62
Algorithm Outline
Unlike standard EM for trees, we compute all
possible pairwise statistics Time O(N2M)
63
Algorithm Outline
Pairwise weights
This stage also computes the branch length for
each pair (i,j)
64
Algorithm Outline
Max. Spanning Tree
Fast greedy procedure to find tree By
construction Q(T,t) ? Q(T0,t0) Thus,
l(T,t) ? l(T0,t0)
65
Algorithm Outline
Fix Tree
Remove redundant nodes Add nodes to break large
degree This operation preserves likelihood
l(T1,t) l(T,t) ? l(T0,t0)
66
Assessing trees the Bootstrap
  • Often we dont trust the tree found as the
    correct one.
  • Bootstrapping
  • Sample (with replacement) n positions from the
    alignment
  • Learn the best tree for each sample
  • Look for tree features which are frequent in all
    trees.
  • For some models this procedure approximates the
    tree posterior P(T X1,,Xn)

67
Algorithm Outline
Construct bifurcation T1
New Tree
Thm l(T1,t1) ? l(T0,t0)
These steps are then repeated until convergence
Write a Comment
User Comments (0)
About PowerShow.com