Phylogeny II : Parsimony, ML, SEMPHY - PowerPoint PPT Presentation

About This Presentation
Title:

Phylogeny II : Parsimony, ML, SEMPHY

Description:

M-step: select new topology based on these expected sufficient statistics ... Find: topology T that maximizes. Si,j is a matrix of # of ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 45
Provided by: NirFri
Category:

less

Transcript and Presenter's Notes

Title: Phylogeny II : Parsimony, ML, SEMPHY


1
Phylogeny II Parsimony, ML, SEMPHY
2
Phylogenetic Tree
  • Topology bifurcating
  • Leaves - 1N
  • Internal nodes N12N-2

3
Character Based Methods
  • We start with a multiple alignment
  • Assumptions
  • All sequences are homologous
  • Each position in alignment is homologous
  • Positions evolve independently
  • No gaps
  • We seek to explain the evolution of each position
    in the alignment

4
Parsimony
  • Character-based method
  • A way to score trees (but not to build trees!)
  • Assumptions
  • Independence of characters (no interactions)
  • Best tree is one where minimal changes take place

5
A Simple Example
  • What is the parsimony score of

A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA
6
A Simple Example
A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA
  • Each column is scored separately.
  • Lets look at the first column
  • Minimal tree has one evolutionary change

C
T
C
T
C
C
C
T
T ? C
7
Evaluating Parsimony Scores
  • How do we compute the Parsimony score for a given
    tree?
  • Traditional Parsimony
  • Each base change has a cost of 1
  • Weighted Parsimony
  • Each change is weighted by the score c(a,b)

8
Traditional Parsimony
a
a
  • Solved independently for each position
  • Linear time solution

a,g
a
9
Evaluating Weighted Parsimony
  • Dynamic programming on the tree
  • S(i,a) cost of tree rooted at i if i is labeled
    by a
  • Initialization
  • For each leaf i set S(i,a) 0 if i is labeled by
    a, otherwise S(i,a) ?
  • Iteration
  • if k is a node with children i and j, then
    S(k,a) minb(S(i,b)c(a,b))
    minb(S(j,b)c(a,b))
  • Termination
  • cost of tree is minaS(r,a) where r is the root

10
Cost of Evaluating Parsimony
  • Score is evaluated on each position independetly.
    Scores are then summed over all positions.
  • If there are n nodes, m characters, and k
    possible values for each character, then
    complexity is O(nmk)
  • By keeping traceback information, we can
    reconstruct most parsimonious values at each
    ancestor node

11
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 Species 1
- A G G G T A A C T G Species 2 - A C G A T T A
T T A Species 3 - A T A A T T G T C T Species 4
- A A T G T T G T C G
How many possible unrooted trees?
12
Maximum Parsimony
How many possible unrooted trees?
1 2 3 4 5 6 7 8 9
10 Species 1 - A G G G T A A C T G Species 2 - A
C G A T T A T T A Species 3 - A T A A T T G T C
T Species 4 - A A T G T T G T C G
13
Maximum Parsimony
How many substitutions?
MP
14
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A
C T G 2 - A C G A T T A T T A 3 - A T A A T T G
T C T 4 - A A T G T T G T C G
15
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
16
Maximum Parsimony
4 1 - G 2 - C 3 - T 4 - A
17
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
18
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
19
Maximum Parsimony
4 1 - G 2 - A 3 - A 4 - G
20
Maximum Parsimony
21
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
0 3 2 2 0 1 1 1 1 3 14
22
Searching for Trees
23
Searching for the Optimal Tree
  • Exhaustive Search
  • Very intensive
  • Branch and Bound
  • A compromise
  • Heuristic
  • Fast
  • Usually starts with NJ

24
Phylogenetic Tree Assumptions
  • Topology bifurcating
  • Leaves - 1N
  • Internal nodes N12N-2
  • Lengths t ti for each branch
  • Phylogenetic tree (Topology, Lengths) (T,t)

25
Probabilistic Methods
  • The phylogenetic tree represents a generative
    probabilistic model (like HMMs) for the observed
    sequences.
  • Background probabilities q(a)
  • Mutation probabilities P(ab, t)
  • Models for evolutionary mutations
  • Jukes Cantor
  • Kimura 2-parameter model
  • Such models are used to derive the probabilities

26
Jukes Cantor model
  • A model for mutation rates
  • Mutation occurs at a constant rate
  • Each nucleotide is equally likely to mutate into
    any other nucleotide with rate a.

27
Kimura 2-parameter model
  • Allows a different rate for transitions and
    transversions.

28
Mutation Probabilities
  • The rate matrix R is used to derive the mutation
    probability matrix S
  • S is obtained by integration. For Jukes Cantor
  • q can be obtained by setting t to infinity

29
Mutation Probabilities
  • Both models satisfy the following properties
  • Lack of memory
  • Reversibility
  • Exist stationary probabilities Pa s.t.

30
Probabilistic Approach
  • Given P,q, the tree topology and branch lengths,
    we can compute

x5
t4
x4
t2
t3
t1
x1
x2
x3
31
Computing the Tree Likelihood
  • We are interested in the probability of observed
    data given tree and branch lengths
  • Computed by summing over internal nodes
  • This can be done efficiently using a tree upward
    traversal pass.

32
Tree Likelihood Computation
  • Define P(Lka) prob. of leaves below node k
    given that xka
  • Init for leaves P(Lka)1 if xka 0 otherwise
  • Iteration if k is node with children i and j,
    then
  • TerminationLikelihood is

33
Maximum Likelihood (ML)
  • Score each tree by
  • Assumption of independent positions
  • Branch lengths t can be optimized
  • Gradient ascent
  • EM
  • We look for the highest scoring tree
  • Exhaustive search
  • Sampling methods (Metropolis)

34
Optimal Tree Search
  • Perform search over possible topologies

Parameter space
Parametric optimization (EM)
Local Maxima
35
Computational Problem
  • Such procedures are computationally expensive!
  • Computation of optimal parameters, per candidate,
    requires non-trivial optimization step.
  • Spend non-negligible computation on a candidate,
    even if it is a low scoring one.
  • In practice, such learning procedures can only
    consider small sets of candidate structures

36
Structural EM
  • Idea Use parameters found for current topology
    to help evaluate new topologies.
  • Outline
  • Perform search in (T, t) space.
  • Use EM-like iterations
  • E-step use current solution to compute expected
    sufficient statistics for all topologies
  • M-step select new topology based on these
    expected sufficient statistics

37
The Complete-Data Scenario
  • Suppose we observe H, the ancestral sequences.

38
Expected Likelihood
  • Start with a tree (T0,t0)
  • Compute
  • Formal justification
  • Define
  • Theorem
  • Consequence improvement in expected score ?
    improvement in likelihood

39
Proof
  • Theorem
  • Simple application of Jensens inequality

40
Algorithm Outline
Unlike standard EM for trees, we compute all
possible pairwise statistics Time O(N2M)
41
Algorithm Outline
Pairwise weights
This stage also computes the branch length for
each pair (i,j)
42
Algorithm Outline
Max. Spanning Tree
Fast greedy procedure to find tree By
construction Q(T,t) ? Q(T0,t0) Thus,
l(T,t) ? l(T0,t0)
43
Algorithm Outline
Fix Tree
Remove redundant nodes Add nodes to break large
degree This operation preserves likelihood
l(T1,t) l(T,t) ? l(T0,t0)
44
Assessing trees the Bootstrap
  • Often we dont trust the tree found as the
    correct one.
  • Bootstrapping
  • Sample (with replacement) n positions from the
    alignment
  • Learn the best tree for each sample
  • Look for tree features which are frequent in all
    trees.
  • For some models this procedure approximates the
    tree posterior P(T X1,,Xn)

45
Algorithm Outline
Construct bifurcation T1
New Tree
Thm l(T1,t1) ? l(T0,t0)
These steps are then repeated until convergence
Write a Comment
User Comments (0)
About PowerShow.com