The distance between sequences, Part I. - PowerPoint PPT Presentation

1 / 111
About This Presentation
Title:

The distance between sequences, Part I.

Description:

and we want to know how similar they are. What is the basis for their ... Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 112
Provided by: elizabe102
Category:

less

Transcript and Presenter's Notes

Title: The distance between sequences, Part I.


1
The distance between sequences, Part I.
  • Foundations
  • M. Elizabeth Corey, Ph.D.
  • UCSC Extension and UC Berkeley Extension

2
A simple start
  • Suppose we have two sequences, A and B
  • A a1, a2, , am
  • and
  • B b1, b2, , bn
  • and we want to know how similar they are.
  • What is the basis for their similarity?

3
The practical measure
  • What we usually do is obtain an alignment and
    then score using the sum of the pairwise scores

4
The nice metric
  • Wouldnt it be nice if we could simply say that
    the distance between sequences was the geometric
    sum of the distances between loci in the
    sequences?

5
Dynamic programming methods
  • Compuational method dating from the 40s,
    introduced to biology as Needleman-Wunsch in
    1969.
  • A numerical value is assigned to every cell in
    the array giving the similarity/dissimilarity of
    residues
  • The example shown
  • match 1
  • mismatch null (value 0)

6
Dynamic programming methods
  • GOAL For each cell find the maximum possible
    score for an alignment ending at that point
  • Searchs subrow and subcolumn, as shown, for
    highest score
  • Adds this to the score for the current row
  • Proceeds row by row through the array

7
Maximum bipartite matching
  • Series of solutions, starting with Dijskta,
    1950s
  • Find the set of matches that provide maximum
    flow.
  • Each match, ai to bj, has a capacity equal to its
    pairwise score.

A
B
s(a1, b1)
8
Alignments not really the problem
  • Optimal alignment falls into a set of problems
    with a long history in computer science.
  • The underlying metric for distances between
    sequences falls in the province of biology.

9
Beguiled by a matrix(PAM)
10
PAM
  • PAM starts with closely related sequences from 34
    superfamilies, grouped into 71 evolutionary
    trees.
  • PAM rests on a measure of amino acid
    mutability.
  • PAM attempts to capture a representative slice of
    evolutionary behavior.

11
PAM (From Dayhoff, Schwartz and Orcutt)
  • Obtain alignments for homologous proteins
  • Compute scoring matrix elements using
  • where aij is substitution frequency, mi is the
    mutability of i and G is a proportionality
    constant.
  • Extrapolate to longer evolutionary distances by
    using S(t0)n

12
Limitations of PAM matrices
  • PAM matrices are built from alignments with gt 85
    identity.
  • The entries in the initial scoring matrix, S(t1)
    arise from short time interval substitutions
    raising S(1) to a higher power may not capture
    some interesting substitutions with longer rate
    constants.

13
The Gutzwiller temptation
  • An abstract dynamic system (M, m, ft)
  • a measurable space, M, composed of the set of all
    sequences.
  • a measure m based on transition probabilities.
  • a group of automorphisms, ft, that map M onto
    itself, that preserves m and where the variable t
    runs through the integers.

14
Whats Bernoulli got to do with it?
  • A scheme with subshift
  • The measure on M is generated by the sets Ai,j,k
    a ai j, ai1 k whose measure is given by
    a matrix of transition probabilities pjk gt 0.
  • A future event a1 depends on a0 hence, memory.
  • Realized in the geodesic flow on a compact closed
    surface of constant negative curvature.

15
System behaviors
  • Ergodicity Transition probabilities are
    positive recurrent and aperiodic.
  • Mixing Inheritance and Mendelian exceptions
    lead to mixing.
  • K-systems Speciation events rigidly segregate
    M other segregations exist.

16
Our salad days
  • Jukes-Cantor
  • HGY
  • Kimura 2-Parameter
  • PAM
  • BLOSUM

17
General Stationary Time-reversible Model
. pCrCA pGrGA pTrTA
pArAC . pGrGC pTrTC
pArAG pCrCG . pTrTG
pArAT pCrCT pGrGT .
R
(Diagonal elements such that rows sum to zero)
Time reversibility pirij pjrji
18
General Stationary Time-reversible Model
  • P(t) eRt
  • Given rates, one can find transition
    probabilities, and vice-versa.

19
Jukes-Cantor
-3a a a a
a -3a a a
a a -3a a
a a a -3a
R
20
Kimura 2-Parameter
A C G T
. b a b
b . b a
a b . b
b a b .
R
a/b transition/transversion bias
21
HKY (Hasegawa, Kishino, Yano)
. mpC mkpG mpT
mpA . mpG mkpT
mkpA mpC . mpT
mpA mkpC mpG .
R
k transversion / transition
22
The BLOSUMn matrices
  • Start with multiple, ungapped alignments of
    proteins found using PROTOMAT.
  • Build clusters by placing together sequences with
    N identity.
  • Measure the score for each pair defined as
  • sij 2log2(pij/eij)
  • eij is expected probability of occurrence of the
    i,j pair
  • pij is observed probability of the i,j pair.

23
Limitations
  • Naive approach measure frequencies of aligned
    pairs and gaps in randomly selected confirmed
    alignments to get pij, use a random set of
    sequences to obtain eij.
  • Difficulty 1 it is difficult to get a good
    random sample of sequences or alignments
    databases are biased.
  • Difficulty 2 When sequences diverge from a
    common ancestor recently, pij is small and s is
    strongly negative. When sequences diverged long
    ago, pij tends to eij and s approaches zero.

24
A short compendium of distances and scores
  • Jukes-Cantor distance
  • Kimura distance
  • Dayhoff evolutionary distance
  • BLOSUM scores
  • Profile scores
  • Average scores

25
References
  • Gu, X. Li, W, 1996. A general additive distance
    with time-reversibility and rate variation among
    nucleotide sites. Proc. Natl. Acad. Sci. USA 93
    4671-4676.
  • Hasegawa, M., Kishino, H., Yano, T., 1985.
    Dating of the human-ape splitting by a molecular
    clock of mitochondrial DNA. J. Mol. Evol. 22
    160-174.
  • Sanderson, M. J. Shaffer, H. B., 2002.
    Troubleshooting molecular phylogenetic analyses.
    Annu. Rev. Ecol. Syst. 33 49-72.

26
The distance between sequences, Part II.
  • Careful Measures
  • M. Elizabeth Corey, Ph.D.
  • UCSC Extension and UC Berkeley Extension

27
Exceptions to Mendels Laws
  • The theory a chromosomal basis of inheritance
  • Some so-called exceptions
  • linkage and recombination
  • gene conversion
  • transposition and mobile genetic elements
  • A plethora of other mutations point mutations,
    reversions, deletions, frameshifts, duplications,
    inversions
  • Exceptions do not result in rejection of
    Mendelian genetics but a better understanding of
    the mechanisms underlying Mendelian inheritance.

28
Mutation frequencies(mutations/generation)
  • Frequency of point mutation 10-7 to 10-8
  • Reversion of point mutations 10-8. Sometimes
    called back mutation, sometimes called
    convergence.
  • Reversion of deletion mutations undetectably
    small.
  • Loss of function mutations result in grossly
    lower biological fitness. The rate of extinction
    due to gross loss of function is much great
    than the rate of reversion, so the line will die
    long before reversion can occur. In the
    aggregate, the record will show a
    pseudo-reversion.

29
Mutation frequencies
  • Deletions 10-6 dependent on chromosomal
    region. Caveat May be underestimated less
    detectable because they are often lethal .
  • Frameshifts 10-6 often repaired.
  • Duplications 10-3 - E. coli approximately 0.l
    of a culture for a given region of the
    chromosome.
  • Inversions hard to detect, not always mutations
  • Gene Conversions still unknown. Reparative.
  • mutators increase mutation frequencies by 100,
    they work on hot spots

30
Protein-based inheritance Prions
  • Proteins that change their shape in response to
    fluctuating environmental pressures, and then
    maintain that shape during mitosis and meiosis,
    constitute a form of cellular memory.
  • Various structural conformations are propagated
    outside of the traditional genetic framework.

31
Hsp90 and Sup35
  • A buffer for silent polymorphisms Hsp90
  • promotes the folding of signal tranducers
  • buffers the effects of many silent polymorphisms
  • may serve as a capacitor of evolutionary change
    storing and releasing genetic variation
  • Epigenetic inheritance The Sup35 prion

32
James Joyces List
  • Milk
  • Call mom!
  • Lettuce
  • Plumb the smithy of my soul for the unborn
    race-consciousness
  • Rent
  • --------------------------------------
  • Thriving in fluctuating environments by
    exploiting pre-existing genetic variations.

33
References
  •    Recent Publications on Conformational Change
    and Evolution        
  • Queitsch, C., Sangster, T.A. and Lindquist, S.
    2002. Hsp90 as a capacitor of phenotypic
    variation. Nature 417 618-624.         
  • Jensen, M.A., True, H.L., Chernoff, Y.O., and
    Lindquist, S., 2001 Molecular Population Genetics
    and Evolution of a Prion-like Protein in
    Saccaromyces cerevisiae. Genetics 159
    527-525.          
  • True, H.L., and Lindquist, S.L. 2000. A yeast
    prion provides an exploratory mechanism for
    genetic variation and phenotypic diversity.
    Nature 407 477-483.          
  • Rutherford, S.L. and Lindquist, S. 1998. Hsp90 as
    a capacitor for morphological evolution. Nature
    396 336-342.

34
Mutations and time
  • Take a series of sequences and figure out how
    different they are by counting up their
    substitutions.

A
5 substitutions
6 substitutions
B
3 substitutions
C
35
Mutations and time
  • What process takes us from A to B to C?

A
gene conversion
No direct ancestry
B
frameshift (repairable) 2 point accepted mutations
C
36
Counting mutations
  • Consider a counting process N(t), t e T where
    N(ti) N(tj) is the number of mutations in the
    time interval (ti,tj.

A
N(tAB) 1 GC
No direct ancestry but we can still count
substitutions N (tAC) 6 PM
B
N(tBC) 1 FS, 2 PM
C
37
Times on the edges of the tree
  • The interoccurrence times between mutations, t1
    0, , t1 t2 t1, ti ti ti-1, are
    exponential variables with mean 1/b such that
  • Pti gt h e-bh
  • and
  • Pti lt h 1 - e-bh
  • for hgt 0.

38
Edge times
  • Gene conversions bgc 1 gc/2,000 years
  • Frame shifts bfs 1 shift/5,000 years
  • Point mutations bpm 1 pm/10,000 years
  • Just an wild guess

A
1/bgc 2,000 yrs
1/bpm 60,000 yrs
B
2/bpm 1/bfs 25,000 yrs
C
39
Edge times
  • Population of A 105
  • Population of B 106
  • Population of C I dont care.

A
1/Na bgc 20 10-2 yrs
1/bpm 60 10-2 yrs
B
2/ Nb bpm 1/ Nb bfs 2510-3 yrs
C
40
Calculating divergence times
  • Doolittle, D.F., Fend, D-F, Tsang, S., Cho, G and
    Little, E. Determining Divergence Times of the
    Major Kingdoms of Living Organisms With a Protein
    Clock. Science, 271, pp. 470-477, 1996.

41
Calculating divergence times
  • Task Build a model for evolutionary time based
    on pairwise distances, dij, and the fossil record
  • Start with the vertebrate fossil record - the
    biogeochemistry gives reliable times.
  • Map the fossil-based phylogeny to the sequence
    based phylogeny and compare edge lengths.
  • Adjust the sequence-based time model to match the
    vertebrate fossil record.

42
Using the fossil record
43
Readjusting the clock
  • After sampling the vertebrate fossil-record and
    fitting the sequence data to the fossil-record ,
    they maintain the same clock.
  • Result Eukaryotes and Prokaryotes diverged
    about 2.5 billion years ago.

44
On fitting the fossil record to sequence data
  • Challenges unequal rates of change in different
    species due to
  • different reproductive cycles in different
    species
  • different base population sizes in different
    species.
  • Obtaining bacterial mutation rates using
    vertebrate mutation rates when we are looking at
    the evolution of populations how viable is it?

45
Population mutation
  • Suppose an average rate of mutation per site is
    about 10-7 (ignoring duplications).
  • Compare lengths of reproductive cycles
  • Prokaryotes (blue-green algae and bacteria) 20
    minutes to an hour per generation.
  • Humans US, average time to first child is 24.8
    years.
  • How many times does a bacteria reproduce in the
    time it takes a human being to reproduce?
  • 24 365 25 219,000
  • So if we are comparing bacterial mutation rates
    to human mutation rates and we looking at
    aggregate populations, we have to adjust by a
    factor of 106.

46
Population mutation
  • Size of the base population on planet earth
  • 5 1030 prokaryotes (UG, Bill Whitman) -
    including about a mole of bacteria
  • 3 109 humans
  • How many bacteria are there, propagating how
    fast, in comparison to humans? Worst case ratio?
  • Calculate using
  • base population rate of generation number of
    mutable genes
  • (1023 106103)
  • -------------------------- 1018
  • (109 1104)

47
One final issue The Success Question
  • When mutations succeed, they succeed within an
    ecological niche.
  • So when we ask When did a species arise?, it is
    not enough to ask about the likelihood of a
    certain kind of mutation, one must also ask
    what is the likelihood that that mutation arose
    in a niche that would support it?
  • So, dont forget about acceptance rates.

48
The FOXP2 point mutation
  • Enard et al, Molecular evolution of FOXP2, a
    gene involved in speech and language, Nature,
    Vol. 418, August 22, 2002

49
Silent/expressed mutations in FOXP2
  • Edge labels are Amino Acid / DNA substitutions

OHG
0/7
1/2
HG
0/2
2/2
Orangutan
Gorilla
Human
50
Selective sweeps
  • Measures for determining the existence of a
    sweep
  • Tajimas D from Genetics, 1989 (conservative)
  • Fay and Wus H from Hitchhiking under positive
    Darwinian selection, Genetics, 2000.
  • Also, Griffiths and Tavare estimate selection
    using linked SNP data

51
Population mutation rates
  • Pia 4Na bi - the population mutation rate for
    site i in species a, where Na is the effective
    population size of species a and bi is the
    mutation rate per generation at site i.

52
Tajimas D for FOXP2
  • -2.20
  • P 0.03
  • S/an 0.079
  • S is the sample size
  • an is the number of segregating sites.

53
Discovering different rate constants
54
Finding the time of appearance of the FOXP2
segregation
  • Sample current human population worldwide.
  • Generate trees with different times for the human
    sequence data.
  • Measure the likelihood of the different trees.

55
Multiple rates
  • The automorphism f, mapping M onto itself, used
    to be a simple shift operation.
  • Now, it incorporates several underlying
    processes, including
  • mutation of the bases (mutation rate)
  • expression of the mutations (expression rate)
  • stabilization of a conformational phenotype
    (stabilization rate)
  • success of the substitution (acceptance rate)

56
The distance between sequences, Part III.
  • Algorithms for phylogenies
  • M. Elizabeth Corey, Ph.D.
  • UCSC Extension and UC Berkeley Extension

57
Motivation
  • Phylogenies provide measures of similarity and
    can lay a foundation for scoring alignments.
  • Rate structures provide indicators for motifs.
  • Branch points allow us to identify and classify
    interesting bases.
  • If the branch points are in phenotypic trees, the
    mutating bases can be used as phenotypic
    identifiers.
  • If the branch points are in genotypic trees,
    mutating (nonsilent) bases can be used as genetic
    identifiers.

58
What goes into a phylogeny?
Pairwise Alignment
  • Distance measures (UPGMA, NN)
  • Site info (MLE and Parsimony)
  • Substitution scores
  • Equilibrium distributions for MLE

Multiple Alignment
Phylogenies
Transitional probability data
59
What do we get in return?
Pairwise Alignment
  • Guide trees
  • Rates and probabilities
  • Scoring matrices Scoring matrices

Multiple Alignment
Phylogenies
Transitional probability data
60
Part III Goals
  • Depict methods for finding guide trees for
    progressive multiple alignment.
  • Clarify the differences between MLE, Maximum
    Parsimony and Distance Methods and identify the
    optimization techniques appropriate for each.
  • Define a new approach for faster identification
    of near-optimal phylogenies.

61
Progressive multiple alignment
  • Choose a set of scores for sequence comparison
  • Alignment scores from Needleman-Wunsch,
    Smith-Waterman and variants.
  • Consensus word score from BLAST, PSI-BLAST and
    others
  • Substitution (scoring) matrices PAM, BLOSUM,
    Jukes-Cantor, etc.
  • Construct a reputable guide tree
  • Hierachical clustering (UPGMA, Neighbor-Joining,
    Fitch and Margoliash)
  • Maximum Parsimony (simple or weighted).
  • Maximum Likelihood Estimation (MLE)
  • Use the guide tree to produce an alignment

62
Tree evaluation - Parsimony
  • Given a semi-labeled tree, it is possible
    to determine the trees internal nodes (ancestral
    sequences) using a parsimony algorithm.
  • Evaluation function A summation of the scored
    mutations in the parsimonious tree.

63
Parsimony - Illustrated
ABC node 3, cost is 3
A(B or D)C node 1, cost is 1
A(B or C) (E or C) node 2, cost is 2
ACC
ABC
ADC
ABE
64
Example Simple Parsimony
  • Initialization
  • Set the cost, C 0. Set k 2n-1, where n is
    the number of sequences.
  • Recursion to compute node, Nk
  • if k is a leaf node, Nk sequence k
  • if k is not a leaf node
  • Compute Ni and Nj for the daughter nodes of Nk.
  • where the intersection of Ni and Nj is nonempty,
  • otherwise increment the cost by the number of
    nonmatching residues and set
  • Termination
  • Minimum cost of tree C.

65
Tree evaluation Distance methods
  • Given a set of alignment scores, but without
    assuming a tree topology, it is possible to
    determine a tree and its edge lengths using a
    distance method. This is sometimes called
    minimum evolution and includes the hierarchical
    clustering methods.
  • Evaluation function The sum of the edge lengths.

66
Hierarchical Clustering Illustrated UPGMA
t1 t2 ½d12
6
7
6
1
2
4
5
1
2
8
9
8
6
7
½d68
6
7
1
2
4
5
3
1
2
4
5
3
From Durbin et al, 2001
67
Algorithm UPGMA
  • Input N sequences and their relative distances,
    dij
  • Initialization
  • Assign each sequence to its own cluster, Ci.
  • Define a leaf of T for each sequence and place
    at height 0.
  • Iteration
  • Pick two clusters Ci, Cj such that dij is
    minimal.
  • Define a new cluster k by Ck Ci,Cj.
  • Define a new set of distances dkl between Ck
    and all current clusters.
  • Define a node k with daughter nodes i and j, and
    place it at height hik ½dik.
  • Add k to the set of current clusters and remove i
    and j.
  • Termination
  • Rooted When only two clusters i, j, remain, add
    the root at height ½dik.

68
Tree evaluation - MLE
  • Given a tree topology and sequences preassigned
    to each leaf, it is possible to determine a
    trees edge lengths using maximum likelihood
    estimation.
  • Evaluation function the likelihood of the tree.

69
Estimating Likelihood
  • Estimate branch lengths by viewing evolution as a
    random process
  • Requires a probability model of evolution as a
    function of time.
  • For DNA one can use Jukes-Cantor model (all
    nucleotides have same substitution rates), or
    Kimura model (different rates for transitions and
    transversions).
  • For proteins one can use Dayhoff, but in the
    probability form not the log-odds form.

70
Estimating Likelihood
  • S1, etc. are the bases or residues observed in
    the extant and ancestral taxa.
  • v lt where l is the substitution rate and t is
    absolute time
  • Pi,j(v) is the probability that the residue at
    node si becomes residue at node sj in time v
  • p0 is the prior probability of the bases or
    nucleotides at any position
  • The likelihood for this tree is
  • L p0P0,5(v5) P5,1(v1) P5,2(v2) P0,6(v6)
    P6,3(v3) P6,4(v4)

71
Example Likelihood
  • For each mutating site in a set of sequences
  • Initialization
  • Set k 2n-1, where n is the number of sequences.
  • Recursion
  • Compute P(Lka) for each symbol, a, in the
    alphabet as follows
  • If k is a leaf node
  • if xk,u a, then P(Lka) 1,
  • else Pk(a) 0.
  • if k is not a leaf node
  • Compute P(Lia), P(Lja) for all a at daughters
    i,j
  • Set P(Lka) Sb,cP(ba,ti) P(Lib) P(ca,tj)
    P(Ljc).
  • Termination
  • Likelihood for site u pa SaP(L2n-1a)
  • (pa is the equilibrium value of the probability
    distribution for a.)
  • Concluding step Combine the likelihoods for
    each site.

72
Maximizing Likelihood Estimation over edge times
  • Likehood estimation includes a step for computing
    the likelihood of some character a at node k
    given the subtree of k.
  • While we know that there is the possibility of
    substitutions leading to a, these depend on how
    long a time we have to make those substitutions
    and we do not know the edge times of the tree.
    We must explore a series of possible times in
    order to to maximize the likelihood.
  • A method that maximizes likelihoods over edge
    times is what is usually referred to as MLE.
  • Standard MLE procedures do not maximize
    likelihoods over all topologies of the tree.

73
Comparisons between MLE, Parsimony and Distance
Methods
Algorithm Requires semi-labeled tree Requires scored alignments Order Results Edge weights Results Internal tree nodes Resulting Tree Is Ultrametric
MLE Yes No La2n-1 2an2 Transitional probabilities subtree probability Yes
Parsimony Yes No 2an2 Mutation counts Ancestral sequences No
Distance Methods No Yes 2n2 Distance measures e.g. alignment scores UPGMA a cluster of sequences UPGMA - no NN - yes.
74
Exploring different topologies
  • Successive addition and rearrangement
  • Very common method (see Phylip programs
    including PROTPARS, DNAPARS, DNACOMP, DNAML,
    DNAMLK, RESTML, KITSCH, FITCH, CONTML, MIX and
    DOLLOP)
  • Sequences are taken in the order that they appear
    in the input file and successively added to a
    tree.
  • MCMC

75
Successive addition
  • Initialization
  • Place the set of sequences into L.
  • Create a tree,T, with one node the root.
  • Iteration for each sequence in L
  • Remove a sequence from L and add it as a leaf to
    T.
  • Apply a process of local rearrangement (in
    Felsensteins package, there are (n-1)(2n-3)
    arrangements.)
  • Score each locally arranged tree.
  • set T to equal the best scoring tree.
  • Termination Globally rearrange the tree by
    swapping subtrees, score each globally rearranged
    tree and accept the tree with the best score.

76
Markov Chain Monte Carlo
  • A Bayesian method for phylogenetic inference
  • Moderately new method rooted in molecular
    dynamics.
  • Topologies are randomly generated and scored so
    that a representative set of most likely tree
    topologies can be identified.
  • Mau, Newton and Larget (1998) apply MCMC to
    sample trees using Bayes theorem. The following
    explanation is based on their methodology - the
    mistakes are mine, the facts and foundations,
    theirs.

77
Introduction to the method
t is the set of all semi-labeled trees
78
Introduction to the method
Sampling the set of trees
Q2
Q1
Q3
79
Introduction to the method
Q01
Q12
Q23
to
t2
t1

a
c
b
d
a
b
c
d
a
b
c
d
A Chain of Accepted Samples
80
Introduction to the method
The partitioned space with representatives t1,,
t3
t1
t3
81
MCMC propaganda
  • allow exact inference provided certain convergent
    criteria are demonstrated.
  • are efficient and can handle many more taxa or
    sequences.
  • measure uncertainty during tree construction (no
    bootstrapping needed.)

82
Summary of the Algorithm
  1. Choose a starting tree
  2. Perturb the current trees topology and branch
    lengths to find a new tree.
  3. Measure the likelihood for the new tree.
  4. Compare the new tree to the last tree and decide
    whether or not to accept it into the chain.
  5. If youve got a sufficiently long chain, check
    the characteristics of your sample to see if
    there is convergence to a set of representative
    topologies. If so, stop. Otherwise, to to 2.

83
Subproblems to be discussed
  1. How do we represent the tree so it that is easy
    to operate on? Cophentic matrices.
  2. What is our perturbation operator?
  3. How do we build our sampling chain?
  4. When are we done sampling?

84
The Cophenetic Matrix
  • Some Notation
  • t a topology
  • n a node
  • a(n) the ancestor of a node
  • L a leaf node (the leaves are the current
    record)
  • I an internal node (the historical record)

85
Cophenetic Trees
  • Labeled history (t1, t2) provides an order on
    coalescent levels.

I0
level 0
t1
level 1
I1
t2
level 2
I2
L3
L1
L2
86
Example A Cophenetic Tree
t1 0.8 t20.3 t30.7 t40.5 t50.9 t61.5 total
4.7
These trees are described in terms of nodes
coalescing or merging backwards in time.
87
Example Cophenetic Matrix
  • The cophenetic matrix for the previous tree.
  • The tree representation (s, a) is
  • (5,7,4,1,2,6,3), (4.7, 0.8, 2.3, 3.2, 1.8, 1.1)

Leaf 5 7 4 1 2 6 3
5 0 9.4 9.4 9.4 9.4 9.4 9.4
7 0 1.6 4.6 6.4 6.4 6.4
4 0 4.6 6.4 6.4 6.4
1 0 6.4 6.4 6.4
2 0 3.6 3.6
6 0 2.2
3 0
88
The Cophenetic Matrix
  • Theorem For any weighted binary tree with
    labeled leaf nodes, the tree topology and branch
    lengths can be uniquely determined using the
    within-tree distances between all pairs of leaf
    nodes. (Lapoint and Legendre, 1992)
  • Note, each permutation of the leaf labels
    generates a different n x n symmetric matrix of
    distance distances.

89
What is the perturbation operator?
  • Q is the proposal function and it has two stages
  • Q1 randomly selects a new leaf order
  • Q2 perturbs the values of the matrix
    supradiagonals.
  • The proposal mechanism is symmetrical
  • Q(tn,tn1) Q(tn1,tn)

90
Details on Q1 and Q2
  • Q1 samples one of the 2n-1 leaf orderings of the
    current tree model.
  • Q2 simultaneously and independently modifies the
    elements of the superdiagonal by creating a
    uniform distribution (ai ? d) where d is
    constant.
  • By applying both types of perturbations, Q1 and
    Q2, all the permutations of trees can be reached.

91
Illustration of Q2
92
Subproblems to be discussed
  1. How do we represent the tree so it that is easy
    to operate on? Use cophenetic matrices.
  2. What is our perturbation operator? Q.
  3. How do we build our sampling chain? Apply
    Metropolis-Hastings
  4. When are we done?

93
Acceptance with Metropolis-Hastings
  • Given a tree t, Metropolis-Hastings
  • 1. Applies Q to build a new tree, t.
  • 2. Always accepts the new tree when it is more
    likely than the old one and sometimes accepts it
    when it is less likely than the old one.

94
Acceptance with Metropolis-Hastings the
algorithm
  • If P(t) gt P(t)
  • accept t into the chain.
  • else
  • accept t into the chain with probability P(t)
    / P(t)

95
Acceptance with Metropolis-Hastings
  • The final step in evaluating the acceptance test
    is evaluating
  • P(t) / P(t)
  • This is easy P(t) is approximated using the LE
    of t.

96
Size of chain and convergence
  • How many trees do you have to propose before you
    begin to get a good enough sample? Mau et al
    1998 sample over about 2500 trees for Clarkia, a
    phylogeny with 9 leaves
  • How do you test that you are done? At the end of
    the run, we say that we have convergence if there
    is a small set of topologies with high relative
    frequency in the chain.
  • Whats the result? The topologies with the
    highest frequencies are the reported
    reconstructions.

97
Mixing
  • To obtain a confidence measure, the algorithm
    must be run more than once each run generates a
    chain of accepted trees.
  • When chains mix well when they come up with the
    same representative topologies, starting from
    different tree topologies.
  • If running a sufficient number of independent
    chains is computationally prohibitive, Suchard et
    al, 2002, provide a poor man's estimate of the
    uncertainty.

98
Example with binary data(from Mau, et al, 1998)
  • 9 species of genus Clarkia (California plants)
  • 120 restriction sites
  • Data translated into a 9 x 120 matrix of zeroes
    and ones, representing the absence or presence of
    a restriction site in the genome of each species.

99
Running the MCMC algorithm
  • Random starting trees
  • Chains of length 250,000 were subsampled at rate
    of 1/100 2500 trees
  • Each run took 20 minutes on a Sparc 10.
  • Convergence was inferred by reproducibility
    across runs with very different starting trees.

100
The most common topologies for Clarkia
A 1,2 B 3,4 C 5,6 D8,9
101
References
  • Smouse and Li (1989) introduced the Bayesian
    paradigm, but not the notation, to the phylogeny
    reconstruction problem.
  • Goldman (1993) used non-Bayesian Monte Carlo
    tests of significance to assess the adequacy of
    evolutionary models.
  • Griffiths and Tavare (1994) constructed Markov
    chains to compute likelihoods for ancestral
    inference.
  • Mau, Newton and Larget (1998) apply MCMC to
    sample trees using Bayes theorem.

102
Drill-down Rates
  • The way I use it, and I admit this is quirky,
    motif means the genetic profile for a functional
    structure. Using the following definitions
  • Let rG be the rate of mutation for a gene.
  • Let rE be the rate of expressed mutation for the
    protein G encodes.
  • Let rS be the rate of structural mutation for the
    protein G encodes.
  • Let rF be the rate of functional mutation for the
    protein G encodes.
  • rG gt rE gt rS gt rF
  • Note that the rate of neutral mutations is rN
    rG rF.
  • The true rate of mutation for a motif is rF,
    the observed rate of mutation for members of a
    motif in a genotypic tree is rG. If we want
    motif branchings, we eliminate all branchings in
    the phylogeny occuring with rates rN.

103
Drill-down Semi-labeled trees
  • Trees with a defined branching pattern and
    defined leaf labels but WITHOUT edge lengths or
    internal node labels.
  • In our terms, phylogenies with known branching
    patterns but without information about ancestors
    or mutation times.

nccbac
nacbac
nccnaa
ncbbbc
104
Drill-down Progressive Alignments
  • As you move up the tree, add to sum of
    characters in growing alignment

105
Progressive Alignments
  • Sum of characters in growing alignment can be
    represented in a table of values called
    afrequency matrix or a profile

106
Progressive Alignments
  • Alignments are frozen once they are made. Scores
    are then calculated between aligned positions
    tabulated in a frequency matrix, using a scoring
    table
  • Sij 2 GG 1AG

A G S T
A 4 1 3 10
G 3 2 6
S 2 14
T 8
107
Algorithm Neighbor-joining
  • Input N sequences and their relative distances,
    dij
  • Initialization
  • Define a leaf of T for each sequence
  • Iteration
  • Pick two nodes i,j such that dij (ri rj) is
    minimal.
  • Define a new set of distances, dkl between k
    and all current nodes.
  • Define a node k with daughter nodes i and j, and
    place it at edge length eik ½(dij ri rj)
    and ejk dij dik.
  • Add k to the set of current nodes and remove i
    and j.
  • Termination
  • Unrooted When only two nodes i, j, remain, add
    an edge of length dij/2.

108
Comparison Neighbor-joining and UPGMA
  • Minimization
  • UPGMA uses dij
  • Nearest-neighbor uses dij (ri rj) where
  • Distance measures
  • For distances between leaves i and j
  • dij is the same in both algorithms.
  • For distances between nodes k and m
  • UPGMA uses dik 1/CiCj Sp in Ci, q in Cjdpq
  • Nearest-neighbor uses dkm ½ (dim djm dij)
    where i and j are the daughters of k.
  • Edge lengths
  • UPGMA set the height of node k to ½ the distance
    between daughters i,j (½ dij).
  • Nearest neighbor sets the edge length between k
    and daughters j to ½(dij ri rj), daughter k
    to dij dik.

109
Drill-down MLE
a P(Lka) P(ca,ti) P(ba,tj)
P(ba,tj)
P(ca,ti)
ncbbcbc P(Ljb) 1
nccbabc P(Lic) 1
site u 3 simplest case
110
Drill-down Enumerating topologies
111
Drill-down Acceptance with Metropolis-Hastings
  • A proposed tree t is accepted with probability
  • However, by detailed balance you can step forward
    or backward with equal probability
  • Q(t,t) Q(t, t)
  • Hence our test becomes
Write a Comment
User Comments (0)
About PowerShow.com