Title: The distance between sequences, Part I.
1The distance between sequences, Part I.
- Foundations
- M. Elizabeth Corey, Ph.D.
- UCSC Extension and UC Berkeley Extension
2A simple start
- Suppose we have two sequences, A and B
- A a1, a2, , am
- and
- B b1, b2, , bn
- and we want to know how similar they are.
- What is the basis for their similarity?
3The practical measure
- What we usually do is obtain an alignment and
then score using the sum of the pairwise scores
4The nice metric
- Wouldnt it be nice if we could simply say that
the distance between sequences was the geometric
sum of the distances between loci in the
sequences?
5Dynamic programming methods
- Compuational method dating from the 40s,
introduced to biology as Needleman-Wunsch in
1969. - A numerical value is assigned to every cell in
the array giving the similarity/dissimilarity of
residues - The example shown
- match 1
- mismatch null (value 0)
6Dynamic programming methods
- GOAL For each cell find the maximum possible
score for an alignment ending at that point - Searchs subrow and subcolumn, as shown, for
highest score - Adds this to the score for the current row
- Proceeds row by row through the array
7Maximum bipartite matching
- Series of solutions, starting with Dijskta,
1950s - Find the set of matches that provide maximum
flow. - Each match, ai to bj, has a capacity equal to its
pairwise score.
A
B
s(a1, b1)
8Alignments not really the problem
- Optimal alignment falls into a set of problems
with a long history in computer science. - The underlying metric for distances between
sequences falls in the province of biology.
9Beguiled by a matrix(PAM)
10PAM
- PAM starts with closely related sequences from 34
superfamilies, grouped into 71 evolutionary
trees. - PAM rests on a measure of amino acid
mutability. - PAM attempts to capture a representative slice of
evolutionary behavior.
11PAM (From Dayhoff, Schwartz and Orcutt)
- Obtain alignments for homologous proteins
- Compute scoring matrix elements using
- where aij is substitution frequency, mi is the
mutability of i and G is a proportionality
constant. - Extrapolate to longer evolutionary distances by
using S(t0)n
12Limitations of PAM matrices
- PAM matrices are built from alignments with gt 85
identity. - The entries in the initial scoring matrix, S(t1)
arise from short time interval substitutions
raising S(1) to a higher power may not capture
some interesting substitutions with longer rate
constants.
13The Gutzwiller temptation
- An abstract dynamic system (M, m, ft)
- a measurable space, M, composed of the set of all
sequences. - a measure m based on transition probabilities.
- a group of automorphisms, ft, that map M onto
itself, that preserves m and where the variable t
runs through the integers.
14Whats Bernoulli got to do with it?
- A scheme with subshift
- The measure on M is generated by the sets Ai,j,k
a ai j, ai1 k whose measure is given by
a matrix of transition probabilities pjk gt 0. - A future event a1 depends on a0 hence, memory.
- Realized in the geodesic flow on a compact closed
surface of constant negative curvature.
15System behaviors
- Ergodicity Transition probabilities are
positive recurrent and aperiodic. - Mixing Inheritance and Mendelian exceptions
lead to mixing. - K-systems Speciation events rigidly segregate
M other segregations exist.
16Our salad days
- Jukes-Cantor
- HGY
- Kimura 2-Parameter
- PAM
- BLOSUM
17General Stationary Time-reversible Model
. pCrCA pGrGA pTrTA
pArAC . pGrGC pTrTC
pArAG pCrCG . pTrTG
pArAT pCrCT pGrGT .
R
(Diagonal elements such that rows sum to zero)
Time reversibility pirij pjrji
18General Stationary Time-reversible Model
- P(t) eRt
- Given rates, one can find transition
probabilities, and vice-versa.
19Jukes-Cantor
-3a a a a
a -3a a a
a a -3a a
a a a -3a
R
20Kimura 2-Parameter
A C G T
. b a b
b . b a
a b . b
b a b .
R
a/b transition/transversion bias
21HKY (Hasegawa, Kishino, Yano)
. mpC mkpG mpT
mpA . mpG mkpT
mkpA mpC . mpT
mpA mkpC mpG .
R
k transversion / transition
22The BLOSUMn matrices
- Start with multiple, ungapped alignments of
proteins found using PROTOMAT. - Build clusters by placing together sequences with
N identity. - Measure the score for each pair defined as
- sij 2log2(pij/eij)
- eij is expected probability of occurrence of the
i,j pair - pij is observed probability of the i,j pair.
23Limitations
- Naive approach measure frequencies of aligned
pairs and gaps in randomly selected confirmed
alignments to get pij, use a random set of
sequences to obtain eij. - Difficulty 1 it is difficult to get a good
random sample of sequences or alignments
databases are biased. - Difficulty 2 When sequences diverge from a
common ancestor recently, pij is small and s is
strongly negative. When sequences diverged long
ago, pij tends to eij and s approaches zero.
24A short compendium of distances and scores
- Jukes-Cantor distance
- Kimura distance
- Dayhoff evolutionary distance
- BLOSUM scores
- Profile scores
- Average scores
25References
- Gu, X. Li, W, 1996. A general additive distance
with time-reversibility and rate variation among
nucleotide sites. Proc. Natl. Acad. Sci. USA 93
4671-4676. - Hasegawa, M., Kishino, H., Yano, T., 1985.
Dating of the human-ape splitting by a molecular
clock of mitochondrial DNA. J. Mol. Evol. 22
160-174. - Sanderson, M. J. Shaffer, H. B., 2002.
Troubleshooting molecular phylogenetic analyses.
Annu. Rev. Ecol. Syst. 33 49-72.
26The distance between sequences, Part II.
- Careful Measures
- M. Elizabeth Corey, Ph.D.
- UCSC Extension and UC Berkeley Extension
27Exceptions to Mendels Laws
- The theory a chromosomal basis of inheritance
- Some so-called exceptions
- linkage and recombination
- gene conversion
- transposition and mobile genetic elements
- A plethora of other mutations point mutations,
reversions, deletions, frameshifts, duplications,
inversions - Exceptions do not result in rejection of
Mendelian genetics but a better understanding of
the mechanisms underlying Mendelian inheritance.
28Mutation frequencies(mutations/generation)
- Frequency of point mutation 10-7 to 10-8
- Reversion of point mutations 10-8. Sometimes
called back mutation, sometimes called
convergence. - Reversion of deletion mutations undetectably
small. - Loss of function mutations result in grossly
lower biological fitness. The rate of extinction
due to gross loss of function is much great
than the rate of reversion, so the line will die
long before reversion can occur. In the
aggregate, the record will show a
pseudo-reversion.
29Mutation frequencies
- Deletions 10-6 dependent on chromosomal
region. Caveat May be underestimated less
detectable because they are often lethal . - Frameshifts 10-6 often repaired.
- Duplications 10-3 - E. coli approximately 0.l
of a culture for a given region of the
chromosome. - Inversions hard to detect, not always mutations
- Gene Conversions still unknown. Reparative.
- mutators increase mutation frequencies by 100,
they work on hot spots
30Protein-based inheritance Prions
- Proteins that change their shape in response to
fluctuating environmental pressures, and then
maintain that shape during mitosis and meiosis,
constitute a form of cellular memory. - Various structural conformations are propagated
outside of the traditional genetic framework.
31Hsp90 and Sup35
- A buffer for silent polymorphisms Hsp90
- promotes the folding of signal tranducers
- buffers the effects of many silent polymorphisms
- may serve as a capacitor of evolutionary change
storing and releasing genetic variation - Epigenetic inheritance The Sup35 prion
32James Joyces List
- Milk
- Call mom!
- Lettuce
- Plumb the smithy of my soul for the unborn
race-consciousness - Rent
- --------------------------------------
- Thriving in fluctuating environments by
exploiting pre-existing genetic variations.
33References
- Â Â Recent Publications on Conformational Change
and Evolution        - Queitsch, C., Sangster, T.A. and Lindquist, S.
2002. Hsp90 as a capacitor of phenotypic
variation. Nature 417 618-624.         - Jensen, M.A., True, H.L., Chernoff, Y.O., and
Lindquist, S., 2001 Molecular Population Genetics
and Evolution of a Prion-like Protein in
Saccaromyces cerevisiae. Genetics 159
527-525.          - True, H.L., and Lindquist, S.L. 2000. A yeast
prion provides an exploratory mechanism for
genetic variation and phenotypic diversity.
Nature 407 477-483.          - Rutherford, S.L. and Lindquist, S. 1998. Hsp90 as
a capacitor for morphological evolution. Nature
396 336-342.
34Mutations and time
- Take a series of sequences and figure out how
different they are by counting up their
substitutions.
A
5 substitutions
6 substitutions
B
3 substitutions
C
35Mutations and time
- What process takes us from A to B to C?
A
gene conversion
No direct ancestry
B
frameshift (repairable) 2 point accepted mutations
C
36Counting mutations
- Consider a counting process N(t), t e T where
N(ti) N(tj) is the number of mutations in the
time interval (ti,tj.
A
N(tAB) 1 GC
No direct ancestry but we can still count
substitutions N (tAC) 6 PM
B
N(tBC) 1 FS, 2 PM
C
37Times on the edges of the tree
- The interoccurrence times between mutations, t1
0, , t1 t2 t1, ti ti ti-1, are
exponential variables with mean 1/b such that - Pti gt h e-bh
- and
- Pti lt h 1 - e-bh
- for hgt 0.
38Edge times
- Gene conversions bgc 1 gc/2,000 years
- Frame shifts bfs 1 shift/5,000 years
- Point mutations bpm 1 pm/10,000 years
- Just an wild guess
A
1/bgc 2,000 yrs
1/bpm 60,000 yrs
B
2/bpm 1/bfs 25,000 yrs
C
39Edge times
- Population of A 105
- Population of B 106
- Population of C I dont care.
A
1/Na bgc 20 10-2 yrs
1/bpm 60 10-2 yrs
B
2/ Nb bpm 1/ Nb bfs 2510-3 yrs
C
40Calculating divergence times
- Doolittle, D.F., Fend, D-F, Tsang, S., Cho, G and
Little, E. Determining Divergence Times of the
Major Kingdoms of Living Organisms With a Protein
Clock. Science, 271, pp. 470-477, 1996.
41Calculating divergence times
- Task Build a model for evolutionary time based
on pairwise distances, dij, and the fossil record - Start with the vertebrate fossil record - the
biogeochemistry gives reliable times. - Map the fossil-based phylogeny to the sequence
based phylogeny and compare edge lengths. - Adjust the sequence-based time model to match the
vertebrate fossil record.
42Using the fossil record
43Readjusting the clock
- After sampling the vertebrate fossil-record and
fitting the sequence data to the fossil-record ,
they maintain the same clock. - Result Eukaryotes and Prokaryotes diverged
about 2.5 billion years ago.
44On fitting the fossil record to sequence data
- Challenges unequal rates of change in different
species due to - different reproductive cycles in different
species - different base population sizes in different
species. - Obtaining bacterial mutation rates using
vertebrate mutation rates when we are looking at
the evolution of populations how viable is it?
45Population mutation
- Suppose an average rate of mutation per site is
about 10-7 (ignoring duplications). - Compare lengths of reproductive cycles
- Prokaryotes (blue-green algae and bacteria) 20
minutes to an hour per generation. - Humans US, average time to first child is 24.8
years. - How many times does a bacteria reproduce in the
time it takes a human being to reproduce? - 24 365 25 219,000
- So if we are comparing bacterial mutation rates
to human mutation rates and we looking at
aggregate populations, we have to adjust by a
factor of 106.
46Population mutation
- Size of the base population on planet earth
- 5 1030 prokaryotes (UG, Bill Whitman) -
including about a mole of bacteria - 3 109 humans
- How many bacteria are there, propagating how
fast, in comparison to humans? Worst case ratio?
- Calculate using
- base population rate of generation number of
mutable genes - (1023 106103)
- -------------------------- 1018
- (109 1104)
47One final issue The Success Question
- When mutations succeed, they succeed within an
ecological niche. - So when we ask When did a species arise?, it is
not enough to ask about the likelihood of a
certain kind of mutation, one must also ask
what is the likelihood that that mutation arose
in a niche that would support it? - So, dont forget about acceptance rates.
48The FOXP2 point mutation
- Enard et al, Molecular evolution of FOXP2, a
gene involved in speech and language, Nature,
Vol. 418, August 22, 2002
49Silent/expressed mutations in FOXP2
- Edge labels are Amino Acid / DNA substitutions
OHG
0/7
1/2
HG
0/2
2/2
Orangutan
Gorilla
Human
50Selective sweeps
- Measures for determining the existence of a
sweep - Tajimas D from Genetics, 1989 (conservative)
- Fay and Wus H from Hitchhiking under positive
Darwinian selection, Genetics, 2000. - Also, Griffiths and Tavare estimate selection
using linked SNP data
51Population mutation rates
- Pia 4Na bi - the population mutation rate for
site i in species a, where Na is the effective
population size of species a and bi is the
mutation rate per generation at site i.
52Tajimas D for FOXP2
- -2.20
- P 0.03
- S/an 0.079
- S is the sample size
- an is the number of segregating sites.
53Discovering different rate constants
54Finding the time of appearance of the FOXP2
segregation
- Sample current human population worldwide.
- Generate trees with different times for the human
sequence data. - Measure the likelihood of the different trees.
55Multiple rates
- The automorphism f, mapping M onto itself, used
to be a simple shift operation. - Now, it incorporates several underlying
processes, including - mutation of the bases (mutation rate)
- expression of the mutations (expression rate)
- stabilization of a conformational phenotype
(stabilization rate) - success of the substitution (acceptance rate)
56The distance between sequences, Part III.
- Algorithms for phylogenies
- M. Elizabeth Corey, Ph.D.
- UCSC Extension and UC Berkeley Extension
57Motivation
- Phylogenies provide measures of similarity and
can lay a foundation for scoring alignments. - Rate structures provide indicators for motifs.
- Branch points allow us to identify and classify
interesting bases. - If the branch points are in phenotypic trees, the
mutating bases can be used as phenotypic
identifiers. - If the branch points are in genotypic trees,
mutating (nonsilent) bases can be used as genetic
identifiers.
58What goes into a phylogeny?
Pairwise Alignment
- Distance measures (UPGMA, NN)
- Site info (MLE and Parsimony)
- Substitution scores
- Equilibrium distributions for MLE
Multiple Alignment
Phylogenies
Transitional probability data
59What do we get in return?
Pairwise Alignment
-
- Guide trees
- Rates and probabilities
- Scoring matrices Scoring matrices
Multiple Alignment
Phylogenies
Transitional probability data
60Part III Goals
- Depict methods for finding guide trees for
progressive multiple alignment. - Clarify the differences between MLE, Maximum
Parsimony and Distance Methods and identify the
optimization techniques appropriate for each. - Define a new approach for faster identification
of near-optimal phylogenies.
61Progressive multiple alignment
- Choose a set of scores for sequence comparison
- Alignment scores from Needleman-Wunsch,
Smith-Waterman and variants. - Consensus word score from BLAST, PSI-BLAST and
others - Substitution (scoring) matrices PAM, BLOSUM,
Jukes-Cantor, etc. - Construct a reputable guide tree
- Hierachical clustering (UPGMA, Neighbor-Joining,
Fitch and Margoliash) - Maximum Parsimony (simple or weighted).
- Maximum Likelihood Estimation (MLE)
- Use the guide tree to produce an alignment
62Tree evaluation - Parsimony
- Given a semi-labeled tree, it is possible
to determine the trees internal nodes (ancestral
sequences) using a parsimony algorithm. - Evaluation function A summation of the scored
mutations in the parsimonious tree.
63Parsimony - Illustrated
ABC node 3, cost is 3
A(B or D)C node 1, cost is 1
A(B or C) (E or C) node 2, cost is 2
ACC
ABC
ADC
ABE
64Example Simple Parsimony
- Initialization
- Set the cost, C 0. Set k 2n-1, where n is
the number of sequences. - Recursion to compute node, Nk
- if k is a leaf node, Nk sequence k
- if k is not a leaf node
- Compute Ni and Nj for the daughter nodes of Nk.
- where the intersection of Ni and Nj is nonempty,
- otherwise increment the cost by the number of
nonmatching residues and set - Termination
- Minimum cost of tree C.
65Tree evaluation Distance methods
- Given a set of alignment scores, but without
assuming a tree topology, it is possible to
determine a tree and its edge lengths using a
distance method. This is sometimes called
minimum evolution and includes the hierarchical
clustering methods. - Evaluation function The sum of the edge lengths.
66Hierarchical Clustering Illustrated UPGMA
t1 t2 ½d12
6
7
6
1
2
4
5
1
2
8
9
8
6
7
½d68
6
7
1
2
4
5
3
1
2
4
5
3
From Durbin et al, 2001
67Algorithm UPGMA
- Input N sequences and their relative distances,
dij - Initialization
- Assign each sequence to its own cluster, Ci.
- Define a leaf of T for each sequence and place
at height 0. - Iteration
- Pick two clusters Ci, Cj such that dij is
minimal. - Define a new cluster k by Ck Ci,Cj.
- Define a new set of distances dkl between Ck
and all current clusters. - Define a node k with daughter nodes i and j, and
place it at height hik ½dik. - Add k to the set of current clusters and remove i
and j. - Termination
- Rooted When only two clusters i, j, remain, add
the root at height ½dik.
68Tree evaluation - MLE
- Given a tree topology and sequences preassigned
to each leaf, it is possible to determine a
trees edge lengths using maximum likelihood
estimation. - Evaluation function the likelihood of the tree.
69Estimating Likelihood
- Estimate branch lengths by viewing evolution as a
random process - Requires a probability model of evolution as a
function of time. - For DNA one can use Jukes-Cantor model (all
nucleotides have same substitution rates), or
Kimura model (different rates for transitions and
transversions). - For proteins one can use Dayhoff, but in the
probability form not the log-odds form.
70Estimating Likelihood
- S1, etc. are the bases or residues observed in
the extant and ancestral taxa. - v lt where l is the substitution rate and t is
absolute time - Pi,j(v) is the probability that the residue at
node si becomes residue at node sj in time v - p0 is the prior probability of the bases or
nucleotides at any position - The likelihood for this tree is
- L p0P0,5(v5) P5,1(v1) P5,2(v2) P0,6(v6)
P6,3(v3) P6,4(v4)
71Example Likelihood
- For each mutating site in a set of sequences
- Initialization
- Set k 2n-1, where n is the number of sequences.
- Recursion
- Compute P(Lka) for each symbol, a, in the
alphabet as follows - If k is a leaf node
- if xk,u a, then P(Lka) 1,
- else Pk(a) 0.
- if k is not a leaf node
- Compute P(Lia), P(Lja) for all a at daughters
i,j - Set P(Lka) Sb,cP(ba,ti) P(Lib) P(ca,tj)
P(Ljc). - Termination
- Likelihood for site u pa SaP(L2n-1a)
- (pa is the equilibrium value of the probability
distribution for a.) - Concluding step Combine the likelihoods for
each site.
72Maximizing Likelihood Estimation over edge times
- Likehood estimation includes a step for computing
the likelihood of some character a at node k
given the subtree of k. - While we know that there is the possibility of
substitutions leading to a, these depend on how
long a time we have to make those substitutions
and we do not know the edge times of the tree.
We must explore a series of possible times in
order to to maximize the likelihood. - A method that maximizes likelihoods over edge
times is what is usually referred to as MLE. - Standard MLE procedures do not maximize
likelihoods over all topologies of the tree.
73Comparisons between MLE, Parsimony and Distance
Methods
Algorithm Requires semi-labeled tree Requires scored alignments Order Results Edge weights Results Internal tree nodes Resulting Tree Is Ultrametric
MLE Yes No La2n-1 2an2 Transitional probabilities subtree probability Yes
Parsimony Yes No 2an2 Mutation counts Ancestral sequences No
Distance Methods No Yes 2n2 Distance measures e.g. alignment scores UPGMA a cluster of sequences UPGMA - no NN - yes.
74Exploring different topologies
- Successive addition and rearrangement
- Very common method (see Phylip programs
including PROTPARS, DNAPARS, DNACOMP, DNAML,
DNAMLK, RESTML, KITSCH, FITCH, CONTML, MIX and
DOLLOP) - Sequences are taken in the order that they appear
in the input file and successively added to a
tree. - MCMC
75Successive addition
- Initialization
- Place the set of sequences into L.
- Create a tree,T, with one node the root.
- Iteration for each sequence in L
- Remove a sequence from L and add it as a leaf to
T. - Apply a process of local rearrangement (in
Felsensteins package, there are (n-1)(2n-3)
arrangements.) - Score each locally arranged tree.
- set T to equal the best scoring tree.
- Termination Globally rearrange the tree by
swapping subtrees, score each globally rearranged
tree and accept the tree with the best score.
76Markov Chain Monte Carlo
- A Bayesian method for phylogenetic inference
- Moderately new method rooted in molecular
dynamics. - Topologies are randomly generated and scored so
that a representative set of most likely tree
topologies can be identified. - Mau, Newton and Larget (1998) apply MCMC to
sample trees using Bayes theorem. The following
explanation is based on their methodology - the
mistakes are mine, the facts and foundations,
theirs.
77Introduction to the method
t is the set of all semi-labeled trees
78Introduction to the method
Sampling the set of trees
Q2
Q1
Q3
79Introduction to the method
Q01
Q12
Q23
to
t2
t1
a
c
b
d
a
b
c
d
a
b
c
d
A Chain of Accepted Samples
80Introduction to the method
The partitioned space with representatives t1,,
t3
t1
t3
81MCMC propaganda
- allow exact inference provided certain convergent
criteria are demonstrated. - are efficient and can handle many more taxa or
sequences. - measure uncertainty during tree construction (no
bootstrapping needed.)
82Summary of the Algorithm
- Choose a starting tree
- Perturb the current trees topology and branch
lengths to find a new tree. - Measure the likelihood for the new tree.
- Compare the new tree to the last tree and decide
whether or not to accept it into the chain. - If youve got a sufficiently long chain, check
the characteristics of your sample to see if
there is convergence to a set of representative
topologies. If so, stop. Otherwise, to to 2.
83Subproblems to be discussed
- How do we represent the tree so it that is easy
to operate on? Cophentic matrices. - What is our perturbation operator?
- How do we build our sampling chain?
- When are we done sampling?
84The Cophenetic Matrix
- Some Notation
- t a topology
- n a node
- a(n) the ancestor of a node
- L a leaf node (the leaves are the current
record) - I an internal node (the historical record)
85Cophenetic Trees
- Labeled history (t1, t2) provides an order on
coalescent levels.
I0
level 0
t1
level 1
I1
t2
level 2
I2
L3
L1
L2
86Example A Cophenetic Tree
t1 0.8 t20.3 t30.7 t40.5 t50.9 t61.5 total
4.7
These trees are described in terms of nodes
coalescing or merging backwards in time.
87Example Cophenetic Matrix
- The cophenetic matrix for the previous tree.
- The tree representation (s, a) is
- (5,7,4,1,2,6,3), (4.7, 0.8, 2.3, 3.2, 1.8, 1.1)
Leaf 5 7 4 1 2 6 3
5 0 9.4 9.4 9.4 9.4 9.4 9.4
7 0 1.6 4.6 6.4 6.4 6.4
4 0 4.6 6.4 6.4 6.4
1 0 6.4 6.4 6.4
2 0 3.6 3.6
6 0 2.2
3 0
88The Cophenetic Matrix
- Theorem For any weighted binary tree with
labeled leaf nodes, the tree topology and branch
lengths can be uniquely determined using the
within-tree distances between all pairs of leaf
nodes. (Lapoint and Legendre, 1992) - Note, each permutation of the leaf labels
generates a different n x n symmetric matrix of
distance distances.
89What is the perturbation operator?
- Q is the proposal function and it has two stages
- Q1 randomly selects a new leaf order
- Q2 perturbs the values of the matrix
supradiagonals. - The proposal mechanism is symmetrical
- Q(tn,tn1) Q(tn1,tn)
90Details on Q1 and Q2
- Q1 samples one of the 2n-1 leaf orderings of the
current tree model. - Q2 simultaneously and independently modifies the
elements of the superdiagonal by creating a
uniform distribution (ai ? d) where d is
constant. - By applying both types of perturbations, Q1 and
Q2, all the permutations of trees can be reached.
91Illustration of Q2
92Subproblems to be discussed
- How do we represent the tree so it that is easy
to operate on? Use cophenetic matrices. - What is our perturbation operator? Q.
- How do we build our sampling chain? Apply
Metropolis-Hastings - When are we done?
93Acceptance with Metropolis-Hastings
- Given a tree t, Metropolis-Hastings
- 1. Applies Q to build a new tree, t.
- 2. Always accepts the new tree when it is more
likely than the old one and sometimes accepts it
when it is less likely than the old one.
94Acceptance with Metropolis-Hastings the
algorithm
- If P(t) gt P(t)
- accept t into the chain.
- else
- accept t into the chain with probability P(t)
/ P(t)
95Acceptance with Metropolis-Hastings
- The final step in evaluating the acceptance test
is evaluating - P(t) / P(t)
- This is easy P(t) is approximated using the LE
of t.
96Size of chain and convergence
- How many trees do you have to propose before you
begin to get a good enough sample? Mau et al
1998 sample over about 2500 trees for Clarkia, a
phylogeny with 9 leaves - How do you test that you are done? At the end of
the run, we say that we have convergence if there
is a small set of topologies with high relative
frequency in the chain. - Whats the result? The topologies with the
highest frequencies are the reported
reconstructions.
97Mixing
- To obtain a confidence measure, the algorithm
must be run more than once each run generates a
chain of accepted trees. - When chains mix well when they come up with the
same representative topologies, starting from
different tree topologies. - If running a sufficient number of independent
chains is computationally prohibitive, Suchard et
al, 2002, provide a poor man's estimate of the
uncertainty.
98Example with binary data(from Mau, et al, 1998)
- 9 species of genus Clarkia (California plants)
- 120 restriction sites
- Data translated into a 9 x 120 matrix of zeroes
and ones, representing the absence or presence of
a restriction site in the genome of each species.
99Running the MCMC algorithm
- Random starting trees
- Chains of length 250,000 were subsampled at rate
of 1/100 2500 trees - Each run took 20 minutes on a Sparc 10.
- Convergence was inferred by reproducibility
across runs with very different starting trees.
100The most common topologies for Clarkia
A 1,2 B 3,4 C 5,6 D8,9
101References
- Smouse and Li (1989) introduced the Bayesian
paradigm, but not the notation, to the phylogeny
reconstruction problem. - Goldman (1993) used non-Bayesian Monte Carlo
tests of significance to assess the adequacy of
evolutionary models. - Griffiths and Tavare (1994) constructed Markov
chains to compute likelihoods for ancestral
inference. - Mau, Newton and Larget (1998) apply MCMC to
sample trees using Bayes theorem.
102Drill-down Rates
- The way I use it, and I admit this is quirky,
motif means the genetic profile for a functional
structure. Using the following definitions - Let rG be the rate of mutation for a gene.
- Let rE be the rate of expressed mutation for the
protein G encodes. - Let rS be the rate of structural mutation for the
protein G encodes. - Let rF be the rate of functional mutation for the
protein G encodes. - rG gt rE gt rS gt rF
- Note that the rate of neutral mutations is rN
rG rF. - The true rate of mutation for a motif is rF,
the observed rate of mutation for members of a
motif in a genotypic tree is rG. If we want
motif branchings, we eliminate all branchings in
the phylogeny occuring with rates rN.
103Drill-down Semi-labeled trees
- Trees with a defined branching pattern and
defined leaf labels but WITHOUT edge lengths or
internal node labels. - In our terms, phylogenies with known branching
patterns but without information about ancestors
or mutation times.
nccbac
nacbac
nccnaa
ncbbbc
104Drill-down Progressive Alignments
- As you move up the tree, add to sum of
characters in growing alignment
105Progressive Alignments
- Sum of characters in growing alignment can be
represented in a table of values called
afrequency matrix or a profile
106Progressive Alignments
- Alignments are frozen once they are made. Scores
are then calculated between aligned positions
tabulated in a frequency matrix, using a scoring
table - Sij 2 GG 1AG
A G S T
A 4 1 3 10
G 3 2 6
S 2 14
T 8
107Algorithm Neighbor-joining
- Input N sequences and their relative distances,
dij - Initialization
- Define a leaf of T for each sequence
- Iteration
- Pick two nodes i,j such that dij (ri rj) is
minimal. - Define a new set of distances, dkl between k
and all current nodes. - Define a node k with daughter nodes i and j, and
place it at edge length eik ½(dij ri rj)
and ejk dij dik. - Add k to the set of current nodes and remove i
and j. - Termination
- Unrooted When only two nodes i, j, remain, add
an edge of length dij/2.
108Comparison Neighbor-joining and UPGMA
- Minimization
- UPGMA uses dij
- Nearest-neighbor uses dij (ri rj) where
- Distance measures
- For distances between leaves i and j
- dij is the same in both algorithms.
- For distances between nodes k and m
- UPGMA uses dik 1/CiCj Sp in Ci, q in Cjdpq
- Nearest-neighbor uses dkm ½ (dim djm dij)
where i and j are the daughters of k. - Edge lengths
- UPGMA set the height of node k to ½ the distance
between daughters i,j (½ dij). - Nearest neighbor sets the edge length between k
and daughters j to ½(dij ri rj), daughter k
to dij dik.
109Drill-down MLE
a P(Lka) P(ca,ti) P(ba,tj)
P(ba,tj)
P(ca,ti)
ncbbcbc P(Ljb) 1
nccbabc P(Lic) 1
site u 3 simplest case
110Drill-down Enumerating topologies
111Drill-down Acceptance with Metropolis-Hastings
- A proposed tree t is accepted with probability
-
- However, by detailed balance you can step forward
or backward with equal probability - Q(t,t) Q(t, t)
- Hence our test becomes