The distance between sequences, Part I.

About This Presentation

Title:

The distance between sequences, Part I.

Description:

and we want to know how similar they are. What is the basis for their ... Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 112

Provided by: elizabe102

Category:

more less

Transcript and Presenter's Notes

Title: The distance between sequences, Part I.

1
The distance between sequences, Part I.

Foundations
M. Elizabeth Corey, Ph.D.
UCSC Extension and UC Berkeley Extension

2
A simple start

Suppose we have two sequences, A and B
A a1, a2, , am
and
B b1, b2, , bn
and we want to know how similar they are.
What is the basis for their similarity?

3
The practical measure

What we usually do is obtain an alignment and
then score using the sum of the pairwise scores

4
The nice metric

Wouldnt it be nice if we could simply say that
the distance between sequences was the geometric
sum of the distances between loci in the
sequences?

5
Dynamic programming methods

Compuational method dating from the 40s,
introduced to biology as Needleman-Wunsch in
1969.
A numerical value is assigned to every cell in
the array giving the similarity/dissimilarity of
residues
The example shown
match 1
mismatch null (value 0)

6
Dynamic programming methods

GOAL For each cell find the maximum possible
score for an alignment ending at that point
Searchs subrow and subcolumn, as shown, for
highest score
Adds this to the score for the current row
Proceeds row by row through the array

7
Maximum bipartite matching

Series of solutions, starting with Dijskta,
1950s
Find the set of matches that provide maximum
flow.
Each match, ai to bj, has a capacity equal to its
pairwise score.

A
B
s(a1, b1)
8
Alignments not really the problem

Optimal alignment falls into a set of problems
with a long history in computer science.
The underlying metric for distances between
sequences falls in the province of biology.

9
Beguiled by a matrix(PAM)
10
PAM

PAM starts with closely related sequences from 34
superfamilies, grouped into 71 evolutionary
trees.
PAM rests on a measure of amino acid
mutability.
PAM attempts to capture a representative slice of
evolutionary behavior.

11
PAM (From Dayhoff, Schwartz and Orcutt)

Obtain alignments for homologous proteins
Compute scoring matrix elements using
where aij is substitution frequency, mi is the
mutability of i and G is a proportionality
constant.
Extrapolate to longer evolutionary distances by
using S(t0)n

12
Limitations of PAM matrices

PAM matrices are built from alignments with gt 85
identity.
The entries in the initial scoring matrix, S(t1)
arise from short time interval substitutions
raising S(1) to a higher power may not capture
some interesting substitutions with longer rate
constants.

13
The Gutzwiller temptation

An abstract dynamic system (M, m, ft)
a measurable space, M, composed of the set of all
sequences.
a measure m based on transition probabilities.
a group of automorphisms, ft, that map M onto
itself, that preserves m and where the variable t
runs through the integers.

14
Whats Bernoulli got to do with it?

A scheme with subshift
The measure on M is generated by the sets Ai,j,k
a ai j, ai1 k whose measure is given by
a matrix of transition probabilities pjk gt 0.
A future event a1 depends on a0 hence, memory.
Realized in the geodesic flow on a compact closed
surface of constant negative curvature.

15
System behaviors

Ergodicity Transition probabilities are
positive recurrent and aperiodic.
Mixing Inheritance and Mendelian exceptions
lead to mixing.
K-systems Speciation events rigidly segregate
M other segregations exist.

16
Our salad days

Jukes-Cantor
HGY
Kimura 2-Parameter
PAM
BLOSUM

17
General Stationary Time-reversible Model
. pCrCA pGrGA pTrTA
pArAC . pGrGC pTrTC
pArAG pCrCG . pTrTG
pArAT pCrCT pGrGT .
R
(Diagonal elements such that rows sum to zero)
Time reversibility pirij pjrji
18
General Stationary Time-reversible Model

P(t) eRt
Given rates, one can find transition
probabilities, and vice-versa.

19
Jukes-Cantor
-3a a a a
a -3a a a
a a -3a a
a a a -3a
R
20
Kimura 2-Parameter
A C G T
. b a b
b . b a
a b . b
b a b .
R
a/b transition/transversion bias
21
HKY (Hasegawa, Kishino, Yano)
. mpC mkpG mpT
mpA . mpG mkpT
mkpA mpC . mpT
mpA mkpC mpG .
R
k transversion / transition
22
The BLOSUMn matrices

Start with multiple, ungapped alignments of
proteins found using PROTOMAT.
Build clusters by placing together sequences with
N identity.
Measure the score for each pair defined as
sij 2log2(pij/eij)
eij is expected probability of occurrence of the
i,j pair
pij is observed probability of the i,j pair.

23
Limitations

Naive approach measure frequencies of aligned
pairs and gaps in randomly selected confirmed
alignments to get pij, use a random set of
sequences to obtain eij.
Difficulty 1 it is difficult to get a good
random sample of sequences or alignments
databases are biased.
Difficulty 2 When sequences diverge from a
common ancestor recently, pij is small and s is
strongly negative. When sequences diverged long
ago, pij tends to eij and s approaches zero.

24
A short compendium of distances and scores

Jukes-Cantor distance
Kimura distance
Dayhoff evolutionary distance
BLOSUM scores
Profile scores
Average scores

25
References

Gu, X. Li, W, 1996. A general additive distance
with time-reversibility and rate variation among
nucleotide sites. Proc. Natl. Acad. Sci. USA 93
4671-4676.
Hasegawa, M., Kishino, H., Yano, T., 1985.
Dating of the human-ape splitting by a molecular
clock of mitochondrial DNA. J. Mol. Evol. 22
160-174.
Sanderson, M. J. Shaffer, H. B., 2002.
Troubleshooting molecular phylogenetic analyses.
Annu. Rev. Ecol. Syst. 33 49-72.

26
The distance between sequences, Part II.

Careful Measures
M. Elizabeth Corey, Ph.D.
UCSC Extension and UC Berkeley Extension

27
Exceptions to Mendels Laws

The theory a chromosomal basis of inheritance
Some so-called exceptions
linkage and recombination
gene conversion
transposition and mobile genetic elements
A plethora of other mutations point mutations,
reversions, deletions, frameshifts, duplications,
inversions
Exceptions do not result in rejection of
Mendelian genetics but a better understanding of
the mechanisms underlying Mendelian inheritance.

28
Mutation frequencies(mutations/generation)

Frequency of point mutation 10-7 to 10-8
Reversion of point mutations 10-8. Sometimes
called back mutation, sometimes called
convergence.
Reversion of deletion mutations undetectably
small.
Loss of function mutations result in grossly
lower biological fitness. The rate of extinction
due to gross loss of function is much great
than the rate of reversion, so the line will die
long before reversion can occur. In the
aggregate, the record will show a
pseudo-reversion.

29
Mutation frequencies

Deletions 10-6 dependent on chromosomal
region. Caveat May be underestimated less
detectable because they are often lethal .
Frameshifts 10-6 often repaired.
Duplications 10-3 - E. coli approximately 0.l
of a culture for a given region of the
chromosome.
Inversions hard to detect, not always mutations
Gene Conversions still unknown. Reparative.
mutators increase mutation frequencies by 100,
they work on hot spots

30
Protein-based inheritance Prions

Proteins that change their shape in response to
fluctuating environmental pressures, and then
maintain that shape during mitosis and meiosis,
constitute a form of cellular memory.
Various structural conformations are propagated
outside of the traditional genetic framework.

31
Hsp90 and Sup35

A buffer for silent polymorphisms Hsp90
promotes the folding of signal tranducers
buffers the effects of many silent polymorphisms
may serve as a capacitor of evolutionary change
storing and releasing genetic variation
Epigenetic inheritance The Sup35 prion

32
James Joyces List

Milk
Call mom!
Lettuce
Plumb the smithy of my soul for the unborn
race-consciousness
Rent
--------------------------------------
Thriving in fluctuating environments by
exploiting pre-existing genetic variations.

33
References

Recent Publications on Conformational Change
and Evolution
Queitsch, C., Sangster, T.A. and Lindquist, S.
2002. Hsp90 as a capacitor of phenotypic
variation. Nature 417 618-624.
Jensen, M.A., True, H.L., Chernoff, Y.O., and
Lindquist, S., 2001 Molecular Population Genetics
and Evolution of a Prion-like Protein in
Saccaromyces cerevisiae. Genetics 159
527-525.
True, H.L., and Lindquist, S.L. 2000. A yeast
prion provides an exploratory mechanism for
genetic variation and phenotypic diversity.
Nature 407 477-483.
Rutherford, S.L. and Lindquist, S. 1998. Hsp90 as
a capacitor for morphological evolution. Nature
396 336-342.

34
Mutations and time

Take a series of sequences and figure out how
different they are by counting up their
substitutions.

A
5 substitutions
6 substitutions
B
3 substitutions
C
35
Mutations and time

What process takes us from A to B to C?

A
gene conversion
No direct ancestry
B
frameshift (repairable) 2 point accepted mutations
C
36
Counting mutations

Consider a counting process N(t), t e T where
N(ti) N(tj) is the number of mutations in the
time interval (ti,tj.

A
N(tAB) 1 GC
No direct ancestry but we can still count
substitutions N (tAC) 6 PM
B
N(tBC) 1 FS, 2 PM
C
37
Times on the edges of the tree

The interoccurrence times between mutations, t1
0, , t1 t2 t1, ti ti ti-1, are
exponential variables with mean 1/b such that
Pti gt h e-bh
and
Pti lt h 1 - e-bh
for hgt 0.

38
Edge times

Gene conversions bgc 1 gc/2,000 years
Frame shifts bfs 1 shift/5,000 years
Point mutations bpm 1 pm/10,000 years
Just an wild guess

A
1/bgc 2,000 yrs
1/bpm 60,000 yrs
B
2/bpm 1/bfs 25,000 yrs
C
39
Edge times

Population of A 105
Population of B 106
Population of C I dont care.

A
1/Na bgc 20 10-2 yrs
1/bpm 60 10-2 yrs
B
2/ Nb bpm 1/ Nb bfs 2510-3 yrs
C
40
Calculating divergence times

Doolittle, D.F., Fend, D-F, Tsang, S., Cho, G and
Little, E. Determining Divergence Times of the
Major Kingdoms of Living Organisms With a Protein
Clock. Science, 271, pp. 470-477, 1996.

41
Calculating divergence times

Task Build a model for evolutionary time based
on pairwise distances, dij, and the fossil record
Start with the vertebrate fossil record - the
biogeochemistry gives reliable times.
Map the fossil-based phylogeny to the sequence
based phylogeny and compare edge lengths.
Adjust the sequence-based time model to match the
vertebrate fossil record.

42
Using the fossil record
43
Readjusting the clock

After sampling the vertebrate fossil-record and
fitting the sequence data to the fossil-record ,
they maintain the same clock.
Result Eukaryotes and Prokaryotes diverged
about 2.5 billion years ago.

44
On fitting the fossil record to sequence data

Challenges unequal rates of change in different
species due to
different reproductive cycles in different
species
different base population sizes in different
species.
Obtaining bacterial mutation rates using
vertebrate mutation rates when we are looking at
the evolution of populations how viable is it?

45
Population mutation

Suppose an average rate of mutation per site is
about 10-7 (ignoring duplications).
Compare lengths of reproductive cycles
Prokaryotes (blue-green algae and bacteria) 20
minutes to an hour per generation.
Humans US, average time to first child is 24.8
years.
How many times does a bacteria reproduce in the
time it takes a human being to reproduce?
24 365 25 219,000
So if we are comparing bacterial mutation rates
to human mutation rates and we looking at
aggregate populations, we have to adjust by a
factor of 106.

46
Population mutation

Size of the base population on planet earth
5 1030 prokaryotes (UG, Bill Whitman) -
including about a mole of bacteria
3 109 humans
How many bacteria are there, propagating how
fast, in comparison to humans? Worst case ratio?
Calculate using
base population rate of generation number of
mutable genes
(1023 106103)
-------------------------- 1018
(109 1104)

47
One final issue The Success Question

When mutations succeed, they succeed within an
ecological niche.
So when we ask When did a species arise?, it is
not enough to ask about the likelihood of a
certain kind of mutation, one must also ask
what is the likelihood that that mutation arose
in a niche that would support it?
So, dont forget about acceptance rates.

48
The FOXP2 point mutation

Enard et al, Molecular evolution of FOXP2, a
gene involved in speech and language, Nature,
Vol. 418, August 22, 2002

49
Silent/expressed mutations in FOXP2

Edge labels are Amino Acid / DNA substitutions

OHG
0/7
1/2
HG
0/2
2/2
Orangutan
Gorilla
Human
50
Selective sweeps

Measures for determining the existence of a
sweep
Tajimas D from Genetics, 1989 (conservative)
Fay and Wus H from Hitchhiking under positive
Darwinian selection, Genetics, 2000.
Also, Griffiths and Tavare estimate selection
using linked SNP data

51
Population mutation rates

Pia 4Na bi - the population mutation rate for
site i in species a, where Na is the effective
population size of species a and bi is the
mutation rate per generation at site i.

52
Tajimas D for FOXP2

-2.20
P 0.03
S/an 0.079
S is the sample size
an is the number of segregating sites.

53
Discovering different rate constants
54
Finding the time of appearance of the FOXP2
segregation

Sample current human population worldwide.
Generate trees with different times for the human
sequence data.
Measure the likelihood of the different trees.

55
Multiple rates

The automorphism f, mapping M onto itself, used
to be a simple shift operation.
Now, it incorporates several underlying
processes, including
mutation of the bases (mutation rate)
expression of the mutations (expression rate)
stabilization of a conformational phenotype
(stabilization rate)
success of the substitution (acceptance rate)

56
The distance between sequences, Part III.

Algorithms for phylogenies
M. Elizabeth Corey, Ph.D.
UCSC Extension and UC Berkeley Extension

57
Motivation

Phylogenies provide measures of similarity and
can lay a foundation for scoring alignments.
Rate structures provide indicators for motifs.
Branch points allow us to identify and classify
interesting bases.
If the branch points are in phenotypic trees, the
mutating bases can be used as phenotypic
identifiers.
If the branch points are in genotypic trees,
mutating (nonsilent) bases can be used as genetic
identifiers.

58
What goes into a phylogeny?
Pairwise Alignment

Distance measures (UPGMA, NN)
Site info (MLE and Parsimony)
Substitution scores
Equilibrium distributions for MLE

Multiple Alignment
Phylogenies
Transitional probability data
59
What do we get in return?
Pairwise Alignment

Guide trees
Rates and probabilities
Scoring matrices Scoring matrices

Multiple Alignment
Phylogenies
Transitional probability data
60
Part III Goals

Depict methods for finding guide trees for
progressive multiple alignment.
Clarify the differences between MLE, Maximum
Parsimony and Distance Methods and identify the
optimization techniques appropriate for each.
Define a new approach for faster identification
of near-optimal phylogenies.

61
Progressive multiple alignment

Choose a set of scores for sequence comparison
Alignment scores from Needleman-Wunsch,
Smith-Waterman and variants.
Consensus word score from BLAST, PSI-BLAST and
others
Substitution (scoring) matrices PAM, BLOSUM,
Jukes-Cantor, etc.
Construct a reputable guide tree
Hierachical clustering (UPGMA, Neighbor-Joining,
Fitch and Margoliash)
Maximum Parsimony (simple or weighted).
Maximum Likelihood Estimation (MLE)
Use the guide tree to produce an alignment

62
Tree evaluation - Parsimony

Given a semi-labeled tree, it is possible
to determine the trees internal nodes (ancestral
sequences) using a parsimony algorithm.
Evaluation function A summation of the scored
mutations in the parsimonious tree.

63
Parsimony - Illustrated
ABC node 3, cost is 3
A(B or D)C node 1, cost is 1
A(B or C) (E or C) node 2, cost is 2
ACC
ABC
ADC
ABE
64
Example Simple Parsimony

Initialization
Set the cost, C 0. Set k 2n-1, where n is
the number of sequences.
Recursion to compute node, Nk
if k is a leaf node, Nk sequence k
if k is not a leaf node
Compute Ni and Nj for the daughter nodes of Nk.
where the intersection of Ni and Nj is nonempty,
otherwise increment the cost by the number of
nonmatching residues and set
Termination
Minimum cost of tree C.

65
Tree evaluation Distance methods

Given a set of alignment scores, but without
assuming a tree topology, it is possible to
determine a tree and its edge lengths using a
distance method. This is sometimes called
minimum evolution and includes the hierarchical
clustering methods.
Evaluation function The sum of the edge lengths.

66
Hierarchical Clustering Illustrated UPGMA
t1 t2 ½d12
6
7
6
1
2
4
5
1
2
8
9
8
6
7
½d68
6
7
1
2
4
5
3
1
2
4
5
3
From Durbin et al, 2001
67
Algorithm UPGMA

Input N sequences and their relative distances,
dij
Initialization
Assign each sequence to its own cluster, Ci.
Define a leaf of T for each sequence and place
at height 0.
Iteration
Pick two clusters Ci, Cj such that dij is
minimal.
Define a new cluster k by Ck Ci,Cj.
Define a new set of distances dkl between Ck
and all current clusters.
Define a node k with daughter nodes i and j, and
place it at height hik ½dik.
Add k to the set of current clusters and remove i
and j.
Termination
Rooted When only two clusters i, j, remain, add
the root at height ½dik.

68
Tree evaluation - MLE

Given a tree topology and sequences preassigned
to each leaf, it is possible to determine a
trees edge lengths using maximum likelihood
estimation.
Evaluation function the likelihood of the tree.

69
Estimating Likelihood

Estimate branch lengths by viewing evolution as a
random process
Requires a probability model of evolution as a
function of time.
For DNA one can use Jukes-Cantor model (all
nucleotides have same substitution rates), or
Kimura model (different rates for transitions and
transversions).
For proteins one can use Dayhoff, but in the
probability form not the log-odds form.

70
Estimating Likelihood

S1, etc. are the bases or residues observed in
the extant and ancestral taxa.
v lt where l is the substitution rate and t is
absolute time
Pi,j(v) is the probability that the residue at
node si becomes residue at node sj in time v
p0 is the prior probability of the bases or
nucleotides at any position
The likelihood for this tree is
L p0P0,5(v5) P5,1(v1) P5,2(v2) P0,6(v6)
P6,3(v3) P6,4(v4)

71
Example Likelihood

For each mutating site in a set of sequences
Initialization
Set k 2n-1, where n is the number of sequences.
Recursion
Compute P(Lka) for each symbol, a, in the
alphabet as follows
If k is a leaf node
if xk,u a, then P(Lka) 1,
else Pk(a) 0.
if k is not a leaf node
Compute P(Lia), P(Lja) for all a at daughters
i,j
Set P(Lka) Sb,cP(ba,ti) P(Lib) P(ca,tj)
P(Ljc).
Termination
Likelihood for site u pa SaP(L2n-1a)
(pa is the equilibrium value of the probability
distribution for a.)
Concluding step Combine the likelihoods for
each site.

72
Maximizing Likelihood Estimation over edge times

Likehood estimation includes a step for computing
the likelihood of some character a at node k
given the subtree of k.
While we know that there is the possibility of
substitutions leading to a, these depend on how
long a time we have to make those substitutions
and we do not know the edge times of the tree.
We must explore a series of possible times in
order to to maximize the likelihood.
A method that maximizes likelihoods over edge
times is what is usually referred to as MLE.
Standard MLE procedures do not maximize
likelihoods over all topologies of the tree.

73
Comparisons between MLE, Parsimony and Distance
Methods
Algorithm Requires semi-labeled tree Requires scored alignments Order Results Edge weights Results Internal tree nodes Resulting Tree Is Ultrametric
MLE Yes No La2n-1 2an2 Transitional probabilities subtree probability Yes
Parsimony Yes No 2an2 Mutation counts Ancestral sequences No
Distance Methods No Yes 2n2 Distance measures e.g. alignment scores UPGMA a cluster of sequences UPGMA - no NN - yes.
74
Exploring different topologies

Successive addition and rearrangement
Very common method (see Phylip programs
including PROTPARS, DNAPARS, DNACOMP, DNAML,
DNAMLK, RESTML, KITSCH, FITCH, CONTML, MIX and
DOLLOP)
Sequences are taken in the order that they appear
in the input file and successively added to a
tree.
MCMC

75
Successive addition

Initialization
Place the set of sequences into L.
Create a tree,T, with one node the root.
Iteration for each sequence in L
Remove a sequence from L and add it as a leaf to
T.
Apply a process of local rearrangement (in
Felsensteins package, there are (n-1)(2n-3)
arrangements.)
Score each locally arranged tree.
set T to equal the best scoring tree.
Termination Globally rearrange the tree by
swapping subtrees, score each globally rearranged
tree and accept the tree with the best score.

76
Markov Chain Monte Carlo

A Bayesian method for phylogenetic inference
Moderately new method rooted in molecular
dynamics.
Topologies are randomly generated and scored so
that a representative set of most likely tree
topologies can be identified.
Mau, Newton and Larget (1998) apply MCMC to
sample trees using Bayes theorem. The following
explanation is based on their methodology - the
mistakes are mine, the facts and foundations,
theirs.

77
Introduction to the method
t is the set of all semi-labeled trees
78
Introduction to the method
Sampling the set of trees
Q2
Q1
Q3
79
Introduction to the method
Q01
Q12
Q23
to
t2
t1

a
c
b
d
a
b
c
d
a
b
c
d
A Chain of Accepted Samples
80
Introduction to the method
The partitioned space with representatives t1,,
t3
t1
t3
81
MCMC propaganda

allow exact inference provided certain convergent
criteria are demonstrated.
are efficient and can handle many more taxa or
sequences.
measure uncertainty during tree construction (no
bootstrapping needed.)

82
Summary of the Algorithm

Choose a starting tree
Perturb the current trees topology and branch
lengths to find a new tree.
Measure the likelihood for the new tree.
Compare the new tree to the last tree and decide
whether or not to accept it into the chain.
If youve got a sufficiently long chain, check
the characteristics of your sample to see if
there is convergence to a set of representative
topologies. If so, stop. Otherwise, to to 2.

83
Subproblems to be discussed

How do we represent the tree so it that is easy
to operate on? Cophentic matrices.
What is our perturbation operator?
How do we build our sampling chain?
When are we done sampling?

84
The Cophenetic Matrix

Some Notation
t a topology
n a node
a(n) the ancestor of a node
L a leaf node (the leaves are the current
record)
I an internal node (the historical record)

85
Cophenetic Trees

Labeled history (t1, t2) provides an order on
coalescent levels.

I0
level 0
t1
level 1
I1
t2
level 2
I2
L3
L1
L2
86
Example A Cophenetic Tree
t1 0.8 t20.3 t30.7 t40.5 t50.9 t61.5 total
4.7
These trees are described in terms of nodes
coalescing or merging backwards in time.
87
Example Cophenetic Matrix

The cophenetic matrix for the previous tree.
The tree representation (s, a) is
(5,7,4,1,2,6,3), (4.7, 0.8, 2.3, 3.2, 1.8, 1.1)

Leaf 5 7 4 1 2 6 3
5 0 9.4 9.4 9.4 9.4 9.4 9.4
7 0 1.6 4.6 6.4 6.4 6.4
4 0 4.6 6.4 6.4 6.4
1 0 6.4 6.4 6.4
2 0 3.6 3.6
6 0 2.2
3 0
88
The Cophenetic Matrix

Theorem For any weighted binary tree with
labeled leaf nodes, the tree topology and branch
lengths can be uniquely determined using the
within-tree distances between all pairs of leaf
nodes. (Lapoint and Legendre, 1992)
Note, each permutation of the leaf labels
generates a different n x n symmetric matrix of
distance distances.

89
What is the perturbation operator?

Q is the proposal function and it has two stages
Q1 randomly selects a new leaf order
Q2 perturbs the values of the matrix
supradiagonals.
The proposal mechanism is symmetrical
Q(tn,tn1) Q(tn1,tn)

90
Details on Q1 and Q2

Q1 samples one of the 2n-1 leaf orderings of the
current tree model.
Q2 simultaneously and independently modifies the
elements of the superdiagonal by creating a
uniform distribution (ai ? d) where d is
constant.
By applying both types of perturbations, Q1 and
Q2, all the permutations of trees can be reached.

91
Illustration of Q2
92
Subproblems to be discussed

How do we represent the tree so it that is easy
to operate on? Use cophenetic matrices.
What is our perturbation operator? Q.
How do we build our sampling chain? Apply
Metropolis-Hastings
When are we done?

93
Acceptance with Metropolis-Hastings

Given a tree t, Metropolis-Hastings
1. Applies Q to build a new tree, t.
2. Always accepts the new tree when it is more
likely than the old one and sometimes accepts it
when it is less likely than the old one.

94
Acceptance with Metropolis-Hastings the
algorithm

If P(t) gt P(t)
accept t into the chain.
else
accept t into the chain with probability P(t)
/ P(t)

95
Acceptance with Metropolis-Hastings

The final step in evaluating the acceptance test
is evaluating
P(t) / P(t)
This is easy P(t) is approximated using the LE
of t.

96
Size of chain and convergence

How many trees do you have to propose before you
begin to get a good enough sample? Mau et al
1998 sample over about 2500 trees for Clarkia, a
phylogeny with 9 leaves
How do you test that you are done? At the end of
the run, we say that we have convergence if there
is a small set of topologies with high relative
frequency in the chain.
Whats the result? The topologies with the
highest frequencies are the reported
reconstructions.

97
Mixing

To obtain a confidence measure, the algorithm
must be run more than once each run generates a
chain of accepted trees.
When chains mix well when they come up with the
same representative topologies, starting from
different tree topologies.
If running a sufficient number of independent
chains is computationally prohibitive, Suchard et
al, 2002, provide a poor man's estimate of the
uncertainty.

98
Example with binary data(from Mau, et al, 1998)

9 species of genus Clarkia (California plants)
120 restriction sites
Data translated into a 9 x 120 matrix of zeroes
and ones, representing the absence or presence of
a restriction site in the genome of each species.

99
Running the MCMC algorithm

Random starting trees
Chains of length 250,000 were subsampled at rate
of 1/100 2500 trees
Each run took 20 minutes on a Sparc 10.
Convergence was inferred by reproducibility
across runs with very different starting trees.

100
The most common topologies for Clarkia
A 1,2 B 3,4 C 5,6 D8,9
101
References

Smouse and Li (1989) introduced the Bayesian
paradigm, but not the notation, to the phylogeny
reconstruction problem.
Goldman (1993) used non-Bayesian Monte Carlo
tests of significance to assess the adequacy of
evolutionary models.
Griffiths and Tavare (1994) constructed Markov
chains to compute likelihoods for ancestral
inference.
Mau, Newton and Larget (1998) apply MCMC to
sample trees using Bayes theorem.

102
Drill-down Rates

The way I use it, and I admit this is quirky,
motif means the genetic profile for a functional
structure. Using the following definitions
Let rG be the rate of mutation for a gene.
Let rE be the rate of expressed mutation for the
protein G encodes.
Let rS be the rate of structural mutation for the
protein G encodes.
Let rF be the rate of functional mutation for the
protein G encodes.
rG gt rE gt rS gt rF
Note that the rate of neutral mutations is rN
rG rF.
The true rate of mutation for a motif is rF,
the observed rate of mutation for members of a
motif in a genotypic tree is rG. If we want
motif branchings, we eliminate all branchings in
the phylogeny occuring with rates rN.

103
Drill-down Semi-labeled trees

Trees with a defined branching pattern and
defined leaf labels but WITHOUT edge lengths or
internal node labels.
In our terms, phylogenies with known branching
patterns but without information about ancestors
or mutation times.

nccbac
nacbac
nccnaa
ncbbbc
104
Drill-down Progressive Alignments

As you move up the tree, add to sum of
characters in growing alignment

105
Progressive Alignments

Sum of characters in growing alignment can be
represented in a table of values called
afrequency matrix or a profile

106
Progressive Alignments

Alignments are frozen once they are made. Scores
are then calculated between aligned positions
tabulated in a frequency matrix, using a scoring
table
Sij 2 GG 1AG

A G S T
A 4 1 3 10
G 3 2 6
S 2 14
T 8
107
Algorithm Neighbor-joining

Input N sequences and their relative distances,
dij
Initialization
Define a leaf of T for each sequence
Iteration
Pick two nodes i,j such that dij (ri rj) is
minimal.
Define a new set of distances, dkl between k
and all current nodes.
Define a node k with daughter nodes i and j, and
place it at edge length eik ½(dij ri rj)
and ejk dij dik.
Add k to the set of current nodes and remove i
and j.
Termination
Unrooted When only two nodes i, j, remain, add
an edge of length dij/2.

108
Comparison Neighbor-joining and UPGMA

Minimization
UPGMA uses dij
Nearest-neighbor uses dij (ri rj) where
Distance measures
For distances between leaves i and j
dij is the same in both algorithms.
For distances between nodes k and m
UPGMA uses dik 1/CiCj Sp in Ci, q in Cjdpq
Nearest-neighbor uses dkm ½ (dim djm dij)
where i and j are the daughters of k.
Edge lengths
UPGMA set the height of node k to ½ the distance
between daughters i,j (½ dij).
Nearest neighbor sets the edge length between k
and daughters j to ½(dij ri rj), daughter k
to dij dik.

109
Drill-down MLE
a P(Lka) P(ca,ti) P(ba,tj)
P(ba,tj)
P(ca,ti)
ncbbcbc P(Ljb) 1
nccbabc P(Lic) 1
site u 3 simplest case
110
Drill-down Enumerating topologies
111
Drill-down Acceptance with Metropolis-Hastings

A proposed tree t is accepted with probability
However, by detailed balance you can step forward
or backward with equal probability
Q(t,t) Q(t, t)
Hence our test becomes

Write a Comment

User Comments (0)