Title: Phylogenetic Trees
1Phylogenetic Trees
2What is a phylogenetic tree?
- A phylogenetic tree is a data structure that
stores information regarding the relationship of
several sequences - Given an appropriate scoring function, a given
sequence can always be found to be more related
to one sequence than to another, non-identical
sequence. In other words, these relationships
can be approximated by a binary tree.
3What relationships are stored in the tree
structure?
- The relationship represented in a phylogenetic
tree is a measure of homology. - The actual definition of homology is biological
in nature (the precise definition of which is
also a matter of some debate), but
computationally it is usually thought of in terms
of an identity or similarity score di, j between
two entities (taxa, sequences, etc.).
4How is this homology represented in a tree?
- As seen before, sequence identity/similarity is
calculated by aligning two or more sequences in a
multiple sequence alignment. - Thus, a phylogenetic tree is simply an
arrangement of the data inherent within a
multiple sequence alignment into a tree.
5Why use phylogenetic trees?
- This arrangement is useful to biologists because
it organizes the sequences into their projected
evolutionary history. - Due to the construction method of a phylogenetic
tree, each node represents a common ancestor. - The distance from the leaves to this common
ancestor is a measure of the evolutionary
distance between the leaves.
6Example Phylogenetic trees
?
?
?
?
Problem given several sequences, determine how
to arrange them according to evolutionary
distance.
7Example Phylogenetic trees
x1
x2
sequence 1
sequence 1
sequence 2
sequence 2
y1
y2
sequence 3
sequence 3
z2
z1
sequence 4
sequence 4
The first arrangement differs from the second
only in that sequence 4 is more closely related
to the others (z1 lt z2). The evolutionary
distance between sequences 1, 2, and 3 remain the
same as evidenced by the horizontal distance from
each leaf to the respective node connecting them
(x1 x2 and y1y2).
8Why use phylogenetic trees?
- Thus, the tree structure provides a context for
the evolutionary information inherent in a
sequence alignment. - This context represents the divergent
relationship of the sequences. - Furthermore, this contextual information is
weighted according to evolutionary distance.
9Why use phylogenetic trees? Summary
- The tree structure shows that two sequences are
related, how they are related in the context of
other sequences, and how distantly they are
related.
10How is a phylogenetic tree constructed?
- While the weighted score di, j is known from the
multiple sequence alignment, the rate of
evolutionary change is not known. - The rate of change depends on mutation rates,
which are not constant.
11How is a phylogenetic tree constructed?
- Mutations can occur in several ways
- forward mutations in one sequence from the
original sequence - backward mutations in one sequence towards the
original sequence - parallel mutations in two or more sequences
- insertions in one or more sequences
- deletions in one or more sequences
12How is a phylogenetic tree constructed?
- To re-iterate from the sequence alignment
discussions not all mutations are alike. - Mutation rates vary between organisms.
- Mutation rates vary with amino acid type.
- Mutation rates vary with the environment the
amino acids are in (mutations in the core of the
protein are less likely than those at the
surface).
13How is a phylogenetic tree constructed?
- Mutation rates vary with mutation type
(substitutions more likely than
insertions/deletions). - Mutations that conserve properties of the
original amino acid (such as charge, size, etc.)
are more likely than those that modify or invert
those properties (such as changing a
positively-charged amino acid to a neutral or
negatively-charged one).
14Methods to calculate rate of evolutionary change
- Poisson process model
- Define Unit Evolutionary Time (average time to
produce one substitution per 100 amino acids) Tu
1/100?, solve for ?. - Find probability of a substitution and
approximate rate of change from theoretical
considerations - Not used because this model assumes that rate is
independent of residue position and amino acid
type
15Methods to calculate rate of evolutionary change
- Amino acid substitution matrix
- Use empirically-determined matrix obtained by
comparing many protein sequences to estimate
probability pi, j that during one evolutionary
time unit Tu, amino acid i will be substituted by
residue j. - The substitution matrix obtained M (pi, j) is
called PAM1 matrix (one point-accepted mutation
per 100 residues).
16Methods to calculate rate of evolutionary change
- Since the matrix was calculated using proteins
with small evolutionary distances (close
homologs), the matrix is then scaled to
approximate the probabilities for proteins with
larger evolutionary distances (more remote
homologs). - PAM250 is one commonly-used matrix (corresponding
to M250, the 250-th power of the PAM1 matrix) - (One problem with this is that zero to the 250-th
power is still zero)
17Methods to calculate rate of evolutionary change
- Using this model, the probability p that a
substitution at a given site (at either position
i in sequence X or position j in sequence Y) has
occurred during t time units is - 20
- p 1 - S (pi, j)(2t) pi
- i1
- where p is a column vector (p1,, p20)T
representing the amino acid composition frequency
of a given polypeptide
18Methods to calculate rate of evolutionary change
- Currently, NCBIs BLAST search tool offers PAM30
70 and BLOSUM60, 70, 45. The default choice
(and thus the most commonly-used substitution
matrix) is BLOSUM60. - BLOSUM was created in a manner similar to that of
the PAM matrix, but using a more diverse set of
sequences. - Thus it is (arguably) more accurate when
comparing more diverse proteins with less
sequence homology. - Since the function of BLAST is to identify
(possibly remote) homologs, it is ideal for BLAST
searches.
19Nucleotide Sequences
- Nucleotide sequences are handled differently due
to their unique properties - there is redundancy in the genetic code (multiple
nucleotide codons specify a given amino acid) - nucleotide substitutions dont always translate
to amino acid substitutions - many third positions in the codon
- introns and other non-coding regions
20Nucleotide Sequences
- some substitutions alter the protein sequence at
more than a single position - creation of stop codon
- frameshift mutation
- altered promoter/operator/splice site, etc.
- may totally destroy protein functionality
- may instead alter only amount of expression
- other sites (such as poly-A binding) may allow
normal protein to be expressed, but mRNA is
rapidly degraded, or not exported from the
nucleus, or secluded in some cellular organelle
21Nucleotide Sequences
- computer simulations have found that the genetic
code is optimized such that many nucleotide
substitutions change an amino acid to a related
type, thus preserving properties such as
hydrophobicity or charge.
22Nucleotide Sequences
- Other features are shared
- substitution rate is species-dependent
- forward/backward/parallel mutations
- insertions/deletions - although they must occur
in multiples of 3 nucleotides to avoid a
frameshift mutation
23Nucleotide Sequences
- The PAM matrices assumed a discrete Markov chain
where - the 1 PAM matrix is the transition matrix of the
markov chain - the parameters are estimated from close homologs
using local sequence alignment - it is assumed that the two sequences being
compared are generated using one application of
the transition matrix gt and thus, that multiple
substitutions did not occur
24Nucleotide Sequences
- it is also assumed that the evolutionary distance
of more distantly related sequences can be
modeled by n-times iteration of the Markov chain
gt although this allows only evolutionary
distances that are multiples of the evolutionary
distance used for setting up the PAM matrix - again, the only multiple of zero is zero...
25Nucleotide Sequences
- Can use continuous Markov process instead of
discrete Markov chain to avoid the assumptions of
no multiple substitutions and discrete
evolutionary time - omit long formal definition, but basically
analogous to Markov chain except the transition
matrix is substituted by a matrix of transition
probability functions that depend on time
parameter t