Phylogenetic Trees presentation

About This Presentation

Transcript and Presenter's Notes

Title: Phylogenetic Trees

1
Phylogenetic Trees

What?
How?
Why?
Methods

2
What is a phylogenetic tree?

A phylogenetic tree is a data structure that
stores information regarding the relationship of
several sequences
Given an appropriate scoring function, a given
sequence can always be found to be more related
to one sequence than to another, non-identical
sequence. In other words, these relationships
can be approximated by a binary tree.

3
What relationships are stored in the tree
structure?

The relationship represented in a phylogenetic
tree is a measure of homology.
The actual definition of homology is biological
in nature (the precise definition of which is
also a matter of some debate), but
computationally it is usually thought of in terms
of an identity or similarity score di, j between
two entities (taxa, sequences, etc.).

4
How is this homology represented in a tree?

As seen before, sequence identity/similarity is
calculated by aligning two or more sequences in a
multiple sequence alignment.
Thus, a phylogenetic tree is simply an
arrangement of the data inherent within a
multiple sequence alignment into a tree.

5
Why use phylogenetic trees?

This arrangement is useful to biologists because
it organizes the sequences into their projected
evolutionary history.
Due to the construction method of a phylogenetic
tree, each node represents a common ancestor.
The distance from the leaves to this common
ancestor is a measure of the evolutionary
distance between the leaves.

6
Example Phylogenetic trees
?
?
?
?
Problem given several sequences, determine how
to arrange them according to evolutionary
distance.
7
Example Phylogenetic trees
x1
x2
sequence 1
sequence 1
sequence 2
sequence 2
y1
y2
sequence 3
sequence 3
z2
z1
sequence 4
sequence 4
The first arrangement differs from the second
only in that sequence 4 is more closely related
to the others (z1 lt z2). The evolutionary
distance between sequences 1, 2, and 3 remain the
same as evidenced by the horizontal distance from
each leaf to the respective node connecting them
(x1 x2 and y1y2).
8
Why use phylogenetic trees?

Thus, the tree structure provides a context for
the evolutionary information inherent in a
sequence alignment.
This context represents the divergent
relationship of the sequences.
Furthermore, this contextual information is
weighted according to evolutionary distance.

9
Why use phylogenetic trees? Summary

The tree structure shows that two sequences are
related, how they are related in the context of
other sequences, and how distantly they are
related.

10
How is a phylogenetic tree constructed?

While the weighted score di, j is known from the
multiple sequence alignment, the rate of
evolutionary change is not known.
The rate of change depends on mutation rates,
which are not constant.

11
How is a phylogenetic tree constructed?

Mutations can occur in several ways
forward mutations in one sequence from the
original sequence
backward mutations in one sequence towards the
original sequence
parallel mutations in two or more sequences
insertions in one or more sequences
deletions in one or more sequences

12
How is a phylogenetic tree constructed?

To re-iterate from the sequence alignment
discussions not all mutations are alike.
Mutation rates vary between organisms.
Mutation rates vary with amino acid type.
Mutation rates vary with the environment the
amino acids are in (mutations in the core of the
protein are less likely than those at the
surface).

13
How is a phylogenetic tree constructed?

Mutation rates vary with mutation type
(substitutions more likely than
insertions/deletions).
Mutations that conserve properties of the
original amino acid (such as charge, size, etc.)
are more likely than those that modify or invert
those properties (such as changing a
positively-charged amino acid to a neutral or
negatively-charged one).

14
Methods to calculate rate of evolutionary change

Poisson process model
Define Unit Evolutionary Time (average time to
produce one substitution per 100 amino acids) Tu
1/100?, solve for ?.
Find probability of a substitution and
approximate rate of change from theoretical
considerations
Not used because this model assumes that rate is
independent of residue position and amino acid
type

15
Methods to calculate rate of evolutionary change

Amino acid substitution matrix
Use empirically-determined matrix obtained by
comparing many protein sequences to estimate
probability pi, j that during one evolutionary
time unit Tu, amino acid i will be substituted by
residue j.
The substitution matrix obtained M (pi, j) is
called PAM1 matrix (one point-accepted mutation
per 100 residues).

16
Methods to calculate rate of evolutionary change

Since the matrix was calculated using proteins
with small evolutionary distances (close
homologs), the matrix is then scaled to
approximate the probabilities for proteins with
larger evolutionary distances (more remote
homologs).
PAM250 is one commonly-used matrix (corresponding
to M250, the 250-th power of the PAM1 matrix)
(One problem with this is that zero to the 250-th
power is still zero)

17
Methods to calculate rate of evolutionary change

Using this model, the probability p that a
substitution at a given site (at either position
i in sequence X or position j in sequence Y) has
occurred during t time units is
20
p 1 - S (pi, j)(2t) pi
i1
where p is a column vector (p1,, p20)T
representing the amino acid composition frequency
of a given polypeptide

18
Methods to calculate rate of evolutionary change

Currently, NCBIs BLAST search tool offers PAM30
70 and BLOSUM60, 70, 45. The default choice
(and thus the most commonly-used substitution
matrix) is BLOSUM60.
BLOSUM was created in a manner similar to that of
the PAM matrix, but using a more diverse set of
sequences.
Thus it is (arguably) more accurate when
comparing more diverse proteins with less
sequence homology.
Since the function of BLAST is to identify
(possibly remote) homologs, it is ideal for BLAST
searches.

19
Nucleotide Sequences

Nucleotide sequences are handled differently due
to their unique properties
there is redundancy in the genetic code (multiple
nucleotide codons specify a given amino acid)
nucleotide substitutions dont always translate
to amino acid substitutions
many third positions in the codon
introns and other non-coding regions

20
Nucleotide Sequences

some substitutions alter the protein sequence at
more than a single position
creation of stop codon
frameshift mutation
altered promoter/operator/splice site, etc.
may totally destroy protein functionality
may instead alter only amount of expression
other sites (such as poly-A binding) may allow
normal protein to be expressed, but mRNA is
rapidly degraded, or not exported from the
nucleus, or secluded in some cellular organelle

21
Nucleotide Sequences

computer simulations have found that the genetic
code is optimized such that many nucleotide
substitutions change an amino acid to a related
type, thus preserving properties such as
hydrophobicity or charge.

22
Nucleotide Sequences

Other features are shared
substitution rate is species-dependent
forward/backward/parallel mutations
insertions/deletions - although they must occur
in multiples of 3 nucleotides to avoid a
frameshift mutation

23
Nucleotide Sequences

The PAM matrices assumed a discrete Markov chain
where
the 1 PAM matrix is the transition matrix of the
markov chain
the parameters are estimated from close homologs
using local sequence alignment
it is assumed that the two sequences being
compared are generated using one application of
the transition matrix gt and thus, that multiple
substitutions did not occur

24
Nucleotide Sequences

it is also assumed that the evolutionary distance
of more distantly related sequences can be
modeled by n-times iteration of the Markov chain
gt although this allows only evolutionary
distances that are multiples of the evolutionary
distance used for setting up the PAM matrix
again, the only multiple of zero is zero...

25
Nucleotide Sequences

Can use continuous Markov process instead of
discrete Markov chain to avoid the assumptions of
no multiple substitutions and discrete
evolutionary time
omit long formal definition, but basically
analogous to Markov chain except the transition
matrix is substituted by a matrix of transition
probability functions that depend on time
parameter t

Write a Comment

User Comments (0)

About PowerShow.com

Phylogenetic Trees PowerPoint PPT Presentation