Title: Intro to Phylogenetic Tree Reconstruction
1Intro to Phylogenetic Tree Reconstruction
- Phylogeny (evolutionary) relationships between
any set of species - Hypothesis All organisms on Earth are
evolutionarily related via a common ancestor - Evidence similarity of many molecular
mechanisms and genetic materials
2Intro to Phylogenetic Tree Reconstruction
- Phylogeny can be represented as a tree.
- 2 Types of Phylogenetics
- classic phylogenetics based on morphological
characters - modern phylogenetics based on information
extracted from sequence data (DNA, RNA and
proteins) - based on characters sites on sequences
- Assumption
- Sequences have descended from common ancestral
genes/species,but difficult to distinguish
orthologues from paralogues - Phylogenetic tree of a group of sequences does
not necessarily represent the true phylogenetic
tree of host species
3Intro to Phylogenetic Tree Reconstruction
- leaves species
- internal nodes (hypothetical) ancestors
- nodes species or character values (states)
- edges evolutionary relationships between nodes
- edge lengths evolutionary distance between
nodes (evolutionary time) - restrict ourselves to binary trees only
- ok, as we can use distances of 0
- rooted vs. unrooted trees
- root represents the ultimate ancestor of the
group of sequences(includes hierarchy)
4Intro to Phylogenetic Tree Reconstruction
- Phylogenetic Tree Reconstruction (Inference)
Problem
- Given
- n species
- m characters
- for each species, values for all characters
- Want fully labelled phylogenetic tree that
'best' explains the given data (i.e. maximize a
target function (score) ) - Assumptions
- characters are mutually independent
- after two species diverged, their further
evolution is independent of each other - Simple Solution check them all out and pick the
best one - problem too many possibilities to check
- n species -gt (2n-3)!! different rooted trees
- n 20 -gt 1021 trees
5Intro to Phylogenetic Tree Reconstruction
- Distance-Based Algorithms
- Idea
- begin with a set of distances di between each
pair i,j of seq. - find the tree that predicts the observed sequence
data as accurately as possible - How to find the tree
- 1. general idea given pairwise distance dij and
tree T predicting pairwise distance dij', find
the T that minimizes SSQ(T) gt Least Squares
Methodbut NP-complete
6Intro to Phylogenetic Tree Reconstruction
- Distance-Based Algorithms
- 2. Clustering UPGMA (Unweighted Pair Group
Method Using Arithmetic Averages) - Idea cluster sequences at each stage, merge two
groups and create a new node in the tree - build the tree bottom up from the leaves
- result rooted tree with molecular clock property
(MCP) - 11 correspondence between distance and
evolutionary time - not always true in reality some sequences evolve
faster - If 'true' tree doesn't have MCP, UPGMA will give
incorrect results
7Intro to Phylogenetic Tree Reconstruction
- Distance-Based Algorithms
- 3. Clustering Neighbor Joining
- guarantees to generate correct tree in polynomial
time if distance is additive - (weaker than MCP, so more reasonable still, not
always true)
8- Phylogenetics and Phylogenetic Trees
9Why do a phylogenetic analysis
- important for deciphering relationships in gene
function and protein structure and function in
different organisms - helps to utilize genetic information of a model
organism to analyze a second organism - helps to sort out gene family relationships
- valuable tool for tracing the evolutionary
history of genes
10Performing a phylogenetic analysis
- Start with reasonable multiple sequence
alignment - Examine either sequence variation in each column
or no. of differences between each pair of
sequences - Produce a tree representation of the sequences
based on similarity/differences
11Methods of evaluating sequence relationships
- sequence A ERKSIQDLFQSFTLFERRLLIEF
- sequence B ERLSISELIGSLRLYERRLIIEY
- sequence C DRKSISDLIGSLRLA---LLIEF
- sequence D DRK---DLISSLRKA---LLIEW
- 1. Account for all column variations
- A,B and C,D form similar groups
based on col. 1 - A,C,D based on col. 3
- 2. Count differences between sequences
- A,B 17/23 similar, 6/23 different
- C,D 21/23 similar, 2/23 different
- C and D are very closely related by either method
12What is a tree?
- a graphical representation of the sequence
similarities among a group of nucleic acid or
protein sequences
13?????
- ?????????? ????????? ( Molecular Phylogenetics
)- ??? ??? ????? ?????? ???????????? ???
??????????? ?? ???? ??? ????? ????? ?? ????????
????????? ?????? ?????????. - ????? phylogeny ?????? ?????? ??? ????? ???
????? ???' ?? ???????? ( Phylogenetic Tree - PT
).
14??????
- ????? ??? ?? taxonomic units , ?''? ??????
???? ?????? ?????? ???????? ?? ??????????. - ????? ???????? ?? operational taxonomic units
(OTUs)???????? ?? ??????????? ?? ?? ????? (???? ,
?? ????? ???). - ????? ??????? ??????? ?? ??????????? ?? ?? ?????
?? ???????? ????? ????? ?????? ??????????? (??
????) ???????? ???' ??????? ????????. - ????? ??? ???????? ?? ?????? ???? ???' ????? ???
?- scaled ?- unscaled edges . ?''? ???? ???? ????
????? ???? ?????? ( scaled ) ?? ??.
15????? 1
Two alternative representations of a phylogenetic
tree for fife OTUs . (a) Unscaled branches
extant OTUs are lined up and nodes are positioned
proportionally to times of divergence. (b) Scaled
branches lengths of branches are proportional to
the numbers of molecular changes.
16Example 1
- Two alternative representations of a phylogenetic
tree for fife OTUs . - (a) Unscaled branches extant OTUs are lined up
and nodes are positioned proportionally to times
of divergence. - (b) Scaled branches lengths of branches are
proportional to the numbers of molecular changes.
17?????? - ????
- additive tree ??? ?? ?? ?????? ????? ??
?"????" ??? ????/?????????? ???????? ???, ???? ,
?- (b) ?? ???? ?????, ????? ????? ??? A ?- B ???
213. -
- ??? ???? ????? ??? ???? ????? ( rooted tree ) ??
??? ???? ( unrooted tree ) .
18????? 2
- Rooted and
- unrooted phylogenetic trees .
- Arrows indicate the unique path leading from the
root (R) to OTU D .
19?????? - ????
- additive tree species tree ??? ?? ?????? ????.
- orthologous genes ???? ????? ?????
??????????? ?????. - paralogous genes ???? ????? ????? ????? ????????
- homologous genes ???? ????? ???? ????.
- clade ( monophyletic group ) ???? ?????
???????? ??? ?????? ?? ???? ???? ??.
20????? 3
Phylogenetic tree of birds , reptiles , and
mammals . The reptiles does constitute a natural
clade since they share ancestors with the birds ,
which are included in the Reptilia . Birds
and crocodiles , on the other hand , constitute a
clade ( Archosauria ) since they
share a common ancestor ( black box ) not shared
any other organism.
21Phylogenetic Prediction
- A phylogenetic analysis of a family of related
nucleic acid or protein sequences is a
determination of how the family might have been
derived during evolution. - The evolutionary relationships among the
sequences are depicted by placing the sequences
as outer branches on a tree. - The branching relationships on the inner part of
the tree then reflect the degree to which
different sequences are related.
22Phylogenetic Prediction
- Two sequences that are very much alike will be
located as neighboring outside branches and will
be joined by a common branch beneath them. - The object of phylogenetic analysis is to
discover all of the branching relationships in
the tree and the branch lengths. - The chapter 6 of David Mounts Bioinformatics
presents procedures for phylogenetic analysis,
with an emphasis on the complexity of the problem
and advice for solving difficult analyses.
23Relationship of Phylogenetic Analysis to
Sequence Alignment
- The commonest method of multiple sequence
alignment (CLUSTALW) is the progressive alignment
method. - The progress is supposed to represent a reliable
history of the evolutionary changes that have
occurred. - A sequence alignment reveals which positions in
the sequences were conserved and which diverged
from a common ancestor sequence, as illustrated
in the next slide.
24Sequence similarity
- Origin of similar sequences.
- Sequences 1 and 2 are each assumed to be derived
from a common ancestor sequence. Some of the
ancestor sequence can be inferred from conserved
positions in the two sequences. - For positions that vary, there are two possible
choices at these sites in the ancestor.
25 26Methods of tree reconstruction
- ???? ??? ????? ?????? ??
-
- ? Distance Matrix Method ( DMM )
- ? Maximum Parsimony Methods
- Maximum Likelihood Methods ?
- Method of Invariants ?
- Mount pp248-254
27Maximum Parsimony Method
- This method predicts the evolutionary tree that
minimizes the number of steps required to
generate the observed variation in the sequences. - A multiple sequence alignment is required to
predict variation. - For each aligned position, PT that require the
smallest number of changes are identified. - This method is used for sequences that are quite
similar and for small number of sequences. - One or more unrooted trees are predicted.
28Maximum Parsimony Method
- The main programs are in the Phylip package
- 1. DNAPARS treats gaps as a fifth nucleotide
state. - 2. DNAPENNY branch and bound search
- 3. DNACOMP
- and so on and so fore
29Methods of tree reconstruction
- ???? ??? ????? ?????? ??
-
- ? Distance Matrix Method ( DMM )
- ? Maximum Parsimony Methods
- Maximum Likelihood Methods ?
- Method of Invariants ?
- ??????? ?????? ???? ????? ?"????" ??? ??????
(???????). - ?????? ????? ??? ?????? ???? ???????? ??????.
- ?????? ??? ????? ?? ?????? ????? ??? ????
(??????) ?? ??????? ??? ????? / ?????????? ??????.
30Least Squares Method
- ????? , ???? ????? NP-complete, ????????? ???? ??
??? ?????????? UPGMA - ( Unweighted Pair Group Method with Arithmetic
mean ) . - Input ?? Dij ?? distance matrix,
- ?? Wij ?? weights
- Find the tree T that minimizes SSQ(T)
- n
- SSQ(T) ??wij (Dij dij)²
- i1 i?j
31UPGMA Unweighted Pair Group Method with
Arithmetic mean
- Initialization
- 1. Initialize n clusters with the given species,
one species per cluster. - 2. Set the size of each cluster to 1 .
- 3. In the output tree T, assign a leaf for each
species.
32UPGMA Unweighted Pair Group Method with
Arithmetic mean
- Iteration
- 1. Find the i and j that have the smallest
distance Dij. - 2. Create a new cluster - (ij), which has n(ij)
ni nj members. - 3. Connect i and j on the tree to a new node,
which corresponds to the new cluster (ij), and
give the two branches connecting i and j to (ij)
length each.
33UPGMA Unweighted Pair Group Method with
Arithmetic mean
- Iteration
- 4. Compute the distance from the new cluster to
all other clusters (except for i and j, which are
no longer relevant) as a weighted average of the
distances from its components - 5. Delete the columns and rows in D that
correspond to clusters i and j, and add a column
and row for cluster (ij), with D(ij),k computed
as above. - 6. Return to 1 until there is only one cluster
left.
34UPGMA Unweighted Pair Group Method with
Arithmetic mean
- Complexity
- The time and space complexity of UPGMA is O(n2),
since there are n-1 iterations, with O(n) work in
each one.
35????? 4
(a) The true phylogenetic tree. (b) The
erroneous phylogenetic tree reconstructed by
using the UPGMA method , which does not take
into account the possibility of unequal
substitution rates along different branches .
36Neighbor Joining (NJ)
- ????????? NJ ???? ???? ????? ?????? ??? ??????
?? Least Square Method. - ???? ??? ???? ?? ???? ???? ????????. ?????? ???
???? ?? ?- clusters ??? ???? ?????? ??? ???? , ??
?????? ???? ???????? . - ??? ??????? ????????? ???? ????? ?? ?- ancestor
????? ?? ??? species ???. - ??? ????? i????? ui ???? ???? ?????? ??? ?????
???' ????? -
-
Dik
? ui ---------- -
k?i (n-2)
- ?? ??? ?????? ?? ???????? ?? ???? ????? ??????
(minimum-evolution criterion), ???????? i ?- j
?????? ?- cluster ???? ??? ?? ????????? ?? ???
?????? ???? ?? - Dij ui uj ??? ???? ?????. ??????? dk,(ij) ??
????? ????? ??????? ???' ????? ?? ???? ?????
???????? .
37Neighbor Joining (NJ)
- Initialization same as in UPGMA
- 1. Initialize n clusters with the given species,
one species per cluster. - 2. Set the size of each cluster to 1 .
- 3. In the output tree T, assign a leaf for each
species.
38Neighbor Joining (NJ)
- Iteration
-
Dik 1. For each species ,
compute ui ? ------ -
k?i (n-2)
- 2. Choose the i and j for which Dij ui uj
is smallest . - 3. Join clusters i and j to a new cluster
(ij) , with a corresponding node in T .
- Calculate the branch lengths from i and j to
the new node as -
- di,(ij) ½ Dij ½(ui uj)
, dj,(ij) ½ Dij ½(ui uj). -
- 4. Compute the distance between the new cluster
and each other cluster -
- Dik Djk
- Dij - D(ij),k
-------------------- -
2
39Neighbor Joining (NJ)
- Iteration
- 5. Delete clusters i and j from the tables, and
replace them by (ij). - 6. If more than two nodes ( clusters ) remain ,
go back to 1. Otherwise , connect the two
remaining nodes by a branch of length Dij . - ??? ????? ?? NJ ??? O(n²) , ??? UPGMA.
40Feng-Doolittle algorithm
- Sequence-sequence alignments usual pairwise.
- Sequence-group the highest scoring pairwise
alignments determines the s-g alignment. - Group-group again the best pairwise sequence
alignment among all pairs. - After an alignment is completed, gap symbols are
replaced with a neutral X character.