Title: Molecular Evolution
1Molecular Evolution
- Multiple alignment-based tree construction
methods (Parsimony) - Statistical evaluation of phylogenetic trees
bootstrapping
EPFL Bioinformatics I 23 Jan 2006
2Parsimony A simple example
1. AAG 2. AAA 5. AAA 3. GGA 4. AGA
6. AGA Total cost 3 1. AAG 3. GGA
5. AAA 2. AAA 4. AGA 6. AAA Total
cost 4 1. AAG 4. AGA 5. AAA 2. AAA 3. GGA
6. AAA Total cost 4
1
0
1
1
0
1
2
0
0
0
1
1
0
0
2
The tree on the top explains the input sequences
with the least number of substitutions
EPFL Bioinformatics I 23 Jan 2006
3Parsimony overview
- Goal To explain the MSA with a minimal number of
mutational events to find the tree with the
minimal cost - Input a multiple sequence alignment (MSA)
- Major components
- A cost function for a tree given an MSA which
simultaneously defines the branch lengths - An algorithm which finds a tree with the minimal
cost - Output
- an un-rooted tree (topology plus branch-lengths)
- A total cost
- an ancestral sequences for each non-terminal
node
EPFL Bioinformatics I 23 Jan 2006
4Parsimony cost function
Each column of a multiple sequence alignment
(MSA) is treated separately. The total cost of a
tree given an MSA is the sum of the costs for
each column of the MSA. The total cost of a tree
is also the sum of the costs for each of its
branches (edges). The costs for two identical
residues is zero. The cost for a residue
substitution a?b is uniformly 1 (traditional
parsimony) or a depends on the residue types
s(a,b) 0 (weighted parsimony). Note that PAM or
Blosum-type substitution matrices are unsuitable
for parsimony because they contain mixed
(negative and positive) scores. Gap characters
are treated like residues. This corresponds to a
proportional gap penalty function of the form
gap_weight agap_length. In the case of
simple (traditional) parsimony, the total cost of
the tree corresponds to the minimum number of
mutational events required for generation of the
input MSA. Note however, that a insertion of n
residues is considered to be the result of n
mutations, which is not necessarily biologically
sound.
EPFL Bioinformatics I 23 Jan 2006
5Parsimony algorithm
Problem 1 For a given column and tree topology,
how to find the optimal residue assignments for
the internal nodes. This solution is obtained by
an efficient recursive algorithm, which
simultaneously returns the total costs for this
column and the costs for each edge (branch
length) Problem 2 How to find the optimal
tree Enumeration Try all possible trees.
Guaranteed to find the optimal tree. Unfeasible
for more than a few sequences (see below) Branch
and bound Try all trees which are possibly
optimal. Trees which contain too costly sub-trees
are successively excluded from the search.
Guaranteed to find the optimal tree. Orders of
magnitudes faster than simple enumeration. Still
unfeasable for large numbers of
sequences. Heuristic algorithms for instance
build first tree for small subset of sequences
using branch and bound, then add additional
sequences successively. Fast but not guaranteed
to find the optimal tree. Remember 3
sequences 1 unrooted tree 4 sequences 3
unrooted trees 5 sequences 15 unrooted trees
10 sequences 2,027,025 unrooted trees 20
sequences 221,643,095,476,699,771,875 unrooted
trees
EPFL Bioinformatics I 23 Jan 2006
6Statistical evaluation of trees bootstrapping
5
4
1
6
7
2
8
3
- Motivation Some branching patterns in a tree may
be uncertain for statistical reasons (short
sequences, small number of mutational events) - Goal of bootstrapping To assess the statistical
robustness for each edge of the tree. - Note that each edge divides the leave nodes into
two subsets. For instance, edge 78 divides the
leaves into subsets 1,2,3 and 4,5.However,
is this short egde statistically robust ? - Method Try to generate tree from subsets of
input data as follows - Randomly modify input MSA by eliminating some
columns and replacing them by existing ones, This
results in duplication of colums. - Compute tree for each modified input MSA.
- For each edge of the tree derived from the real
MSA, determine the fraction of trees derived from
modified MSAs which contain an edge that divides
the leaves into the same subsets. This fraction
is called the bootstrap value. Edges with low
bootstrap values (e.g. lt0.9) are considered
unreliable.
EPFL Bioinformatics I 23 Jan 2006