Title: Clustering methods
1Clustering methods
Objective Given n taxa, where the distance di,j
between taxa i,j is available how can one
determine the best fitting phylogenetic tree for
the data?
2Simplest solution
Determine the minimum number of mutations
yielding intermediate ancestors for all possible
tree topologies having n leaves. Infeasible
since number of possible binary trees is
exponential.
3Tree building methods
Two methods of classification
4Definitions
5Definitions
- A tree is an undirected, connected, acyclic
graph. - Nodes x,y are immediate neighbors if there is an
undirected edge - between x and y.
- Thus there is a unique path between any two
distinct nodes or - vertices in a tree.
- A rooted tree has a distinguished node called the
root. - A leaf or external node has no children.
6Definitions
- Non-leaf nodes are called internal nodes.
- The depth of a tree is one less than the maximum
number of nodes on a path from the root to leaf. - An ordered tree is a rooted tree such that the
children of internal nodes are ordered- there is
a difference whether a child is the leftmost
child, or second child, etc. - A tree is binary if every node has at most two
children, - otherwise its multifurcating.
7Ultrametric trees
- If the value di,j of the distance function
between all the leaves i,j - of a tree T is simply the sum of edge weights
along the path - connecting i and j, then d is called an
additive metric tree. - If in addition the path length from root r of
tree T to every leaf - of T is identical, then d is called an
ultrametric.
8Ultrametric trees
- Conditions
- Evolution rate is constant across all taxa
involved in construction of phylogenetic tree. - Result
- Expected number of substitutions occurred from
root to any leaf is - the same,
- i.e, a constant evolution rate for nucleotide
substitution gives rise - to an ultrametric tree.
9Ultrametric trees
10Clustering Algorithms
- Objectives
- Repeatedly cluster the data by grouping the
closest elements. - Used for
- Phylogeny construction.
- Group similar results from gene expression
microarray data.
11Clustering Algorithms
- Distance Based Methods
- UPGMA
- WPGMA
- Fitch-Margoliash
- Character Based Criterion Based Method
- Maximum Likelihood
12UPGMA
Pair Group Method When pairs are repeatedly
amalgamated. UPGMA (Unweighted Pair Group Method
with Arithmetic mean) Sequence alignment
distance between sequences has been determined
in a distance matrix D. The algorithm works
under the hypothesis of an ultrametric tree.
13UPGMA
Algorithm INPUT nXn distance matrix D. 1)
Initialize set C to consist of n initial
singleton clusters 1,......,n 2)
Initialize function dist(c,d) on C by defining
for all i and j in C.
dist(i,j) D(i,j)
14UPGMA
15WPGMA
- Modification of UPGMA
- Weight the clusters by their size.
- If cluster size of pairs amalgamated is roughly
the same, then - WPGMA is essentially UPGMA.
- Algorithm
- INPUT nXn distance matrix D.
- 1) Initialize set C to consist of n initial
singleton clusters - 1,......,n
- 2) Initialize function dist(c,d) on C by defining
for all i - and j in C.
dist(i,j) D(i,j)
16WPGMA
17Fitch- Margoliash
This method modifies WPGMA in branch length
determination. Hence produces same topology as
WPGMA. Repeatedly determines the closest two
clusters a,b temporarily grouping all other
clusters into c, and uses the above
observation to determine the branch length
x,y. NOTE Produces an unrooted tree.
18Fitch- Margoliash
19(No Transcript)
20Maximum Likelihood
- This approach turns the phylogenetic problem
inside out. - It searches for the evolutionary model, including
the tree itself, - that has the highest likelihood of producing the
observed data. - ML is derived for each base position in the
alignment. - The likelihood is calculated in terms of the
probability that the - pattern of variation at a site would be produced
by a particular - substitution process, given a particular tree and
overall observed - base frequencies.
- The likelihood becomes the sum of the
probabilities of each possible - reconstruction of substitution under a particular
substitution process.
21Maximum Likelihood
The likelihoods for all the sites are multiplied
to give an overall likelihood of the tree (i.e
the probability of the data given a tree and a
substitution process). A good tree will have
many sites with high likelihood, so that their
product is high. If there is no phylogenetic
signal in the data all random trees will be
comparable.
22Maximum Likelihood
Substitution model should be optimized to fit the
observed data. Examples Transition bias,
observed by the large number of sites that
include only purines or pyrimidines. Therefore,
model that assumes no bias will perform
poorly. If large fraction of the sites have a
single base, and another large component have
equal base frequencies, a model which assumes all
sites evolve equally will be less accurate than a
model that allows rate heterogeneity.
23Maximum Likelihood
Drawback ML biggest disadvantage is that it is
computationally expensive. Impractical to
perform a complete search that simultaneously opti
mizes substitution model and tree for a give data
set. Alternative Better option is to estimate
first the ML of the substitution model. Applied
iteratively, searching for better ML trees, then
reestimating parameters, then searching for
better trees.
24Maximum Likelihood
- Felsensteins model
- Phylogenetic tree from sequence data whose
likelihood is a local maxima. - Determine optimal branch lengths for the topology
- Add new sequences 1 by 1 make local
modifications to topology - Again optimize new branch lengths.
- For unrooted tree T having m edges there are m
possibilities of adjoining a new leaf taxon.
25Maximum Likelihood
- Felsensteins model assumes
- n nucleotide sequences are given each of the
same length m. - No insertions or deletions have occurred in
constructing the phylogenetic tree. - Evolutionary process is a reversible Markov
process whose substitution rate matrix is
determined by nucleotide specific substitution
rates. - Base changes at different sites of the length m
oligonucleotide are independent events (not
biologically valid).
26Likelihood of a tree
Likelihood of this tree can be calculated
by filling in all possible sequences for internal
nodes. Even with an assumption that there are no
insertions or deletions..Combinatorial explosion
SOLUTION Consider each site independently from
the others. Drastically reduces complexity of the
calculation.
27Likelihood of a tree Pr data tree Consider
the following tree
Using independence assumption Likelihood of the
complete tree is
Thus, likelihood of tree with m independent
sites, one has to multiply the likelihoods of m
site-specific trees.
28Summary
Distance based methods like UPGMA are bad as far
as accuracy but are fast methods. Character
based methods are more accurate but are
computationally expensive. Depending on
requirements, choose method.