Clustering methods - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Clustering methods

Description:

Fitch-Margoliash. Character Based & Criterion Based Method. Maximum Likelihood. UPGMA ... Fitch- Margoliash. This method modifies WPGMA in branch length determination. ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 29

Provided by: sri121

Category:

more less

Transcript and Presenter's Notes

Title: Clustering methods

1
Clustering methods
Objective Given n taxa, where the distance di,j
between taxa i,j is available how can one
determine the best fitting phylogenetic tree for
the data?
2
Simplest solution
Determine the minimum number of mutations
yielding intermediate ancestors for all possible
tree topologies having n leaves. Infeasible
since number of possible binary trees is
exponential.
3
Tree building methods
Two methods of classification
4
Definitions
5
Definitions

A tree is an undirected, connected, acyclic
graph.
Nodes x,y are immediate neighbors if there is an
undirected edge
between x and y.
Thus there is a unique path between any two
distinct nodes or
vertices in a tree.
A rooted tree has a distinguished node called the
root.
A leaf or external node has no children.

6
Definitions

Non-leaf nodes are called internal nodes.
The depth of a tree is one less than the maximum
number of nodes on a path from the root to leaf.
An ordered tree is a rooted tree such that the
children of internal nodes are ordered- there is
a difference whether a child is the leftmost
child, or second child, etc.
A tree is binary if every node has at most two
children,
otherwise its multifurcating.

7
Ultrametric trees

If the value di,j of the distance function
between all the leaves i,j
of a tree T is simply the sum of edge weights
along the path
connecting i and j, then d is called an
additive metric tree.
If in addition the path length from root r of
tree T to every leaf
of T is identical, then d is called an
ultrametric.

8
Ultrametric trees

Conditions
Evolution rate is constant across all taxa
involved in construction of phylogenetic tree.
Result
Expected number of substitutions occurred from
root to any leaf is
the same,
i.e, a constant evolution rate for nucleotide
substitution gives rise
to an ultrametric tree.

9
Ultrametric trees
10
Clustering Algorithms

Objectives
Repeatedly cluster the data by grouping the
closest elements.
Used for
Phylogeny construction.
Group similar results from gene expression
microarray data.

11
Clustering Algorithms

Distance Based Methods
UPGMA
WPGMA
Fitch-Margoliash
Character Based Criterion Based Method
Maximum Likelihood

12
UPGMA
Pair Group Method When pairs are repeatedly
amalgamated. UPGMA (Unweighted Pair Group Method
with Arithmetic mean) Sequence alignment
distance between sequences has been determined
in a distance matrix D. The algorithm works
under the hypothesis of an ultrametric tree.
13
UPGMA
Algorithm INPUT nXn distance matrix D. 1)
Initialize set C to consist of n initial
singleton clusters 1,......,n 2)
Initialize function dist(c,d) on C by defining
for all i and j in C.
dist(i,j) D(i,j)
14
UPGMA
15
WPGMA

Modification of UPGMA
Weight the clusters by their size.
If cluster size of pairs amalgamated is roughly
the same, then
WPGMA is essentially UPGMA.
Algorithm
INPUT nXn distance matrix D.
1) Initialize set C to consist of n initial
singleton clusters
1,......,n
2) Initialize function dist(c,d) on C by defining
for all i
and j in C.

dist(i,j) D(i,j)
16
WPGMA
17
Fitch- Margoliash
This method modifies WPGMA in branch length
determination. Hence produces same topology as
WPGMA. Repeatedly determines the closest two
clusters a,b temporarily grouping all other
clusters into c, and uses the above
observation to determine the branch length
x,y. NOTE Produces an unrooted tree.
18
Fitch- Margoliash
19
(No Transcript)
20
Maximum Likelihood

This approach turns the phylogenetic problem
inside out.
It searches for the evolutionary model, including
the tree itself,
that has the highest likelihood of producing the
observed data.
ML is derived for each base position in the
alignment.
The likelihood is calculated in terms of the
probability that the
pattern of variation at a site would be produced
by a particular
substitution process, given a particular tree and
overall observed
base frequencies.
The likelihood becomes the sum of the
probabilities of each possible
reconstruction of substitution under a particular
substitution process.

21
Maximum Likelihood
The likelihoods for all the sites are multiplied
to give an overall likelihood of the tree (i.e
the probability of the data given a tree and a
substitution process). A good tree will have
many sites with high likelihood, so that their
product is high. If there is no phylogenetic
signal in the data all random trees will be
comparable.
22
Maximum Likelihood
Substitution model should be optimized to fit the
observed data. Examples Transition bias,
observed by the large number of sites that
include only purines or pyrimidines. Therefore,
model that assumes no bias will perform
poorly. If large fraction of the sites have a
single base, and another large component have
equal base frequencies, a model which assumes all
sites evolve equally will be less accurate than a
model that allows rate heterogeneity.
23
Maximum Likelihood
Drawback ML biggest disadvantage is that it is
computationally expensive. Impractical to
perform a complete search that simultaneously opti
mizes substitution model and tree for a give data
set. Alternative Better option is to estimate
first the ML of the substitution model. Applied
iteratively, searching for better ML trees, then
reestimating parameters, then searching for
better trees.
24
Maximum Likelihood

Felsensteins model
Phylogenetic tree from sequence data whose
likelihood is a local maxima.
Determine optimal branch lengths for the topology
Add new sequences 1 by 1 make local
modifications to topology
Again optimize new branch lengths.
For unrooted tree T having m edges there are m
possibilities of adjoining a new leaf taxon.

25
Maximum Likelihood

Felsensteins model assumes
n nucleotide sequences are given each of the
same length m.
No insertions or deletions have occurred in
constructing the phylogenetic tree.
Evolutionary process is a reversible Markov
process whose substitution rate matrix is
determined by nucleotide specific substitution
rates.
Base changes at different sites of the length m
oligonucleotide are independent events (not
biologically valid).

26
Likelihood of a tree
Likelihood of this tree can be calculated
by filling in all possible sequences for internal
nodes. Even with an assumption that there are no
insertions or deletions..Combinatorial explosion
SOLUTION Consider each site independently from
the others. Drastically reduces complexity of the
calculation.
27
Likelihood of a tree Pr data tree Consider
the following tree
Using independence assumption Likelihood of the
complete tree is
Thus, likelihood of tree with m independent
sites, one has to multiply the likelihoods of m
site-specific trees.
28
Summary
Distance based methods like UPGMA are bad as far
as accuracy but are fast methods. Character
based methods are more accurate but are
computationally expensive. Depending on
requirements, choose method.

Write a Comment

User Comments (0)