Clustering methods - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Clustering methods

Description:

Fitch-Margoliash. Character Based & Criterion Based Method. Maximum Likelihood. UPGMA ... Fitch- Margoliash. This method modifies WPGMA in branch length determination. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 29
Provided by: sri121
Category:

less

Transcript and Presenter's Notes

Title: Clustering methods


1
Clustering methods
Objective Given n taxa, where the distance di,j
between taxa i,j is available how can one
determine the best fitting phylogenetic tree for
the data?
2
Simplest solution
Determine the minimum number of mutations
yielding intermediate ancestors for all possible
tree topologies having n leaves. Infeasible
since number of possible binary trees is
exponential.
3
Tree building methods
Two methods of classification
4
Definitions
5
Definitions
  • A tree is an undirected, connected, acyclic
    graph.
  • Nodes x,y are immediate neighbors if there is an
    undirected edge
  • between x and y.
  • Thus there is a unique path between any two
    distinct nodes or
  • vertices in a tree.
  • A rooted tree has a distinguished node called the
    root.
  • A leaf or external node has no children.

6
Definitions
  • Non-leaf nodes are called internal nodes.
  • The depth of a tree is one less than the maximum
    number of nodes on a path from the root to leaf.
  • An ordered tree is a rooted tree such that the
    children of internal nodes are ordered- there is
    a difference whether a child is the leftmost
    child, or second child, etc.
  • A tree is binary if every node has at most two
    children,
  • otherwise its multifurcating.

7
Ultrametric trees
  • If the value di,j of the distance function
    between all the leaves i,j
  • of a tree T is simply the sum of edge weights
    along the path
  • connecting i and j, then d is called an
    additive metric tree.
  • If in addition the path length from root r of
    tree T to every leaf
  • of T is identical, then d is called an
    ultrametric.

8
Ultrametric trees
  • Conditions
  • Evolution rate is constant across all taxa
    involved in construction of phylogenetic tree.
  • Result
  • Expected number of substitutions occurred from
    root to any leaf is
  • the same,
  • i.e, a constant evolution rate for nucleotide
    substitution gives rise
  • to an ultrametric tree.

9
Ultrametric trees
10
Clustering Algorithms
  • Objectives
  • Repeatedly cluster the data by grouping the
    closest elements.
  • Used for
  • Phylogeny construction.
  • Group similar results from gene expression
    microarray data.

11
Clustering Algorithms
  • Distance Based Methods
  • UPGMA
  • WPGMA
  • Fitch-Margoliash
  • Character Based Criterion Based Method
  • Maximum Likelihood

12
UPGMA
Pair Group Method When pairs are repeatedly
amalgamated. UPGMA (Unweighted Pair Group Method
with Arithmetic mean) Sequence alignment
distance between sequences has been determined
in a distance matrix D. The algorithm works
under the hypothesis of an ultrametric tree.
13
UPGMA
Algorithm INPUT nXn distance matrix D. 1)
Initialize set C to consist of n initial
singleton clusters 1,......,n 2)
Initialize function dist(c,d) on C by defining
for all i and j in C.
dist(i,j) D(i,j)
14
UPGMA
15
WPGMA
  • Modification of UPGMA
  • Weight the clusters by their size.
  • If cluster size of pairs amalgamated is roughly
    the same, then
  • WPGMA is essentially UPGMA.
  • Algorithm
  • INPUT nXn distance matrix D.
  • 1) Initialize set C to consist of n initial
    singleton clusters
  • 1,......,n
  • 2) Initialize function dist(c,d) on C by defining
    for all i
  • and j in C.

dist(i,j) D(i,j)
16
WPGMA
17
Fitch- Margoliash
This method modifies WPGMA in branch length
determination. Hence produces same topology as
WPGMA. Repeatedly determines the closest two
clusters a,b temporarily grouping all other
clusters into c, and uses the above
observation to determine the branch length
x,y. NOTE Produces an unrooted tree.
18
Fitch- Margoliash
19
(No Transcript)
20
Maximum Likelihood
  • This approach turns the phylogenetic problem
    inside out.
  • It searches for the evolutionary model, including
    the tree itself,
  • that has the highest likelihood of producing the
    observed data.
  • ML is derived for each base position in the
    alignment.
  • The likelihood is calculated in terms of the
    probability that the
  • pattern of variation at a site would be produced
    by a particular
  • substitution process, given a particular tree and
    overall observed
  • base frequencies.
  • The likelihood becomes the sum of the
    probabilities of each possible
  • reconstruction of substitution under a particular
    substitution process.

21
Maximum Likelihood
The likelihoods for all the sites are multiplied
to give an overall likelihood of the tree (i.e
the probability of the data given a tree and a
substitution process). A good tree will have
many sites with high likelihood, so that their
product is high. If there is no phylogenetic
signal in the data all random trees will be
comparable.
22
Maximum Likelihood
Substitution model should be optimized to fit the
observed data. Examples Transition bias,
observed by the large number of sites that
include only purines or pyrimidines. Therefore,
model that assumes no bias will perform
poorly. If large fraction of the sites have a
single base, and another large component have
equal base frequencies, a model which assumes all
sites evolve equally will be less accurate than a
model that allows rate heterogeneity.
23
Maximum Likelihood
Drawback ML biggest disadvantage is that it is
computationally expensive. Impractical to
perform a complete search that simultaneously opti
mizes substitution model and tree for a give data
set. Alternative Better option is to estimate
first the ML of the substitution model. Applied
iteratively, searching for better ML trees, then
reestimating parameters, then searching for
better trees.
24
Maximum Likelihood
  • Felsensteins model
  • Phylogenetic tree from sequence data whose
    likelihood is a local maxima.
  • Determine optimal branch lengths for the topology
  • Add new sequences 1 by 1 make local
    modifications to topology
  • Again optimize new branch lengths.
  • For unrooted tree T having m edges there are m
    possibilities of adjoining a new leaf taxon.

25
Maximum Likelihood
  • Felsensteins model assumes
  • n nucleotide sequences are given each of the
    same length m.
  • No insertions or deletions have occurred in
    constructing the phylogenetic tree.
  • Evolutionary process is a reversible Markov
    process whose substitution rate matrix is
    determined by nucleotide specific substitution
    rates.
  • Base changes at different sites of the length m
    oligonucleotide are independent events (not
    biologically valid).

26
Likelihood of a tree
Likelihood of this tree can be calculated
by filling in all possible sequences for internal
nodes. Even with an assumption that there are no
insertions or deletions..Combinatorial explosion
SOLUTION Consider each site independently from
the others. Drastically reduces complexity of the
calculation.
27
Likelihood of a tree Pr data tree Consider
the following tree
Using independence assumption Likelihood of the
complete tree is
Thus, likelihood of tree with m independent
sites, one has to multiply the likelihoods of m
site-specific trees.
28
Summary
Distance based methods like UPGMA are bad as far
as accuracy but are fast methods. Character
based methods are more accurate but are
computationally expensive. Depending on
requirements, choose method.
Write a Comment
User Comments (0)
About PowerShow.com