Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics Algorithms and Data Structures

Description:

UNIVERSITY OF SOUTH CAROLINA. College of Engineering & Information Technology ... If dik djk then dik = dij. Else if dik dij then dik = djk. Else dij = djk ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 38
Provided by: john244
Learn more at: https://www.cse.sc.edu
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures


1
Bioinformatics Algorithms and Data Structures
  • Chapter 17.1 Strings and Evolutionary Trees
  • Lecturer Dr. Rose
  • Slides by Dr. Rose
  • April 8, 2003

2
Strings and Evolutionary Trees
  • the great Tree of Life fills with its dead and
    broken branches the crust of the earth, and
    covers the surface with is ever-branching and
    beautiful ramifications. - Darwin

3
Strings and Evolutionary Trees
  • There are three competing theories to creating
    classification trees
  • Evolutionary taxonomy
  • Numerical taxonomy
  • Cladistics

4
Strings and Evolutionary Trees
  • Evolutionary taxonomy
  • classification informed by evolutionary theory
  • Fills in internal nodes corresponding to common
    ancestors.

5
Strings and Evolutionary Trees
  • Numerical taxonomy (Phenetics)
  • Studies the relationship between groups of
    organisms based on the degree of similarity
  • Similarity can be in terms of molecular,
    phenotypic or anatomical data.
  • The resulting graph, which is a tree-like network
    is called a phenogram.
  • Maximum Likelihood method.

6
Strings and Evolutionary Trees
  • Cladistics
  • Characterized by character-state methods, e.g.,
    maximum parsimony.
  • Guiding principle
  • not all character states shared by organisms
    provide evolutionary information
  • Important to restrict consideration to
    evolutionarily significant states.

7
Strings and Evolutionary Trees
  • Cladistics continued
  • Willi Hennig Society
  • http//www.cladistics.org/

8
Strings and Evolutionary Trees
  • Tree building algorithms
  • Distance-based methods
  • Input distance data such as sequence edit
    distance
  • Output weighted tree with pairwise distances
    matching evolutionary distance
  • We will consider data that is
  • Ultrametric (section 17.1)
  • Additive but not ultrametric (section 17.2)
  • Nonadditive data (no section)

9
Strings and Evolutionary Trees
  • Tree building algorithms continued
  • Maximum-parsimony methods
  • Character-based methods
  • Input character data (often aligned sequences)
  • Output tree with
  • input taxa at leaves
  • Inferred taxa at internal nodes
  • Goal minimize the total cost of mutations
  • maximize parsimony.
  • Seeks a tree that has the minimum cost over all
    possible trees

10
Ultrametric trees and distances
  • Before discussing ultrametric distances, consider
    additive distances
  • Defn. Additive distances are distances which can
    be fitted to an unrooted tree such that all
    pairwise taxa distances are equal to the sum of
    the branch lengths connecting them. (Table and
    figure from http//imbs.massey.ac.nz/Research/MolE
    vol/Farside/DNA/00312.html)

11
Ultrametric trees and distances
  • Ultrametric distances are more constrained than
    additive distances.
  • Defn. Ultrametric distances are distances that
  • fit a tree so that the distance between any two
    taxa is equal to the sum of the branches joining
    them.
  • for any three taxa i, j and k, the two largest
    distances are equal, i.e.,
  • If dikgt djk then dik dij
  • Else if dikgt dij then dik djk
  • Else dij djk

12
Ultrametric trees and distances
  • Q What is an ultrametric tree?
  • An ultrametric tree T for n-by-n symmetric
    distance matrix D has the following properties
  • T has n leaves, one per unique row of D.
  • Internal nodes are labeled by an entry from D and
    have two children.
  • The numbers labeling internal nodes strictly
    decrease along any path from the root to a leaf.
  • D(i, j) denotes the label of the least common
    ancestor of leaves i and j in T.
  • The distances in D must be ultrametric.

13
Ultrametric trees and distances
  • Consider the following example from the textbook
  • Verify the ultrametric condition, i.e., for any
    three taxa, two of the distances will be the same
    and larger than the third distance.

14
Ultrametric trees and distances
  • Interpretation of ultrametric as evolutionary
    trees
  • The leaves are the existing OTUs
  • The internal nodes are the divergence events
  • A divergence event is a point where the
    evolutionary histories of two OTUs split.

15
Ultrametric trees and distances
  • Q If taxa A and B diverge at time t, which
    statements are implied by the meaning of
    divergence?
  • A is the ancestor of B
  • B is the ancestor of A
  • Neither A nor B is an ancestor of the other.
  • Neither A nor B have a living ancestor.

16
Ultrametric trees and distances
  • If the branching order time of each divergence
    is known
  • The label at each internal node is the time of
    the divergent event corresponding to that node.
  • The labels from the root to leaves must be
    strictly increasing.
  • D(i, j) is the time that taxa i and j diverged.
  • The author calls T a min-ultrametric tree.

17
Ultrametric trees and distances
  • Equivalently, if the evolutionary history is
    known
  • The label at each internal node is the time that
    has passed since the divergent event
    corresponding to that node.
  • The labels from the root to leaves must be
    strictly decreasing.
  • D(i, j) is the time since taxa i and j diverged.
  • T is an ultrametric tree for D.

18
Ultrametric trees and distances
  • Defn.The symmetric matrix D defines an
    ultrametric distance iff for any three indices i,
    j and k, the two largest distances are equal,
    i.e.,
  • If dikgt djk then dik dij
  • Else if dikgt dij then dik djk
  • Else dij djk
  • Call D ultrametric if it defines ultrametric
    distances.
  • Thm. Distance matrix D has an ultrametric tree
    iff D is an ultrametric matrix. (Proof page 451)

19
Ultrametric trees and distances
  • Proof.
  • (if T ultrametic then D ultrametric)
  • If T is ultrametric (draw T)
  • each internal node v is labeled D(i, j) where i
    and j are leaves and v is the least common
    ancestor.
  • For any three leaves, i, j, k, in T, let u be
    the least common ancestor, then
  • u is labeled by two of D(i, j), D(i, k), D(j,
    k), i.e., two of these are equal.
  • Further more one of D(i, j), D(i, k), D(j, k)
    is smallest
  • Therefore D is ultrametric.

20
Ultrametric trees and distances
  • Proof.
  • ? (if D ultrametic then there is an ultrametric
    T)
  • If D is ultrametric
  • The number of distinct entries d in each row i
    defines the number of nodes from the root to leaf
    i.
  • Each node in this path is labeled in decreasing
    order with a distinct label.
  • Any node v on this path labeled D(i, j) must be
    the least common ancestor of leaves i and j.

21
Ultrametric trees and distances
  • Proof.? continued
  • The path to leaf i partitions the n-1 remaining
    leaves in d-1 classes.
  • Each distinct node on the path to i is labeled by
    the distance from i to to the leaves in that
    partition.
  • Example

22
Ultrametric trees and distances
  • Proof.? continued
  • We want to recursively find the ultrametric tree
    for each of the d-1 partitions and then combine
    them.
  • Consider the partition defined by internal node
    v.
  • Let j be a leaf contained in this partition.
  • Let l be some other leaf. There are three cases
  • l is in the same partition as j.
  • l is in a partition between i and node v.
  • l is in a partition between node v and the root.

23
Ultrametric trees and distances
  • Proof.? continued
  • The three cases Let i A, j F
  • l is in the same partition as j. example l B
  • l is in a partition between i and v. example l
    D
  • l is in a partition between v and the root.
    example l C

24
Ultrametric trees and distances
  • Proof.? continued
  • Case 1 (i A, j F, l B)
  • D(i, j) D(i, l) thus D(j, l) ? D(i, j). Why?
  • ? So we can add the subtree containing j l.
    knowing that D(j, l) is correct.

25
Ultrametric trees and distances
Typos in text
  • Proof.? continued
  • Case 2 (i A, j F, l D)
  • D(i, l) lt D(i, j) thus D(i, j) D(j, l)
  • ? So we can add the subtree at v containing j
    knowing that D(j, l) is correct, i.e. v is
    labeled correctly.

26
Ultrametric trees and distances
Typos in text
  • Proof.? continued
  • Case 3 (i A, j F, l C)
  • D(i, l) gt D(i, j) thus D(i, l) D(j, l)
  • ? So we can add the subtree at v containing j
    knowing that D(j, l) is correct, i.e., it labels
    their least common ancestor .

27
Ultrametric trees and distances
  • Proof.? continued
  • In each of the three cases, the ultrametric tree
    defined by v can be correctly attached to v.
  • Hence, using recursion, we can construct the
    ultrametric tree T for D.

28
Ultrametric trees and distances
  • Gusfield presents two related theorems
  • Thm. If D is ultrametric, then the ultrametric
    tree for D is unique.
  • This is a consequence of the fact that the nodes
    that appear on path to a given node i must appear
    in every ultrametric tree for D.
  • Thm. If D is ultrametric, then the ultrametric
    tree for D can be constructed in O(n2) time.

29
Ultrametric trees and distances
  • Given ultrametic data we can
  • reconstruct evolutionary history.
  • Find the relative divergence times
  • Find the exact tree topology
  • Q How do we get ultrametric data?
  • Consider the molecular clock theory.

30
Molecular Clock Theory
  • Proposed by Emile Zucker and Linus Pauling.
  • Idea accepted mutations occur at a constant rate
    for a given protein.
  • There are three important issues
  • Accepted mutations are those that still allow
    the protein to function properly.
  • Lethal mutations will be selected against and
    should not accumulate.

31
Molecular Clock Theory
  • The theoretical clock rate is protein specific.
  • Different proteins will have different clocks.
  • Gusfield mentions hemoglobin and cytochrome c.
  • Both are stable and similar in all mammals.
  • However, hemoglobin mutates faster than
    cytochrome c.

32
Molecular Clock Theory
  • The implication is that the number of mutations
    is proportional to length of the time interval.
  • Requirement the interval must be long enough.
  • The length of an interval can be measured by the
    number of mutations.
  • This requires that the clock be calibrated.

33
Molecular Clock Theory
  • Assumptions
  • all DNA mutates at the same rate
  • Observed accepted rate differences are due to
    different constraints
  • Natural selection at the organism level
  • Physical chemistry at the molecular level

34
Molecular Clock Theory
  • Q How would we use the molecular clock to
    collect ultrametric data?
  • Find a common protein for two taxa of interest
  • Determine the number of accepted mutational
    differences, say k.
  • By molecular clock theory each taxa contributed
    k/2 accepted mutations.

35
Molecular Clock Theory
  • Do this for each pair of n taxa.
  • Result n choose 2 numbers satisfying the
    requirement for an ultrametric tree.
  • The ultrametric tree will describe the true
    evolutionary history for the n taxa.
  • Great huh?

36
Molecular Clock Theory
  • If only the molecular clock theory was correct ?
  • In the real world, the situation is complicated.
  • Molecular clock rates can and do diverge.
  • Sometimes there are common mutation rates.
  • Sometimes the mutation rates diverge.

37
Additive Distance Trees
Write a Comment
User Comments (0)
About PowerShow.com