Reconstruction on trees and Phylogeny 3 - PowerPoint PPT Presentation

About This Presentation
Title:

Reconstruction on trees and Phylogeny 3

Description:

d(u1,u2) = 2 iff for all w1 and w2 it holds that ... (u1,u2) 2, then either. u2 is ... u2. v. v' 6/19/09. 16. Proof of CFN theorem. Define D(e) = - log (e) ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 19
Provided by: chris802
Category:

less

Transcript and Presenter's Notes

Title: Reconstruction on trees and Phylogeny 3


1
Reconstruction on trees and Phylogeny 3
Elchanan Mossel, U.C. Berkeley mossel_at_stat.berke
ley.edu, http//www.cs.berkeley.edu/mossel/ Sup
ported by Microsoft Research and the Miller
Institute
2
Phylogeny
  • Phylogeny is the true evolutionary relationships
    between groups of living things

Noah
Shem
Ham
Japheth
Cush
Mizraim
Kannan
3
History of Phylogeny
  • Prehistory animal kingdom or plant kingdom.
  • Intuitively
  • More scientifically morphology, fossils, etc.
    Darwin
  • But Is a human more like a great ape or like a
    chimpanzee?

No brain, Cant move
Stupid Walks
Stupid Swims
Stupid Flies
Too smart Barely moves
4
Molecular Phylogeny
  • Molecular Phylogeny Based on DNA, RNA or protein
    sequences of organisms.
  • Rooted / Unrooted trees Evolution from common
    ancestor modeled on a rooted tree. Usually
    reconstruct unrooted trees.
  • Mutation mechanisms
  • Substitutions
  • Transpositions
  • Insertions, Deletions, etc.
  • Will only consider substitutions
  • and assume sequences are aligned.

Noah
acctga
Shem
Ham
Japheth
acctga
acctaa
acctga
Put
Cush
Mizraim
Kannan
acctga
acctga
agctga
acctga
5
Genetic substitution models and trees
  • Assumption 1 Letters of sequences (characters)
    evolve independently and identically.
  • Assumption 2 Trees are binary -- All internal
    degrees are 3 (bifurcating speciation results
    valid if degrees are 3).
  • Given a set of species (labeled vertices) X, an
    X-tree is a tree which has X as the set of
    leaves.
  • Two X-trees T1 and T2 are identical if theres a
    graph isomorphism between T1 and T2 that is the
    identity map on X.

u
u
Me
v
Me Me
Me
w
w
d
a
c
b
d
a
b
c
c
a
b
d
6
Substitution model finite state space
  • Finite set A of information values (A 4 for
    DNA).
  • Tree T(V,E) rooted at r.
  • Vertex v 2 V, has information sv 2 A.
  • Edge e(v, u), where v is the parent of u, has a
    mutation matrix Me of size A A
  • Mi,j (v,u) P?u j ?v i
  • Will focus on the CFN model
  • A character is (?v)v 2 T.
  • For each character ?, the data is ??T (?v)v 2
    ?T,
  • where ?T is the boundary of the tree ?T n.
  • We are given k independent characters ?1?T,,
    ?k?T.

7

A diagram
Length of sequence!
  • Interested to know k characters needed to
    reconstruct the tree with n leaves, given a
    range ?max,?min for mutation rate ?.

8

Phylogeny Conjectures and results
Statistical physics
Phylogeny
Binary tree in ordered phase
k O(log n)
conj
Binary tree unordered
k poly(n)
conj
Percolation
Random Cluster
critical ? 1/2
M-Steel2003
CFN
Ising model
critical ? 2?2 1
M-2003
Sub-critical representation
High mutation
M-2003
Problems How general? What is the critical
point? (extremality vs. spectral)
9
The CFN model
  • Cavendar-Farris-Neyman model
  • 2 data types 1 and 1 (purine-pyrimidine)
  • Mutation along edge e with probability ?(e) copy
    data from parent. Otherwise, choose 1/-1 with
    probability ½ independently of everything else
  • ThmCFN Suppose that for all e, 1 - ? ?(e)
    ? 0.
  • Then given k characters of the process at n
    leaves,
  • It is possible to reconstruct the underlying
    topology with probability 1 - ?, if k nO(-log
    ?).

Steel 94 Trick to extend to general Me provided
that det(Me)? -1,-1? ? - ?, ? ? 1 - ?, 1,
10
Phase transition for the CFN model
  • Th1M2003 Suppose that n3 2q and
  • T is a uniformly chosen (q1)-level 3-regular
    X-tree.
  • For all e, ?(e)
  • Then in order to reconstruct the topology with
    probability ? 0.1, at least k ?(n(-2log2(?) -
    1)) characters are needed.
  • Proof Information theoretic variant of the proof
    for random cluster model.
  • Same proof applies to any model for which the
    reconstruction problem is unsolvable.
  • more formally, for models for which I(??,?n)
    decays exp. fast in n.

11
CFN Logarithmic reconstruction
  • Th2M2003 If T is an X-tree on n leaves s.t.
  • For all e, ?min 1, ?max
  • Then k O(log n log ?) characters suffice to
    reconstruct the topology with probability 1-? .
  • Need either a balanced tree all leaves at the
    same distance from a root.
  • Or, molecular clock ?(e) e-t(e), where t(e)
    is the time interval between the two endpoints of
    the interval all leaves are at the same time.

12
Main Lemma M2003
  • Lemma Suppose that 2 ? min2 1, then there
    exists an L, and ? 0 such that the CFN model on
    the binary tree of L levels with
  • ?(e) ? ? min, for all e not adjacent to ?T.
  • ?(e) ? ? ? min , for all e adjacent to ?T.
  • satisfies Esr Maj(s?) ? ?.
  • Roughly, given boundary data of quality ? ?, we
    can reconstruct the root data with quality ? ?.
  • In phylogeny can treat known pieces of the tree
    as vertices.
  • Main problem how to reconstruct pieces of the
    tree?

13
Metric spaces on trees
  • Let D be a positive function on the edges E.
  • Define D(u,v) ? D(e) e 2 path(u,v).
  • Claim Given D(v,u) for all v and u in ?T, it is
    possible to reconstruct the topology of T.
  • Proof Suffices to find d(u, v) for all u, v 2
    ?T where d is the graph metric distance.
  • d(u1,u2) 2 iff for all w1 and w2 it holds that
  • D(u1,u2,w1,w2) D(u1,w1)D(u2,w2)
    D(u1,u2)D(w1,w2) 0 (Four point condition).

w1
u1
w1
u1
w2
u2
w2
u2
14
Metric spaces on trees
  • Continue by replacing known sub-trees T on
    vertices (v1,,vr) by a single vertex v.
  • The distance between (v1,,vr) and (u1,us) is
    defined as d(v1,u1).
  • D(u1,u2,w1,w2) 0 ) D(u1,u2,w1,w2) ? 2 min_e
    D(e).
  • Suffices to have D with accuracy min_e D(e)/4.

15
Metric spaces on trees
  • Let T be a balanced tree.
  • The L-topology of T is
  • d(u,v) mind(u,v,2L.
  • Claim If T is balanced, then in order to recover
    the L-topology of T it suffices to have
  • For each leaf u of T a set U(u) containing all
    elements at distance 2L2 from u.
  • For all u and all w1,w2,w3,w4 2 U(u) the sign of
    D(w1,w2,w3,w4).
  • proof If d(u1,u2) 2, then either
  • u2 is not in U(u1), or
  • Let v be a sister of u1 and v a cousin of v.
  • D(u1,v,u2,v) 0.
  • We have a witness that u1 and u2 are not siblings.

u2
u1
v
v
16
Proof of CFN theorem
  • Define D(e) - log ?(e).
  • D(u,v) -log(Cov(?v,?u)), where Cov(?v, ?u)
    E?v?u.
  • Estimate Cov(?v, ?u) by Cor(?v, ?u) where
  • Need D with accuracy m min D(e)/4 c?, or
  • Cor (1 ? c?)Cov.
  • Cor(?v, ?u) is a sum of k i.i.d. ? 1 variables
    with expected value Cov(?v, ?u).
  • Cov(?v, ?u) may be a small as ? 2 depth(T)
    n-O(-log ?).
  • Given k nO(-log ?) characters, it is possible
    to estimate D and therefore reconstruct T with
    high probability.

17
Reconstructing the topology M2003
  • The algorithm Repeat the following
  • Reconstruct the topology up to l levels from the
    boundary using 4-points method.
  • For each sample, reconstruct the data l levels
    from the boundary using majority algorithm.


-
-
  • Reconstruction near the boundary take O(log n)
    samples.
  • By main lemma quality stays above ?.

18
Proving main Lemma
  • Need to estimate Esr Maj(s?). Estimate has two
    parts
  • Case 1 For all e adjacent to ?T, ?(e) is small.
    Here we use a perturbation argument, i.e.
    estimate partial derivatives of Esr Maj(s?)
    with respect to various variables (using
    something like Russo formula).
  • Case 2 Some e adjacent to ?T has large ?(e). Use
    percolation theory arguments.
  • Both cases uses isoperimetric estimates for the
    discrete cube.
Write a Comment
User Comments (0)
About PowerShow.com