Genome Rearrangement Phylogeny - PowerPoint PPT Presentation

About This Presentation
Title:

Genome Rearrangement Phylogeny

Description:

Li-San Wang Tandy Warnow. Department of Computer Sciences. University of Texas at Austin ... estimates for the t.e.d.s [Wang WABI'02] Weighbor(IEBP), Weighbor ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 42
Provided by: dimacsR
Category:

less

Transcript and Presenter's Notes

Title: Genome Rearrangement Phylogeny


1
Genome Rearrangement Phylogeny
Robert K. Jansen School of Biology University of
Texas at Austin Bernard M.E. MoretDepartment of
Computer ScienceUniversity of New MexicoLi-San
Wang Tandy Warnow Department of Computer
Sciences University of Texas at Austin
2
Outline
  • Introduction
  • Genome rearrangement phylogeny reconstruction
  • Application
  • Other methods
  • Future research

3
New Phylogenetic Signals
  • Large-throughput sequencing efforts lead to
    larger datasets
  • Challenge inferring deep evolutionary events
  • Biologists turning to rare genomic changes
  • Rare
  • Large state space
  • High signal-to-noise ratio
  • Potential for clarifying early evolution
  • Best studied gene order evolution
    (genome rearrangement)

4
Genomes As Signed Permutations
1 5 3 4 -2 -6 or 5 1 6 2 -4
-3 etc.
5
Gene Order Data
  • Rare changes on the genomic scale
  • Large state space
  • DNA 4 states/character
  • Protein (amino acid sequence) 20
    states/character
  • Circular gene order with 120 genes
  • High signal-to-noise ratio

states/character
6
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10
Inversion
1 2 6 5 -4 -3 7 8 9 10
Transposition
1 2 7 8 3 4 5 6 9 10
Inverted Transposition
1 2 7 8 6 -5 -4 -3 9 10
7
Edit Distances Between Genomes
  • (INV) Inversion distance Hannenhalli Pevzner
    1995
  • Computable in linear time Moret et al 2001
  • (BP) Breakpoint distance Watterson et al. 1982
  • Computable in linear time
  • NJ(BP) Blanchette, Kunisawa, Sankoff, 1999

A
1 2 3 4 5 6 7 8 9 10
B
1 2 3 -8 -7 -6 4 5 9 10
BP(A,B)3
8
Our Model the Generalized Nadeau-Taylor Model
STOC01
  • Three types of events
  • Inversions (INV)
  • Transpositions (TRP)
  • Inverted Transpositions (ITP)
  • Events of the same type are equiprobable
  • Probabilities of the three types have fixed ratio
  • We focus on signed circular genomes in this talk.

9
Simulation Study Protocol
Synthetic Input
Evolutionary Process
Known in simulation
PhylogeneticMethod
Inferred Tree
10
Quantifying Error
11
Outline
  • Genome rearrangement evolution
  • Genome rearrangement phylogeny reconstruction
  • Application
  • Other methods
  • Future research

12
Gene Order Parsimony
13
Breakpoint PhylogenySankoff Blanchette 1998
  • Maximum Parsimony-style problem
  • Find tree(s), leaf-labeled by genomes, with
    shortest breakpoint length
  • NP-hard problem on two levels
  • Find the shortest tree (the space of trees has
    exponential size)
  • Given a tree, find its breakpoint length (Even
    for a tree with 3 leaves, but can be reduced to
    TSP)
  • BPAnalysis Sankoff Blanchette 1998
  • Takes 200 years to compute our 13-taxon dataset
    on a Sun workstation

14
BPAnalysis
  • Tree length evaluation for EVERY tree
  • Given a fixed tree topology, evaluate the tree
    length
  • Iteratively evaluate the median problem (tree
    length for a 3-leaf tree)

15
GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)
  • http//www.cs.unm.edu/moret/GRAPPA/
  • Uses lowerbound techniques to speed up
  • Used on real datasets, producing thousand-fold
    speedups over BPAnalysis ISMB01
  • Contributors (led by Bernard Moret at UNM) U.
    New MexicoU. Texas at AustinUniversitá di
    Bologna, Italy

16
The Circular Lowerbound of the Length of a Tree
  • Given a tree, we can lowerbound its length very
    quickly

17
The Lowerbound Technique
  • Avoid any tree X without potential
  • tree X whose lowerbound lb(X) is higher than
    twice the length c(T) of the best tree T
  • Finding a good starting tree quickly is of utmost
    importance
  • We turn to distance-based methods
  • Neighbor joining (NJ) Saitou and Nei 1987
  • Weighbor Bruno et al. 2000

18
Additive Distance Matrix and True Evolutionary
Distance (T.E.D.)
S2
S3
S4
S5
S1
S3
S1 0 9 15 14 17
S1
S2 0 14 13 16
S4
7
5
5
S3 0 13 16
1
3
S4 0 13
4
S5 0
8
S2
S5
Theorem Waterman et al. 1977 Given an mm
additive distance matrix, we can reconstruct a
tree realizing the distance in O(m2) time.
19
Error Tolerance of Neighbor Joining
  • Theorem Atteson 1999Let Dij be the true
    evolutionary distances, and dij be the
    estimated distances for T. Let be the
    length of the shortest edge in T. If for all taxa
    i,j, we havethen neighbor joining returns T.

20
BP and INV
INV vs K
(120 genes)
BP/2 vs K
(K Actual number of inversions)
(Inversion-only evolution)
21
NJ(BP) Blanchette, Kunisawa, Sankoff 1999 and
NJ(INV)
Transpositions/inverted transpositions only
Inversion only
120 genes, 160 leaves Uniformly Random Tree
22
Estimate True Evolutionary DistancesUsing BP
  • To use the scatter plot to
  • estimate the actual number
  • of events (K)
  • Compute BP/2
  • From the curve, look up the corresponding
    valueof K

(2)
(1)
BP/2 vs K
(120 genes)
(K Actual number of inversions)
(Inversion-only evolution)
23
True Evolutionary Distance (t.e.d.) Estimators
for Gene Order Data
IEBP Inverting the Expected BreakPoint
distance EDE Empirically Derived Estimator
24
True Evolutionary Distance Estimators
Exact-IEBP vs K
(120 genes)
BP vs K
(K Actual number of inversions)
(Inversion-only evolution)
25
Variance of True Evolutionary Distance Estimators
  • There are new distance-based phylogeny
    reconstruction methods (though designed for DNA
    sequences)
  • Weighbor Bruno et al. 2000These methods use
    the variance of good t.e.d.s, and yield more
    accurate trees than NJ.
  • Variance estimates for the t.e.d.s Wang WABI02
  • Weighbor(IEBP), Weighbor(EDE)

K vs Exact-IEBP (120 genes)
26
Using T.E.D. Helps
120 genes160 leaves Uniformly random
treeTranspositions/invertedtranspositions
only(180 runs per figure)
5
27
Observations
  • EDE is the best distance estimator when used with
    NJ and Weighbor.
  • True evolutionary distance estimators are
    reliable even when we do not know the GNT model
    parameters (the probability ratios of the three
    types of events).

28
Outline
  • Genome rearrangement evolution
  • Genome rearrangement phylogeny reconstruction
  • Application
  • Other methods
  • Future research

29
Percentage of Trees Eliminated Through Bounding
ISMB01
30
Campanulaceae cpDNA
  • 13 taxa (tobacco as outlier)
  • 105 gene segments
  • GRAPPA finds 216 trees with shortest breakpoint
    length (out of 654,729,075 trees)
  • Running Time
  • BPAnalysis takes 2 centuries on a Sun workstation
  • GRAPPA takes 1.5 hours on a 512-node supercluster
  • About 2300-fold speedup on a single node

31
Campanulaceae Moret et al. ISMB 2001
Strict consensus of 216 optimal trees found by
GRAPPA
6 out of 10 max. edges found
32
Outline
  • Genome rearrangement evolution
  • Genome rearrangement phylogeny reconstruction
  • Application
  • Other methods
  • Future research

33
Fast Approaches for Genome Rearrangement
Phylogeny
  • Basic technique encode data as strings and apply
    maximum parsimony
  • Running time exponential in the number of
    genomes, but polynomial in the number of genes
    (faster than GRAPPA)
  • MPBE ISMB00Maximum Parsimony using Binary
    Encodings
  • MPME Boore et al. Nature 95, PSB02Maximum
    Parsimony using Multi-state Encodings
  • The length of a tree using these two methods is a
    lowerbound of the true breakpoint length Bryant
    01

34
Maximum Parsimony using Binary Encoding (MPBE)
Input genome (circular)
A 1 2 3 4 -4 3 2 1 B 1 -4 -3 2
2 3 4 -1 C 1 2 -3 4 4 3 2 -1
MPBE Strings
A 1 1 1 1 0 0 0 0 0 B 0 1 1 0 1 1
0 0 0 C 1 0 0 0 0 0 1 1 1
35
Maximum Parsimony using Multistate Encoding (MPME)
Input genome (circular)
A 1 2 3 4 -4 3 2 1 B 1 -4 -3 2
2 3 4 -1 C 1 2 -3 4 4 3 2 -1
MPME Strings
We use PAUP to solve Maximum Parsimony gt
Constraint number of states per site cannot
exceed 32
1 2 3 4 -1 2 3 -4
A 2 3 4 1 4 1 2 -3 B -4 3 4 1 2 1
2 -3 C 2 3 -2 3 4 1 -4 1
36
NJ vs MP (120 genes, 160 genomes)
All three event types equiprobable (datasets that
exceed 32-state limit for MPME are dropped)
37
Inversion Phylogeny
  • Inversion median has higher running time than
    breakpoint median
  • Inversion phylogeny overall has shorter running
    time than breakpoint phylogeny, and returns more
    accurate trees Moret et al. WABI 02

38
DCM-GRAPPA Moret Tang 2003
  • Disk-Covering Method divide the original problem
    into subproblems Huson, Nettles, Parida, Warnow
    and Yooseph, 1998
  • Uses inversion distance
  • DCM-GRAPPA can now process thousands of genomes,
    each having hundreds of genes

39
Ongoing and Future Research
  • Genome rearrangement phylogeny with unequal gene
    content (duplications, deletions, etc.)
  • Non-uniform genome rearrangement
    models(Segment-length dependent model, hotspots)

40
Acknowledgements
  • University of Texas Tandy Warnow (Advisor)
    Robert K. Jansen Stacia Wyman Luay
    Nakhleh Usman Roshan Cara
    Stockham Jerry Sun
  • University of New Mexico Bernard M.E. Moret
    David Bader Jijun Tang Mi
    Yan
  • Central Washington University Linda Raubeson

41
PhylolabDepartment of Computer
SciencesUniversity of Texas at Austin
Please visit us at http//www.cs.utexas.edu/users/
phylo/
Write a Comment
User Comments (0)
About PowerShow.com