Title: Combinatorial and Statistical Approaches in Gene Rearrangement Analysis
1Combinatorial and Statistical Approaches in Gene
Rearrangement Analysis
- Jijun Tang
- Computer Science and Engineering
- University of South Carolina
- jtang_at_cse.sc.edu
- (803) 777-8923
2Outline
- Backgrounds
- Branch-and-Bound Algorithms for the Median
Problem - Maximum Likelihood Methods for Phylogenetic
Reconstruction - Post-Analysis
- Conclusions
3(No Transcript)
4Simple Rearrangements
5Phylogenetic Reconstruction
6Rearrangement Phylogeny
7(No Transcript)
8(No Transcript)
9Median Problem
Goal find M so that DAMDBMDCM is minimized NP
hard for most metric distances
10Multichromosomal Reversal Median problem
- To find a median genome that minimizes the
summation of the multichromosomal HP distances on
the three edges - Events considered reversal, translocation,
fusion, fission - Exact and heuristic solvers exist for the
Unichromosomal Reversal Median Problem (reversals
are the only events)
11Capless Breakpoint Graph
- Genome A ? Non-perfect Matching M(A)
- Let a,b be adjacency genes in A. Then (at,bh) is
an edge in M(A) - A genome is composed of a set of edges and ends.
- Matchings naturally correspond to Undirected
Genomes (Flipping of chromosomes does not alter
matchings)
12Example
- Example Genomes
- A -5, 1, 6, 3 , 2, 4
- B 1, 6 , -5, -4, -3, -2
Adjacency Graph
13Capless Breakpoint Graph
B-end
A-end
- Denote C(A,B) Cycles, AB AB-Paths, AA
AA-paths, BB BB-paths in G(A,B), n
genes - n 6,C(A,B) 1,AB 4,
- dHP 6-1-4/2 3
-
14A Lower Bound of the HP Distance
- A simpler lower bound only contains genes,
cycles, paths. - Derived from Hannenhalli, Pevzner 1995
- dHP (A,B)n C(A,B) - AB/2 AA - BB
- Pseudo-cycle of A and B
15Pseudo-cycle distance Median Problem
- Pseudo-cycle distance
- Pseudo-cycle distance Median Problem (PMP) to
find a median genome that minimizes the summation
of the Pseudo-cycle distance on the three edges - We use the Pseudo-cycle distance as a lower bound
for the HP distance to derive a RMP solver
16Branch-and-Bound Algorithm
- Enumerate the solution genomes gene by gene
(Genome Enumeration) - After enumerated a gene, compute an upper bound
based on the partial solution genome - Bound check whether the upper bound of the
partial solution is less than a criteria - Branch
- If it is true, the partial genome is discarded,
enumerate another gene - Otherwise update the criteria and continue
enumeration
17Genome Enumeration for Multichromosome Genomes
Genome Enumeration For genomes on gene 1,2,3
2
2
2
-2
-2
-2
18Features
- Main Components
- Contraction Operation
- Upper Bound on the number of pseudo-cycles
- Genome enumeration
- Extension of Capraras method for unichromosomal
genomes (1999)
19Contraction Operation
- Contraction eat,bh on M(A) M(A)/e
20Upper Bound on the Number of Pseudo-cycles
- Let S be a genome and ZG1, G2, G3 a set of
three input genomes
- The maximal ?(S,Z) is denoted by ?
- Based on triangle inequality, an upper bound on
the number of pseudo-cycles can be derived
21Notes
- qn- ? is the lower bound of the sum of
pseudo-cycle distances between any S and each
genome in Z G1, G2, G3 - Given an edge e, assume genome S contains e and
maximizes ?(S,Z) let ZG1/e, G2/e, G3/e, and
assume S maximizes Z?(S,Z), then S S?e
22Upper Bound Test
- In a step of the algorithm, the current partial
solution is Sie1,e2,,ei - The upper bound of ?(S,Z) of genoms containing Si
is the following
- Let UB be the current upper bound
- If UBSiltUB, then the best upper bound of the
genomes containing Si is worse than UB
23Branch-and-Bound Algorithm for Multichromosomal
Genomes
- Compute an initial Upper Bound (UB) from the
input genomes. - In each step, either an end or an edge is fixed
in the solution. - End Fixing Mark a node as an end of a
chromosome. - Edge Fixing Fix an edge e to the current partial
solution genome Si.
24Genome Enumeration for Multichromosome Genomes
Genome Enumeration For genomes on gene 1,2,3
2
2
2
-2
-2
-2
- Red line end fixing
- Black line edge fixing
25Properties
- Can be extended to compute a given tree using
iterative or progressive approaches - However, median computation is still difficult
- Large nuclear genomes
- Complex events
- We also need to search the best tree from the
large tree space - N species
- 20 species
26Statistical Approaches
- Combinatorial approaches are the focus of genome
rearrangement research - Only one MCMC method exists
- Maximum Likelihood methods have been very popular
in sequence phylogenetic analysis - Bootstrapping (data resampling) is a popular
method to assess quality of obtained trees - Hard to directly apply ML and bootstrapping to
gene order
27Sequence ML Phylogeny
- For each position, generate all possible tree
structures - Based on the evolutionary model, calculate
likelihood of these trees and sum them to get the
column likelihood - Calculate tree likelihood by multiplying the
likelihood for each position - Choose tree with the greatest likelihood
28Example
A acgcaa
B acataa
C atgtca
D gcgtta
29All Possible Evolutionary Paths (Column 1)
a c g t
a c g t
a c g t
30Likelihood for One Path
a
a
a
g
31Sum of All Paths (Column 1)
a c g t
a c g t
a c g t
32Whole Sequence
33MLBE
- Convert the gene-orders into binary sequences
based on adjacencies - Convert the binary sequences into protein or DNA
sequence - Use RAxML to compute a ML tree on the sequences
- Binary encoding was used before for parsimony
analysis, with reasonable results
34Binary Encoding
35MLBE Sequences
36Experimental Setup
- Generate random trees of N taxa
- Each tree is equally likely
- Birth-death model is preferred
- Starting from the root, apply r events along each
edge - r is the expected number of events
- Actual number is a sample between 12r
- Comparing the inferred tree with the true tree
using RF rate
37(No Transcript)
38Experimental Results (Equal Content 1)
80 inversion, 20 transposition
39Experimental Results (Equal Content 2)
80 inversion, 20 transposition
40Experimental Results (Unequal 1)
90 inversion, 10 of del/ins/dup, 5-30 genes per
segment
41Experimental Results (Unequal 2)
90 inversion, 10 of del/ins/dup, 5-30 genes per
segment
42Multistate Endocing
43MLME Results (200 genes 20 genomes)
44MLME Results (1000 genes 20 genomes)
45Post Analysis
- Bootstrapping has been widely used to assess the
quality of sequence phylogeny - The same procedure is impossible for gene order
data since there is only one character - We tested the procedure of jackknifing through
simulated data to obtain - Is jackknifing useful
- The best jackknifing rate
- What is the threshold of the support values
46DNA bootstrapping
47Bootstrapping Results
48Jackknifing Procedure
- Generate a new dataset by removing half of the
genes from the original genomes (orders are
preserved) - Compute a tree on the new dataset
- Repeat K times and obtain K replicates
- Obtain a consensus tree with support values
49An ExampleNew Genomes
- 1 2 3 4 5 6 7 8 9 10
- 1 -4 5 2 8 10 9 -7 -6 3
-
1 3 5 7 9 1 5 9 -7 3
50Jackknifing Rate
51Support Value Threshold - FP
Up to 90 FP can be identified with 85 as the
threshold
52Trees with FP
53Support Value Threshold - FN
54Low Support Branches
55Jackknife Properties
- Jackknifing is necessary and useful for gene
order phylogeny, and a large number of errors can
be identified - 40 jackknifing rate is reasonable
- 85 is a conservative threshold, 75 can also be
used - Low support branches should be examined in detail
56Conclusions
- Great progress has been made in genome
rearrangement research - We are able to handle real size data
- Now the question is what data
- Data quality and biological modeling
- Ancestral genome reconstruction is still
difficult - Putting everything together has just started
57Thank You!